1. What is Compilation? introduction to JVM compilation

February 24, 2016 – Joe Kearney – 7 mins

An introduction to what we mean by compilation and why it has multiple stages. A brief comparison between how Scala and Java compile similar structures for use in the JVM.

Compiling is, of course, what happens while you're sword-fighting on office chairs

Compiling is the process of going from human-readable code to executable instructions that can run on a processor. These days there are many stages to this. If you watch closely enough you can see the Scala compiler going through many different phases during compilation of Scala code to JVM bytecode, and that’s just the first step, getting to the JVM’s intermediate representation.

Why not just compile straight to machine code, skipping this intermediate step? There are a lot of reasons for this. Primarily, this intermediate step is how Java implements its “write once, run anywhere” policy – it’s a level of indirection that allows the same source code to be translated eventually into machine-specific instructions, whether a laptop, phone or data-centre blade.

But the fact that ultimate compilation is deferred until runtime has other benefits too – it allows the compiler to use runtime information to guide optimisation of the code. When Hotspot was originally shipped it was described as implementing “just-in-time compilation with adaptive optimisation”. At runtime your code generates a lot of information about how it is used, about distribution of parameters, hot-spots (geddit?) of code that could benefit from more aggressive optimisation. It’s a profiler that automatically tunes your code, on the fly.

Compiling Scala source

At its core, scalac or any other JVM language compiler is just a function String => Seq[ByteCode], bytecode being is the JVM’s internal language, and the input being your program’s source code. When you consider having multiple source files and multiple output classfiles containing the bytecode it’s more like:

Set[String] => Either[CompileError, Set[Seq[ByteCode]]]

Many Scala talks, those on subjects like types, Scalaz and monads, focus on the CompileError part of this function, and are about proving validity of code under laws around types. Here were going assume validity of code completely, and focus on the succesful results of a compilation.

Once we have some bytecode, the JVM can be considered “just” the following function:

Seq[ByteCode] => Seq[MachineCode]

This is another compilation step, and it occurs entirely inside the JVM at runtime. The classfiles and jars compiled from your source are the inputs. It’s worth noting that code is compiled per method, and different methods can be compiled independently and even by different compilers.

There are four compilers at work.

interpreter – interprets bytecode instruction by instruction and runs the required operations on the processor. This sounds slow, and it is, but there’s notably very little startup cost.
client, or C1 – one of two “proper” compilers in the JVM. It compiles quickly, performs some optimisations on the code but is primarily concerned with completing quickly. It’s good for programs that need quick startup. It’s often used for short-lived apps or human-facing GUIs.
server, or C2 – the more aggressive cousin of C1. This takes the other side of the tradeoff, sacrificing startup speed (compilation times) for later runtime speed thanks to much greater optimisation of code.
tiered – chains together the other three, to get the benefits of quick startup times with eventual high performance. The level of optimisation is progressive. After the first few interpreted invocations a method will be compiled by C1, and after a few more (usually on the order of 10,000) the method will be recompiled and a faster, shinier version swapped in.

Code can be compiled multiple times through the lifetime of the JVM. If an assumption used in compilation turns out not to be true then the code can be deoptimised, and compilation starts again with the new information.

Types and classfiles

Bytecode is organised into classfiles, and one type is stored in each file. What’s a type, in this context?

Languages express their types in different ways. Scala has three kinds of types: objects, classes and traits; while Java has two: classes and interfaces.

(Aside: counting kinds like this is at least reasonable, but the exact numbers depend on the details. Scala’s case classes are just classes with some free sugar. Java’s enums are just classes too and annotations are really interfaces, at least at the classfile storage level. Java and the JVM have the whole extra complexity of primitives being different to reference types, too, but we’ll just ignore that for now as they’re not stored in classfiles.)

How do Scala and Java types compile down to classfiles?

The mapping from Java types to the JVM is easy – the kinds are the correspond directly. This is because Java grew up in a pretty close relationship with the JVM. There is very little in the JVM itself that isn’t there to support some feature of Java, because for a long time Java was the only language worth noting that used the JVM. These days there are many more, and features like the invokedynamic instruction were added specifically for those others.

Scala has a less direct correspondence between its types and JVM classfiles. traits compile to interfaces if they have no implementation and only define API. This makes sense if you think about what is allowed in a Java interface. All other Scala types compile down to classes.

The fun really starts when looking at how inheritance in Scala (which allows multiple inheritance from types with behaviour, in the form of traits) is implemented in the JVM type system (which does not). I might come back to address this in the future, but I left it out of scope of the talk, which was already growing too long!

How do Scala and Java class members compare?

There are two sorts of code that can be invoked in Scala or Java – those on exist on an instance of some type and have access to its members (this reference, other fields), and those that have no such context other than global state. The difference between these two sorts of function is that the code for the second type exists in a single place, it has a statically-known address.

The first type is precisely the polymorphism that makes Java and Scala object oriented languages. The method might have been overridden in a class hierarchy, and to find the right code for the function attached to an instance, the hierarchy needs to be examined in some way. This is called a virtual function call.

Scala divides these two sorts of function by putting the first kind on classes or traits, and the second kind on objects. (There are classfile subtleties that we will see later.)

Java calls the first kind instance methods, and the second kind must be marked with the static keyword.

The bytecode for Java is again a pretty direct translation from source – you see methods and fields, some of them are static. Scala has a more complex translation into classfiles, which we’ll see in the next section.