martes, 22 de noviembre de 2016

Plugging Bee JIT and Compiler

This post should have been written long time ago. By now I already forgot many of the difficulties I found when plugging the JIT and Smalltalk compiler to Bee, but on the upside I now have a much bigger picture of the system so I can describe more things with more details than before.

Where can we start? A good place can be to compare our self-hosted runtime against the host VM. What are the differences between the JIT in Bee and the host VM? To answer that, we have to understand what a JIT is, what it is made of, and how it is plugged to the system. Good! now we have some concrete things we can describe.

What is a JIT compiler


Just-in-time compiler. Well, the last word tells most: it is a compiler, and the JIT part refers to the fact that it is thought for working simultaneously with the program being run, unlike things like GCC to mention some compiler, which are used to compile programs before they are run.

Smalltalk code, in most implementations is compiled to bytecodes. This comes from its early history, the famous Blue Book describes bytecodes extensively. When saving a method, the Smalltalk compiler is triggered to convert the text to bytecodes. So you might think that the Smalltalk compiler is a kind of JIT-compiler, but it is not. At least not in the usual sense of the word, or what is generally known as an actual JIT-compiler. After compilation to bytecodes, at the point where the method is actually going to be invoked, those bytecodes have to be executed in some way. That is, invocation of a method means having computations done to alter the program's execution context. The approach that JIT-based VMs take to execute methods is to translate them to native code just before execution.

In the case of Bee, the VM translates the method's bytecodes to native code. Running a method is actually executing the native code derived from the method's bytecodes. Bee is not an interpreted Smalltalk, which means that always, before any method is run, it will be nativized. Generation of a method's native code can be done at any time before its execution. Usually nativization happens during lookup: the activation of a message-send will cause the invocation of a compiled method; the lookup algorithm checks if that method has native code already generated, and if not, asks the JIT to translate from bytecodes to native code.

But lookup is not the only moment at which native code for a method can be generated. Nativization could be done ahead-of-time (AOT), and a JIT-compiler that allows for this is sometimes called an AOT-compiler. Bee not only provides, but also requires, an AOT-compiler, as we will see below.

Internals of JIT compilers


The main work that our JIT compiler accomplishes is to convert from bytecodes to native code, but how does this conversion work? To start grabbing an idea, we can first see how bytecodes look like, and how they relate to source code. Consider this simple method:

sum
    ^3 + 4

The Smalltalk compiler will process this text and produce a CompiledMethod with these bytecodes:

[16r18] load R with SmallInteger 3
[16r19] push SmallInteger 4
[16rF7] send selector #+
[16r48] return


Basically, what we have is a stack machine with some special registers: R is the receiver and return register, there are some more (not mentioned here). Then the work the JIT has to do is to transform bytecodes to native instructions.

In this case, what we will get is something like:

prologue: ...
method: mov EAX, 7      ;  load R with SmallInteger 3
        push 9          ;push SmallInteger 4
        call lookup #+
epilogue: ...
        ret

The method label describes the most interesting part, where bytecodes got converted to native instructions (EAX is both the receiver and return register in our calling convention). The labels prologue and epilogue consist of a few instructions for constructing and desconstructing a stack frame.

The way our JIT-compiler is implemented is very straightforward: it iterates each bytecode, assembling the corresponding native instructions each time:

MethodNativizer>>#translateMethod
    self emitPrologueAndAlign.
        [self bytecodeIndex < self bytecodeSize] whileTrue: [
            self saveBytecodeNativeAddress.
            self translateSingleBytecode: self nextBytecode].
    self emitEpilogue

Plugging the JIT compiler to a classic VM


The final step, when you have a JIT compiler that is able to translate any method, is to plug it to the VM. There are a few places there that are affected by its presence, mainly the lookup mechanism and the Garbage Collector. In the case of a classic VM, the typical places would be:

Implementation of lookup

The VM will check if the method found has native code and if not trigger the nativizer, like in the following simplified code:

void lookupAndInvoke(oop *receiver, oop *selector) {
    Method *method = global_cache->lookup(receiver, selector);

    if (method->native_code() == nil)
    {
         nativize(method);
    }

    invoke(method);
}

void GlobalLookupCache::lookup(oop *receiver, oop *selector)
{
    Class *class = receiver->class();
    Method *method = lookup_in_cache(class, selector);
    if (method != nil)
        return method;

    method = class->find_in_method_dictionary(selector);
    this->add_to_cache(class, selector, method);
}

We don't provide here a description of how the cache is indexed exactly, but you can think of it as if it were just a low-level C array or vector.

When methods are changed in the system

The VM needs to be told when any method is changed in the system, so that it can update the cache. This is usually done with a primitive:

MethodDictionary>>#flushFromCache: aSymbol
    <primitive: FlushFromCodeCache>

The primitive could be implemented with something like this:

void FlushFromCodeCache(oop *selector)
{
    global_cache->remove_all_entries_with(selector);
}

During GC

The low level array pointers have to be updated, as compiled methods, classes and selectors could be moved. This will require just a special case in the GC to trace the pointers in the array.

Finally, to make all this work, the VM is compiled to native instructions. Then both the JIT, the primitives and the GC are just called from the corresponding places to make things work. How does Bee JIT differ from the one of the VM then?

Plugging the JIT to Bee self-hosted runtime


The first thing to notice is that Bee JIT is implemented in Smalltalk. Thus, it consists of a bunch of compiled methods, and not of "native code". So Bee JIT cannot be just linked into Bee self-hosted runtime and start nativizing methods as required during lookup. That takes a bit more of effort, as it requires someone to first convert the JIT methods to native code, a chicken and egg problem. But this  problem can be "easily" fixed: we can take our Smalltalk JIT and execute it inside the host VM, using it to translate its own methods! That would be, to nativize itself ahead-of-time. The result is a set of compiled methods with their associated native instructions.

We have to be careful, and make out nativized JIT behave slightly different than the host VM JIT, as it should not generate code with references to the host VM. For example, when a message-send bytecode is translated, instead of calling the host VM lookup, it has to call a different one which is stored in Bee kernel. The problem of referencing external things, particularly objects in Bee kernel, is already solved by our Smalltalk libraries framework. The final step to plug the nativizer is then to generate a Smalltalk library, one that can be loaded by Bee kernel, and that contains both the JIT methods and their corresponding native code.

As for plugging the JIT to the Smalltalk world, many things are simplified because all things are implemented within the same paradigm. The lookup is already coded in Smalltalk, and the only thing needed is to call our JIT when finding a method that doesn't have native code yet:


Object>>#_lookupAndInvoke: selector
    | cm nativeCode |
    cm := anObject _cachedLookup: selector.
    cm == nil ifTrue: [^anObject doesNotUnderstandSelector: selector].
    cm prepareForExecution.
    nativeCode := cm nativeCode.
    ^anObject _transferControlTo: nativeCode code

CompiledMethod>>#prepareForExecution
    self isNativized ifFalse: [self nativize].
    nativeCode refresh

CompiledMethod>>#nativize
nativeCode := BeeNativizationEnvironment current nativeCodeFor: self


Regarding the global lookup cache, as it contains just a reachable Smalltalk array, there is no need to make a special case for traversing it during GC. The #flushFromCache: implementation is not a primitive anymore, it just another Smalltalk method that traverses the cache looking for methods that correspond the flushed selector.

GlobalDispatchCache>>#flush: selector for: behavior
    | index |
    index := self indexOf: selector with: behavior.
    contents
        at: index put: nil;
        at: index + 2 put: nil;
        at: index + 4 put: nil


Interestingly, the #nativize method is dynamically bound, and is loaded when the JIT compiler library is bound, only if we tell Bee to load the JIT. We can, using the same technique we used for the JIT, ship any other library pre-nativized, so that we don't need to load the JIT if we don't want to, or if we don't plan to dynamically modify code:



Besides, the Smalltalk compiler is also packaged in an optional library, so we can get the following workflow:



This can be good for saving system resources, and also for performance reasons: as the code shipped with our libraries is nativized ahead of time, we can afford spend more time optimizing it, delivering code that is much faster than naively JITted code. I'll write more about that in the future, but that's going to be in another post, I hope you enjoyed this one!



miércoles, 9 de marzo de 2016

The paper from ESUG '14

As I've been really busy lately and didn't find time write any new post yet, I thought that at least I could put here the paper from ESUG '14 for people that didn't see it. You may find it interesting. Many things were improved in the last two years (specially performance and completeness), but I still find the paper nice and come back to read it when I have to explain some part of the system and I don't know how. So, without further ado, here it is.

lunes, 28 de septiembre de 2015

Blogging back

Wow, one blink of an eye and around a year passes! It's been a long time since the last post... you'll have to excuse me, but I find it very hard to make a pause to write here when at the same time I'm facing amazing problems such as the ones Bee poses. I'll try writing here more frequently, I hope I'll have the needed will power to keep doing the exercise.

Anyway, in case you were wondering, yes! there's been a whole lot of work done since last year and I'm going show some of the advances in the Smalltalks, which happens to be hosted in Buenos Aires this year. For a while I've been rounding corners of the self-hosted system, implementing missing primitives, working in the debugger, user interface, garbage collection and more. The next posts will describe some of that work, hope to see you soon!


viernes, 15 de agosto de 2014

Pre-releasing Bee Smalltalk

Today marks a key milestone in Bee project: we are realizing Bee main sources and libraries to the public sphere! Do not get too excited yet. This is stated as a pre-release, because many things are yet missing and as the license is CC-BY-NC (for now). Also, there's no way to browse code like in a complete smalltalk environment, nor to rebuild the kernel.

You can consider this as just a small preview of what the system is going to be like internally. There are two kinds of files: SLLs are the binary smalltalk libraries, SMLs contain the source code of their respective libraries. bee.sll is the main executable (it is just an .exe file, you can rename and double-click). In case no arguments are passed, a *REALLY* basic command line interface is shown, where you can enter some Smalltalk code:

$> bee
Welcome to bee.
command line interface is waiting for you
> 3+4+5
12
> 3 class
SmallInteger
> | o | o := Object new. o size

0


If you pass a file, like this:

$> bee script.st

bee will load the nativizer and compiler and try to compile and execute the file sources. In linux (with wine) you can write a script like:

#!/usr/bin/wine bee.sll

Transcript show: 'hello bee'

You can also execute:

$> bee loadAndRunLibrary: <libname>

which will load the library (allowing to execute code without loading the compiler and nativizer). I think the most interesting thing to look for now are the implementations of the primitives. You can go to your favourite smalltalk to inspect

CompiledMethod allInstances select: [:cm | cm primitiveNumber > 0]

and then check in bee.sml how we implemented them (be sure to look at the bootstrapped version and not the Host VM one, we are writing both for now).

SLL files may or may not contain native code. In the case of the Kernel library (bee.sll), course native code is present (because you require some machine code to execute). Also, the JIT contains native code (to generate native code you also need a native native code generator).  Anything else doesn't require native code (but may contain it). For example, for now the compiler doesn't contain any native code. When loaded, the nativizer generates machine code just in time when methods are executed.

That's all for now, I'll be showing more details at ESUG, see you!



lunes, 14 de abril de 2014

Working on multithreading

A Smalltalk runtime that is able to execute concurrent sequences of code would be very useful. Seeing the rise of multicore/multiprocessor hardware and the direction the industry is taking, increasing the amount of cores in processors each year, makes us think we should be prepared.

The simplest way of making use of multiple cores is to just avoid threads and directly work on multiple processes, connected by some kind of IPC. This is easier because different processes share nothing, except what you explicitly choose. Isolation helps program safety concurrency wise, as two fully separated processes cannot have race conditions between them.

Then you would ask, why not just use processes? The main reason is that threads can be lighter and, in some way, more straightforward (as different threads share all their memory and resources). But also the question can be stated conversely, why not threads? Giving the option of using threads would be nice, so why not? There are lots of VMs wandering around out there supporting a myriad of different languages, but not many allow multithreaded execution of code (some notable exceptions are JVM, Jikes and CLR). Being able to execute many threads can simplify some VM implementation details and complicate others. Then, in order to understand how we can make a multithreaded Bee, we should better deep into what are the inherent problems of concurrent code execution.

There's an important point to add here. If we want to explore if it is possible to support concurrent independent smalltalk threads, we should do it a this stage, and wait no longer. As we'll see here, threading support can affect many system design decisions, so adding it later can require lots of changes and make it too costly. Also, adding threading support to a minimal system is a lot easier than to a big one. Then, to facilitate development, we should strive to make it work very well from the beginning, and then analyze how it behaves with each addition to the system.

Basic implementation and testing

The basic problems that arise when writing multithreaded code are related to resource access. Usually, race conditions start popping, because of unsynchronized concurrent accesses to memory locations. Good programmers write code in a way that avoids most pitfalls that lead to race conditions, but there always are some corners to cut. Avoiding global variables is the starting point of thread-safe code. Unfortunately, Smalltalk is plagued by global variables. Symbols, classes, the Smalltalk global, all the meta-model is global in Smalltalk, so we have a bad starting point. But if we think a bit, many of these things are immutable and most rarely change.

The way we test the multithreading scenario is also interesting. For now, we decided to go for a simple test. Instead of implementing and testing some known multithreaded algorithm, we just take all of our tests and run them in parallel. That is,

BeeKernelTest>>#runAllConcurrent
  | selectors results |
  selectors := self allTests.
  results := selectors
    concurrentCollect: [:selector | self runTest: selector].
  ^results conform: [:result | result]

FixedSizeCollection>>#concurrentCollect: aBlock
  | answer size token |
size := self size.
token := Object new.
answer := self species new: size.
answer atAllPut: token.
1
  to: size
  do: [:index | [| x y |
    x := self at: index.
    y := aBlock evaluateWith: x.
    answer at: index put: y] forkThread].
  [answer includes: token] whileTrue: [].
^answer

We still have to think the API we are going to implement but this is a nice starting point. Finally, we should emphasize that are planning the 'do nothing' kind of thread support here. This means, we are not going to add synchronization locks all around the system. It is the developer who knows how threads are going to be used, should him be the one that configures the necessary locking. But there are some points were the system should know that there can be many threads out there, lets explore them.

Allocation

A good example is object allocation. In bee, objects are allocated by aGCSpace. When receiving new, the class calculates the amount of space needed and sends #allocate: bytes to the global gcSpace. If many threads are allocating objects in the same gcSpace, they should better synchronize the allocation. Look at this naive implementation of allocate:

GCSpace>>#allocate: size
| answer next |
answer := nextFree.
next := answer + size.
nextFree := next.
^answer

Now suppose two threads fall into #allocate: at roughly the same time. It could happen that both threads read the same value of nextFree (the one before updating), returning a same address for two different objects. This will quickly lead to random memory corruption and undeterministic program crashes. There are two main ways of solving this: replacing the global gcSpace with many smaller spaces, or adding some kind of mutex. The figure shows both options, notice that while shared spaces need a mutex, split spaces get only half of the total address space for each one.



There's also the option to mix both things, each gcSpace having independent pages, wouldn't need a mutex to allocate individual objects withing the page. When the page is exhausted it could take more pages from a global space, accessed with a mutex. That would be the most efficient trade-off, but would complicate implementation. For now, lets check the simplest mutex solution:

GCSpace>>#allocate: size
    | answer next |
    mutex _busyWait.
    answer := nextFree.
    next := answer + size.
    nextFree := next.
    mutex _release.
    ^answer

where mutex is a 4-bytes byteArray initialized to -1, (-1 meaning lock is free).

low-level mutex
The figure shows the implementation of both #_busyWait and #_release. The array starts with -1. xchg swaps the values of [eax] and ecx, atomically. Thus, the value of ecx after xchg can be 0 or -1. If it is -1, it means the lock was free, so we are clear to go. If its value is 0, it means somebody else has already reserved the lock and we have to wait. Notice that in that case the xchg didn't make any change in memory, as both [eax] and ecx were 0. In this implementation there is no "wait", we continue reading the mutex until it becomes -1. For #_release, we have to use lock prefix to give dec instruction atomic semantics (xchg is always atomic thus doesn't require lock prefix). Also notice that just after acquiring the lock we read the value of next free. We have to be careful that processors don't reorder reads and writes in this sensitive code sections, or the lock would become useless. We know this won't be a problem because instructions with lock semantics assure this reordering will not happen.

This was a simple solution so we didn't need to split the global gcSpace. But there are other not thread-safe places. When looking at native code for example, there are some inherently unsafe points.

Call site patching

If you recall the post about lookup, after doing lookup and before method activation comes call site patching. Supose we are sending #yourself, before patching we have this assembly at call site:

[68 4-bytes address] push #_yourself
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

which the call site patcher transforms to:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call (Object>>#yourself) nativeCode

The transformation is done by overwriting the 11 bytes, with three consecutive 4-byte writes (one byte is written twice). A second thread could fall after the first patch of 4 bytes, finding this

[90] nop
[90] nop
[90] nop
[90] nop
[??] (#_yourself oop most significant byte)
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

the are not many satisfying solutions to this problem. One trick could be to insert a back jump at first write, like this

[EB FE] jmp -2
[90] nop
[90] nop
[??] (#_yourself most significant byte)
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

but what happens if thread B was already decoding the push instruction when the write was issued. Is decoding atomic? This approach isn't satisfactory so we didn't apply it. Another way (we haven't implemented it yet), is to avoid instruction replacement in favor of instruction mutation, always issuing the same kind of call. We could directly do this from the beginning:

[E8 4-bytes address] call (Object>>#_lookupAndInvokeYourself:) nativeCode

notice that this (Object>>#_lookupAndInvokeYourself:) would indirectly call lookup:

(Object>>#_lookupAndInvokeYourself:) nativeCode
    [68 4-bytes address] push #_yourself
    [FF 15 4-bytes address] jmp [(Object>>#_lookupAndInvoke:) nativeCode]

this is almost exactly the same code we had at call site, just that we have extracted it into a separate native code, and changed the call for a jump (avoiding a double call). The patch of the call site will not replace the instruction -it will still be a direct call- but will only mutate the address:

[E8 4-bytes address] call (Object>>#yourself:) nativeCode

the problem here is that movs are not guaranteed to be atomic, unless aligned to the size you are moving. Then we should assure that the call site 4-byte displacement is correctly aligned (inserting nops if needed), like this:

0x100000 [90] nop
0x100001 [90] nop
0x100002 [90] nop
0x100003 [E8 4-bytes displacement] call (Object>>#yourself) nativeCode

and we are still assuming this won't break instruction decoding. We shall experiment and see what happens, but luckily AMD Application Programming manual (3.11 Cross-Modifying Code), seems to prove us right. We could also have a per-thread code cache, but I think it's a bit expensive and unnecesary. For now, we just disabled call-site patching in multithreading targets.

Global Lookup

Other problem with lookup is the global dispatch cache. This is not a thread-safe object and it will never be. Lookup is executed frequently and can be slow, so we cannot leave other threads blocked waiting for a lock. This cache contains an internal lookup table that is altered on frequently so accessing it without a lock corrupts it quickly. The solution we implemented is to have one dispatch cache per thread (you could also have one per core to optimize resources, the dispatch cache measures roughly 4kb). To allow this, we reified the Thread object. A thread object contains a dispatchCache. When created, a thread executes a block, first doing initialization:

Thread class>>#newOn: aBlock
    ^self new execute: aBlock

Thread>>#execute: aBlock
    | method entrypoint |
    method := self _lookup: #value:.
    entrypoint := [
        self setCurrent.
        aBlock value].
    method newThreadOn: entrypoint

Thread>>#value: aBlock 
    aBlock value

Thread>>#setCurrent
  KernelLibrary tlsSetValue: tlsThreadIndex with: self.
  KernelLibrary tlsSetValue: tlsGlobalDispatchCacheIndex with: globalLookup

The code is a bit complex because we have to call CreateThread function (done by newThreadOn:). We are not executing the incoming block directly but putting it inside entryPoint, which first initializes the thread and then evaluates it. You can see in #setCurrent that we implemented Thread Local Storage support. This allows us to access thread-specific information (like the currentThread object), though fs segment register indirection.

Thread class>>#current
  ^tlsThreadIndex _getThreadValue

Windows mantains the fs segment register to always point to the TIB (thread information block) segment. This TIB is a buffer that contains a lot of thread information, and has some free slots for storing user-specific thread data. We are making use of one of these slots to store the current thread object, from which we can obtain the dispatch cache. In the figure we can see the assembly code required.


FFI

A little race condition we found was related to FFI. Consider this method:

CLibrary>>#strlen: aString
<api: strlen ulong ulongReturn>

it makes a call to the native strlen function in C standard library, passing aString address and converting the result to an integer. At compile time, the compiler sees the api: pragma and insterts an extra literal to the generated compiled method, a byte array. This byte array encodes the address of strlen and the types of the arguments and return value. The function address makes the first 4 bytes of the byte array, and is calculated lazily: it is initialized to FF FF FF FF, and set the first time the method is invoked.


atomic vs. non atomic memory writes
With multithreaded applications, many methods may be activating the method at the same time. All of these activations should read FF FF FF FF or the correct address of strlen, and there shouldn't be any problem. But there is. If the read and write of the address are not atomic, an invocation may find that the address is something like FF FF xx xx, where the last 2 bytes have been written and the first 2 not. As unlikely you think this can be, it happened to us!

The simplest way to avoid this problem is two fold: first, take advantage of x86 architecture, which guarantees atomic 32-bit read/writes if the memory address read/written is 32-bit aligned; second, to issue the read/write as one 32-bits operation (and not to split it in 2 16-bits or 4 8-bits). The figure above shows two examples of not-atomic operations, and an atomic one. The first store is unaligned so the processor doesn't guarantee atomicity. The second is split in two stores. Each of them is 16-bit aligned and, as a consequence, 16-bit atomic, but not 32-bit. A second thread could read the memory just after first 16-bit write and before the second 16-bit write. The third is an aligned write. x86 arquitecture assures this will be issued atomically.

Going back to our problem, we know the function address is 32-bit aligned because it lays at the beginning of the byte array which, as any other object, is 32-bit aligned. But our uLongAtOffset:/uLongAtOffset:put: implementation was splitting the ulong in two, to avoid problems with big integers. The solution was to change our implementation of uLongAtOffset family assuring that reads and writes are done in one operation. As usual with race conditions, solving the problem took just minutes, finding it out, many hours.

Conclusions

Here we gave many implementation details of some of the problems we faced when entering this crazy multithreading world. We are still learning, so there might still be undetected problems, but we believe there aren't going to appear major road blocks. There are of course some unanswered questions. We are still thinking how we are going to face thread barriers, like how are we going to stop all threads for GC? We could just maintain a counter that is increased on activations and back-jumps, and peek for events from time to time. This might reduce performance, maybe we can avoid counting by doing some tricks with SuspendThread. In this latter case we should be careful as SuspendThread is not thought to be used within the same process but from the outside (mainly for debuggers).

To conclude this post, we attach a simple multithreaded test program. This just runs all tests in parallel and shows the results in console. To run,

$> wine bee.sll runAllConcurrent

to run on linux/mac; on windows, rename to bee.exe. There isn't much too see but the non-deterministic order of finalization of the tests. If you were hoping to see a command line interface that parses smalltalk, compiles, nativizes and finally executes it, stay tuned, we already have a working prototype and need to find time to write(no joke!). See you in next post!

Here is bee.sll




miércoles, 12 de marzo de 2014

Bee performance and lookup strategies

It's been quite some time since last post, we'be been really busy cooking Bee Smalltalk. You wouldn't believe how much work is needed in order to connect the dots!

We continued trying to make libraries load in bee, but we felt that performance was abysmal, library loading was taking ages and tests were ran so slow that they were slowing down the whole development process. In order to measure performance, we ported a bunch of benchmarks to bee.

Beenchmarks

The first benchmark we took was the simplest we know: tinyBenchmarks. It consists of a rather simple pair of metrics. The first one estimates bytecode speed and the second one dispatch speed. The bytecode test does a lot of arithmetic (which is executed directly without message sends) and sends only few messages: #new:, #at:, #at:put:. On the other hand, the dispatch test calculates Fibonacci recursively:


^self < 2
ifTrue: [1]
ifFalse: [(self - 1) fibonacchi + (self - 2) fibonacchi + 1]

But tinyBenchmarks are really really basic, and not a complete benchmark suite, they are just a quick and small test to get a general idea. For that reason, we wanted to have more benchmarks and ported SMark to bee. We didn't bother to port all the benchmarks yet, but at least we made Slopstone and Smopstone work.

So we ran a pass of the benchs through CodeXL (if you know a better native code profiler leave a comment!) and analysed the results. The first thing that we saw was expected: our lookup was slow, brutally slow. For that reason, we spent a couple of days improving bee lookup speed, and here we are going to show some of the results we've got.

Bee Lookup

We could just throw the numbers here, but to give them context we want to describe the different lookup strategies we implemented in bee.  We'll give you many details now, but if you want to know better about most of these strategies, you can also read a very good post about Cog lookup wrote not long ago by Clément Béra.

One last advice before we start. Bee implementation is a bit different compared to other dialect. This affects its lookup directly. In Bee, the original Behavior class from Smalltalk was split in two: method responsibilities were set apart and only shape responsibilities were left there. What we usually knew as Behavior was renamed as Species. The method responsibilities were reified as a new kind of object named Behavior: what we used to call the method dictionary array. A behavior then is pretty much like an array filled with method dictionaries (or behaviors). In Bee the object header doesn't point to the class, but to the behavior, and the object's class is given by the one found in the first method dictionary of the object's behavior.

Naive lookup


To understand how lookup works in bee, let's have a look at a method that sends a message, and analyze the generated native code. Supose we have:

sendYourself
    self yourself

This method only sends #yourself message. The nativizer will usually push the arguments (in this case there isn't any) to the stack and call some lookup function. In the naive case, lookup function receives one extra argument: the selector. The figure depicts the generated assembly. Also the original bee smalltalk implementation of _lookupAndInvoke: is this:

_lookupAndInvoke: aSymbol
    | cm |
    cm := self _lookup: aSymbol.
    cm == nil ifTrue: [
        cm := self _lookup: #doesNotUnderstand:.
        self _transferControlTo: cm noClassCheckEntrypoint].
    cm isNativized ifFalse: [self compile: cm].
    self
        _transferControlDiscardingLastArgTo: cm noClassCheckEntrypoint

#_lookup: looks for the corresponding method in the receiver's behavior (former method dictionary array).

_lookup: aSymbol
    ^self _lookup: aSymbol in: self behavior

If the method isn't nativized yet (can't happen now) it should be compiled. Finally, we've got the compiled method, we have to jump to its native code. If you pay attention to the stack, the last thing pushed is the selector (#yourself), and #lookupAndInvoke: takes exactly one argument. There's also an implicit push of the return address in the call instruction. Notice that the return point from yourself should be the method that sent the message (sendYourself) and not lookupAndInvoke:, that's why we have to jump and not call the entrypoint.

Invoke lookup

As you can imagine, #lookupAndInvoke: and its closure have to be specially compiled. This is because lookup must not raise another lookup recursively, or else it would enter an infinite recursion. As lookup method closure is really small and lacks any polimorphism, our solution is to precalculate what methods will be used during lookup. For lookup nativization, we then alter send nativization, to use method invocation instead of method lookup. For example,

_lookup: aSymbol
    ^self _lookup: aSymbol in: self behavior


which originally would be nativized as

...
[68 4-bytes address] push #_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]
...

can be nativized as

[68 4-bytes address] push Object>>#_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_invoke:) nativeCode]

with (almost) this implementation of _invoke:

_invoke: aCompiledMethod
    | nativecode bytes classCheckDisplacement address |
    nativecode := aCompiledMethod _basicAt: 2.
    bytes := nativecode _basicAt: 1.
    classCheckDisplacement := 16rD.

    address := bytes _asSmallInteger + classCheckDisplacement.
    self _transferControlDiscardingLastArgTo: address


With this strategy working, we ran a pass of tiny benchmarks. As a reference, in the same machine I get:

TinyBenchmarks:
Host VM: 151.210.868 bytecodes/sec; 15.391.934 sends/sec
Pharo: 266.112.266 bytecodes/sec; 23.850.401 sends/sec

Report for: SMarkSlopstone
Host VM: Stone total: iterations=100 runtime: 0.136ms +/-0.010

Report for: SMarkSmopstone
Host VM: Stone total: iterations=1 runtime: 259.756999999285ms


The result of the beenchmarks was (expectedly) embarrasing:

TinyBenchmarks
205507 bytecodes/sec; 379224 sends/sec

Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 92.4ms +/-1.4

Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 534259.684200004ms

Which means Host VM gets 735.79X and 40.59X better results for tinyBenchmarks and also 679.41X and 2056.77X for slopstone and smopstone respectively. As I said before, this result was expected because, until then, we never optimized anything. But of course, a Smalltalk written VM has to be optimized at some point, in order to make it fast. And so we did.

Monomorphic inline cache (aka patching call site)


To solve our performance problems, we first thought of re-enabling the PIC, or at least, some kind of MIC. We already have an implementation of a PIC, but it isn't plugged to the system as it works now, and for the moment it was easier to implement a MIC. The idea is simple: at call-site, instead of calling lookup, patch the call address to point the native code of the last method found. To make this safe, add to that method (and all others) a prologue that checks if the receiver's behavior is correct for that method. In assembly, the previous message send should be patched to:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[FF 15 4-bytes address] call [(Object>>#yourself) nativeCode]

for the call site and add

    test    al, 1
    jnz     receiverIsSmallInteger

classCmp:
    cmp     dword ptr [eax-4], Object instanceBehavior
    jnz     failedClassCheck
    jmp     realMethodStart

receiverIsSmallInteger:
    cmp     [classCmp+3], SmallInteger instanceBehavior
    jz      realMethodStart

failedClassCheck:
    pop     ecx
    push    #yourself
    push    ecx
    jmp     [(Object>>#lookupAndInvoke:) nativeCode]

realMethodStart:
    push ebp
    mov ebp, esp
    ...

for the Object >> #yourself prologue. EAX is the receiver register, which points to the receiver's first slot. In EAX-4 is the pointer to its behavior. First in assembly is a special case for small integers, as they are not pointers but immediate values so [EAX-4] doesn't point to SmallInteger instanceBehavior. Anyway, you can see that in case of a behavior mismatch, the selector is pushed to the stack, carefully popping and pushing the return address to simulate a call. This prologue is executed only after the call site has been patched, in the following executions, and assumes no selector pushed, that's why push #yourself was replaced with nops.  A possible (but less efficient) variation of all this trickery would have been not to remove the push in call site, and to pop it in case class check succeded, carefully protecting the return address:

opcode bytes                  disassembly
[68 4-bytes address] push #yourself
[FF 15 4-bytes address] call [Object>>#yourself nativeCode]

with:

realMethodStart:
    pop [esp]
    push ebp
    ...

(and also not to push the selector in failedClassCheck)

But replacing the push #yourself with five nops is a bit more efficient, as the selector is only needed for #_lookupAndInvoke:

Luckily, we already had this prologue generation implemented and ready to use, we just needed to plug it. This involved some small modifications to the lookup:

_lookupAndInvoke: aSymbol
    | cm |
    cm := self _lookup: aSymbol.
    cm == nil ifTrue: [
        cm := self _lookup: #doesNotUnderstand:.
        self _transferControlTo: cm noClassCheckEntrypoint].
    cm isNativized ifFalse: [self compile: cm]; patchClassCheck: self behavior.
    self
        _transferControlDiscardingLastArgAndPatchingTo: cm noClassCheckEntrypoint

There are two patches involved here: the obvious one is that of the call-site. But also patching class check is very important, and without it patching call site would be much less useful. To understand why, think again of Object>>#yourself. The native code of Object>>#yourself will contain a prologue that compares receiver's behavior against Object behavior. But you'll probably be sending #yourself to objects of a class that is not Object, so the check will fail and lookup will be executed again, when it wasn't actually needed. The simplest solution is an MRU one: change the behavior in the cmp instruction with the most recently used object's one. patchClassCheck does exactly that, it changes:

classCmp:
    cmp     dword ptr [eax-4], Object instanceBehavior

to

classCmp:
    cmp     dword ptr [eax-4], receiver's behavior


There's still a small optimization to add. Instead of leaving an indirect call, we can replace it with a direct one:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call (Object>>#yourself) nativeCode

In x86, immediate calls are always relative to the call-site. This complicates a bit the nsll generation and libraries loading. To avoid this, we prefer to generate the nslls using indirects calls (which always take absolute addresses) for now, and patch to direct calls on first lookup. It is important to put the nops before the call and not after, because that way the return address pushed into the stack when executing the call is not modified. The return address is used as the pivot during patching, so leaving it unmodified means that subsequent patching of the call site will overwrite the same bytes than before.

There's another point were this optimization can also be applied. Remember that lookup used a different send strategy: invoke. This means pushing the compiled method and then calling (Object >> #_invoke) nativeCode. It would be more efficient to do a direct call to the method's native code. For example

_lookup: aSymbol
    ^self _lookup: aSymbol in: self behavior


which was being nativized as

...
[68 4-bytes address] push Object>>#_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_invoke:) nativeCode]
...

can be patched to:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call [(Object>>#_lookup:in:) nativeCode]

by a small modification of invoke:

_invoke: aCompiledMethod
    | nativecode bytes classCheckDisplacement address |
    nativecode := aCompiledMethod _basicAt: 2.
    bytes := nativecode _basicAt: 1.
    classCheckDisplacement := 16rD.

    address := bytes _asSmallInteger + classCheckDisplacement.
    self _transferControlDiscardingLastArgAndPatchingUnsafeTo: address

Where are an "unsafe" patch means making the direct call not to the prologue, but directly to the real method entrypoint, as the invoke should execute directly without class checks.

With this small optimization plugged, we ran the tests again:

TinyBenchmarks
86428089 bytecodes/sec; 41330738 sends/sec

Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 17.77ms +/-0.34

Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 103848.336900011ms

Comparing to Host VM in tinyBenchmarks, it is still 1.75X better than our speed in bytecodes, but we get 2.69X more message sends per second! About these results, of course we are cheating a bit: we don't have GC, nor are we peeking for events when entering methods or at backjumps. Slopstone and smopstone instead showed the sad thruth, bee took 130.66X and 399.79X more time to complete than Host VM. Slop/smopstones, are a much better metric of the real performance. But these results were awesome anyway, as this sends/sec result give us a clue of what kind of performance we can expect in the end.

Global lookup cache


Being 400 times slower than the Host VM was still too much, so there was a last optimization we did: the global lookup. There is a last kind of lookup in bee that works when the MIC fails, the global lookup. Consider this implementation of Object class>>#new

new
    ^self basicNew initialize

Both the receivers of basicNew (some class) and the receiver of initialize (some object of some class) are going to be of a different class on each #new send. This is because you are going to create different kinds of objects, which means that class check will fail most times. What we can do is have a cache of common methods found on lookup, which we can put in a table. Given a selector and the behavior of the receiver, there is a unique compiled method to be found. Then, we can calculate a hash from the combination of selector and behavior, and use it as index in the table. If that position contains nil, we do slow lookup, and put there the resulting method, paired with the behavior for which it was found. If not nil, we look the behavior for which the method was dispatched, and if it is the one we are looking for, voila, we return the method we found. We can do this for 3 consecutive slots to mitigate hash collisions.

GlobalDispatchCache >> #lookupAndCache: selector in: aBehavior
    | method |
    method := self at: selector for: aBehavior.
    method == nil ifTrue: [
        method := self _lookup: selector in: aBehavior.
        self at: selector for: aBehavior put: method].
    ^method

GlobalDispatchCache >> #at: selector for: behavior
    | index |
    index := self indexOf: selector with: behavior.
    ^self lookup: selector for: behavior startingAt: index

GlobalDispatchCache >> #indexOf: selector with: behavior
    ^(selector _asPointer bitXor: behavior _asPointer) + 1 bitAnd: self tableSize


As we need to store a <compiledMethod,behavior> pair each time, we will use odd positions for compiledMethods and even ones for behaviors. Oops are aligned to 4 bytes, so both addresses end in 00. _asPointer sets the last bit to 1, which gives us a smallPointer (a small integer which is also an oop divided by 2). Finally, the xor leaves the last bit (of the smallinteger) as 0 and +1 assures we always get an odd hash result. Then we can use that as index for the compiled method and the next one for the behavior.

Finally, we plug this to the lookup:

_lookupAndInvoke: aSymbol
    | cm |
    cm := self _cachedLookup: aSymbol.
    cm == nil ifTrue: [
        cm := self _cachedLookup: #doesNotUnderstand:.
        self _transferControlTo: cm noClassCheckEntrypoint].
    cm isNativized ifFalse: [self compile: cm]; patchClassCheck: self behavior.
    self
        _transferControlDiscardingLastArgAndPatchingTo: cm noClassCheckEntrypoint

_cachedLookup: aSymbol
^GlobalDispatchCache current lookupAndCache: aSymbol in: self behavior

With all of this, we ran the tests for a last time:

TinyBenchmarks
86720867 bytecodes/sec; 42086788 sends/sec

Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 2.764ms +/-0.083

Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 12765.7030000091ms

TinyBenchmarks performance remained the same this time, which probably implies that there wasn't any class check failure and that global lookup wasn't needed there. But slopstone and smopstone improved a lot, now taking only 20.32X and 49.14X the time of the Host VM. Regarding tinyBenchmarks, we see that both on HostVM and Pharo the bytecodes/sec are always around 10 times the messages/sec, where in Bee we only get 2 times more. As Bee native code is very similar to Host VM for jitted bytecodes, this suggests that the implementation of the few messages being sent (#new:#at:#at:put:) is heavily dragging down performance, and should be improved.

To conclude this lengthy post, we see that there's still a long way to go performance-wise, but we think this work painted a bright outlook for bee. It now has a performance level that is acceptable for, at least, development. For now, we are done with lookup optimizations, which were the low hanging fruit that CodeXL showed. Now it is showing that improvements have to be done to primitives, which are a bit slow. I think we can gain an order of magnitude more there in smopstones with a small amount of work. After that, we'll see. We still have to plug the PIC to replace the MIC, keep working on multithreading and more, but that's going to take another whole post!



Results

Bytecodes/sec Sends/sec Slopstone(ms) Smopstone(ms)
HostVM 151210868 15391934 0.136 259.757
Bee (unoptimized) 205507 379224 92.4 534259.684
Bee – MIC 86428089 41330738 17.77 103848.337
Bee – MIC – Glo. Cache 86720867 42086788 2.764 12765.703

jueves, 2 de enero de 2014

Connecting the dots

We've been working very hard this year. The seed that was planted years ago has been growing underground and we are now starting to see the first leaves pop out. After Smalltalks 2013 we had the chance to stop the ball, have an overview and open to wider perspectives. After some time, we got back to daily work and thought about the current situation and the next steps. We opened this blog to let people know about both the progress of this project and the ideas around it. Now that the year is gone, we'd like to share some overview and vision of the future.

We have lot's of stuff going on here and many things almost working that need a last push to get integrated. You know, that 1% boring stuff.

We now have this small kernel working with libraries which is the base of the system. We also have a jitter, a garbage collector, and a starting point of a debugger all of them written in Smalltalk, but none of them is plugged yet. So we are asking ourselves, what do we do next, what do we need, and how do we make all this actually work together? Because they are already written, we just have to plug them.

The work on libraries mentioned in the previous posts opened a lot of doors and from this point on we already started plugging things and we are fixing stuff that is broken without the host VM. The compiler, the jitter, the GC, browsers, what should go first? We think that we can for now go without GC, and it's an interesting idea to see how much can be done without it. I mean, what if you disabled the GC for a while? how much will your system run if you let it use as much memory as it wants without collecting? We are going to see it soon.


Connecting Bee dots!
So for now we decided to leave GC unplugged and go for the JIT first. This will allow us to use plain slls, without native code, as that's what the JIT brings. After that, the compiler is next, and with that the regain a live smalltalk. And there is also a possibility that would be awesome: to compile from a remote host and inject the compiled methods in the guest, that is a live Smalltalk that doesn't need a compiler nor JIT, think in the footprint of that, it is minuscule.

There is also a lot of work in performance that has to be done to tune the runtime but we want to do a rigorous work to measure the impact. This involves two sides of a same coin: benchmarking, in order to compare many different algorithms and language implementations, and profiling so that we can detect the hot areas that need improvement. We have been working to port SMark to Bee, fortunately it is very well written, and an excellent piece of work. We are still missing the port of our profiler, but we almost have a small transcript-like profile log. There are some areas that we know are in need of optimization and that are very easy to improve -low hanging fruit-. Re-enabling the PIC is the most prominent, but at least a monomorphic cache for patching the call site will do for now. Then, we also have to move to a hashed lookup instead of a linear one (yes we are traversing all the method dictionary on each send each time now).

Moving to other zone, we lack browsers. To make browsers work we need to rethink things related to Process, to revise signaling primitives and check some parts of the window messaging architecture.
Then we have to choose which one we need first, I think the debugger and the inspector. We made some tests during Smalltalks 2013 to implement an out-of-process debugger, were we made an initial implementation of halt with a command loop. It was really promising but a lot was missing, we have to work on it. What I would like is to be able to debug transparently local and remote images. The first one is going to be easier and the second one super fun. Besides, the remote debugger should be able to work in two modes: assisted and unassisted mode. In assisted mode, the debugger issues commands to the guest asking to step, continue, etc, by sending messages that the guest responds. On unassisted mode, the debugger halts the guest and manipulates its memory directly. Assisted mode will be most useful but unassisted will let us do a lot of fancy things to fix broken images and do kernel changes on runtime.

In the research department there are so many ideas popping out of my brain that I feel my head is going to explode soon. We showed some bits of native multithreading in Smalltalks this year. It's very interesting, but is going to be really hard to debug, so we'll need better tools before we continue trying that. There are a lot of experiments to do there. So before that I'd like to experiment with the optimizations described in Ungar's Self, mainly inlining. Also, I was thinking about compiling to different architectures. In our design, almost everything falls into the Assembler as a backend. And this is a really nice Assembler in one sense: It doesn't target an explicit architecture but a virtual one, which was implemented loosely following this Wirfs-Brock tech report. It's a stack based architecture with a small amount of registers to store context (receiver, argument, temporary, method context). In our case, the assembler targets Intel x86, but we could implement another one that targets ARM, or maybe LLVM bytecode or even asm.js. These last things are probably not too close yet.

Finally, our code tracking is not as good as we would like, and to get open we'll need something better. We are thinking on using git to track sources, and will probably use Cypress for that. We are porting it to Bee, but we still have to find a way to integrate git to our development process. We have to find a way to stop managing changeset files by hand. Only after this we are going to be able to have a continuous integration server and automatic testing and profiling.

There are so many interesting topics, we want to wrap everything up, but we must focus on one item at a time if we want to be successful, I should go back to plugging the JIT now.

Happy new year!