Bee performance and lookup strategies

It's been quite some time since last post, we'be been really busy cooking Bee Smalltalk. You wouldn't believe how much work is needed in order to connect the dots!

We continued trying to make libraries load in bee, but we felt that performance was abysmal, library loading was taking ages and tests were ran so slow that they were slowing down the whole development process. In order to measure performance, we ported a bunch of benchmarks to bee.

Beenchmarks

The first benchmark we took was the simplest we know: tinyBenchmarks. It consists of a rather simple pair of metrics. The first one estimates bytecode speed and the second one dispatch speed. The bytecode test does a lot of arithmetic (which is executed directly without message sends) and sends only few messages: #new:, #at:, #at:put:. On the other hand, the dispatch test calculates Fibonacci recursively:


^self < 2
ifTrue: [1]
ifFalse: [(self - 1) fibonacchi + (self - 2) fibonacchi + 1]

But tinyBenchmarks are really really basic, and not a complete benchmark suite, they are just a quick and small test to get a general idea. For that reason, we wanted to have more benchmarks and ported SMark to bee. We didn't bother to port all the benchmarks yet, but at least we made Slopstone and Smopstone work.

So we ran a pass of the benchs through CodeXL (if you know a better native code profiler leave a comment!) and analysed the results. The first thing that we saw was expected: our lookup was slow, brutally slow. For that reason, we spent a couple of days improving bee lookup speed, and here we are going to show some of the results we've got.

Bee Lookup

We could just throw the numbers here, but to give them context we want to describe the different lookup strategies we implemented in bee.  We'll give you many details now, but if you want to know better about most of these strategies, you can also read a very good post about Cog lookup wrote not long ago by Clément Béra.

One last advice before we start. Bee implementation is a bit different compared to other dialect. This affects its lookup directly. In Bee, the original Behavior class from Smalltalk was split in two: method responsibilities were set apart and only shape responsibilities were left there. What we usually knew as Behavior was renamed as Species. The method responsibilities were reified as a new kind of object named Behavior: what we used to call the method dictionary array. A behavior then is pretty much like an array filled with method dictionaries (or behaviors). In Bee the object header doesn't point to the class, but to the behavior, and the object's class is given by the one found in the first method dictionary of the object's behavior.

Naive lookup


To understand how lookup works in bee, let's have a look at a method that sends a message, and analyze the generated native code. Supose we have:

sendYourself
    self yourself

This method only sends #yourself message. The nativizer will usually push the arguments (in this case there isn't any) to the stack and call some lookup function. In the naive case, lookup function receives one extra argument: the selector. The figure depicts the generated assembly. Also the original bee smalltalk implementation of _lookupAndInvoke: is this:

_lookupAndInvoke: aSymbol
    | cm |
    cm := self _lookup: aSymbol.
    cm == nil ifTrue: [
        cm := self _lookup: #doesNotUnderstand:.
        self _transferControlTo: cm noClassCheckEntrypoint].
    cm isNativized ifFalse: [self compile: cm].
    self
        _transferControlDiscardingLastArgTo: cm noClassCheckEntrypoint

#_lookup: looks for the corresponding method in the receiver's behavior (former method dictionary array).

_lookup: aSymbol
    ^self _lookup: aSymbol in: self behavior

If the method isn't nativized yet (can't happen now) it should be compiled. Finally, we've got the compiled method, we have to jump to its native code. If you pay attention to the stack, the last thing pushed is the selector (#yourself), and #lookupAndInvoke: takes exactly one argument. There's also an implicit push of the return address in the call instruction. Notice that the return point from yourself should be the method that sent the message (sendYourself) and not lookupAndInvoke:, that's why we have to jump and not call the entrypoint.

Invoke lookup

As you can imagine, #lookupAndInvoke: and its closure have to be specially compiled. This is because lookup must not raise another lookup recursively, or else it would enter an infinite recursion. As lookup method closure is really small and lacks any polimorphism, our solution is to precalculate what methods will be used during lookup. For lookup nativization, we then alter send nativization, to use method invocation instead of method lookup. For example,

_lookup: aSymbol
    ^self _lookup: aSymbol in: self behavior


which originally would be nativized as

...
[68 4-bytes address] push #_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]
...

can be nativized as

[68 4-bytes address] push Object>>#_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_invoke:) nativeCode]

with (almost) this implementation of _invoke:

_invoke: aCompiledMethod
    | nativecode bytes classCheckDisplacement address |
    nativecode := aCompiledMethod _basicAt: 2.
    bytes := nativecode _basicAt: 1.
    classCheckDisplacement := 16rD.

    address := bytes _asSmallInteger + classCheckDisplacement.
    self _transferControlDiscardingLastArgTo: address


With this strategy working, we ran a pass of tiny benchmarks. As a reference, in the same machine I get:

TinyBenchmarks:
Host VM: 151.210.868 bytecodes/sec; 15.391.934 sends/sec
Pharo: 266.112.266 bytecodes/sec; 23.850.401 sends/sec

Report for: SMarkSlopstone
Host VM: Stone total: iterations=100 runtime: 0.136ms +/-0.010

Report for: SMarkSmopstone
Host VM: Stone total: iterations=1 runtime: 259.756999999285ms


The result of the beenchmarks was (expectedly) embarrasing:

TinyBenchmarks
205507 bytecodes/sec; 379224 sends/sec

Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 92.4ms +/-1.4

Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 534259.684200004ms

Which means Host VM gets 735.79X and 40.59X better results for tinyBenchmarks and also 679.41X and 2056.77X for slopstone and smopstone respectively. As I said before, this result was expected because, until then, we never optimized anything. But of course, a Smalltalk written VM has to be optimized at some point, in order to make it fast. And so we did.

Monomorphic inline cache (aka patching call site)


To solve our performance problems, we first thought of re-enabling the PIC, or at least, some kind of MIC. We already have an implementation of a PIC, but it isn't plugged to the system as it works now, and for the moment it was easier to implement a MIC. The idea is simple: at call-site, instead of calling lookup, patch the call address to point the native code of the last method found. To make this safe, add to that method (and all others) a prologue that checks if the receiver's behavior is correct for that method. In assembly, the previous message send should be patched to:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[FF 15 4-bytes address] call [(Object>>#yourself) nativeCode]

for the call site and add

    test    al, 1
    jnz     receiverIsSmallInteger

classCmp:
    cmp     dword ptr [eax-4], Object instanceBehavior
    jnz     failedClassCheck
    jmp     realMethodStart

receiverIsSmallInteger:
    cmp     [classCmp+3], SmallInteger instanceBehavior
    jz      realMethodStart

failedClassCheck:
    pop     ecx
    push    #yourself
    push    ecx
    jmp     [(Object>>#lookupAndInvoke:) nativeCode]

realMethodStart:
    push ebp
    mov ebp, esp
    ...

for the Object >> #yourself prologue. EAX is the receiver register, which points to the receiver's first slot. In EAX-4 is the pointer to its behavior. First in assembly is a special case for small integers, as they are not pointers but immediate values so [EAX-4] doesn't point to SmallInteger instanceBehavior. Anyway, you can see that in case of a behavior mismatch, the selector is pushed to the stack, carefully popping and pushing the return address to simulate a call. This prologue is executed only after the call site has been patched, in the following executions, and assumes no selector pushed, that's why push #yourself was replaced with nops.  A possible (but less efficient) variation of all this trickery would have been not to remove the push in call site, and to pop it in case class check succeded, carefully protecting the return address:

opcode bytes                  disassembly
[68 4-bytes address] push #yourself
[FF 15 4-bytes address] call [Object>>#yourself nativeCode]

with:

realMethodStart:
    pop [esp]
    push ebp
    ...

(and also not to push the selector in failedClassCheck)

But replacing the push #yourself with five nops is a bit more efficient, as the selector is only needed for #_lookupAndInvoke:

Luckily, we already had this prologue generation implemented and ready to use, we just needed to plug it. This involved some small modifications to the lookup:

_lookupAndInvoke: aSymbol
    | cm |
    cm := self _lookup: aSymbol.
    cm == nil ifTrue: [
        cm := self _lookup: #doesNotUnderstand:.
        self _transferControlTo: cm noClassCheckEntrypoint].
    cm isNativized ifFalse: [self compile: cm]; patchClassCheck: self behavior.
    self
        _transferControlDiscardingLastArgAndPatchingTo: cm noClassCheckEntrypoint

There are two patches involved here: the obvious one is that of the call-site. But also patching class check is very important, and without it patching call site would be much less useful. To understand why, think again of Object>>#yourself. The native code of Object>>#yourself will contain a prologue that compares receiver's behavior against Object behavior. But you'll probably be sending #yourself to objects of a class that is not Object, so the check will fail and lookup will be executed again, when it wasn't actually needed. The simplest solution is an MRU one: change the behavior in the cmp instruction with the most recently used object's one. patchClassCheck does exactly that, it changes:

classCmp:
    cmp     dword ptr [eax-4], Object instanceBehavior

to

classCmp:
    cmp     dword ptr [eax-4], receiver's behavior


There's still a small optimization to add. Instead of leaving an indirect call, we can replace it with a direct one:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call (Object>>#yourself) nativeCode

In x86, immediate calls are always relative to the call-site. This complicates a bit the nsll generation and libraries loading. To avoid this, we prefer to generate the nslls using indirects calls (which always take absolute addresses) for now, and patch to direct calls on first lookup. It is important to put the nops before the call and not after, because that way the return address pushed into the stack when executing the call is not modified. The return address is used as the pivot during patching, so leaving it unmodified means that subsequent patching of the call site will overwrite the same bytes than before.

There's another point were this optimization can also be applied. Remember that lookup used a different send strategy: invoke. This means pushing the compiled method and then calling (Object >> #_invoke) nativeCode. It would be more efficient to do a direct call to the method's native code. For example

_lookup: aSymbol
    ^self _lookup: aSymbol in: self behavior


which was being nativized as

...
[68 4-bytes address] push Object>>#_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_invoke:) nativeCode]
...

can be patched to:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call [(Object>>#_lookup:in:) nativeCode]

by a small modification of invoke:

_invoke: aCompiledMethod
    | nativecode bytes classCheckDisplacement address |
    nativecode := aCompiledMethod _basicAt: 2.
    bytes := nativecode _basicAt: 1.
    classCheckDisplacement := 16rD.

    address := bytes _asSmallInteger + classCheckDisplacement.
    self _transferControlDiscardingLastArgAndPatchingUnsafeTo: address

Where are an "unsafe" patch means making the direct call not to the prologue, but directly to the real method entrypoint, as the invoke should execute directly without class checks.

With this small optimization plugged, we ran the tests again:

TinyBenchmarks
86428089 bytecodes/sec; 41330738 sends/sec

Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 17.77ms +/-0.34

Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 103848.336900011ms

Comparing to Host VM in tinyBenchmarks, it is still 1.75X better than our speed in bytecodes, but we get 2.69X more message sends per second! About these results, of course we are cheating a bit: we don't have GC, nor are we peeking for events when entering methods or at backjumps. Slopstone and smopstone instead showed the sad thruth, bee took 130.66X and 399.79X more time to complete than Host VM. Slop/smopstones, are a much better metric of the real performance. But these results were awesome anyway, as this sends/sec result give us a clue of what kind of performance we can expect in the end.

Global lookup cache


Being 400 times slower than the Host VM was still too much, so there was a last optimization we did: the global lookup. There is a last kind of lookup in bee that works when the MIC fails, the global lookup. Consider this implementation of Object class>>#new

new
    ^self basicNew initialize

Both the receivers of basicNew (some class) and the receiver of initialize (some object of some class) are going to be of a different class on each #new send. This is because you are going to create different kinds of objects, which means that class check will fail most times. What we can do is have a cache of common methods found on lookup, which we can put in a table. Given a selector and the behavior of the receiver, there is a unique compiled method to be found. Then, we can calculate a hash from the combination of selector and behavior, and use it as index in the table. If that position contains nil, we do slow lookup, and put there the resulting method, paired with the behavior for which it was found. If not nil, we look the behavior for which the method was dispatched, and if it is the one we are looking for, voila, we return the method we found. We can do this for 3 consecutive slots to mitigate hash collisions.

GlobalDispatchCache >> #lookupAndCache: selector in: aBehavior
    | method |
    method := self at: selector for: aBehavior.
    method == nil ifTrue: [
        method := self _lookup: selector in: aBehavior.
        self at: selector for: aBehavior put: method].
    ^method

GlobalDispatchCache >> #at: selector for: behavior
    | index |
    index := self indexOf: selector with: behavior.
    ^self lookup: selector for: behavior startingAt: index

GlobalDispatchCache >> #indexOf: selector with: behavior
    ^(selector _asPointer bitXor: behavior _asPointer) + 1 bitAnd: self tableSize


As we need to store a <compiledMethod,behavior> pair each time, we will use odd positions for compiledMethods and even ones for behaviors. Oops are aligned to 4 bytes, so both addresses end in 00. _asPointer sets the last bit to 1, which gives us a smallPointer (a small integer which is also an oop divided by 2). Finally, the xor leaves the last bit (of the smallinteger) as 0 and +1 assures we always get an odd hash result. Then we can use that as index for the compiled method and the next one for the behavior.

Finally, we plug this to the lookup:

_lookupAndInvoke: aSymbol
    | cm |
    cm := self _cachedLookup: aSymbol.
    cm == nil ifTrue: [
        cm := self _cachedLookup: #doesNotUnderstand:.
        self _transferControlTo: cm noClassCheckEntrypoint].
    cm isNativized ifFalse: [self compile: cm]; patchClassCheck: self behavior.
    self
        _transferControlDiscardingLastArgAndPatchingTo: cm noClassCheckEntrypoint

_cachedLookup: aSymbol
^GlobalDispatchCache current lookupAndCache: aSymbol in: self behavior

With all of this, we ran the tests for a last time:

TinyBenchmarks
86720867 bytecodes/sec; 42086788 sends/sec

Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 2.764ms +/-0.083

Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 12765.7030000091ms

TinyBenchmarks performance remained the same this time, which probably implies that there wasn't any class check failure and that global lookup wasn't needed there. But slopstone and smopstone improved a lot, now taking only 20.32X and 49.14X the time of the Host VM. Regarding tinyBenchmarks, we see that both on HostVM and Pharo the bytecodes/sec are always around 10 times the messages/sec, where in Bee we only get 2 times more. As Bee native code is very similar to Host VM for jitted bytecodes, this suggests that the implementation of the few messages being sent (#new:#at:#at:put:) is heavily dragging down performance, and should be improved.

To conclude this lengthy post, we see that there's still a long way to go performance-wise, but we think this work painted a bright outlook for bee. It now has a performance level that is acceptable for, at least, development. For now, we are done with lookup optimizations, which were the low hanging fruit that CodeXL showed. Now it is showing that improvements have to be done to primitives, which are a bit slow. I think we can gain an order of magnitude more there in smopstones with a small amount of work. After that, we'll see. We still have to plug the PIC to replace the MIC, keep working on multithreading and more, but that's going to take another whole post!



Results

Bytecodes/sec Sends/sec Slopstone(ms) Smopstone(ms)
HostVM 151210868 15391934 0.136 259.757
Bee (unoptimized) 205507 379224 92.4 534259.684
Bee – MIC 86428089 41330738 17.77 103848.337
Bee – MIC – Glo. Cache 86720867 42086788 2.764 12765.703

Comentarios

Entradas populares de este blog

Design principles behind Bee Smalltalk's Execution Engine

Connecting the dots

Pre-releasing Bee Smalltalk