Bee performance and lookup strategies
It's been quite some time since last post, we'be been really busy cooking Bee Smalltalk. You wouldn't believe how much work is needed in order to connect the dots!
To understand how lookup works in bee, let's have a look at a method that sends a message, and analyze the generated native code. Supose we have:
sendYourself
self yourself
_lookupAndInvoke: aSymbol
| cm |
cm := self _lookup: aSymbol.
cm == nil ifTrue: [
cm := self _lookup: #doesNotUnderstand:.
self _transferControlTo: cm noClassCheckEntrypoint].
cm isNativized ifFalse: [self compile: cm].
self
_transferControlDiscardingLastArgTo: cm noClassCheckEntrypoint
#_lookup: looks for the corresponding method in the receiver's behavior (former method dictionary array).
_lookup: aSymbol
^self _lookup: aSymbol in: self behavior
If the method isn't nativized yet (can't happen now) it should be compiled. Finally, we've got the compiled method, we have to jump to its native code. If you pay attention to the stack, the last thing pushed is the selector (#yourself), and #lookupAndInvoke: takes exactly one argument. There's also an implicit push of the return address in the call instruction. Notice that the return point from yourself should be the method that sent the message (sendYourself) and not lookupAndInvoke:, that's why we have to jump and not call the entrypoint.
_lookup: aSymbol
^self _lookup: aSymbol in: self behavior
which originally would be nativized as
...
[68 4-bytes address] push #_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]
...
can be nativized as
[68 4-bytes address] push Object>>#_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_invoke:) nativeCode]
with (almost) this implementation of _invoke:
_invoke: aCompiledMethod
| nativecode bytes classCheckDisplacement address |
nativecode := aCompiledMethod _basicAt: 2.
bytes := nativecode _basicAt: 1.
classCheckDisplacement := 16rD.
address := bytes _asSmallInteger + classCheckDisplacement.
TinyBenchmarks:
Report for: SMarkSlopstone
Host VM: Stone total: iterations=100 runtime: 0.136ms +/-0.010
Report for: SMarkSmopstone
Host VM: Stone total: iterations=1 runtime: 259.756999999285ms
TinyBenchmarks
205507 bytecodes/sec; 379224 sends/sec
Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 534259.684200004ms
Which means Host VM gets 735.79X and 40.59X better results for tinyBenchmarks and also 679.41X and 2056.77X for slopstone and smopstone respectively. As I said before, this result was expected because, until then, we never optimized anything. But of course, a Smalltalk written VM has to be optimized at some point, in order to make it fast. And so we did.
To solve our performance problems, we first thought of re-enabling the PIC, or at least, some kind of MIC. We already have an implementation of a PIC, but it isn't plugged to the system as it works now, and for the moment it was easier to implement a MIC. The idea is simple: at call-site, instead of calling lookup, patch the call address to point the native code of the last method found. To make this safe, add to that method (and all others) a prologue that checks if the receiver's behavior is correct for that method. In assembly, the previous message send should be patched to:
opcode bytes disassembly
[68 4-bytes address] push #yourself
[FF 15 4-bytes address] call [Object>>#yourself nativeCode]
with:
realMethodStart:
pop [esp]
push ebp
...
(and also not to push the selector in failedClassCheck)
But replacing the push #yourself with five nops is a bit more efficient, as the selector is only needed for #_lookupAndInvoke:
There are two patches involved here: the obvious one is that of the call-site. But also patching class check is very important, and without it patching call site would be much less useful. To understand why, think again of Object>>#yourself. The native code of Object>>#yourself will contain a prologue that compares receiver's behavior against Object behavior. But you'll probably be sending #yourself to objects of a class that is not Object, so the check will fail and lookup will be executed again, when it wasn't actually needed. The simplest solution is an MRU one: change the behavior in the cmp instruction with the most recently used object's one. patchClassCheck does exactly that, it changes:
classCmp:
cmp dword ptr [eax-4], Object instanceBehavior
classCmp:
cmp dword ptr [eax-4], receiver's behavior
There's still a small optimization to add. Instead of leaving an indirect call, we can replace it with a direct one:
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call (Object>>#yourself) nativeCode
In x86, immediate calls are always relative to the call-site. This complicates a bit the nsll generation and libraries loading. To avoid this, we prefer to generate the nslls using indirects calls (which always take absolute addresses) for now, and patch to direct calls on first lookup. It is important to put the nops before the call and not after, because that way the return address pushed into the stack when executing the call is not modified. The return address is used as the pivot during patching, so leaving it unmodified means that subsequent patching of the call site will overwrite the same bytes than before.
There's another point were this optimization can also be applied. Remember that lookup used a different send strategy: invoke. This means pushing the compiled method and then calling (Object >> #_invoke) nativeCode. It would be more efficient to do a direct call to the method's native code. For example
_lookup: aSymbol
^self _lookup: aSymbol in: self behavior
which was being nativized as
...
[68 4-bytes address] push Object>>#_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_invoke:) nativeCode]
...
can be patched to:
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call [(Object>>#_lookup:in:) nativeCode]
With this small optimization plugged, we ran the tests again:
Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 103848.336900011ms
Comparing to Host VM in tinyBenchmarks, it is still 1.75X better than our speed in bytecodes, but we get 2.69X more message sends per second! About these results, of course we are cheating a bit: we don't have GC, nor are we peeking for events when entering methods or at backjumps. Slopstone and smopstone instead showed the sad thruth, bee took 130.66X and 399.79X more time to complete than Host VM. Slop/smopstones, are a much better metric of the real performance. But these results were awesome anyway, as this sends/sec result give us a clue of what kind of performance we can expect in the end.
Being 400 times slower than the Host VM was still too much, so there was a last optimization we did: the global lookup. There is a last kind of lookup in bee that works when the MIC fails, the global lookup. Consider this implementation of Object class>>#new
new
^self basicNew initialize
Both the receivers of basicNew (some class) and the receiver of initialize (some object of some class) are going to be of a different class on each #new send. This is because you are going to create different kinds of objects, which means that class check will fail most times. What we can do is have a cache of common methods found on lookup, which we can put in a table. Given a selector and the behavior of the receiver, there is a unique compiled method to be found. Then, we can calculate a hash from the combination of selector and behavior, and use it as index in the table. If that position contains nil, we do slow lookup, and put there the resulting method, paired with the behavior for which it was found. If not nil, we look the behavior for which the method was dispatched, and if it is the one we are looking for, voila, we return the method we found. We can do this for 3 consecutive slots to mitigate hash collisions.
GlobalDispatchCache >> #lookupAndCache: selector in: aBehavior
| method |
method := self at: selector for: aBehavior.
method == nil ifTrue: [
method := self _lookup: selector in: aBehavior.
self at: selector for: aBehavior put: method].
^method
GlobalDispatchCache >> #at: selector for: behavior
| index |
index := self indexOf: selector with: behavior.
^self lookup: selector for: behavior startingAt: index
GlobalDispatchCache >> #indexOf: selector with: behavior
^(selector _asPointer bitXor: behavior _asPointer) + 1 bitAnd: self tableSize
As we need to store a <compiledMethod,behavior> pair each time, we will use odd positions for compiledMethods and even ones for behaviors. Oops are aligned to 4 bytes, so both addresses end in 00. _asPointer sets the last bit to 1, which gives us a smallPointer (a small integer which is also an oop divided by 2). Finally, the xor leaves the last bit (of the smallinteger) as 0 and +1 assures we always get an odd hash result. Then we can use that as index for the compiled method and the next one for the behavior.
Finally, we plug this to the lookup:
_lookupAndInvoke: aSymbol
| cm |
cm := self _cachedLookup: aSymbol.
cm == nil ifTrue: [
cm := self _cachedLookup: #doesNotUnderstand:.
self _transferControlTo: cm noClassCheckEntrypoint].
cm isNativized ifFalse: [self compile: cm]; patchClassCheck: self behavior.
self
_transferControlDiscardingLastArgAndPatchingTo: cm noClassCheckEntrypoint
_cachedLookup: aSymbol
^GlobalDispatchCache current lookupAndCache: aSymbol in: self behavior
With all of this, we ran the tests for a last time:
Report for: SMarkSmopstone
Stone total: iterations=1 runtime: 12765.7030000091ms
TinyBenchmarks performance remained the same this time, which probably implies that there wasn't any class check failure and that global lookup wasn't needed there. But slopstone and smopstone improved a lot, now taking only 20.32X and 49.14X the time of the Host VM. Regarding tinyBenchmarks, we see that both on HostVM and Pharo the bytecodes/sec are always around 10 times the messages/sec, where in Bee we only get 2 times more. As Bee native code is very similar to Host VM for jitted bytecodes, this suggests that the implementation of the few messages being sent (#new:, #at:, #at:put:) is heavily dragging down performance, and should be improved.
To conclude this lengthy post, we see that there's still a long way to go performance-wise, but we think this work painted a bright outlook for bee. It now has a performance level that is acceptable for, at least, development. For now, we are done with lookup optimizations, which were the low hanging fruit that CodeXL showed. Now it is showing that improvements have to be done to primitives, which are a bit slow. I think we can gain an order of magnitude more there in smopstones with a small amount of work. After that, we'll see. We still have to plug the PIC to replace the MIC, keep working on multithreading and more, but that's going to take another whole post!
We continued trying to make libraries load in bee, but we felt that performance was abysmal, library loading was taking ages and tests were ran so slow that they were slowing down the whole development process. In order to measure performance, we ported a bunch of benchmarks to bee.
So we ran a pass of the benchs through CodeXL (if you know a better native code profiler leave a comment!) and analysed the results. The first thing that we saw was expected: our lookup was slow, brutally slow. For that reason, we spent a couple of days improving bee lookup speed, and here we are going to show some of the results we've got.
One last advice before we start. Bee implementation is a bit different compared to other dialect. This affects its lookup directly. In Bee, the original Behavior class from Smalltalk was split in two: method responsibilities were set apart and only shape responsibilities were left there. What we usually knew as Behavior was renamed as Species. The method responsibilities were reified as a new kind of object named Behavior: what we used to call the method dictionary array. A behavior then is pretty much like an array filled with method dictionaries (or behaviors). In Bee the object header doesn't point to the class, but to the behavior, and the object's class is given by the one found in the first method dictionary of the object's behavior.
Beenchmarks
The first benchmark we took was the simplest we know: tinyBenchmarks. It consists of a rather simple pair of metrics. The first one estimates bytecode speed and the second one dispatch speed. The bytecode test does a lot of arithmetic (which is executed directly without message sends) and sends only few messages: #new:, #at:, #at:put:. On the other hand, the dispatch test calculates Fibonacci recursively:
^self < 2
ifTrue: [1]
ifFalse: [(self - 1) fibonacchi + (self - 2) fibonacchi + 1]
But tinyBenchmarks are really really basic, and not a complete benchmark suite, they are just a quick and small test to get a general idea. For that reason, we wanted to have more benchmarks and ported SMark to bee. We didn't bother to port all the benchmarks yet, but at least we made Slopstone and Smopstone work.
Bee Lookup
We could just throw the numbers here, but to give them context we want to describe the different lookup strategies we implemented in bee. We'll give you many details now, but if you want to know better about most of these strategies, you can also read a very good post about Cog lookup wrote not long ago by Clément Béra.One last advice before we start. Bee implementation is a bit different compared to other dialect. This affects its lookup directly. In Bee, the original Behavior class from Smalltalk was split in two: method responsibilities were set apart and only shape responsibilities were left there. What we usually knew as Behavior was renamed as Species. The method responsibilities were reified as a new kind of object named Behavior: what we used to call the method dictionary array. A behavior then is pretty much like an array filled with method dictionaries (or behaviors). In Bee the object header doesn't point to the class, but to the behavior, and the object's class is given by the one found in the first method dictionary of the object's behavior.
Naive lookup
To understand how lookup works in bee, let's have a look at a method that sends a message, and analyze the generated native code. Supose we have:
sendYourself
self yourself
This method only sends #yourself message. The nativizer will usually push the arguments (in this case there isn't any) to the stack and call some lookup function. In the naive case, lookup function receives one extra argument: the selector. The figure depicts the generated assembly. Also the original bee smalltalk implementation of _lookupAndInvoke: is this:
_lookupAndInvoke: aSymbol
| cm |
cm := self _lookup: aSymbol.
cm == nil ifTrue: [
cm := self _lookup: #doesNotUnderstand:.
self _transferControlTo: cm noClassCheckEntrypoint].
cm isNativized ifFalse: [self compile: cm].
self
_transferControlDiscardingLastArgTo: cm noClassCheckEntrypoint
_lookup: aSymbol
^self _lookup: aSymbol in: self behavior
Invoke lookup
As you can imagine, #lookupAndInvoke: and its closure have to be specially compiled. This is because lookup must not raise another lookup recursively, or else it would enter an infinite recursion. As lookup method closure is really small and lacks any polimorphism, our solution is to precalculate what methods will be used during lookup. For lookup nativization, we then alter send nativization, to use method invocation instead of method lookup. For example,_lookup: aSymbol
^self _lookup: aSymbol in: self behavior
which originally would be nativized as
...
[68 4-bytes address] push #_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]
...
can be nativized as
[68 4-bytes address] push Object>>#_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_invoke:) nativeCode]
with (almost) this implementation of _invoke:
_invoke: aCompiledMethod
| nativecode bytes classCheckDisplacement address |
nativecode := aCompiledMethod _basicAt: 2.
bytes := nativecode _basicAt: 1.
classCheckDisplacement := 16rD.
address := bytes _asSmallInteger + classCheckDisplacement.
self _transferControlDiscardingLastArgTo: address
With this strategy working, we ran a pass of tiny benchmarks. As a reference, in the same machine I get:
TinyBenchmarks:
Host VM: 151.210.868 bytecodes/sec; 15.391.934 sends/sec
Pharo: 266.112.266 bytecodes/sec; 23.850.401 sends/secReport for: SMarkSlopstone
Host VM: Stone total: iterations=100 runtime: 0.136ms +/-0.010
Report for: SMarkSmopstone
Host VM: Stone total: iterations=1 runtime: 259.756999999285ms
The result of the beenchmarks was (expectedly) embarrasing:
TinyBenchmarks
Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 92.4ms +/-1.4
Stone total: iterations=100 runtime: 92.4ms +/-1.4
Stone total: iterations=1 runtime: 534259.684200004ms
Which means Host VM gets 735.79X and 40.59X better results for tinyBenchmarks and also 679.41X and 2056.77X for slopstone and smopstone respectively. As I said before, this result was expected because, until then, we never optimized anything. But of course, a Smalltalk written VM has to be optimized at some point, in order to make it fast. And so we did.
Monomorphic inline cache (aka patching call site)
To solve our performance problems, we first thought of re-enabling the PIC, or at least, some kind of MIC. We already have an implementation of a PIC, but it isn't plugged to the system as it works now, and for the moment it was easier to implement a MIC. The idea is simple: at call-site, instead of calling lookup, patch the call address to point the native code of the last method found. To make this safe, add to that method (and all others) a prologue that checks if the receiver's behavior is correct for that method. In assembly, the previous message send should be patched to:
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[FF 15 4-bytes address] call [(Object>>#yourself) nativeCode]
[90] nop
[90] nop
[90] nop
[90] nop
[FF 15 4-bytes address] call [(Object>>#yourself) nativeCode]
for the call site and add
test al, 1
jnz receiverIsSmallInteger
classCmp:
cmp dword ptr [eax-4], Object instanceBehavior
jnz failedClassCheck
jmp realMethodStart
receiverIsSmallInteger:
cmp [classCmp+3], SmallInteger instanceBehavior
jz realMethodStart
failedClassCheck:
pop ecx
push #yourself
push ecx
jmp [(Object>>#lookupAndInvoke:) nativeCode]
realMethodStart:
push ebp
mov ebp, esp
...
jnz receiverIsSmallInteger
classCmp:
cmp dword ptr [eax-4], Object instanceBehavior
jnz failedClassCheck
jmp realMethodStart
receiverIsSmallInteger:
cmp [classCmp+3], SmallInteger instanceBehavior
jz realMethodStart
failedClassCheck:
pop ecx
push #yourself
push ecx
jmp [(Object>>#lookupAndInvoke:) nativeCode]
realMethodStart:
push ebp
mov ebp, esp
...
for the Object >> #yourself prologue. EAX is the receiver register, which points to the receiver's first slot. In EAX-4 is the pointer to its behavior. First in assembly is a special case for small integers, as they are not pointers but immediate values so [EAX-4] doesn't point to SmallInteger instanceBehavior. Anyway, you can see that in case of a behavior mismatch, the selector is pushed to the stack, carefully popping and pushing the return address to simulate a call. This prologue is executed only after the call site has been patched, in the following executions, and assumes no selector pushed, that's why push #yourself was replaced with nops. A possible (but less efficient) variation of all this trickery would have been not to remove the push in call site, and to pop it in case class check succeded, carefully protecting the return address:
opcode bytes disassembly
[68 4-bytes address] push #yourself
[FF 15 4-bytes address] call [Object>>#yourself nativeCode]
with:
realMethodStart:
pop [esp]
push ebp
...
(and also not to push the selector in failedClassCheck)
Luckily, we already had this prologue generation implemented and ready to use, we just needed to plug it. This involved some small modifications to the lookup:
_lookupAndInvoke: aSymbol
| cm |
cm := self _lookup: aSymbol.
cm == nil ifTrue: [
cm := self _lookup: #doesNotUnderstand:.
self _transferControlTo: cm noClassCheckEntrypoint].
cm isNativized ifFalse: [self compile: cm]; patchClassCheck: self behavior.
self
_transferControlDiscardingLastArgAndPatchingTo: cm noClassCheckEntrypoint
| cm |
cm := self _lookup: aSymbol.
cm == nil ifTrue: [
cm := self _lookup: #doesNotUnderstand:.
self _transferControlTo: cm noClassCheckEntrypoint].
cm isNativized ifFalse: [self compile: cm]; patchClassCheck: self behavior.
self
_transferControlDiscardingLastArgAndPatchingTo: cm noClassCheckEntrypoint
There are two patches involved here: the obvious one is that of the call-site. But also patching class check is very important, and without it patching call site would be much less useful. To understand why, think again of Object>>#yourself. The native code of Object>>#yourself will contain a prologue that compares receiver's behavior against Object behavior. But you'll probably be sending #yourself to objects of a class that is not Object, so the check will fail and lookup will be executed again, when it wasn't actually needed. The simplest solution is an MRU one: change the behavior in the cmp instruction with the most recently used object's one. patchClassCheck does exactly that, it changes:
classCmp:
cmp dword ptr [eax-4], Object instanceBehavior
to
classCmp:
cmp dword ptr [eax-4], receiver's behavior
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call (Object>>#yourself) nativeCode
In x86, immediate calls are always relative to the call-site. This complicates a bit the nsll generation and libraries loading. To avoid this, we prefer to generate the nslls using indirects calls (which always take absolute addresses) for now, and patch to direct calls on first lookup. It is important to put the nops before the call and not after, because that way the return address pushed into the stack when executing the call is not modified. The return address is used as the pivot during patching, so leaving it unmodified means that subsequent patching of the call site will overwrite the same bytes than before.
There's another point were this optimization can also be applied. Remember that lookup used a different send strategy: invoke. This means pushing the compiled method and then calling (Object >> #_invoke) nativeCode. It would be more efficient to do a direct call to the method's native code. For example
_lookup: aSymbol
^self _lookup: aSymbol in: self behavior
which was being nativized as
...
[68 4-bytes address] push Object>>#_lookup:in:
[FF 15 4-bytes address] call [(Object>>#_invoke:) nativeCode]
...
can be patched to:
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call [(Object>>#_lookup:in:) nativeCode]
by a small modification of invoke:
_invoke: aCompiledMethod
| nativecode bytes classCheckDisplacement address |
nativecode := aCompiledMethod _basicAt: 2.
bytes := nativecode _basicAt: 1.
classCheckDisplacement := 16rD.
address := bytes _asSmallInteger + classCheckDisplacement.
| nativecode bytes classCheckDisplacement address |
nativecode := aCompiledMethod _basicAt: 2.
bytes := nativecode _basicAt: 1.
classCheckDisplacement := 16rD.
address := bytes _asSmallInteger + classCheckDisplacement.
self _transferControlDiscardingLastArgAndPatchingUnsafeTo: address
Where are an "unsafe" patch means making the direct call not to the prologue, but directly to the real method entrypoint, as the invoke should execute directly without class checks.
TinyBenchmarks
86428089 bytecodes/sec; 41330738 sends/sec
Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 17.77ms +/-0.34
Stone total: iterations=100 runtime: 17.77ms +/-0.34
Stone total: iterations=1 runtime: 103848.336900011ms
Comparing to Host VM in tinyBenchmarks, it is still 1.75X better than our speed in bytecodes, but we get 2.69X more message sends per second! About these results, of course we are cheating a bit: we don't have GC, nor are we peeking for events when entering methods or at backjumps. Slopstone and smopstone instead showed the sad thruth, bee took 130.66X and 399.79X more time to complete than Host VM. Slop/smopstones, are a much better metric of the real performance. But these results were awesome anyway, as this sends/sec result give us a clue of what kind of performance we can expect in the end.
Global lookup cache
Being 400 times slower than the Host VM was still too much, so there was a last optimization we did: the global lookup. There is a last kind of lookup in bee that works when the MIC fails, the global lookup. Consider this implementation of Object class>>#new
new
^self basicNew initialize
Both the receivers of basicNew (some class) and the receiver of initialize (some object of some class) are going to be of a different class on each #new send. This is because you are going to create different kinds of objects, which means that class check will fail most times. What we can do is have a cache of common methods found on lookup, which we can put in a table. Given a selector and the behavior of the receiver, there is a unique compiled method to be found. Then, we can calculate a hash from the combination of selector and behavior, and use it as index in the table. If that position contains nil, we do slow lookup, and put there the resulting method, paired with the behavior for which it was found. If not nil, we look the behavior for which the method was dispatched, and if it is the one we are looking for, voila, we return the method we found. We can do this for 3 consecutive slots to mitigate hash collisions.
GlobalDispatchCache >> #lookupAndCache: selector in: aBehavior
| method |
method := self at: selector for: aBehavior.
method == nil ifTrue: [
method := self _lookup: selector in: aBehavior.
self at: selector for: aBehavior put: method].
^method
GlobalDispatchCache >> #at: selector for: behavior
| index |
index := self indexOf: selector with: behavior.
^self lookup: selector for: behavior startingAt: index
GlobalDispatchCache >> #indexOf: selector with: behavior
^(selector _asPointer bitXor: behavior _asPointer) + 1 bitAnd: self tableSize
As we need to store a <compiledMethod,behavior> pair each time, we will use odd positions for compiledMethods and even ones for behaviors. Oops are aligned to 4 bytes, so both addresses end in 00. _asPointer sets the last bit to 1, which gives us a smallPointer (a small integer which is also an oop divided by 2). Finally, the xor leaves the last bit (of the smallinteger) as 0 and +1 assures we always get an odd hash result. Then we can use that as index for the compiled method and the next one for the behavior.
Finally, we plug this to the lookup:
_lookupAndInvoke: aSymbol
| cm |
cm := self _cachedLookup: aSymbol.
cm == nil ifTrue: [
cm := self _cachedLookup: #doesNotUnderstand:.
self _transferControlTo: cm noClassCheckEntrypoint].
cm isNativized ifFalse: [self compile: cm]; patchClassCheck: self behavior.
self
_transferControlDiscardingLastArgAndPatchingTo: cm noClassCheckEntrypoint
_cachedLookup: aSymbol
^GlobalDispatchCache current lookupAndCache: aSymbol in: self behavior
With all of this, we ran the tests for a last time:
TinyBenchmarks
86720867 bytecodes/sec; 42086788 sends/sec
Report for: SMarkSlopstone
Stone total: iterations=100 runtime: 2.764ms +/-0.083
Stone total: iterations=100 runtime: 2.764ms +/-0.083
Stone total: iterations=1 runtime: 12765.7030000091ms
TinyBenchmarks performance remained the same this time, which probably implies that there wasn't any class check failure and that global lookup wasn't needed there. But slopstone and smopstone improved a lot, now taking only 20.32X and 49.14X the time of the Host VM. Regarding tinyBenchmarks, we see that both on HostVM and Pharo the bytecodes/sec are always around 10 times the messages/sec, where in Bee we only get 2 times more. As Bee native code is very similar to Host VM for jitted bytecodes, this suggests that the implementation of the few messages being sent (#new:, #at:, #at:put:) is heavily dragging down performance, and should be improved.
To conclude this lengthy post, we see that there's still a long way to go performance-wise, but we think this work painted a bright outlook for bee. It now has a performance level that is acceptable for, at least, development. For now, we are done with lookup optimizations, which were the low hanging fruit that CodeXL showed. Now it is showing that improvements have to be done to primitives, which are a bit slow. I think we can gain an order of magnitude more there in smopstones with a small amount of work. After that, we'll see. We still have to plug the PIC to replace the MIC, keep working on multithreading and more, but that's going to take another whole post!
Results | ||||
Bytecodes/sec | Sends/sec | Slopstone(ms) | Smopstone(ms) | |
HostVM | 151210868 | 15391934 | 0.136 | 259.757 |
Bee (unoptimized) | 205507 | 379224 | 92.4 | 534259.684 |
Bee – MIC | 86428089 | 41330738 | 17.77 | 103848.337 |
Bee – MIC – Glo. Cache | 86720867 | 42086788 | 2.764 | 12765.703 |
Comentarios
Publicar un comentario