Working on multithreading

A Smalltalk runtime that is able to execute concurrent sequences of code would be very useful. Seeing the rise of multicore/multiprocessor hardware and the direction the industry is taking, increasing the amount of cores in processors each year, makes us think we should be prepared.

The simplest way of making use of multiple cores is to just avoid threads and directly work on multiple processes, connected by some kind of IPC. This is easier because different processes share nothing, except what you explicitly choose. Isolation helps program safety concurrency wise, as two fully separated processes cannot have race conditions between them.

Then you would ask, why not just use processes? The main reason is that threads can be lighter and, in some way, more straightforward (as different threads share all their memory and resources). But also the question can be stated conversely, why not threads? Giving the option of using threads would be nice, so why not? There are lots of VMs wandering around out there supporting a myriad of different languages, but not many allow multithreaded execution of code (some notable exceptions are JVM, Jikes and CLR). Being able to execute many threads can simplify some VM implementation details and complicate others. Then, in order to understand how we can make a multithreaded Bee, we should better deep into what are the inherent problems of concurrent code execution.

There's an important point to add here. If we want to explore if it is possible to support concurrent independent smalltalk threads, we should do it a this stage, and wait no longer. As we'll see here, threading support can affect many system design decisions, so adding it later can require lots of changes and make it too costly. Also, adding threading support to a minimal system is a lot easier than to a big one. Then, to facilitate development, we should strive to make it work very well from the beginning, and then analyze how it behaves with each addition to the system.

Basic implementation and testing

The basic problems that arise when writing multithreaded code are related to resource access. Usually, race conditions start popping, because of unsynchronized concurrent accesses to memory locations. Good programmers write code in a way that avoids most pitfalls that lead to race conditions, but there always are some corners to cut. Avoiding global variables is the starting point of thread-safe code. Unfortunately, Smalltalk is plagued by global variables. Symbols, classes, the Smalltalk global, all the meta-model is global in Smalltalk, so we have a bad starting point. But if we think a bit, many of these things are immutable and most rarely change.

The way we test the multithreading scenario is also interesting. For now, we decided to go for a simple test. Instead of implementing and testing some known multithreaded algorithm, we just take all of our tests and run them in parallel. That is,

BeeKernelTest>>#runAllConcurrent
  | selectors results |
  selectors := self allTests.
  results := selectors
    concurrentCollect: [:selector | self runTest: selector].
  ^results conform: [:result | result]

FixedSizeCollection>>#concurrentCollect: aBlock
  | answer size token |
size := self size.
token := Object new.
answer := self species new: size.
answer atAllPut: token.
1
  to: size
  do: [:index | [| x y |
    x := self at: index.
    y := aBlock evaluateWith: x.
    answer at: index put: y] forkThread].
  [answer includes: token] whileTrue: [].
^answer

We still have to think the API we are going to implement but this is a nice starting point. Finally, we should emphasize that are planning the 'do nothing' kind of thread support here. This means, we are not going to add synchronization locks all around the system. It is the developer who knows how threads are going to be used, should him be the one that configures the necessary locking. But there are some points were the system should know that there can be many threads out there, lets explore them.

Allocation

A good example is object allocation. In bee, objects are allocated by aGCSpace. When receiving new, the class calculates the amount of space needed and sends #allocate: bytes to the global gcSpace. If many threads are allocating objects in the same gcSpace, they should better synchronize the allocation. Look at this naive implementation of allocate:

GCSpace>>#allocate: size
| answer next |
answer := nextFree.
next := answer + size.
nextFree := next.
^answer

Now suppose two threads fall into #allocate: at roughly the same time. It could happen that both threads read the same value of nextFree (the one before updating), returning a same address for two different objects. This will quickly lead to random memory corruption and undeterministic program crashes. There are two main ways of solving this: replacing the global gcSpace with many smaller spaces, or adding some kind of mutex. The figure shows both options, notice that while shared spaces need a mutex, split spaces get only half of the total address space for each one.



There's also the option to mix both things, each gcSpace having independent pages, wouldn't need a mutex to allocate individual objects withing the page. When the page is exhausted it could take more pages from a global space, accessed with a mutex. That would be the most efficient trade-off, but would complicate implementation. For now, lets check the simplest mutex solution:

GCSpace>>#allocate: size
    | answer next |
    mutex _busyWait.
    answer := nextFree.
    next := answer + size.
    nextFree := next.
    mutex _release.
    ^answer

where mutex is a 4-bytes byteArray initialized to -1, (-1 meaning lock is free).

low-level mutex
The figure shows the implementation of both #_busyWait and #_release. The array starts with -1. xchg swaps the values of [eax] and ecx, atomically. Thus, the value of ecx after xchg can be 0 or -1. If it is -1, it means the lock was free, so we are clear to go. If its value is 0, it means somebody else has already reserved the lock and we have to wait. Notice that in that case the xchg didn't make any change in memory, as both [eax] and ecx were 0. In this implementation there is no "wait", we continue reading the mutex until it becomes -1. For #_release, we have to use lock prefix to give dec instruction atomic semantics (xchg is always atomic thus doesn't require lock prefix). Also notice that just after acquiring the lock we read the value of next free. We have to be careful that processors don't reorder reads and writes in this sensitive code sections, or the lock would become useless. We know this won't be a problem because instructions with lock semantics assure this reordering will not happen.

This was a simple solution so we didn't need to split the global gcSpace. But there are other not thread-safe places. When looking at native code for example, there are some inherently unsafe points.

Call site patching

If you recall the post about lookup, after doing lookup and before method activation comes call site patching. Supose we are sending #yourself, before patching we have this assembly at call site:

[68 4-bytes address] push #_yourself
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

which the call site patcher transforms to:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call (Object>>#yourself) nativeCode

The transformation is done by overwriting the 11 bytes, with three consecutive 4-byte writes (one byte is written twice). A second thread could fall after the first patch of 4 bytes, finding this

[90] nop
[90] nop
[90] nop
[90] nop
[??] (#_yourself oop most significant byte)
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

the are not many satisfying solutions to this problem. One trick could be to insert a back jump at first write, like this

[EB FE] jmp -2
[90] nop
[90] nop
[??] (#_yourself most significant byte)
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

but what happens if thread B was already decoding the push instruction when the write was issued. Is decoding atomic? This approach isn't satisfactory so we didn't apply it. Another way (we haven't implemented it yet), is to avoid instruction replacement in favor of instruction mutation, always issuing the same kind of call. We could directly do this from the beginning:

[E8 4-bytes address] call (Object>>#_lookupAndInvokeYourself:) nativeCode

notice that this (Object>>#_lookupAndInvokeYourself:) would indirectly call lookup:

(Object>>#_lookupAndInvokeYourself:) nativeCode
    [68 4-bytes address] push #_yourself
    [FF 15 4-bytes address] jmp [(Object>>#_lookupAndInvoke:) nativeCode]

this is almost exactly the same code we had at call site, just that we have extracted it into a separate native code, and changed the call for a jump (avoiding a double call). The patch of the call site will not replace the instruction -it will still be a direct call- but will only mutate the address:

[E8 4-bytes address] call (Object>>#yourself:) nativeCode

the problem here is that movs are not guaranteed to be atomic, unless aligned to the size you are moving. Then we should assure that the call site 4-byte displacement is correctly aligned (inserting nops if needed), like this:

0x100000 [90] nop
0x100001 [90] nop
0x100002 [90] nop
0x100003 [E8 4-bytes displacement] call (Object>>#yourself) nativeCode

and we are still assuming this won't break instruction decoding. We shall experiment and see what happens, but luckily AMD Application Programming manual (3.11 Cross-Modifying Code), seems to prove us right. We could also have a per-thread code cache, but I think it's a bit expensive and unnecesary. For now, we just disabled call-site patching in multithreading targets.

Global Lookup

Other problem with lookup is the global dispatch cache. This is not a thread-safe object and it will never be. Lookup is executed frequently and can be slow, so we cannot leave other threads blocked waiting for a lock. This cache contains an internal lookup table that is altered on frequently so accessing it without a lock corrupts it quickly. The solution we implemented is to have one dispatch cache per thread (you could also have one per core to optimize resources, the dispatch cache measures roughly 4kb). To allow this, we reified the Thread object. A thread object contains a dispatchCache. When created, a thread executes a block, first doing initialization:

Thread class>>#newOn: aBlock
    ^self new execute: aBlock

Thread>>#execute: aBlock
    | method entrypoint |
    method := self _lookup: #value:.
    entrypoint := [
        self setCurrent.
        aBlock value].
    method newThreadOn: entrypoint

Thread>>#value: aBlock 
    aBlock value

Thread>>#setCurrent
  KernelLibrary tlsSetValue: tlsThreadIndex with: self.
  KernelLibrary tlsSetValue: tlsGlobalDispatchCacheIndex with: globalLookup

The code is a bit complex because we have to call CreateThread function (done by newThreadOn:). We are not executing the incoming block directly but putting it inside entryPoint, which first initializes the thread and then evaluates it. You can see in #setCurrent that we implemented Thread Local Storage support. This allows us to access thread-specific information (like the currentThread object), though fs segment register indirection.

Thread class>>#current
  ^tlsThreadIndex _getThreadValue

Windows mantains the fs segment register to always point to the TIB (thread information block) segment. This TIB is a buffer that contains a lot of thread information, and has some free slots for storing user-specific thread data. We are making use of one of these slots to store the current thread object, from which we can obtain the dispatch cache. In the figure we can see the assembly code required.


FFI

A little race condition we found was related to FFI. Consider this method:

CLibrary>>#strlen: aString
<api: strlen ulong ulongReturn>

it makes a call to the native strlen function in C standard library, passing aString address and converting the result to an integer. At compile time, the compiler sees the api: pragma and insterts an extra literal to the generated compiled method, a byte array. This byte array encodes the address of strlen and the types of the arguments and return value. The function address makes the first 4 bytes of the byte array, and is calculated lazily: it is initialized to FF FF FF FF, and set the first time the method is invoked.


atomic vs. non atomic memory writes
With multithreaded applications, many methods may be activating the method at the same time. All of these activations should read FF FF FF FF or the correct address of strlen, and there shouldn't be any problem. But there is. If the read and write of the address are not atomic, an invocation may find that the address is something like FF FF xx xx, where the last 2 bytes have been written and the first 2 not. As unlikely you think this can be, it happened to us!

The simplest way to avoid this problem is two fold: first, take advantage of x86 architecture, which guarantees atomic 32-bit read/writes if the memory address read/written is 32-bit aligned; second, to issue the read/write as one 32-bits operation (and not to split it in 2 16-bits or 4 8-bits). The figure above shows two examples of not-atomic operations, and an atomic one. The first store is unaligned so the processor doesn't guarantee atomicity. The second is split in two stores. Each of them is 16-bit aligned and, as a consequence, 16-bit atomic, but not 32-bit. A second thread could read the memory just after first 16-bit write and before the second 16-bit write. The third is an aligned write. x86 arquitecture assures this will be issued atomically.

Going back to our problem, we know the function address is 32-bit aligned because it lays at the beginning of the byte array which, as any other object, is 32-bit aligned. But our uLongAtOffset:/uLongAtOffset:put: implementation was splitting the ulong in two, to avoid problems with big integers. The solution was to change our implementation of uLongAtOffset family assuring that reads and writes are done in one operation. As usual with race conditions, solving the problem took just minutes, finding it out, many hours.

Conclusions

Here we gave many implementation details of some of the problems we faced when entering this crazy multithreading world. We are still learning, so there might still be undetected problems, but we believe there aren't going to appear major road blocks. There are of course some unanswered questions. We are still thinking how we are going to face thread barriers, like how are we going to stop all threads for GC? We could just maintain a counter that is increased on activations and back-jumps, and peek for events from time to time. This might reduce performance, maybe we can avoid counting by doing some tricks with SuspendThread. In this latter case we should be careful as SuspendThread is not thought to be used within the same process but from the outside (mainly for debuggers).

To conclude this post, we attach a simple multithreaded test program. This just runs all tests in parallel and shows the results in console. To run,

$> wine bee.sll runAllConcurrent

to run on linux/mac; on windows, rename to bee.exe. There isn't much too see but the non-deterministic order of finalization of the tests. If you were hoping to see a command line interface that parses smalltalk, compiles, nativizes and finally executes it, stay tuned, we already have a working prototype and need to find time to write(no joke!). See you in next post!

Here is bee.sll




Comentarios

Entradas populares de este blog

Plugging Bee JIT and Compiler

Blogging back

Pre-releasing Bee Smalltalk