Tuesday, February 14, 2012

Blaise Pascal Magazine Rerun #6: Synchronisation in Multithreaded Environment

This article was originally written for the Blaise Pascal Magazine and was published in Issue #11.
Relevant Delphi code is available at
http://17slon.com/blogs/gabr/BPM/Synchronisation.zip.

Welcome back to the third installment of out short introduction to multithreading. You already know what multithreading is and how to start multiple threads. I also mentioned some problems that can (and will) occur while writing multithreaded code but I didn’t spend more time on the topic. Today is the day when I’ll fix this. You’ll hear about synchronisation, what it’s good for and which synchronisation tools are available to you.

If we say that the multithreading is a way of executing multiple methods at once, then synchronisation is the very opposite, a way to prevent multiple methods from executing simultaneously. At a first glance, this doesn’t look very reasonable. We want to execute many methods at once but then we seek to find a way to prevent this from happening? Why?

The truth is that there are not many problem that can be split in multiple independent execution unit (with the stress on the “independent”). There’s almost always some small part of code where global data must be accessed or global state must be modified and in all such cases we have to ensure that only one code path can modify this global state at any moment. Allow two of them to do the same and everything will break. For example, the last time I told you that it is imperative to use Synchronize if you want to execute VCL code from a thread and the reason is exactly that – VCL modifies global state a lot and if another method starts to do the same while the VLC is in the middle of the update, well, then anything can happen.

To illustrate this situation, I’ve put together a small test project, TestTThread. It consists of two threads, one executing

for i := 1 to 100000 do begin
a := GCounter; GCounter := a + 1;
end;

and another

for i := 1 to 100000 do begin
a := GCounter;
GCounter := a - 1;
end;

GCounter is a global variable (accessible from both threads) initialized to 0 while a and i are local variables.

The mission, dear reader, if you choose to accept it, is to tell what will the final value of GCounter be. Will it be zero? After all, one thread adds million ones and the other subtracts the same amount.

Well, no. The result will be 70495. Or 78841. Or -61505. Or any other number between -100.000 and +100.000!

This will happen because the modification of GCounter by one thread can (and will) be interrupted by the other one. In other words, this modification is not atomic. (Atomic is a word often used in multithreaded programming and has the same meaning as in the chemistry – indivisibility.)

If you still can’t see how GCounter can result in any value but 0, here’s a possible scenario.

  • First thread starts execution and loads GCounter (0) into the local variable a.
  • First thread is then suspended. Maybe the system is very busy or you’re running the code on a single-core, single-CPU machine and the operating system needs the CPU to start the second thread.
  • Second thread starts and loads GCounter (still 0) into the local variable a.
  • Second thread then proceeds with the execution. For each i it decrements the GCounter by 1.
  • When second thread terminates, the value in GCounter is -100.000.
  • First thread then resumes execution. It adds 1 to the local variable a (which still contains 0!) and stores the result into GCounter. Bang! In one operation GCounter jumped from -100.000 to 1!
  • First thread then continues execution and increments GCounter another 99.999 times.
  • The resulting value in GCounter is 100.000. Wroooong!

Now it is obvious – multithreading allows this code to executing in multiple threads at the same time and synchronisation will be used to prevent this from happening.

Now we are ready to take a journey through the wonderful world of synchronisation primitives – the tools provided by the operating systems that will help you synchronise multiple threads. For the good measure I’ll also throw in some higher level synchronisation tools, that is tools that are written using the synchronisation primitives to perform some complicated task.

Critical section

Let’s start with the simplest (and in many cases the most appropriate) primitive – critical section. This is a Windows-implemented object that can be used to lock access to the critical section (as is the generic name for the part of code that should execute atomically, or uninterrupted).

To use the critical section you have to initialize it first. Then it is passed to all threads that will compete for the resource (the two lines of code inside the for loop in our case). Each thread will acquire the critical section, execute the critical code and release the critical section. The operating system will make sure that at most one thread can have critical section acquired at any time. At the end, critical section object must be deleted.

Because native Windows API for the critical section is somewhat clumsy and because Delphi contains internal wrapper for that API since forever (version 2, I believe), the test project uses Delphi’s TCriticalSection instead of the native API. TCriticalSection is just a simple wrapper around that API.

procedure TfrmTestTThread.btnTestCriticalSectionClick(Sender: TObject);
var
thread: TThread;
begin
GCounter := 0;
FCriticalSection := TCriticalSection.Create;
thread := TCSThread.Create(-1, FCriticalSection);
thread.FreeOnTerminate := true;
thread.OnTerminate := ReportThreadTerminated;
thread.Resume;
thread := TCSThread.Create(1, FCriticalSection);
thread.FreeOnTerminate := true;
thread.OnTerminate := ReportThreadTerminated;
thread.Resume;
end;

constructor TCSThread.Create(increment: integer; critSect: TSynchroObject);
begin
inherited Create(true);
FIncrement := increment;
FCriticalSection := critSect;
end;

procedure TCSThread.Execute;
var
a: integer;
i: integer;
begin
for i := 1 to CNumRepetitions do begin
FCriticalSection.Acquire;
a := GCounter;
GCounter := a + FIncrement;
FCriticalSection.Release;
end;
end;

The TButton event handler with the long name first creates critical section object, TCriticalSection (defined in the SyncObjs unit).  Then it creates two threads and passes this critical section to both of them. Each thread is configured in the “one shot” mode – it will be destroyed automatically and some global event handler will be called when this happens. This event handler reports the GCounter value and execution time to the user.

The thread’s constructor doesn’t do anything special either, it just saves the increment (+1 or -1) and critical section object for the later use.

The main difference lies in the Execute method where the critical increment operation is wrapped between FCriticalSection.Acquire and FCriticalSection.Release. That small change ensures that only one thread at a time will execute this critical code and the result will be zero as expected.

Let’s walk through the first execution scenario one more time, but now with critical sections.

  • First thread starts execution and loads GCounter (0) into the local variable a.
  • First thread is suspended.
  • Second thread starts, starts executing FCriticalSection.Acquire and blocks (stops execution).
  • First thread is then resumed, continues execution and sets GCounter to 1.
  • First thread calls FCriticalSection.Release.
  • Only then will the second thread proceed with the execution. The call to Acquire will return and the second thread will decrement GCounter back to 0.
  • The execution will continue to jump between the two threads. For loops will be executed in parallel but the critical part, modifying the GCounter, will be not. It is entirely possible that multiple for loops in one thread will execute while another is waiting for its time but that will represent no problem. Calculations will be protected and final result will be 0. Hurrah!

Few other facts deserve to be said about the critical sections. In Windows XP and before they were fairly slow and because of that many spinlock implementations appeared. (Spinlock is a weird multithreaded approach to locking where it is assumed that it is better for the thread to execute some short loop (test some global value and retry, i.e. spin) for some time instead of pausing the thread. Read more about it on Wikipedia.). Since Windows Vista, critical sections are extremely fast (and internally implemented as spinlocks) and it is hard to compete with them, at least performance-wise. Still, one spinlock implementation is included in the test project for comparison, but only if the project is compiled with the conditional symbol TestWithOTL (and if you have downloaded the OmniThreadLibrary source and added it to the project search path).

Critical sections will only prevent other threads from accessing the same resource. For example, the following code will work just fine.

for i := 1 to CNumRepetitions do begin
FCriticalSection.Acquire;
FCriticalSection.Acquire;
a := GCounter;
GCounter := a + FIncrement;
FCriticalSection.Release;
FCriticalSection.Release;
end;

When the Acquire is called for the second time, the OS will notice that the current thread already owns the critical section and will allow the second Acquire to return. This can be quite useful as it allows a method to call another method which uses its own Acquire/Release to protect the same resource. Just keep in mind that every Acquire needs corresponding Release.

Fine-grained is better

While critical sections are perfectly useful tool and can be extremely helpful in multithreaded programming, many programs use them in a wrong way. Because it is “a hard work” to declare multiple critical sections to protect multiple unrelated parts of the code, many programmer declare just one critical section (“big giant lock”) and use it to protect everything. That, of course, is not wrong by itself, except that it can slow the program down considerably. For example, if this “one to bind them” critical section is used to lock access to logging code and logging code writes to a network drive that becomes unavailable, whole program (including GUI if this master critical section is also acquired from it) can lock down for large time period.

That’s why you should always remember Critical Sections Rule One: “Use fine grained critical sections.” Each critical section should only protect access to one resource. Oh, wait, isn’t that Rule Two? Maybe Rule One is: “Always pair Acquire and Release.” Doesn’t matter, just keep both in mind.

Because declaring and creating multiple critical sections is such a chore, smart programmers have invented various approaches that will help you minimize typing.

For example, some Delphi classes come with critical sections built in. TThreadList is an example of a such class.

If you are using Delphi 2009, the System unit includes TMonitor class which allows you to automatically create a critical section around any TObject. (Actually, it is not a critical section but spinlock, but that is not really important at the moment.) Just use TMonitor.Enter(object) to lock it and TMonitor.Exit(object) to unlock it. The test code demonstrates this approach in the TMonitorThread thread class.

There’s a problem with TMonitor, though. Well, there are two. Three actually. The first one is that TMonitor is a badly chosen name because another TMonitor lives in the Forms unit. Since forever. Therefore you’ll have to write System.TMonitor.Enter(object) if you use the Forms unit.

The second problem is the choice of method names for Acquire/Release, errr, Enter/Exit. Quite few version of Delphi back, VCL defined TSynchroObject class as a basis for all synchronisation classes. This class names access methods as Acquire/Release. A different naming convention in TMonitor is, well, unfortunate.

The third and the biggest problem I have with the TMonitor is that it’s slow. Somehow (and I don’t yet know why), TMonitor.Enter runs about 30 times slower than TCriticalSection.Acquire. I intend to drill into this and by the time this issue of Blaise Pascal Magazine will be in your hands I’ll surely publish the results of analysis on my blog. Check www.thedelphigeek.com for more details.

As I like to say, however, a design that requires fast locking is a bad design. Instead of speeding up the locking, it is always better to redesign the algorithm and remove locking altogether. And the first two “problems” are only my nitpicks. If you’re using Delphi 2009 or better you definitely want to look into TMonitor.

Execution time for 100.000 loops in 2 threads, Delphi 2010

Unsafe

1 ms

critical section

27 ms

TMonitor

611 ms

mutex

1206 ms

named mutex

1214 ms

event

1116 ms

semaphore

1102 ms

spinlock

8 ms

TOmniCS

27 ms

For the users of Delphi 2007 there’s another solution – my TOmniCS. This is a smart record that implements Acquire and Release via Windows API critical section and which is initialized on first use. Just declare and use it – there’s no need to create the object or destroy it. It is even compatible with the TSynchroObject approach. Because the Windows critical section is used internally, TOmniCS is fast, no slower than the pure Windows approach. The implementation is too complicated for this article, but you can read about it on my blog.

Mutex

Because the critical sections can be used only inside a process, Windows provides another mechanism for inter-process synchronisation. While designed for inter-process works, mutexes can be also useful in multithreaded, single process environment.

The usage pattern is completely the same – initialize, acquire, release and delete. The mutex is initialized by calling the CreateMutex API taking three parameters – attributes (leave it at nil), initial owner (set it to true if you want the mutex to be created in the acquired state) and the name. The latter parameter can be empty (nil, the mutex is unnamed) or can be set to some name. I’ll return to mutex names later.

Mutexes can also be used via the TMutex class (SyncObjs unit) which is not implemented in older Delphis. If you’re using the TMutex class, the code would look exactly the same as the TCriticalSection approch. In the test project, however, I’m using Windows API to show you that is it equally simple and slightly more powerful.

Back to the topic ... CreateMutex returns a THandle value (which is 0 if CreateMutex failed) which must be destroyed by the CloseHandle API when mutex is no longer used.

To acquire the mutex you have to wait on it by calling WaitForSingleObject or some other variant of this API. To release the mutex, just call ReleaseMutex API.

for i := 1 to CNumRepetitions do begin
if WaitForSingleObject(FMutex, INFINITE) <> WAIT_OBJECT_0 then Exit;
a := GCounter;
GCounter := a + FIncrement;
ReleaseMutex(FMutex);
end;

As with the critical sections, a mutex can be acquired multiple times by a single thread. And again as with the critical sections, each acquire must be paired with a corresponding release.

Mutexes have some advantages over critical sections. Firstly, the wait can be more flexible. Instead of a simple blocking Acquire there is a much more flexible WaitFor API which can timeout after some time and can wait on multiple synchronisation primitives at the same time. (Which is something I’ll pretend not to exist as this topic is definitely not appropriate for the beginners.) Secondly, they can be named. If you know a mutex’s name, you don’t have to create it globally and pass it around. Each thread can create its own mutex instead and OS will take care of reusing the same mutex for all threads using the same name. The test project demonstrates this in the TNamedMutex thread.

procedure TNamedMutexThread.Execute;
var
a: integer;
i: integer;
mutex: THandle;
begin
mutex := CreateMutex(nil, false,
'/Gp/D34ED6D0-B679-42AB-A470-9330F7190B7A/Blaise/Threading/3/TestMutex');
Win32Check(mutex <> 0);
for i := 1 to CNumRepetitions do begin
if WaitForSingleObject(mutex, INFINITE) <> WAIT_OBJECT_0 then Exit;
a := GCounter;
GCounter := a + FIncrement;
ReleaseMutex(mutex);
end;
CloseHandle(mutex);
end;

If you’re using this approach you have to make sure that the names don’t clash with other applications. Named mutexes are shared across applications – if two apps are using the same mutex name, they will lock each other out of a mutex-protected code. Naming a mutex “GlobalLock” is, for example, a very bad thing. I tend to use company/personal prefix combined with a GUID (press Ctrl-Shift-G to generate it), application name and mutex name.

There’s also a downside to mutexes. Because they are implemented in the OS kernel, they are inherently slow, about 50 times slower than critical sections. (By my measurements. Your numbers may vary.) That’s why critical sections are preferred in most situations.

Semaphore

And now to something entirely more complicated ... semaphores. I’ll only touch them in passing and show how they can be used to protect critical resources. You should keep in mind that semaphores are much more powerful than this simple usage will show. If you find the multithreading topic interesting, you should continue by reading “The Little Book of Semaphores”, available free at http://www.greenteapress.com/semaphores/. Great stuff!

The most important semaphore feature is that they can count. A semaphore can be used to limit access to one thread only, but it can also be used to limit it to more than one thread, for example to 10 threads (the eleventh would be blocked).

As with the mutexes, named semaphores can be used to synchronise processes.

The semaphore API has wrapper in SyncObjs unit (not available in older Delphis) but again I’ve chosen to use the simple Windows API. The test project uses only unnamed semaphores.

To create a semaphore, one has to call CreateSemaphore API (or TSemaphore.Create). This API takes four parameters – attributes (nil is OK), initial count, maximum count and name (again, nil is OK). The semaphore count will be set to the initial count (and we’ll see immediately what that means). To close the semaphore, use CloseHandle.

To acquire the semaphore, use one of the WaitFor functions. WaitForSingleObject is used in the test project. If the semaphore count is lower than the maximum count, it will be incremented by one and the execution will continue. Otherwise, the program will wait in the WaitForSingleObject until the semaphore count is decremented or until a timeout occurs.

To release the semaphore, call ReleaseSemaphore API. This API takes three parameters – the semaphore handle, release count (you can release more than 1 “count” of semaphore in one operation) and a pointer to the previous semaphore count (before it was decremented). A nil pointer can be passed as a last parameter if you don’t need to know the previous count.

for i := 1 to CNumRepetitions do begin
if WaitForSingleObject(FSemaphore, INFINITE) <> WAIT_OBJECT_0 then Exit;
a := GCounter;
GCounter := a + FIncrement;
ReleaseSemaphore(FSemaphore, 1, nil);
end;

When used as a mutex, semaphore’s initial count is set to 0 and maximum count to 1. That way, the first WaitFor will increment the semaphore to the max and lock the critical code. ReleaseCode is called with the decrement of 1 to set semaphore’s count back to 0.

Because they are kernel objects, semaphores work as fast (or as slow) as mutexes. Unlike the mutex, though,  a semaphore cannot be always reacquired from a thread that already owns it. (It can be if its count is low enough – in other word, the second WaitFor will work just as if was called from a different thread.)

BTW, if you dig semaphores, make sure to read my post on inverse semaphore.

Event

The last of the “big three” kernel synchronisation primitive is an event. In a way, it is similar to a mutex – it can be acquired and released and you use WaitFor to access the critical code. The main difference is conceptual. The event is usually used just inversely as the mutex is. One doesn’t wait for an event to get access to some critical code but wait for it until an event is signalled. As kernel synchronisation primitives, events have names, can be used for inter-process synchronisation and are slow.

Although typically used to signal some change in global state, events can be used for resource protection. The test project uses events to lock access to critical code in the TEventThread thread.

procedure TfrmTestTThread.btnTestEventClick(Sender: TObject);
var
thread: TThread;
begin
GCounter := 0;
FEvent := CreateEvent(nil, false, false, nil);
Win32Check(FEvent <> 0);
thread := TEventThread.Create(-1, FEvent);
thread.FreeOnTerminate := true;
thread.OnTerminate := ReportThreadTerminated;
thread.Resume;
thread := TEventThread.Create(1, FEvent);
thread.FreeOnTerminate := true;
thread.OnTerminate := ReportThreadTerminated;
thread.Resume;
SetEvent(FEvent);
btnTestEvent.Enabled := false;
end;

procedure TEventThread.Execute;
var
a: integer;
i: integer;
begin
for i := 1 to CNumRepetitions do begin
if WaitForSingleObject(FEvent, INFINITE) <> WAIT_OBJECT_0 then Exit;
a := GCounter;
GCounter := a + FIncrement;
SetEvent(FEvent);
end;
end;

The thread creation code is very similar to the mutex version except that at the end the global event is set. This puts the event into signalled state.

The two threads protect critical code by calling WaitForSingleObject on the event. If event is signalled, WaitFor will put the event back into non-signalled (reset) state and continue execution. Otherwise, WaitFor will block until the event becomes signalled (set) or until the timeout occurs.

At the end of the critical section, SetEvent sets the event and allows another increment or decrement to execute.

An event can be configured during creation into “manual reset” state. When created like this, WaitFor will not reset an event. Such event can be reset by calling the ResetEvent API.

Advanced synchronisation

With that we’re leaving the area of “basic and simple” synchronisation tools and are entering the “messy and practical” realm. I don’t have place to write a long description of other synchronisation primitives so I’ll only give them few words each.

A very useful tool in practical situations is a “MREW” (multiple readers, exclusive writer) lock. It allows multiple readers of a shared resource but only one writer. Even more strict, the writer can only get access if there are no readers and readers can only get access if there is no writer. This type of synchronisation is excellent when global structures are mostly read from and rarely modified as it allows multiple reading threads to execute without blocking. Delphi’s implementation, TMultiReadExclusiveWriteSynchronizer, is implemented in the SysUtils unit.

Slim reader/writer lock (SRW) is the Windows implementation of the MREW lock. It is fast but implemented only in Vista and newer which limits its use in a generic code. I didn’t have an opportunity to use SRW locks in practice and cannot yet comment on their usefulness.

Asynchronous Procedure Call (APC) is a Windows mechanism that allows a piece of code to be executed in another thread. You can think of it as of Synchronize that also works between background threads. (Synchronize is limited to executing code in the main thread.) The target thread cannot, however, be interrupted at any point – it must already be waiting in some kind of sleep or WaitFor API.

Condition variables are synchronisation primitives that enable threads to wait until a particular condition occurs. They can be used together with critical sections in atomic operations (API SleepConditionVariableCS) but are somehow complicated to understand and are implemented only in Windows Vista and newer. Delphi wraps this API in the TConditionVariableCS class (SyncObjs unit), at least in the newest version.

Signoff

This concludes our short tour of synchronisation primitives. Next time I’ll cover another very important multithreading topic – communication between threads. After that, you’ll be ready to start banging out your own multithreading code.

No comments:

Post a Comment