Friday, June 10, 2011

Lock-free vs. Locking: Rematch!

My article on lock-free vs. locking data queues was well-received but there were also some comments and complaints. To answer the critics (your comments were all very appreciated!) and to fix some issues with the previous test I’ve decided to stage a rematch.

New contestants

Both Iztok and Chris have provided me with new versions of their queue implementations. Iztok’s code was tuned a little and changed to use TOmniValue instead of TAnyValue for better comparison with TOmniQueue. Chris has created three different variants of his code.
The new contestants are:

  • TOmniQueue [TOQ] – A lock-free queue implementation from the OmniThreadLibrary
  • TSimpleThreadedQueueSem [TSTQS] – Variant of Chris’ original code that uses TCriticalSection and TSemaphore instead of TMonitor.
  • TSimpleThreadedQueueLL [TSTQLL] – Variant of TSTQS that uses linked list instead of TQueue and has spinning enabled on critical sections.
  • TSimpleThreadedQueueNoWait [TSTQNW] – A simplified variant of TSTQLL that doesn’t implement Dequeue’s waiting functionality.
  • TThreadSafeQueue [TTSQ] – Iztok’s code with TOmniValue as a base data transfer element.
  • TBlockingCollection [TBC] – A wrapper around TOQ that adds multiple features useful for multithreaded programming (timeouts, blocking enumerator, throttling), included in the OmniThreadLibrary.

I should point out that a direct comparison between all those implementation is not completely fair. Some of them are necessarily more complex because they implement additional functionality. In the interest of fairness we should split them in two groups, first containing only “simple” queue implementations (TOQ, TSTQNW, TTSQ) and another containing implementation that support timeout in Dequeue (TSTQS, TSTQLL, TBC).
There’s also a big difference between “polymorphic” queues (TOQ, TTSQ, TBC) that can contain almost any data (except records) and “typed” queues (TSTQS, TSTQLL, TSTQNW) which can contain only data of any type (but it can be almost any type – as allowed by the Delphi’s generics implementation).
TTSQ supports additional optimization – preallocation of memory used for linked list storage. In my tests, 50,000 elements were preallocated. The time needed to allocated them is not included in test measurements (as this happened in the main thread before the test suite was initialized) which makes a huge improvement in TTSQ performance.

New test suite

The new test suite was changed a little to produce more accurate results. Thread setup and teardown is now fully ignored and readers/writers are started all at once.
In addition to the original integer-sending code two string tests were added. First (string-1) is sending the same string multiple times and the second (string-2) is sending newly allocated string in each iteration. For example, the writer code in TOmniBlockingCollection test is now implemented as a following (anonymous) method:

procedure
var
  i   : integer;
  guid: TGUID;
  s   : string;
begin
  CoCreateGuid(guid);
  s := GUIDToString(guid);
  for i := 1 to CNumMessagesPerThread do
    if subtest = 1 then
      queue.Add(i)
    else begin
      if subtest = 3 then begin
        CoCreateGuid(guid);
        s := GUIDToString(guid);
      end;
      queue.Add(s);
    end;
end

Subtest 1 (integer) sends integers, subtest 2 (string-1) sends always the same string and subtest 3 (string-2) sends a fresh string each time.
This of course affects the code. In the integer case, most of the time is spent in the queue code and only a little in the wrapper executing the code. Although this is fine for stress-testing, it is not a realistic scenario. In real world, readers and writers would do more than just send and receive data. Therefore, string-1 and string-2 test may provide more measurements that are more appropriate for your usage.

The whole benchmark (6 test * 8 reader configurations * 8 writer configurations * 3 subtests * 10 repetitions) now takes almost 9 hours to execute on my not-so-slow computer (2 x E5430 Xeon @ 2.67 GHz). It was executed on freshly rebooted machine that was not running any additional programs.

Benchmark source, executable, result files and Excel file used for analysis can be downloaded from http://17slon.com/blogs/gabr/files/lock-free-vs-locking-rematch.zip.

Results

As you can see from the charts below, blocking TTSQ raised to the top in more complicated tests while lock-free TOQ always performs at least satisfyingly good although it is usually not the best performer. I’m guessing that the big difference occurs because of the memory allocations, which occur before the test in the TTSQ but during the test in the TOQ. [There may be a way to speed up TOQ by reducing memory allocations. If I can find one, I’ll repeat the test.]

If you need Dequeue with timeout then TBC is still the best bet by far.

Results by subtest

integer @ 1, 4, 8 readers

Integer subtest results did not change much from the previous benchmark except that the numbers (messages/ms) are much higher because of better benchmark implementation. TSTQNW is the fastest when there are less then five writers and both lock-free implementations are fastest with five writers and more. TTSQ is always the slowest.

imageimageimage

string-1 @ 1, 4, 8 readers

With the string-1 subtest, situation is changed. TSTQS is the slowest in all tests. Most of the tests are comparably fast when only one reader is running with an exception of TSTQNW which flies when there are at most four writers. The situation is different when four and eight readers are running. In those cases, TTSQ (which was the slowest in the previous test) starts pulling high above the competition.

imageimageimage

string-2 @ 1, 4, 8 readers

In string-2 test with one reader, TOQ and TBC have a slight lead over TTSQ, TBC is definitely the slowest one while TSTQLL and TSTQNW are holding the middle. When number of readers are increasing, TTSQ again takes the lead and all three TSTQ implementations become very slot. Lock-free implementations are holding the middle ground.

imageimageimage

Results by implementation

TOmniQueue – integer, string-1 and string-2

In the integer test, TOQ performs well in all configurations. In string-1 test, there’s a big difference between one reader and more than one reader (the reason is unknown). In string-2 test the total throughput of system actually improves with more threads, which is a good sign for real-world applications.

image imageimage

TSimpleThreadedQueueSem – integer, string-1 and string-2

TSTQS improves with increasing number of writers in integer and string-1 tests, but performs very bad in all configurations (except with one reader and one writer) in string-2 test.

imageimageimage

TSimpleThreadedQueueLL – integer, string-1 and string-2

TSTQLL reaches its peak in the integer test with three to four writers. In string-1 and string-2 tests the speed drops drastically when number of readers increases.

imageimageimage

TSimpleThreadedQueueNoWait – integer, string-1 and string-2

The TSTQNW version is very fast with small number of writers in both integer and string-2 but the numbers drop when there are more than eight threads in the system. String-2 also shows big speed decrease when there are too many readers.

imageimageimage

TThreadSafeQueue – integer, string-1 and string-2

TTSQ shows constant but slow results in the integer test and very good results all over the scale in the string-1 and string-2 tests.

imageimageimage

TBlockingCollection – integer, string-1 and string-2

TBC performs very well in the integer test while the numbers in string-1 and string-2 tests are lower. Nevertheless, the performance doesn’t decrease with increasing number of tasks.

imageimageimage

7 comments:

  1. Hey, what happened to your site layout? I almost didn't recognize who it was at first...

    ReplyDelete
  2. Surprise! :)

    I was tired of the old look and wanted something fresher.

    ReplyDelete
  3. Primož, thanks for all the work you put into this. It is a very detailed test, far better I would ever make. And the results are interesting.

    The only thing I just don't understand is how it is possible that TTSQ is actually slower with integers as it is with strings. It is also slower then last time when TAnyValue was used. And TAnyValue it slower then TOmniValue. I guess it comes to some weird state that one of the readers is blocking or something like this :)

    Anyway I thinks these are good result and you took into consideration all the factors you could. It also shows how hard it is to correctly test such things.

    ReplyDelete
  4. Iztok, I also don't understand this weird result. (And sadly I don't have a day or two to further analyze the integer behaviour of TTSQ).

    On an unrelated not, I implemented better memory handling in TOQ, but the results did not improve much. Too bad :(

    ReplyDelete
  5. I did not imply that you should look this any further. You did more than enough. Lets wrap this up and call it a good test.

    ReplyDelete
  6. Naah, no way! ;)

    I want to test another aspect - how well are readers and writers working in parallel. With the current test, it is entirely possible for an implementation to block all readers and writers but one of each and still get good total throughput.

    But first I have to enhance OTL a little as I can't pass parameters to Parallel.Async at the moment ...

    ReplyDelete
  7. Ok then I would really like to get to the bottom of the Integer mystery. ;) Tell me if I can help.

    I will also have a look at TAnyValue and see if there is some room for improvement.

    ReplyDelete