My article on lock-free vs. locking data queues was well-received but there were also some comments and complaints. To answer the critics (your comments were all very appreciated!) and to fix some issues with the previous test I’ve decided to stage a rematch.
Both Iztok and Chris have provided me with new versions of their queue implementations. Iztok’s code was tuned a little and changed to use TOmniValue instead of TAnyValue for better comparison with TOmniQueue. Chris has created three different variants of his code.
The new contestants are:
- TOmniQueue [TOQ] – A lock-free queue implementation from the OmniThreadLibrary
- TSimpleThreadedQueueSem [TSTQS] – Variant of Chris’ original code that uses TCriticalSection and TSemaphore instead of TMonitor.
- TSimpleThreadedQueueLL [TSTQLL] – Variant of TSTQS that uses linked list instead of TQueue and has spinning enabled on critical sections.
- TSimpleThreadedQueueNoWait [TSTQNW] – A simplified variant of TSTQLL that doesn’t implement Dequeue’s waiting functionality.
- TThreadSafeQueue [TTSQ] – Iztok’s code with TOmniValue as a base data transfer element.
- TBlockingCollection [TBC] – A wrapper around TOQ that adds multiple features useful for multithreaded programming (timeouts, blocking enumerator, throttling), included in the OmniThreadLibrary.
I should point out that a direct comparison between all those implementation is not completely fair. Some of them are necessarily more complex because they implement additional functionality. In the interest of fairness we should split them in two groups, first containing only “simple” queue implementations (TOQ, TSTQNW, TTSQ) and another containing implementation that support timeout in Dequeue (TSTQS, TSTQLL, TBC).
There’s also a big difference between “polymorphic” queues (TOQ, TTSQ, TBC) that can contain almost any data (except records) and “typed” queues (TSTQS, TSTQLL, TSTQNW) which can contain only data of any type (but it can be almost any type – as allowed by the Delphi’s generics implementation).
TTSQ supports additional optimization – preallocation of memory used for linked list storage. In my tests, 50,000 elements were preallocated. The time needed to allocated them is not included in test measurements (as this happened in the main thread before the test suite was initialized) which makes a huge improvement in TTSQ performance.
New test suite
The new test suite was changed a little to produce more accurate results. Thread setup and teardown is now fully ignored and readers/writers are started all at once.
In addition to the original integer-sending code two string tests were added. First (string-1) is sending the same string multiple times and the second (string-2) is sending newly allocated string in each iteration. For example, the writer code in TOmniBlockingCollection test is now implemented as a following (anonymous) method:
procedure var i : integer; guid: TGUID; s : string; begin CoCreateGuid(guid); s := GUIDToString(guid); for i := 1 to CNumMessagesPerThread do if subtest = 1 then queue.Add(i) else begin if subtest = 3 then begin CoCreateGuid(guid); s := GUIDToString(guid); end; queue.Add(s); end; end
Subtest 1 (integer) sends integers, subtest 2 (string-1) sends always the same string and subtest 3 (string-2) sends a fresh string each time.
This of course affects the code. In the integer case, most of the time is spent in the queue code and only a little in the wrapper executing the code. Although this is fine for stress-testing, it is not a realistic scenario. In real world, readers and writers would do more than just send and receive data. Therefore, string-1 and string-2 test may provide more measurements that are more appropriate for your usage.
The whole benchmark (6 test * 8 reader configurations * 8 writer configurations * 3 subtests * 10 repetitions) now takes almost 9 hours to execute on my not-so-slow computer (2 x E5430 Xeon @ 2.67 GHz). It was executed on freshly rebooted machine that was not running any additional programs.
Benchmark source, executable, result files and Excel file used for analysis can be downloaded from http://17slon.com/blogs/gabr/files/lock-free-vs-locking-rematch.zip.
As you can see from the charts below, blocking TTSQ raised to the top in more complicated tests while lock-free TOQ always performs at least satisfyingly good although it is usually not the best performer. I’m guessing that the big difference occurs because of the memory allocations, which occur before the test in the TTSQ but during the test in the TOQ. [There may be a way to speed up TOQ by reducing memory allocations. If I can find one, I’ll repeat the test.]
If you need Dequeue with timeout then TBC is still the best bet by far.
Results by subtest
integer @ 1, 4, 8 readers
Integer subtest results did not change much from the previous benchmark except that the numbers (messages/ms) are much higher because of better benchmark implementation. TSTQNW is the fastest when there are less then five writers and both lock-free implementations are fastest with five writers and more. TTSQ is always the slowest.
string-1 @ 1, 4, 8 readers
With the string-1 subtest, situation is changed. TSTQS is the slowest in all tests. Most of the tests are comparably fast when only one reader is running with an exception of TSTQNW which flies when there are at most four writers. The situation is different when four and eight readers are running. In those cases, TTSQ (which was the slowest in the previous test) starts pulling high above the competition.
string-2 @ 1, 4, 8 readers
In string-2 test with one reader, TOQ and TBC have a slight lead over TTSQ, TBC is definitely the slowest one while TSTQLL and TSTQNW are holding the middle. When number of readers are increasing, TTSQ again takes the lead and all three TSTQ implementations become very slot. Lock-free implementations are holding the middle ground.
Results by implementation
TOmniQueue – integer, string-1 and string-2
In the integer test, TOQ performs well in all configurations. In string-1 test, there’s a big difference between one reader and more than one reader (the reason is unknown). In string-2 test the total throughput of system actually improves with more threads, which is a good sign for real-world applications.
TSimpleThreadedQueueSem – integer, string-1 and string-2
TSTQS improves with increasing number of writers in integer and string-1 tests, but performs very bad in all configurations (except with one reader and one writer) in string-2 test.
TSimpleThreadedQueueLL – integer, string-1 and string-2
TSTQLL reaches its peak in the integer test with three to four writers. In string-1 and string-2 tests the speed drops drastically when number of readers increases.
TSimpleThreadedQueueNoWait – integer, string-1 and string-2
The TSTQNW version is very fast with small number of writers in both integer and string-2 but the numbers drop when there are more than eight threads in the system. String-2 also shows big speed decrease when there are too many readers.
TThreadSafeQueue – integer, string-1 and string-2
TTSQ shows constant but slow results in the integer test and very good results all over the scale in the string-1 and string-2 tests.
TBlockingCollection – integer, string-1 and string-2
TBC performs very well in the integer test while the numbers in string-1 and string-2 tests are lower. Nevertheless, the performance doesn’t decrease with increasing number of tasks.