Thursday, May 26, 2011

Lock-free vs. locking

For the past few days I was running tests on three different queue implementations – one lock-free, one based on Delphi’s TQueue<T> and TMonitor and one implemented using critical section and linked list. For a good measure I’ve also thrown a blocking wrapper around the lock-free queue into the mix. [Mostly because I’m using it a lot (it has useful features for multithreading work) and I wondered how much time penalty it adds to my applications.]
During that time I’ve discovered a showstopper bug in TMonitor which practically made TSimpleThreadedQueue useless when using more than one reader. Still, I’m including partial results for that implementation too, as they provide a good comparison.

Test suite

The test is very straightforward. It uses Parallel.Async (requiring D2009 to recompile) to start few reader and few writer tasks. Writers write 100.000 messages each into shared queue and readers read from the shared queue until there’s nothing more to read. A boolean flag is used to indicate this condition. [It is toggled when all writers complete their work.] IOmniResourceCount is used in main thread to detect when all readers/writers have completed their work.
Writer part is very straightforward.

  Parallel.Async(
    procedure
    var
      i: integer;
    begin
      WaitForSingleObject(startWriting, INFINITE);
      firstWriter.CAS(0, DSiTimeGetTime64);
      for i := 1 to CNumMessagesPerThread do
        queue.Enqueue(i);
      countWriters.Allocate;
    end);
There are few tricks here. Firstly, an event is used to start all writers at once. Secondly, first writer to be started will update the firstWriter variable to record the time when the test has started. [CAS will fail for all other writers because firstWriter will not contain zero anymore.] Thirdly, each writer “logs out” by calling countWriter.Allocate at the end so that the main test method knows when all writers have completed their work.
Readers are only slightly more complicated.

  Parallel.Async(
    procedure
    var
      value: TOmniValue;
    begin
      repeat
        while queue.TryDequeue(value) do
          ;
      until allWritersDone;
      lastReader.Value := DSiTimeGetTime64;
      countReaders.Allocate;
    end);
Reader tries to dequeue elements from the queue. If the queue is empty, it checks the status of allWritersDone flag and exits only if it is set. If not, writers are still doing their job but readers have emptied the queue (maybe all writers are blocked).
At the end, lastReader value is updated to contain the time when the last reader has completed its work (TGp8AlignedInt64 takes the job of atomic update here so it’s not possible for this value to contain a mix of two times from two threads) and reader “logs out”.
When using TOmniBlockingCollection this test becomes much simpler.

  Parallel.Async(
    procedure
    var
      value: TOmniValue;
    begin
      while queue.Take(value) do
          ;
      lastReader.Value := DSiTimeGetTime64;
      countReaders.Allocate;
    end);
With blocking collection, Take will block if the collection is empty but writers are still doing their work.
At the end, the test method calculates message throughput – a number of messages that travelled through the system in one millisecond. [Note: this is a throughput for all threads together. When multiple readers/writers are present, every reader/writer thread read/wrote less than this number of messages in one millisecond.]
Test is run with 1 to 8 readers and 1 to 8 writers (all in all 64 combinations). Test for each reader/writer combination is repeated ten times. Slowest number is ignored (to remove noise caused by the system activity) and other nine measurements are averaged.
All measurements for all tests are dumped into file speedtest.csv (included in the test download). Columns indicate number of writers and rows number of readers. Each test contains header row. First result block belongs to TOmniQueue, second to TSimpleThreadedQueue (only the first line contains values because this queue breaks down when multiple readers are in use due to TMonitor bugs), third to TThreadSafeQueue and fourth to TOmniBlockingCollection.

image
All test times are logged to file speedtest.log (also included in the test download).
All tests were executed on dual-CPU system containing 8 cores (4 per CPU).

Comparison

The two charts below represent a cross-section of all data. First chars shows throughput of complete system plotted against number of writers. Data is shown in nine series. TOmniQueue (marked TOQ in the chart), TThreadSafeQueue (TTSQ) and TOmniBlockingCollection (TOBC) show data for 1, 4 and 8 readers while TSimpleThreadedQueue (TSTQ) shows only data for 1 reader. image
The lock-free implementation (TOC series) starts above all locking implementations and stays that way when number of writers increments. Interestingly, TOBC starts quite low (which was something I expected due to quite complicated implementation) but then improves as the system becomes congested. [Keep in mind that the test system had 8 cores and this chart shows data for as much as 16 parallel tasks (8 readers + 8 writers).] What is even more interesting is that all TOC-based implementation approach 750 msg/ms as number of writers increases and all locking implementations approach 200 msg/ms. TSimpleThreadedQueue is slowest all the time – one more reason to stay away from TQueue and TMonitor…
If we transpose the chart and plot throughput against number of readers with series representing number of writers, we can see that for small number of writers (red line for 1 TTSQ) TThreadSafeQueue actually works quite well, but that the throughput drops almost linearly when number of readers increases. image
The Excel file used to generate those charts is included in the downloadable test.

TOmniQueue

From TOmniQueue-specific chart we can see that queues are much faster when total number of readers + writers is small (no surprise) and that the queue degrades gracefully when there are more readers and writers than cores. image
image

TThreadSafeQueue

TThreadSafeQueue seems to have a sweet spot when two writers are used. Its speed begins to drop with more writers and quickly settles in 200 – 300 msg/ms range. image
Transposed chart shows interesting drop when number of readers increase and there’s only one writer. I’m guessing that writer keeps get locked out and cannot put data into queue fast enough. image

TOmniBlockingCollection

TOmniBlockingCollections clearly shows big overhead when many readers but not much writers are present and the cause is most probably the same as with the single writer in TThreadSafeQueue. When there’s enough writers, total throughput stays high at all times.
image
image

Conclusion

Use lock-free queues as there’s really no reason not to use them.

11 comments:

  1. Very nice! I will definitely try this.

    Warren

    ReplyDelete
  2. Great analysis as always. But could you please rerun the tests with string constant as a payload instead of Integer. As I mentioned in the comments the difference in throughput I measured came from TOmniValue being a lot faster than TAnyValue for numerical values. When both use strings as payload (then both use interfaces) then the speed is roughly the same with 8 writers and one reader.

    That is the way I tested. I am afraid much of the difference you saw came from TAnyValue being slow.

    ReplyDelete
  3. Oh I tested with generated GUID values for each item to eliminate any memory and compiler manipulation and to achieve diversification.

    ReplyDelete
  4. @Iztok: Great idea, will do.

    ReplyDelete
  5. Great, thanks for your effort, I appreciate it. I also uploaded a new version of my threading unit where I removed one not needed Enter / Leave block. My test show that with string payload and 8 writers / one reader, TThreadSafeQueue is actually faster then TOmniQueue. But I only have quad core proc, so I am very interested what your tests will show.

    ReplyDelete
  6. Iztok, you may also want to update http://www.cromis.net/blog/downloads/cromis-threading/. (I found the new threading unit in full download so no hurry.)

    ReplyDelete
  7. This highlights some huge performance differences, well done.

    I guess it also shows that either one should do multi-threading "right", or should stick to single-threaded code. Half-hearted multi-threading efforts just don't pay back.

    ReplyDelete
  8. @gabr: updated the page, thanks
    @Eric: Can you be more elaborate :)

    ReplyDelete
  9. This isn't necessarily comparing like with like. When I fix TSimpleThreadedQueue to use a critical section and a semaphore instead of a monitor, it's performance is still crap. Replacing TQueue with a simple linked list sped things up a bit, and adding a spin count to the critical section a bit more. However, TThreadSafeQueue was still beating it, so I had a look at TThreadSafeQueue's code and realised it wasn't implementing any waiting functionality in its Dequeue method (quite a significant absence, no?). Removing the semaphore from the revised TSimpleThreadedQueue (now quite different from the original admittedly), and TThreadSafeQueue was now toast.

    ReplyDelete
  10. Good point, Chris, I totally missed that. I'm preparing a rematch - would you like to contribute your new implementation?

    ReplyDelete
  11. See here: http://delphihaven.wordpress.com/code/tsimplethreadedqueue-variants/.
    I've worked around a couple compiler limitations/bugs re generic records, but the code should still deserve the epithet 'simple' all told.

    ReplyDelete