During that time I’ve discovered a showstopper bug in TMonitor which practically made TSimpleThreadedQueue useless when using more than one reader. Still, I’m including partial results for that implementation too, as they provide a good comparison.
Test suiteThe test is very straightforward. It uses Parallel.Async (requiring D2009 to recompile) to start few reader and few writer tasks. Writers write 100.000 messages each into shared queue and readers read from the shared queue until there’s nothing more to read. A boolean flag is used to indicate this condition. [It is toggled when all writers complete their work.] IOmniResourceCount is used in main thread to detect when all readers/writers have completed their work.
Writer part is very straightforward.
Parallel.Async( procedure var i: integer; begin WaitForSingleObject(startWriting, INFINITE); firstWriter.CAS(0, DSiTimeGetTime64); for i := 1 to CNumMessagesPerThread do queue.Enqueue(i); countWriters.Allocate; end);There are few tricks here. Firstly, an event is used to start all writers at once. Secondly, first writer to be started will update the firstWriter variable to record the time when the test has started. [CAS will fail for all other writers because firstWriter will not contain zero anymore.] Thirdly, each writer “logs out” by calling countWriter.Allocate at the end so that the main test method knows when all writers have completed their work.
Readers are only slightly more complicated.
Parallel.Async( procedure var value: TOmniValue; begin repeat while queue.TryDequeue(value) do ; until allWritersDone; lastReader.Value := DSiTimeGetTime64; countReaders.Allocate; end);Reader tries to dequeue elements from the queue. If the queue is empty, it checks the status of allWritersDone flag and exits only if it is set. If not, writers are still doing their job but readers have emptied the queue (maybe all writers are blocked).
At the end, lastReader value is updated to contain the time when the last reader has completed its work (TGp8AlignedInt64 takes the job of atomic update here so it’s not possible for this value to contain a mix of two times from two threads) and reader “logs out”.
When using TOmniBlockingCollection this test becomes much simpler.
Parallel.Async( procedure var value: TOmniValue; begin while queue.Take(value) do ; lastReader.Value := DSiTimeGetTime64; countReaders.Allocate; end);With blocking collection, Take will block if the collection is empty but writers are still doing their work.
At the end, the test method calculates message throughput – a number of messages that travelled through the system in one millisecond. [Note: this is a throughput for all threads together. When multiple readers/writers are present, every reader/writer thread read/wrote less than this number of messages in one millisecond.]
Test is run with 1 to 8 readers and 1 to 8 writers (all in all 64 combinations). Test for each reader/writer combination is repeated ten times. Slowest number is ignored (to remove noise caused by the system activity) and other nine measurements are averaged.
All measurements for all tests are dumped into file speedtest.csv (included in the test download). Columns indicate number of writers and rows number of readers. Each test contains header row. First result block belongs to TOmniQueue, second to TSimpleThreadedQueue (only the first line contains values because this queue breaks down when multiple readers are in use due to TMonitor bugs), third to TThreadSafeQueue and fourth to TOmniBlockingCollection.
All test times are logged to file speedtest.log (also included in the test download).
All tests were executed on dual-CPU system containing 8 cores (4 per CPU).
ComparisonThe two charts below represent a cross-section of all data. First chars shows throughput of complete system plotted against number of writers. Data is shown in nine series. TOmniQueue (marked TOQ in the chart), TThreadSafeQueue (TTSQ) and TOmniBlockingCollection (TOBC) show data for 1, 4 and 8 readers while TSimpleThreadedQueue (TSTQ) shows only data for 1 reader.
The lock-free implementation (TOC series) starts above all locking implementations and stays that way when number of writers increments. Interestingly, TOBC starts quite low (which was something I expected due to quite complicated implementation) but then improves as the system becomes congested. [Keep in mind that the test system had 8 cores and this chart shows data for as much as 16 parallel tasks (8 readers + 8 writers).] What is even more interesting is that all TOC-based implementation approach 750 msg/ms as number of writers increases and all locking implementations approach 200 msg/ms. TSimpleThreadedQueue is slowest all the time – one more reason to stay away from TQueue and TMonitor…
If we transpose the chart and plot throughput against number of readers with series representing number of writers, we can see that for small number of writers (red line for 1 TTSQ) TThreadSafeQueue actually works quite well, but that the throughput drops almost linearly when number of readers increases.
The Excel file used to generate those charts is included in the downloadable test.
TOmniQueueFrom TOmniQueue-specific chart we can see that queues are much faster when total number of readers + writers is small (no surprise) and that the queue degrades gracefully when there are more readers and writers than cores.
TThreadSafeQueueTThreadSafeQueue seems to have a sweet spot when two writers are used. Its speed begins to drop with more writers and quickly settles in 200 – 300 msg/ms range.
Transposed chart shows interesting drop when number of readers increase and there’s only one writer. I’m guessing that writer keeps get locked out and cannot put data into queue fast enough.