Wednesday, April 25, 2018

Multithreading is (still) hard!

Multithreading is hard!

No matter how long you deal with it and how good you (think) you are, you will make a mistake. Usually, it will be a problem that will only exhibit itself in a rare circumstances, most probably on a hard-to-reach customer machine. With some (bad) luck it will only appear on Friday afternoons or during your vacation.

That is why I'm always introducing multithreading with the "Don't do it yourself!" motto. Use a standard library! (And by that I mean OmniThreadLibrary, of course. ;) ) As your code, it too will have bugs. Unlike your code, it has 1000+ users running it in very different environments which means that at least it is tested as much as possible.

There are, nevertheless, bugs that escape detection for a long time. In 2011, for example, I fixed a well-hidden problem in TOmniBlockingCollection ("Multithreading is Hard!"). As nasty as that was, it was nothing compared to the bug I found recently!

As it turned out, the implementation of a bounded (fixed-size) multiple-producer multiple-consumer lock-free queue TOmniBoundedQueue (try saying that in one breath!) was buggy since its inception! As this queue is basis to all OmniThreadLibrary communication channels it is really surprising that the bug hid from everybody for 9 (yes, nine!) long years.

The bug in question raised its head when producer was writing data to a queue at the same time as consumer was reading the last data item from the queue. There was a race condition in head and tail pointer manipulation where one part of the code conventionally ignored the fact that the pointer could be modified by the other side at the same time. Because of that, TOmniBoundedQueue.Enqueue could return False indicating that the queue is full while, on the other hand the queue was actually empty!

The problem was so basic and so unexpected that no tests were written to cover this case and so it escaped attention. It was even better hidden because the OTL-based code typically uses SendWait instead of Send to fix situations when queue consumer is temporarily stopped and SendWait worked just fine. By the time it retried the Send, consumer has already exited the problematic code section and everything was fine.

While fixing such fundamental code is always a problem - it is so easy to break something - I have high hopes that my fix for this situation will not introduce new problems. I've been running fixed OTL code on our test servers for the last week and no weird problems were reported in our applications.

Still, it doesn't hurt to do more testing. If you are using OTL 3.07.5, please update from GitHub and test your applications. If you run into problems, report them as soon as possible. Thank you!

If everything goes fine, I'll be releasing version 3.07.6 early in May.

2 comments:

  1. http://17slon.com/blogs/gabr/presentations/zlot2017/Zlot-Parallel.pdf
    is a wonder ppt for your multithread patterns.

    ReplyDelete




  2. Multithreading is hard
    “New programmers
    are drawn to multithreading
    like moths to flame,
    with similar results.”
    - Danny Thorpe

    I like this.

    ReplyDelete