Tuesday, May 24, 2011

TMonitor bug?

When Chris Rolliston implemented a simple threaded queue class using TMonitor, I was quite ecstatic. Finally I will be able to compare my lock-free queue with a threaded implementation! (That’s something I wanted to do for a long time but I never found a willpower to write a locking queue.)
My elation continued while I was coding my test app and while I tested my own queue. And then I plugged in Chris’ queue and … everything went downhill :( Immediately, my test app started crashing. At first it looked like the bug was in Chris’ code, but when I enabled debug DCUs, debugger pointed to a very unlikely location – TMonitor. Unlikely because of two reason – because I know that Delphi people take quality to the heart and test the code and because I know Allen is an experienced guy who writes excellent code.
Please note that I’m not saying that there’s a bug in TMonitor – at least at the moment I’m not. It just looks that way. I would have to do some heavy debugging to prove this and at the moment my head is too full of pollen and antihistamine to do such work. That’s why I’m asking experienced Delphi programmers to jump in and help me find the true reason of the problem - TMonitor, TSimpleThreadedQueue or my test program.
To run the test program you would need fairly fresh OmniThreadLibrary (at least release 899) and Delphi XE. Download the source, make sure that path includes OmniThreadLibrary, recompile and run. Most probably it would help to have four cores to exhibit the problem although I was able to reproduce it on only one.
Anyway, run the application and click the “Bug?” button. If you see something like
image
popping up, then you have repeated the problem.
In my testing, clicking Break opens debugger in System.pas. If you get SimpleThreadedQueue.pas or speedtest1.pas instead, then most probably you don’t have debug DCUs enabled.
image
The test code in StartTest2 creates one writer and two reader threads. In my testing the problem occurs when one writer and one reader already completed their work and only the last reader is running. When this last reader calls Dequeue, TMonitor.Wait gets called. This one calls RemoveWaiter and crash occurs.
This is all I know about the problem at this moment. If you feel capable and can spare yourself some time, please download the app and try to help me finding the real cause for this crash. Thanks!

Updated 2011-10-11

This bug was fixed in Delphi XE2.

21 comments:

  1. Gabr, see this link
    stackoverflow.com/questions/4856306/tthreadedqueue-not-capable-of-multiple-consumers

    May be related, the error in TThreadedQueue seems to be in the TMonitor.
    I will look into your code later today.

    ReplyDelete
  2. Thierry Pellen15:21

    When the queue contains only 2 elements, Walker gets initialised with FWaitQueue.Next.Next (ie FWaitQueue), isn't it ?
    As a consequence, the while loop is never executed, and the queue either gets screwed up (FWaitQueue := Last is executed, but FWaitQueue.Next is never fixed up), or it stays untouched...

    I have the same problem under Delphi 2010...

    ReplyDelete
  3. I have written a locking queue some time ago. I tested it on quad core processor and found out that in almost all cases it is just slightly slower then you own lock free implementation. There is very small difference. Much of the difference comes from TAnyValue being slower most of the time then TOmniValue, because I use interfaces. I wrote about this improvement briefly here: http://www.cromis.net/blog/2011/03/new-version-of-simplestorage-and-other-updates/

    You can find the code on my download page. I also have the test somewhere still and can mail it to you if you like.

    ReplyDelete
  4. Oh the code is in the threading unit from download.

    ReplyDelete
  5. Thierry Pellen16:47

    Maybe the "while Walker <> FWaitQueue do" loop could be replaced by something like this:

    repeat
    if Walker = @WaitingThread then
    begin
    Last.Next := Walker.Next;
    Break;
    end;
    Last := Walker;
    Walker := Walker.Next;
    until Last = FWaitQueue.Next;

    (tested only with a pen and a sheet of paper !)

    ReplyDelete
  6. Hi,
    I tested your code with the TThreadedQueue and the same error appeared. This is the same error as I reported in the link given above. I'm not the man to pinpoint what's going on in TMonitor. Looks like a job for Allen Bauer.

    ReplyDelete
  7. Funnily enough, I actually asked a similar question back in February: https://forums.embarcadero.com/thread.jspa?messageID=320898&tstart=0#320898. Leaving it for a bit, a few months later I looked at TMonitor again and came to the conclusion it did actually work if used more simply than I was originally attempting to use it. Obviously not though!

    ReplyDelete
  8. @Leif: Yes, it looks like this usage of TMonitor doesn't support multiple consumers. Too bad - especially as it makes TThreadedQueue useless. :(

    @Iztok: I'll make sure to test with your queue too.

    @Thierry: Your guess is as good as mine (at this moment).

    ReplyDelete
  9. This report might be interesting:

    Report No: 78415 Status: Open
    AV in TInternalConditionVariable.RemoveWaiter
    http://qc.embarcadero.com/wc/qcmain.aspx?d=78415
    QCWIN:Defect_No=78415

    ReplyDelete
  10. Sorry, now I see that the discussion Chris pointed to also mentions this report.

    ReplyDelete
  11. @Moritz: Yes, that seems to be the same problem. Shame on Embarcadero for not fixing this with a hotfix!

    ReplyDelete
  12. Thierry Pellen11:15

    @Chris, Moritz, gabr
    So the same bug appears in SynObjs.pas AND in System.pas... **Sigh** Code duplication...

    In the end, 2 bugs in 2 different units could be solved with just one shot ! That's pretty good news !

    ReplyDelete
  13. That's a 2009 bug, ouch, I guess it means that TMonitor hasn't been used much (if any) in production code out there since it was introduced... Give us our 4 bytes per instance back! ;)

    ReplyDelete
  14. When I initially posted the bug back in 2009, I've also attached a patch that fixes at least the error in SyncObjs.pas. That patch, however, disappeared from the report (presumably because it got overridden by an attachment that got added later). I'll have a look if I can re-attach the patch.

    ReplyDelete
  15. Is XE or 2010 required to compile/run the test app? Or could it work with Delphi 2009?

    ReplyDelete
  16. @Michel: I think the test app should work in 2009 (but I may be wrong).

    ReplyDelete
  17. With Delphi 2009 (after commenting out TStopWatch related code and changing Line 79 to "if not TMonitor.Wait(FQueue, LongWord(Timeout)) then Exit(wrTimeout);", no exception is thrown. Instead the application freezes, the log shows:

    Debug Output: Exiting writer Process speedtest.exe (3928)
    Debug Output: Awaited writers Process speedtest.exe (3928)
    Debug Output: Exiting reader Process speedtest.exe (3928)

    Looks like it is broken in a different way?

    ReplyDelete
  18. Hi!

    Here's the patch that can be applied to SyncObjs.pas:

    diff --git a/SyncObjs.pas b/SyncObjs.pas
    index 4239a0b..f5382d9 100644
    --- a/SyncObjs.pas
    +++ b/SyncObjs.pas
    @@ -953,7 +953,7 @@ begin
    begin
    WaitQueue := LockQueue;
    try
    - Last := WaitQueue.Next;
    + Last := WaitQueue;
    Walker := Last.Next;
    while Walker <> WaitQueue do
    begin
    @@ -969,7 +969,10 @@ begin
    if Walker.Next = Walker then
    WaitQueue := nil
    else
    - WaitQueue := Last;
    + begin
    + WaitQueue := Walker.Next;
    + Last.Next := WaitQueue;
    + end;
    finally
    UnlockQueue(WaitQueue);
    end;

    System.TMonitor.RemoveWaiter can be patched accordingly.

    ReplyDelete
  19. Hi!

    Here's the patch that can be applied to SyncObjs.pas:

    diff --git a/SyncObjs.pas b/SyncObjs.pas
    index 4239a0b..f5382d9 100644
    --- a/SyncObjs.pas
    +++ b/SyncObjs.pas
    @@ -953,7 +953,7 @@ begin
    begin
    WaitQueue := LockQueue;
    try
    - Last := WaitQueue.Next;
    + Last := WaitQueue;
    Walker := Last.Next;
    while Walker <> WaitQueue do
    begin
    @@ -969,7 +969,10 @@ begin
    if Walker.Next = Walker then
    WaitQueue := nil
    else
    - WaitQueue := Last;
    + begin
    + WaitQueue := Walker.Next;
    + Last.Next := WaitQueue;
    + end;
    finally
    UnlockQueue(WaitQueue);
    end;


    System.TMonitor.RemoveWaiter can be fixed accordingly.

    ReplyDelete
  20. Anonymous10:37

    Hi

    We seem to have this problem affecting us with Delphi XE.

    It is affecting the datasnap callbacks in our application.

    we can simulate with a small app, is there any way we can do these changes in Delphi XE system.pas and compile the DCU

    ReplyDelete
  21. Hi! I have created a patch for TMonitor in Delphi XE, based on Christian Gudrian's comment above. It can be downloaded from here: http://href.hu/x/hdu3

    ReplyDelete