Monday, July 21, 2008

Lock-free stack problem

WiPIn my last treatise on OmniThreadLibrary internals you could read about the lock-free stack structure, included in the OmniThreadLibrary. The code was relatively simple to understand (after studying it for hours ;) but alas, it didn't work. The problem only appeared when multiple readers or multiple writers were operating on the same structure and that's why the initial tests which used only one reader and one writer failed to detect it.

I had no idea what could go wrong (at the moment I thought I understand the lock-free stuff in the OtlContainters) but luckily the author of the stack code was smarter and located the problem. He explained it to me so I can explain it to you ...

Remember the PopLink code from the previous article? Probably not, so I'll reprint it here in a slightly modified form. [For the original code and comments, see the Implementing lock-free stack article.]

function TOmniBaseContainer.PopLink(var chainHead: POmniLinkedData): POmniLinkedData;
label
Spin;
var
second: POmniLinkedData;
begin
// mov eax, [edx]
Result := chainHead^;
Spin:
// test eax, eax
// jz @Exit
if not assigned(Result) then
Exit;
// mov ecx, [eax]
second := Result.Next;
// lock cmpxchg [edx], ecx
{.$ATOMIC}
if Result = chainHead^ then
chainHead := second
{.$END ATOMIC}
else
// jnz @Spin
goto Spin
end; { TOmniBaseContainer.PopLink }

I must emphasize again that this is not a working code. It might even compile (yes, Pascal/Delphi does have labels and goto!) but it surely won't work as the part between {.$ATOMIC} and {.$END ATOMIC} won't be executed atomically. Still, this is a nice approximation that allows us to understand the problem that also occurs with the assembler version.

The code is (now that it's coded in Delphi) pretty simple: it accesses first element in the chain, follows its Next pointer to the second element, and then atomically checks whether the chain head still points to the first element and if that is true, changes the chain head to point to the second element.

What can go wrong there? After all, we're checking the chain head and making the switch atomically! When we have only one reader and one writer, nothing much. Problems occur when there are multiple readers or multiple writers working at the same time.

The single point of failure is the lock cmpxchg instruction (the atomic part in Delphi code). There is a problem with the value that is stored in the chain head - when lock cmpxchg is executed, we cannot be sure that the ecx register (or the second variable) still points to a node that is indeed second in the chain.

Let's examine one of many possible failure scenarios:

  • A thread executes the code above up to the atomic part. Immediately after the second := first.Next is executed, thread is stopped and another thread gets the CPU slice.

At that point we have following data structure: chainHead -> first -> second -> ...

  • Another thread executes the same PopLink code and pops node first from the chain.

The data structure now looks like this: chainHead -> second -> ...

  • Yet another thread pushes a completely different node (let's call it newnode) into the chain.

chainHead -> newnode -> second -> ...

  • The first node is pushed back into the chain.

chainHead -> first -> newnode -> second -> ...

  • Now the first thread continues execution with the lock cmpxchg. The condition still holds - chainHead indeed points to the first node. Therefore, chainHead is modified to point to the second node. But this is not the node chainHead should be pointing to! It should point to the newnode node!

And so everything falls apart. Sad

Can the code be fixed? Yes, and the fix is already in the repository and is included in the latest snapshot. Next time I'll explain the new approach (and maybe even the operation of the lock-free queue). Stay tuned!

11 comments:

  1. I not only appreciate the library itself but I'm really enjoying your writing about it here -- i hope you'll keep doing it.

    ReplyDelete
  2. I wish people would stop calling code "lock free" - it just isn't so.

    This code is just uses much lighter locking that a mutex or critical section allows when done correctly, but using a flag like this makes the flag the lock.

    Sorry, this bit of angst and annoyance was brought to you by the letters x and y and the number 3.

    ReplyDelete
  3. Depends on how you look at it.

    It is lock-free. It doesn't acquire a lock and it doesn't require centralized service that manages locks (yep, locking can be decentralized, I know).

    It aint wait-free. It would be nice to have a lock-free wait-free structure but I don't think that's achievable on the Intel32 architecture.

    ReplyDelete
  4. But yes, you could call the traditional approach 'pesimistic locking' and lock-free approach 'optimistic locking', if you wish.

    ReplyDelete
  5. @Xepol:
    This is typical 100% "lock free" method.
    There is no locking of the memory buffer. So all involved threads can read and write at any time to the some location in memory buffer. Of course the memory conflicts exist. If some thread interrupts the writing of another thread the write operation will be repeated!

    If you still doubt then check the OmniThreadLibrary performance.

    ReplyDelete
  6. The approach you describe above is basically similar to how "Spin Locks" work. ("jnz @Spin" is where it, well, spins)

    Spin Locks are much faster (actually /may be/ much faster) if used in situations where locks are held for *very short* times (as in your example). The advantage come from saved context switches - which outweighs the actual code execution time that needs synchronized (in examples like this one).

    Spin locks may come at a cost though: If for some reason the code to be synchronized runs too long, the OS may decide to *force* a context switch, as the time slice is "used up". At that point spin locks start burning cpu cycles - so you have to very carefully here.

    @GJ: Yes, the OmniThreadLibrary has a very good performance - but it's not lock free, but uses fast locks ;-)

    ReplyDelete
  7. @olaf: We are using spinlocks in OTL, you know :)

    There is a big difference between using spinlocked stack and lock-free stack.

    In spinlock version, you do:
    spinlock.Acquire;
    stack.Push; //without spinning
    spinlock.Release;

    Spinning is used to acquire the lock.

    In lock-free version (sorry, but that's how the term is used - as I've said before, this approach is lock-free but not stack-free), no acquire/release cycle is used. The code goes:

    stack.Push; //with spinning if required

    However, you are correct that the approach is similar and that the speed is on the same order of magnitude.

    ReplyDelete
  8. Why do I always think of one more statement just after I click the Publish button???

    You could say that the lock-free approach is equivalent to a spinlock that locks only few instructions (pointer swapping) and which automatically releases the lock.

    ReplyDelete
  9. mov eax, [edx]

    is actually

    Result := chainHead;

    because it's a var parameter

    ReplyDelete
  10. Indeed, indeed. After all, chainHead^ is a record, not a pointer.

    Thanks for the catch, I've updated the main article.

    ReplyDelete
  11. BTW, Wikipedia agrees with 'my' definition of lock-free.

    See also description of the ABA problem where a while loop is used in a 'lock-free' algorithm.

    ReplyDelete