Wednesday, July 23, 2008

A working lock-free stack implementation

WiP

This is the third post in the mini-series on a lock-free stack, included in the OmniThreadLibrary. You've learned how the lock-free stack was implemented in the first place and which problem was hidden in that implementation. I've also mentioned that the problem was fixed in the meantime and that the working implementation is already available in the repository and as a snapshot, but I haven't said a word about the new implementation. This will now be rectified.

In the initial implementation lied an excellent example of the ABA problem.  You can read more about it in the Wikipedia, where you'll also learn that ABA problem is commonly solved by adding ‘tag’ bits. That's exactly what was done in the OTL, except that we didn't use just few measly bits, but a 4-byte counter.

Instead of containing only a pointer to the first node, each chain head is now composed of two parts — a pointer to the first node and a 4-byte spin count. This second part will make sure that the ABA problem cannot occur.

type
TOmniHeadAndSpin = packed record
Head: POmniLinkedData;
Spin: cardinal;
end;{ TOmniHeadAndSpin }

As the problem lied in the PopLink method, we'll start today's exploration just there. This is the new version in all its glory. (Of course, assembler is still provided by the ‘GJ’, I had nothing to do with it.)

class function TOmniBaseContainer.PopLink(var chain: TOmniHeadAndSpin): POmniLinkedData;
asm
push edi
push ebx
mov edi, eax //edi = @chain
@Spin:
mov ecx, 1 //Increment spin reference for 1
lock xadd [edi + 4], ecx //Get old spin reference to ecx
mov eax, [edi] //eax := chain.Head
mov edx, [edi +4] //edx := chain.Spin
test eax, eax
jz @Exit //Is Empty?
inc ecx //Now we are ready to cmpxchg8b
cmp edx, ecx //Is reference the some?
jnz @Spin
mov ebx, [eax] //ebx := Result.Next
lock cmpxchg8b [edi] //Now try to xchg
jnz @Spin //Do spin ???
@Exit:
pop ebx
pop edi
end;

You may have noticed that this is now a class function (a class static function, to be more precise). That made more sense as it needs no information from the class — it only works on a chain, which is passed as a parameter. Also, the code is much longer and not very easy to understand, so I'll try the trick that helped in the previous installment. I'll write an approximation of the PopLink in Delphi.

class function TOmniBaseContainer.PopLink(var chain: TOmniHeadAndSpin): POmniLinkedData;
label
Spin;
var
chainSpin: cardinal;
nextNode : POmniLinkedData;
spinRef : cardinal;
begin
// push edi
// push ebx
// mov edi, eax //edi = @chain
Spin:
// mov ecx, 1 //Increment spin reference for 1
// lock xadd [edi + 4], ecx //Get old spin reference to ecx
{.$ATOMIC}
spinRef := chain.Spin;
chain.Spin := chain.Spin + 1;
{.$END ATOMIC}
// mov eax, [edi] //eax := chain.Head
// mov edx, [edi +4] //edx := chain.Spin
{.$ATOMIC}
Result := chain.Head;
chainSpin := chain.Spin;
// test eax, eax
// jz @Exit //Is Empty?
if not assigned(Result) then
Exit;
// inc ecx //Now we are ready to cmpxchg8b
Inc(spinRef);
// cmp edx, ecx //Is reference the some?
// jnz @Spin
if spinRef <> chainSpin then
goto Spin;
// mov ebx, [eax] //ebx := Result.Next
// nextNode = ebx
nextNode := Result.Next;
// lock cmpxchg8b [edi] //Now try to xchg
{.$ATOMIC}
if (Result = chain.Head) and (chainSpin = chain.Spin) then begin
chain.Head := nextNode;
chain.Spin := spinRef;
end
else begin
chainSpin := chain.Spin;
Result := chain.Head;
{.$END ATOMIC}
// jnz @Spin //Do spin ???
goto Spin;
end;
//@Exit:
// pop ebx
// pop edi
end;

In the beginning, a little housekeeping is done (two registers stored away because Delphi compiler expects us not to modify them) and address of the chain record is loaded into the edi register.

With a little help of locked xadd, current version of the header spin value is loaded into the spinRef (ecx register) and is incremented in the header.

//  mov   ecx, 1                            //Increment spin reference for 1
// lock xadd [edi + 4], ecx //Get old spin reference to ecx
{.$ATOMIC}
spinRef := chain.Spin;
chain.Spin := chain.Spin + 1;
{.$END ATOMIC}

Xadd instructions stores first operand ([edi + 4], which equals to chain.Spin) into the second operand (ecx) and sum of both operands (original value of the second operand is used for summation) into the first operand.

Then we fetch current chain.Head and chain.Spin and store them into the Result (eax) and chainSpin (edx), respectively. This is a non-atomic fetch and doesn't guarantee that both values really belong to the same state of the chain record. Another thread may have been invoked between those two movs and it could call PopLink on the same chain and modify the header.

//  mov   eax, [edi]                        //eax := chain.Head
// mov edx, [edi +4] //edx := chain.Spin
Result := chain.Head;
chainSpin := chain.Spin;
// test eax, eax
// jz @Exit //Is Empty?
if not assigned(Result) then
Exit;

Because of the same reason there is no guarantee  that chainSpin is indeed equal to spinRef+1 so this is now checked for.

//  inc   ecx                               //Now we are ready to cmpxchg8b
Inc(spinRef);
// cmp edx, ecx //Is reference the some?
// jnz @Spin
if spinRef <> chainSpin then
goto Spin;

Now we're getting ready for the pointer swap. First we need an address of the second node in the ebx register.

//  mov   ebx, [eax]                        //ebx := Result.Next
nextNode := Result.Next;

Finally, we can use cmpxchg8b instruction to do the swap. Cmpxchg8b is a very special beast. It is similar to the cmpxchg, but it compares and swaps full 8 bytes in one go!

//  lock cmpxchg8b [edi]                    //Now try to xchg
{.$ATOMIC}
if (Result = chain.Head) and (chainSpin = chain.Spin) then begin
chain.Head := nextNode;
chain.Spin := spinRef;
end
else begin
chainSpin := chain.Spin;
Result := chain.Head;
{.$END ATOMIC}
// jnz @Spin //Do spin ???
goto Spin;
end;

Cmpxchg8b compares combination of edx and eax registers with its operand [edi]. In our case, edi is pointing to the chain (8 byte value in which first 4 bytes are a node pointer and second 4 a spin count), and eax and edx were loaded from the same structure just few instructions back. In other words, just like in the initial implementation we're checking if our internal values are still relevant. If that is not true, cmpxchg8b reloads edx and eax from the [edi] and the code then jumps back to the Spin label.

If values match, registers ecx and ebx are copied into the operand [edi]. Here, value in the ecx register is equal to the edx (we just verified that) and ebx contains the pointer to the next node. In other words, chain head is modified to point to the second node in chain.

There's not much difference between this implementation and the original one. The biggest and most important change is introduction of spin counter, which makes sure that the ABA problem, mentioned in the previous post, cannot occur.

[To be completely frank, ABA situation can still occur, at least in theory. To experience a problem with the new code, one thread would have to stop just before cmpxchg8b and wait there for a very long time. So long, in fact, that other threads would be able to wrap the spin counter from $FFFFFFFF to 0 and back to the original value. A little less than 4,3 billion PushLinks would have to be called. If the original thread is then awaken — and the spin counter contains the exact same value as it had before the thread was paused — the ABA problem would occur. That will never happen in practice, I'm totally sure.]

Besides the PopLink and introduction of spin counters, nothing much has changed. The PushLink method was slightly modified, as it was also changed from a normal method to a class static one, but it still works exactly as before.

class procedure TOmniBaseContainer.PushLink(const link: POmniLinkedData;
var chain: TOmniHeadAndSpin);
asm
mov ecx, eax
mov eax, [edx] //edx = chain.Head
@Spin:
mov [ecx], eax //link.Next := chain.Head
lock cmpxchg [edx], ecx //chain.Head := link
jnz @Spin
end; { TOmniBaseStack.PushLink }

That's it. This version of the lock-free stack has successfully passed 4-hour stress test, which makes me believe that it really works. Winking

Next time on the agenda: How the lock-free queue is made. This time I'll really blog about it. Nothing more will distract me. Promise!

8 comments:

  1. Anonymous22:53

    This is really interesting! Keep up the great work!

    I'm eager to try it out as soon as I make meet my current deadline.

    Also, would love to hear your impression of AsyncCalls 2.21:

    http://andy.jgknet.de/async/

    Any estimate on when you'll tag OmniThreadLibrary as 1.0? It'll help me get permission to use it at work.

    ReplyDelete
  2. Andy, hi!

    I know of AsyncCalls (and I'm recommending it around), but I never took a deeper look. Maybe one of those days ...

    As of version 1.0, hmmm. I want to finish the thread pool (not much work here) and I also want to take another try at the IOmniCommunicationEndpoint. Sending objects and interfaces is much too complicated at the moment.

    The problem here is that I'll be going on vacation in a week. I don't think version 1.0 will be released before that. Maybe somewhere near the end of the August.

    Better to get it right then to release it quick, I think.

    ReplyDelete
  3. Yuhong Bao16:47

    Note that the XADD and CMPXCHG instructions were introduced in the 486 processor and CMPXCHG8B in the Pentium. Older processors that do not have these instructions will crash with an illegal opcode exception.

    ReplyDelete
  4. This is late, but you should probably describe more of the test environment. If you can afford it developing on a multi-core, multi-CPU machine is best since you can run your tests in the most stressful environment likely to be found right on your development machine. But since most of us still just have multi-core single CPU machines there are still places where this sort of threaded code can fall apart that will never happen on our development machines.

    In my limited experience I haven't seen a bug that occurs on multi-core/single CPU but not on multi-core, multi-CPU machines, and I wonder if one can exist. I prefer to test on both just to be sure.

    ReplyDelete
  5. I'm doing all my tests on a machine with two quad-core Intel Xeon E5430 processors. My tests run a different number of threads (from 1 to 16 - two times the number of cores) on a different number of cores (from 1 to 8).

    ReplyDelete
  6. Hi,

    Does 'lock' before 'cmpxchg8b' make it atomic ?

    Also, would it be possible to know what extra work (eg. energy consumption or time in cycles) do specialized atomic instructions like cmpxchg8b (or generic compare-and-swap, fetch-and-store etc.) require in terms of basic atomic instructions such as LOAD and STORE ?

    -shahid.

    ReplyDelete
  7. @Shahid:

    1. Yes, it does.

    2. I have no idea.

    ReplyDelete
  8. Hello fellow Delphi geek. I am intrigued by your article and I think I should check out your OmniThread library. I have built some fancy multi-threaded libraries of my own that you might be interested in, including a fancy ThreadPool, ManagedThread, Command Processor, and memory heap manager. I'm very interested in parallel programming and am looking to collaborate with other like-minded people.

    I just wanted to point out one thing. Use of the 'lock' asm instruction, doesn't really make it atomic per-se. Whereas, yes, it protected the next instruction, however it does it using a locking mechanism, which IMO is slightly different than the true definition of "atomic"... splitting hairs with definitions there. The lock instruction locks the memory bus, not just for the current thread, not just for the current CPU, but for all CPUs and CPU cores... it even blocks the system kernel, so overuse of it can really murder performance... however since it is so lightweight and supported at the processor level, in many situations it is more appropriate than using critical sections etc. I think the bottom line is whether you're looking for performance or scalability and how often whatever you're protecting is predicted to get hit. I often prefer to take a single-thread performance hit if it leads to increased scalability across 8 or more cores... but the lock instruction doesn't scale, by design.

    I am slightly angry with the Delphi changes to strings that caused the lock instruction to be peppered all over my code. I'm not sure what they're doing with the move to Automatic Reference Counting changes either as there's no windows compiler for it yet. There are papers published on how to do reference counting without using any locks of any sort using an approach similar to an approach I implemented for a company in the 90s... but I'm not sure if that would really translate when you want a general-purpose critical-section-like object... and I don't know if Embarcadero used such an approach with their iOS/Android compilers (they made so many changes with huge consequences, it's going to take a long time before I get any real cross-platform code running). I have my own blog that might have some interesting topics to you, but I won't shamelessly promote it on your site.

    ReplyDelete