Tuesday, July 15, 2008

Implementing lock-free stack

[I'm cheating. The title should be “OmniThreadLibrary internals - OtlContainers”, but I wanted to attract more readers.Winking]

WiPToday I'll be describing the OtlContainers unit. Part of that unit, actually, as there is lot to tell and lot to show. I'll focus on the lock-free stack today and describe other stuff from the same unit in a day or two.

To give the credit where it is due — most of the code you'll see today was written by a Slovenian developer ‘GJ’. His is the implementation of the lock-free stack and all the assembler parts. I only wrapped them in higher-level structures. GJ, thanks for your contribution!

If you want to follow this article with the OtlContainers loaded in IDE, please make sure that you have the latest source [quick links: snapshot, repository, OtlContainers.pas].

The topic of today's presentation is TOmniBaseContainer. This is a base class that implements lock-free size-limited stack. Multiple simultaneous writers are supported, as are multiple readers. Although this is a fully functional class, you would usually want to use more feature-rich descendant TOmniStack.

A note regarding the limited size. It is much simpler to implement size-limited lock-free structure than to deal with dynamic allocation. In most cases, dynamic allocation would lead to possible locks inside the memory manager and some of the advantage gained by the lock-free algorithm would go away. Even more, you usually don't want the lock-free structure to grow without any control. Sure, OTL implementation won't be the right solution to all your needs but still we do believe that it is good enough for many usage scenarios.

Back to the story ... TOmniBaseContainer stores data into TOmniLinkedData nodes. First four bytes in the node point to the next node and other bytes (as many as the application requested) contain application data. Only first of those bytes is actually included in the TOmniLinkedData structure. It is only used as a placeholder for address calculation in Push and Pop operations.

type
POmniLinkedData = ^TOmniLinkedData;
TOmniLinkedData = packed record
Next: POmniLinkedData;
Data: byte; //user data, variable size
end; { TLinkedOmniData }

In the following diagrams I'll use this symbol to represent one TOmniLinkedData node.

data node

Nodes are connected into chains. Address of the first node is stored in a chain header. The Next field of the last node in chain contains value nil

data chain

As the TOmniBaseContainer implementation is size-limited, it can preallocate all nodes in one go. The Initialize method first allocates one big buffer to store all nodes and connects them into one chain, pointed to by the recycle chain header. This chain contains unused nodes. The other chain, public chain, contains nodes that are in use and is initialized to nil.

You'll see that the initialization code rounds up the size of node data to the nearest multiplier of 4. This is required because operations that modify node pointers (Next field) require them to be aligned on addresses that are divisible by 4.

  // calculate element size, round up to next 4-aligned value
bufferElementSize := ((SizeOf(POmniLinkedData) + elementSize) + 3) AND NOT 3;
GetMem(obcBuffer, bufferElementSize * cardinal(numElements));
Assert(cardinal(obcBuffer) AND 3 = 0);
//Format buffer to recycleChain, init orbRecycleChain and orbPublicChain.
//At the beginning, all elements are linked into the recycle chain.
obcRecycleChain := obcBuffer;
nextElement := nil; // to remove compiler warning in nextElement.Next := nil assignment below
currElement := obcRecycleChain;
for iElement := 0 to obcNumElements - 2 do begin
nextElement := POmniLinkedData(cardinal(currElement) + bufferElementSize);
currElement.Next := nextElement;
currElement := nextElement;
end;
nextElement.Next := nil; // terminate the chain
obcPublicChain := nil;

Structure, created by the initialization code is depicted below. Recycle chain points to the first node, which points to the next node and so on. The last node in the chain has nil stored in the Next pointer. Public chain is empty.

empty stack

Let's see what happens when code pushes new data onto the stack.

function TOmniBaseContainer.Push(const value): boolean;
var
linkedData: POmniLinkedData;
begin
linkedData := PopLink(obcRecycleChain);
Result := assigned(linkedData);
if not Result then
Exit;
Move(value, linkedData.Data, ElementSize);
PushLink(linkedData, obcPublicChain);
end; { TOmniBaseContainer.Push }

PopLink removes first node from the recycle chain. Chain header is updated to point to the next node in the chain. Pointer to the removed node is stored in the linkedData variable. If recycle chain is empty, linkedData contains nil and Push returns False.

The code next moves application data from the value parameter to the preallocated buffer. linkedData.Data conveniently addresses first data byte in the buffer.

At the end, the code inserts linkedData node at the beginning of the public chain and updates the public chain header to point to the linkedData node.

one pushed

The real workhorses here are PopLink and PushLink. Former pops first node from a chain and latter inserts the node into the head of the chain. The trick here is that they are written with multithreading in mind — they both expect that data structures may change at any time because another thread may be accessing the structure simultaneously.

Let's take a look at the PopLink first.

function TOmniBaseContainer.PopLink(var chainHead: POmniLinkedData): POmniLinkedData;
//nil << Link.Next << Link.Next << ... << Link.Next
//FILO buffer logic ^------ < chainHead
asm
mov eax, [edx] //Result := chainHead
@Spin:
test eax, eax
jz @Exit
mov ecx, [eax] //ecx := Result.Next
lock cmpxchg [edx], ecx //chainHead := Result.Next
jnz @Spin //Do spin ???
@Exit:
end; { TOmniBaseContainer.PopLink }

At beginning, edx register contains the address of the chainHead parameter. mov eax, [edx] moves chainHead, i.e. the node that chainHead is pointing to, into the eax. Chain may be empty, in which case chainHead is nil, or in assembler terms, eax is 0. test eax, eax checks for this condition.

If there's at least one node in the chain, we can continue. mov ecx, [eax] moves the address of the next node into the ecx register. Remember that eax points to the node and [eax] is the same as accessing the node's Next field.

Now we have an address of the first node in the eax register, address of the second node in the ecx register (there may be no second node, in which case the ecx is 0) and chainHead in the edx register. And now we can do the heavy magic.

lock cmpxchg [edx], ecx does few things, all at once. Well, not exactly in once, but processor (and level one cache and all weird hardware wrapping the CPUs) makes sure that this operation won't be interrupted by another core or CPU attempting to do the same thing at the same time. Firstly, cmpxchg compares eax with [edx]. Remember, we loaded eax from [edx] so those two values should be the same, yes? Well, no. Between mov eax, [edx] and lock cmpxchg, thread may be interrupted and stopped and another thread may have modified the chain header by executing PopLink. That's why we have to recheck.

If eax and [edx] are still the same, ecx is loaded into [edx]. In other words, address of the second node (which may be nil) is stored into the chainHead. Because of the lock prefix, all that (testing and assignment) happens atomically. Uninterruptible. In other words, another thread can not step on our (digital) toes.

If eax and [edx] are not the same, cmpxchg loads eax from [edx]. In other words, eax is refreshed from the chainHead. If that happens, jnz @Spin instruction will jump to the @Spin label and repeat the whole procedure.

At the end we have address of the second node stored in chainHead and address of the first node stored in eax, which is fine as pointer Results are returned in eax.

You can read more on cmpxchg here.

PushLink is much simpler.

procedure TOmniBaseContainer.PushLink(const link: POmniLinkedData; var chainHead:
POmniLinkedData);
asm
mov eax, [ecx] //ecx = chainHead
@Spin:
mov [edx], eax //link := chainHead.Next
lock cmpxchg [ecx], edx //chainHead := link
jnz @Spin
end; { TOmniBaseContainer.PushLink }

At the beginning, edx contains the value of the link parameter and ecx contains the address of the chainHead parameter.

The code first loads the address of the first node in the chain into the eax register. Nil pointers are fine in this case as we will not be following (dereferencing) them.

Then the code loads this same value into the [edx]. Remember, edx contains the value of the link, therefore [edx] represents link.Next. The mov [edx], eax line sets link.Next to point to the first node in the chain. If the chain is empty, link.Next will be set to nil, which is exactly the correct thing to do.

lock cmpxchg [ecx], edx then compares eax and [ecx] to ensure that underlying data haven't changed since PushLink started its execution. If values are not equal, eax is reloaded from [ecx] and code execution continues from the @Spin label. If values are equal, edx is loaded into [ecx]. At that moment, edx still contains the value stored in the link parameter and ecx contains the address of the chainHead. In other words, chainHead is set to link. As link.Next was set to old chainHead in the previous line, we have successfully linked link at the beginning of the chain.

That's all, the hard part is over. If you understand PopLink and PushLink, everything else is simple.

When we push the second value into the stack, it is inserted at the beginning of the public chain and recycle chain points to the next free node.

two pushed

The process continues until all nodes are linked into the public chain and recycle chain is nil.

four pushed

To pop a value from the stack, TOmniBaseContainer.Pop is used. It first gets a topmost allocated node (atomically, of course). Then it moves node data into the method parameter and pushes node into the recycle chain (again, atomically).

function TOmniBaseContainer.Pop(var value): boolean;
var
linkedData: POmniLinkedData;
begin
linkedData := PopLink(obcPublicChain);
Result := assigned(linkedData);
if not Result then
Exit;
Move(linkedData.Data, value, ElementSize);
PushLink(linkedData, obcRecycleChain);
end; { TOmniBaseContainer.Pop }

One of the important things to note is that nodes are not moved around. They are allocated at the beginning and for the whole lifetime of the TOmniBaseContainer they stay immovabe. Only Next fields are modified (and of course the Data is copied) and that's why this lock-free stack implementation is extremely fast. First results indicate that it can move about 800.000 integers per second between two threads on a 1,67 GHz Core2 Duo (T2300) machine.

base container test

Next post will discuss the lock-free ring buffer that is built on top of the lock-free stack. Stay tuned!

9 comments:

  1. Anonymous14:21

    perhaps problem with no desallocate variant (if is a string for exemple) in TOmniMessage = record
    MsgID : word;
    MsgData: TOmniValue;
    end; { TOmniMessage }
    The destructor in TOmniBaseContainer do just FreeMem(obcBuffer) not "desallocate" variant in record.

    ReplyDelete
  2. Anonymous19:50

    I'd really like to thank you for writing on this topic, I'm learning a bunch and the timing couldn't be better since I'm now focused on learning about lock free structures.

    Thanks!!!

    ReplyDelete
  3. Sadly, the PopLink as implemented in this post, doesn't work :(

    It fails very rarely and only under very stressful conditions, but it fails nevertheless :(

    Our kung-fu master GJ has created better version, which passed all current tests. I'll post it here as soon as I manage to understand it.

    ReplyDelete
  4. Anonymous13:27

    Maybe a "crash" in

    nextElement := nil; // to remove compiler warning in nextElement.Next := nil assignment below currElement := obcRecycleChain; for iElement := 0 to obcNumElements - 2 do begin nextElement := POmniLinkedData(cardinal(currElement) + bufferElementSize); currElement.Next := nextElement; currElement := nextElement; end; nextElement.Next := nil; // terminate the chain obcPublicChain := nil;

    if obcNumElements = 1
    nextElement.Next := nil; -> nil.Next := nil;

    ReplyDelete
  5. @mp:

    a) No, Variants are reference counted, just as interfaces. Variant message goes automatically 'out of scope' and string is destroyed when that happens. I've extended 8_RegisterComm test with a button that sends a string and checked with FastMM that everything is indeed released. Result: no leaks.

    b) You're right, Initialize doesn't work when numElements = 1. Hardly a realistic case, but still worth fixing. Thanks for the catch!

    Both updates are already in the repository.

    ReplyDelete
  6. Of course, since you're using the "lock" assembler instruction, you can't really claim to be "lock-free". If there are multiple processors, you're taking all the overhead of synchronizing them and their caches every time you go through this code.

    A true low-lock stack would avoid the "lock" instruction, and its overhead, on read. You almost certainly still need it on write, though. (I already posted a comment with a link to an MS article on low-lock programming -- won't spam you by posting the link again here.)

    ReplyDelete
  7. Depends on how you depend lock-free, of course. This code is lock-free in the sense that it doesn't acquire a lock (and that is the prevailing definition as far as I can tell).

    If you manage to write/find a less-locked stack on the Intel processor, I'll happily incorporate it into OTL. As far as I know, the current solution is the best you can do.

    ReplyDelete
  8. I enjoyed your writing, and have purchased 3 of your books.

    I first read this post 10~12 years ago. Now I read it again, and I wonder if you could directly use the Win32 API InterlockedCompareExchange, rather than those ASM code?
    https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange

    ReplyDelete
    Replies
    1. Indeed, assembler could be removed from the implementation without a problem.

      Delete