[I'm cheating. The title should be “OmniThreadLibrary internals - OtlContainers”, but I wanted to attract more readers.]
Today I'll be describing the OtlContainers unit. Part of that unit, actually, as there is lot to tell and lot to show. I'll focus on the lock-free stack today and describe other stuff from the same unit in a day or two.
To give the credit where it is due — most of the code you'll see today was written by a Slovenian developer ‘GJ’. His is the implementation of the lock-free stack and all the assembler parts. I only wrapped them in higher-level structures. GJ, thanks for your contribution!
If you want to follow this article with the OtlContainers loaded in IDE, please make sure that you have the latest source [quick links: snapshot, repository, OtlContainers.pas].
The topic of today's presentation is TOmniBaseContainer. This is a base class that implements lock-free size-limited stack. Multiple simultaneous writers are supported, as are multiple readers. Although this is a fully functional class, you would usually want to use more feature-rich descendant TOmniStack.
A note regarding the limited size. It is much simpler to implement size-limited lock-free structure than to deal with dynamic allocation. In most cases, dynamic allocation would lead to possible locks inside the memory manager and some of the advantage gained by the lock-free algorithm would go away. Even more, you usually don't want the lock-free structure to grow without any control. Sure, OTL implementation won't be the right solution to all your needs but still we do believe that it is good enough for many usage scenarios.
Back to the story ... TOmniBaseContainer stores data into TOmniLinkedData nodes. First four bytes in the node point to the next node and other bytes (as many as the application requested) contain application data. Only first of those bytes is actually included in the TOmniLinkedData structure. It is only used as a placeholder for address calculation in Push and Pop operations.
type
POmniLinkedData = ^TOmniLinkedData;
TOmniLinkedData = packed record
Next: POmniLinkedData;
Data: byte;
end;
In the following diagrams I'll use this symbol to represent one TOmniLinkedData node.
Nodes are connected into chains. Address of the first node is stored in a chain header. The Next field of the last node in chain contains value nil.
As the TOmniBaseContainer implementation is size-limited, it can preallocate all nodes in one go. The Initialize method first allocates one big buffer to store all nodes and connects them into one chain, pointed to by the recycle chain header. This chain contains unused nodes. The other chain, public chain, contains nodes that are in use and is initialized to nil.
You'll see that the initialization code rounds up the size of node data to the nearest multiplier of 4. This is required because operations that modify node pointers (Next field) require them to be aligned on addresses that are divisible by 4.
bufferElementSize := ((SizeOf(POmniLinkedData) + elementSize) + 3) AND NOT 3;
GetMem(obcBuffer, bufferElementSize * cardinal(numElements));
Assert(cardinal(obcBuffer) AND 3 = 0);
obcRecycleChain := obcBuffer;
nextElement := nil;
currElement := obcRecycleChain;
for iElement := 0 to obcNumElements - 2 do begin
nextElement := POmniLinkedData(cardinal(currElement) + bufferElementSize);
currElement.Next := nextElement;
currElement := nextElement;
end;
nextElement.Next := nil;
obcPublicChain := nil;
Structure, created by the initialization code is depicted below. Recycle chain points to the first node, which points to the next node and so on. The last node in the chain has nil stored in the Next pointer. Public chain is empty.
Let's see what happens when code pushes new data onto the stack.
function TOmniBaseContainer.Push(const value): boolean;
var
linkedData: POmniLinkedData;
begin
linkedData := PopLink(obcRecycleChain);
Result := assigned(linkedData);
if not Result then
Exit;
Move(value, linkedData.Data, ElementSize);
PushLink(linkedData, obcPublicChain);
end;
PopLink removes first node from the recycle chain. Chain header is updated to point to the next node in the chain. Pointer to the removed node is stored in the linkedData variable. If recycle chain is empty, linkedData contains nil and Push returns False.
The code next moves application data from the value parameter to the preallocated buffer. linkedData.Data conveniently addresses first data byte in the buffer.
At the end, the code inserts linkedData node at the beginning of the public chain and updates the public chain header to point to the linkedData node.
The real workhorses here are PopLink and PushLink. Former pops first node from a chain and latter inserts the node into the head of the chain. The trick here is that they are written with multithreading in mind — they both expect that data structures may change at any time because another thread may be accessing the structure simultaneously.
Let's take a look at the PopLink first.
function TOmniBaseContainer.PopLink(var chainHead: POmniLinkedData): POmniLinkedData;
asm
mov eax, [edx]
@Spin:
test eax, eax
jz @Exit
mov ecx, [eax]
lock cmpxchg [edx], ecx
jnz @Spin
@Exit:
end;
At beginning, edx register contains the address of the chainHead parameter. mov eax, [edx] moves chainHead, i.e. the node that chainHead is pointing to, into the eax. Chain may be empty, in which case chainHead is nil, or in assembler terms, eax is 0. test eax, eax checks for this condition.
If there's at least one node in the chain, we can continue. mov ecx, [eax] moves the address of the next node into the ecx register. Remember that eax points to the node and [eax] is the same as accessing the node's Next field.
Now we have an address of the first node in the eax register, address of the second node in the ecx register (there may be no second node, in which case the ecx is 0) and chainHead in the edx register. And now we can do the heavy magic.
lock cmpxchg [edx], ecx does few things, all at once. Well, not exactly in once, but processor (and level one cache and all weird hardware wrapping the CPUs) makes sure that this operation won't be interrupted by another core or CPU attempting to do the same thing at the same time. Firstly, cmpxchg compares eax with [edx]. Remember, we loaded eax from [edx] so those two values should be the same, yes? Well, no. Between mov eax, [edx] and lock cmpxchg, thread may be interrupted and stopped and another thread may have modified the chain header by executing PopLink. That's why we have to recheck.
If eax and [edx] are still the same, ecx is loaded into [edx]. In other words, address of the second node (which may be nil) is stored into the chainHead. Because of the lock prefix, all that (testing and assignment) happens atomically. Uninterruptible. In other words, another thread can not step on our (digital) toes.
If eax and [edx] are not the same, cmpxchg loads eax from [edx]. In other words, eax is refreshed from the chainHead. If that happens, jnz @Spin instruction will jump to the @Spin label and repeat the whole procedure.
At the end we have address of the second node stored in chainHead and address of the first node stored in eax, which is fine as pointer Results are returned in eax.
You can read more on cmpxchg here.
PushLink is much simpler.
procedure TOmniBaseContainer.PushLink(const link: POmniLinkedData; var chainHead:
POmniLinkedData);
asm
mov eax, [ecx]
@Spin:
mov [edx], eax
lock cmpxchg [ecx], edx
jnz @Spin
end;
At the beginning, edx contains the value of the link parameter and ecx contains the address of the chainHead parameter.
The code first loads the address of the first node in the chain into the eax register. Nil pointers are fine in this case as we will not be following (dereferencing) them.
Then the code loads this same value into the [edx]. Remember, edx contains the value of the link, therefore [edx] represents link.Next. The mov [edx], eax line sets link.Next to point to the first node in the chain. If the chain is empty, link.Next will be set to nil, which is exactly the correct thing to do.
lock cmpxchg [ecx], edx then compares eax and [ecx] to ensure that underlying data haven't changed since PushLink started its execution. If values are not equal, eax is reloaded from [ecx] and code execution continues from the @Spin label. If values are equal, edx is loaded into [ecx]. At that moment, edx still contains the value stored in the link parameter and ecx contains the address of the chainHead. In other words, chainHead is set to link. As link.Next was set to old chainHead in the previous line, we have successfully linked link at the beginning of the chain.
That's all, the hard part is over. If you understand PopLink and PushLink, everything else is simple.
When we push the second value into the stack, it is inserted at the beginning of the public chain and recycle chain points to the next free node.
The process continues until all nodes are linked into the public chain and recycle chain is nil.
To pop a value from the stack, TOmniBaseContainer.Pop is used. It first gets a topmost allocated node (atomically, of course). Then it moves node data into the method parameter and pushes node into the recycle chain (again, atomically).
function TOmniBaseContainer.Pop(var value): boolean;
var
linkedData: POmniLinkedData;
begin
linkedData := PopLink(obcPublicChain);
Result := assigned(linkedData);
if not Result then
Exit;
Move(linkedData.Data, value, ElementSize);
PushLink(linkedData, obcRecycleChain);
end;
One of the important things to note is that nodes are not moved around. They are allocated at the beginning and for the whole lifetime of the TOmniBaseContainer they stay immovabe. Only Next fields are modified (and of course the Data is copied) and that's why this lock-free stack implementation is extremely fast. First results indicate that it can move about 800.000 integers per second between two threads on a 1,67 GHz Core2 Duo (T2300) machine.
Next post will discuss the lock-free ring buffer that is built on top of the lock-free stack. Stay tuned!