Today, I want to talk about a small multithreading kernel (framework, library, or whatever you want to call it) that I've recently helped develop for the Spring4D project.
ComputeCore is a lightweight parallel processing framework for Delphi that lets you easily run CPU-intensive tasks across multiple threads. Its goals are straightforward: to run (possibly compute-heavy) tasks concurrently, allow tasks to spawn child tasks, and expose a very simple interface for the caller.
It will eventually be published as Spring.Threading, but until that happens, we can examine the almost-final version at https://github.com/gabr42/ComputeCore. In Spring4D, it will run the parallel implementation of the new sorting algorithm, but that doesn't mean it is only useful for one specific job! It is indeed a completely open (although intentionally simple) framework that can be used for your projects too.
[This article was written with the help of an AI. My next article will explain this decision and the writing process.]
Goals
· Parallel execution of computationally intensive tasks. Run many tasks concurrently to utilize multiple CPU cores, speeding up computational workloads. Tasks will not spend (much) time sleeping or waiting for external events.
·
Nested task support. Any task
may itself schedule subtasks, allowing parallel algorithms to spawn subtasks (e.g. a
parallel quicksort).
·
Simple interface. Only two
methods: one to run a task and one to wait for it. There’s no complex
configuration or result-passing – just fire-and-forget tasks.
We also limited the complexity of the framework by setting the following limits:
·
No return value: Tasks are
procedures (TProc
)
that return nothing. If you need a result, your task can write to a shared
variable or object that you manage externally.
·
No cancellation: Once started,
a task runs to completion. This avoids dealing with cancellation tokens or
cooperative aborting. (If you need to stop long-running tasks, you’d have to
handle that inside the task code itself.)
These choices make the core simpler. You just queue a TProc
and let it run. There’s no need to
manage futures or handle cancellation flags in the library itself.
Why a Custom Solution?
You might wonder why we chose to write a custom solution when Delphi already offers one, namely PPL, which implements everything needed plus some extra features.
The reason is that Spring4D supports older Delphi versions, starting from XE, which did not include PPL. (PPL was only introduced in XE7.)
Top-Level Design
There is one global compute core, accessible via the interface IComputeCore
. By default, the first time
you call TTask.Run(...)
,
ComputeCore framework automatically creates a global instance using CPUCount - 1
worker threads. If you
want a different setup (for example, you only have 2 cores but still want 2
worker threads), you can override this by explicitly assigning:
GlobalComputeCore := TComputeCore.Create(MyThreadCount);
Once set, all Run
and WaitFor
calls
go through that global instance. The interface IComputeCore
is fully thread-safe,
meaning you can call Run
from any thread (even from within a task) without additional locking.
ComputeCore’s public API is simple:
IComputeCore = interface ['{FF061A68-2B81-40EB-A374-5C94CF13B610}'] // Schedules a procedure to be executed in a task as soon as possible. function Run(const taskProc: TProc): ITask; // Waits for a task to complete execution procedure WaitFor(const task: ITask); end;
For compatibility
with Delphi’s Parallel Programming Library (PPL), ComputeCore also provides a class TTask
with static
methods. For example, TTask.Run(procedure
begin ... end)
internally ensures the global compute core is
created and then calls IComputeCore.Run
.
If you want, you can create multiple compute cores (by calling TComputeCore.Create
multiple times), but doing so would mean the convenience of TTask.Run
and TTask.WaitFor
(which assume a single
global core) would no longer apply. Typically you just let the singleton
instance serve the whole application.
Data Structures
Internally, TComputeCore
manages its threads and tasks using a few basic data structures:
·
Worker thread list (FThreads
).
An array (fixed at creation time) of thread objects. Each one runs the worker
loop (see below). This array is only accessed during construction and
destruction of the core, so it doesn’t require additional locking.
·
Inactive thread stack (FInactiveThreads
).
An array (used as a stack) of threads that are currently idle and
waiting. When a thread finishes a task and has no more work immediately, it
pushes itself onto this list. When a new task arrives, if there’s an idle
thread available here, the core pops it and signals that thread to wake
up.
·
Task queue (FTasks
). A
first-in-first-out queue of pending tasks (each task is encapsulated in an
object that implements an ITask
interface). When you call Run
,
the TProc
is
wrapped in a task object and enqueued here.
Because multiple threads (the main thread and any of the worker threads) can
be calling Run
, WaitFor
, or finishing
tasks concurrently, access to FTasks
and FInactiveThreads
must be synchronized. ComputeCore uses Delphi’s TMonitor
(via MonitorEnter
/MonitorExit
) on the compute core object (Self
) to protect these
shared structures. In other words, most methods grab the core’s monitor, do a
quick operation on the queue or array, and release it. Delphi’s monitor can be
thought of as a ready-to-use critical section tied to any TObject
.
In practice, methods that acquire the lock end in _U
(unsafe) if they assume the lock is
already held, or without _U
if they do the locking themselves. For example, GetTask_U
assumes the monitor is held
and just pops a task from the queue. AllocateTask
is a higher-level method that holds the lock
while calling GetTask_U
and also updates the thread status (active/inactive) inside the same lock. This
“big lock” strategy ensures there are no race conditions between pulling a task
and updating the idle list. (See more below.)
Running a Task
When you call GlobalComputeCore.Run(MyProc)
,
ComputeCore does something like:
1. Enqueue
the task. The TProc
is wrapped in an ITask
interface and is added to FTasks.
2. Wake
a worker if idle. If FInactiveThreads
is not empty, pop an idle thread index and signal that thread’s event (using
something like TEvent.SetEvent
)
so it wakes up immediately to grab tasks.
3. Return
the ITask. You get back an ITask
interface you can later pass to WaitFor
.
If no thread was idle (because all workers were busy), the task just sits in
the queue. As soon as any worker thread finishes its current task, it will loop
and notice there’s a new task, so it will start working on it.
Worker Threads
Each worker is a simple Delphi TThread
running the following code:
procedure TCCThread.Execute; var task: ITask; begin NameThreadForDebugging('ComputeCore worker'); // Wait until signaled or timeout while (not Terminated) and (FSignal.WaitFor <> wrTimeout) do begin // Process tasks until no more are available while (not Terminated) and FOwner_ref.AllocateTask(Self, task) do task.Execute; end; end;
Here, FSignal
is a synchronization event that a main thread pulses whenever a new task is
enqueued or a thread should check for work. AllocateTask
(called inside the inner
loop) does something like: take the core’s lock, get a task from FTasks
, mark this thread active (or put
it on the idle list if no task was found), then release the lock. If a task was
returned, the thread exits the lock and calls task.Execute
(which runs your TProc
). It then loops to
try AllocateTask
again, draining as many queued tasks as it can before going back to wait on FSignal
.
This simple loop means threads stay active as long as work is available, and
otherwise don’t spin or consume CPU.
Race Conditions and Thread Safety
ComputeCore avoids such races by doing multiple related actions under one
lock. For instance, the AllocateTask
method (inside the lock) both takes a task from the queue (GetTask_U
) and updates the thread’s
state (marks it active or pushes it on FInactiveThreads
) before releasing the lock. This atomic
update prevents the scenario where a thread might check the queue, see nothing,
and go inactive, while another thread enqueues a task in the tiny gap. By
bundling those steps, either the task is seen by an already-active thread or an
idle thread is immediately woken.
Likewise, signaling new tasks to workers is done only after enqueuing under
the same lock that protects FTasks
.
This coordination ensures that any new task always has a chance to trigger
exactly one thread. Because all shared state changes (queue and idle list)
happen within protected critical sections, there’s no chance for a “blind spot”
where a task is sitting in the queue but all threads are sleeping.
Another place where race condition could occur is intial creation of the GlobalComputeCore
singleton. Technically, this code could be called from two places (two different threads that were not created by the ComputeCore) at the same time. To solve this, ComputeCore uses optimistic lazy initialization.
It doesn’t lock before creating the instance; it simply creates it and tries to store it in the global variable atomically. This is a common concurrency pattern described in the OmniThreadLibrary documentation. Essentially: if two threads try to initialize simultaneously, they may both create an instance but only one will “win” and the other will be discarded. This avoids a locking bottleneck on startup while still ensuring exactly one instance ends up in use.
Preventing Deadlocks with Nested Tasks
A classical problem in thread pools is deadlock by saturation:
suppose every worker thread is busy running a task, and those tasks do
something like TTask.Run
and then immediately WaitFor
the child tasks. If tasks spawn a new task and then wait, it’s possible for all
worker threads to be tied up waiting on subtasks, with no thread left to
actually execute those subtasks – a deadlock. This is exactly what can happen
with a naive parallel QuickSort implementation if not handled carefully.
ComputeCore solves this by letting waiting threads “help out.” Instead of
truly blocking when you call WaitFor(task)
,
the waiting thread will loop and execute other tasks from the queue while it’s
waiting for task
to finish. In pseudo-code, WaitFor
looks like:
while task.WorkDone.WaitFor(0) = wrTimeout do begin __Acquire; GetTask_U(newTask); __Release; if not assigned(newTask) then TThread.Yield else begin newTask.Execute; if assigned(newTask.ExceptObj) then raise Exception(newTask.ExceptObj); end; end;
In words, while the WorkDone
event of our awaited task is not yet signaled (signifying that the task is still active), the thread takes another pending task if available, and executes it. If there’s no other
task, it just yields briefly (allowing other threads to run). This way, threads
waiting on child tasks will process unrelated tasks. This effectively keeps the
pipeline full and avoids deadlock. In practice this means a depth-N
nested chain of tasks
can execute without deadlock as long as each waiting thread helps execute work.
(This is sometimes called a helping or work-stealing
strategy, though here it’s a simple single-queue version.)
With this approach, even parallel QuickSort won’t hang as long as there is
at least one “helping” check in the wait loop. In fact, this “wait by executing
pending tasks” trick is a common pattern in custom thread pools to avoid
starvation in nested parallel workloads.
Unit Testing
Let's talk about unit testing and stress testing in a separate article.
Conclusion
ComputeCore demonstrates that you can implement a custom parallel solution in a reasonable time (and without too many grey hairs) if you follow a few simple rules:
·
Keep it simple. A minimal
interface make the code understandable and maintainable.
·
Minimize sharing. Only the task
queue and idle-thread list are shared, and only for short critical sections.
Most of the work (executing tasks) happens outside the lock.
· Think about race conditions. Every insertion into the queue or change to the idle list is done under a lock to prevent missed signals. Combining actions (e.g. pulling a task and marking thread state) into one locked block prevents subtle bugs.
·
Test thoroughly. Concurrent
code is tricky. Even with careful design, always write unit tests and stress
tests to exercise scheduling, nested waits, and edge cases.
By following these ideas you can build your own lightweight
parallel framework for Delphi. It’s not that hard, and the result (like
ComputeCore) can give you fine-grained control over task scheduling without the
overhead of a heavier library.
No comments:
Post a Comment