Saturday, June 28, 2025

ComputeCore - A Simple Parallel Task Framework

Today, I want to talk about a small multithreading kernel (framework, library, or whatever you want to call it) that I've recently helped develop for the Spring4D project.

ComputeCore is a lightweight parallel processing framework for Delphi that lets you easily run CPU-intensive tasks across multiple threads. Its goals are straightforward: to run (possibly compute-heavy) tasks concurrently, allow tasks to spawn child tasks, and expose a very simple interface for the caller.

It will eventually be published as Spring.Threading, but until that happens, we can examine the almost-final version at https://github.com/gabr42/ComputeCore. In Spring4D, it will run the parallel implementation of the new sorting algorithm, but that doesn't mean it is only useful for one specific job! It is indeed a completely open (although intentionally simple) framework that can be used for your projects too.

[This article was written with the help of an AI. My next article will explain this decision and the writing process.]

Goals

Before writing any code, we (Stefan, who maintains Spring4D, and I) set up the following goals:

·        Parallel execution of computationally intensive tasks. Run many tasks concurrently to utilize multiple CPU cores, speeding up computational workloads. Tasks will not spend (much) time sleeping or waiting for external events.

·        Nested task support. Any task may itself schedule subtasks, allowing parallel algorithms to spawn subtasks (e.g. a parallel quicksort).

·        Simple interface. Only two methods: one to run a task and one to wait for it. There’s no complex configuration or result-passing – just fire-and-forget tasks.

We also limited the complexity of the framework by setting the following limits:

·        No return value: Tasks are procedures (TProc) that return nothing. If you need a result, your task can write to a shared variable or object that you manage externally.

·        No cancellation: Once started, a task runs to completion. This avoids dealing with cancellation tokens or cooperative aborting. (If you need to stop long-running tasks, you’d have to handle that inside the task code itself.)

These choices make the core simpler. You just queue a TProc and let it run. There’s no need to manage futures or handle cancellation flags in the library itself.

Why a Custom Solution?

You might wonder why we chose to write a custom solution when Delphi already offers one, namely PPL, which implements everything needed plus some extra features.

The reason is that Spring4D supports older Delphi versions, starting from XE, which did not include PPL. (PPL was only introduced in XE7.)

Top-Level Design

There is one global compute core, accessible via the interface IComputeCore. By default, the first time you call TTask.Run(...), ComputeCore framework automatically creates a global instance using CPUCount - 1 worker threads. If you want a different setup (for example, you only have 2 cores but still want 2 worker threads), you can override this by explicitly assigning:

GlobalComputeCore := TComputeCore.Create(MyThreadCount);

Once set, all Run and WaitFor calls go through that global instance. The interface IComputeCore is fully thread-safe, meaning you can call Run from any thread (even from within a task) without additional locking.

ComputeCore’s public API is simple:

  IComputeCore = interface ['{FF061A68-2B81-40EB-A374-5C94CF13B610}']     // Schedules a procedure to be executed in a task as soon as possible.     function Run(const taskProc: TProc): ITask;     // Waits for a task to complete execution     procedure WaitFor(const task: ITask);   end; 

For compatibility with Delphi’s Parallel Programming Library (PPL), ComputeCore also provides a class TTask with static methods. For example, TTask.Run(procedure begin ... end) internally ensures the global compute core is created and then calls IComputeCore.Run

If you want, you can create multiple compute cores (by calling TComputeCore.Create multiple times), but doing so would mean the convenience of TTask.Run and TTask.WaitFor (which assume a single global core) would no longer apply. Typically you just let the singleton instance serve the whole application.

Data Structures

Internally, TComputeCore manages its threads and tasks using a few basic data structures:

·        Worker thread list (FThreads). An array (fixed at creation time) of thread objects. Each one runs the worker loop (see below). This array is only accessed during construction and destruction of the core, so it doesn’t require additional locking.

·        Inactive thread stack (FInactiveThreads). An array (used as a stack) of threads that are currently idle and waiting. When a thread finishes a task and has no more work immediately, it pushes itself onto this list. When a new task arrives, if there’s an idle thread available here, the core pops it and signals that thread to wake up.

·        Task queue (FTasks). A first-in-first-out queue of pending tasks (each task is encapsulated in an object that implements an ITask interface). When you call Run, the TProc is wrapped in a task object and enqueued here.

Because multiple threads (the main thread and any of the worker threads) can be calling Run, WaitFor, or finishing tasks concurrently, access to FTasks and FInactiveThreads must be synchronized. ComputeCore uses Delphi’s TMonitor (via MonitorEnter/MonitorExit) on the compute core object (Self) to protect these shared structures. In other words, most methods grab the core’s monitor, do a quick operation on the queue or array, and release it. Delphi’s monitor can be thought of as a ready-to-use critical section tied to any TObject.

In practice, methods that acquire the lock end in _U (unsafe) if they assume the lock is already held, or without _U if they do the locking themselves. For example, GetTask_U assumes the monitor is held and just pops a task from the queue. AllocateTask is a higher-level method that holds the lock while calling GetTask_U and also updates the thread status (active/inactive) inside the same lock. This “big lock” strategy ensures there are no race conditions between pulling a task and updating the idle list. (See more below.)

Running a Task

When you call GlobalComputeCore.Run(MyProc), ComputeCore does something like:

1.     Enqueue the task. The TProc is wrapped in an ITask interface and is added to FTasks.

2.     Wake a worker if idle. If FInactiveThreads is not empty, pop an idle thread index and signal that thread’s event (using something like TEvent.SetEvent) so it wakes up immediately to grab tasks.

3.     Return the ITask. You get back an ITask interface you can later pass to WaitFor.

If no thread was idle (because all workers were busy), the task just sits in the queue. As soon as any worker thread finishes its current task, it will loop and notice there’s a new task, so it will start working on it.

Worker Threads

Each worker is a simple Delphi TThread running the following code:

procedure TCCThread.Execute; var   task: ITask; begin   NameThreadForDebugging('ComputeCore worker');   // Wait until signaled or timeout   while (not Terminated) and (FSignal.WaitFor <> wrTimeout) do begin // Process tasks until no more are available     while (not Terminated) and FOwner_ref.AllocateTask(Self, task) do       task.Execute;   end; end;

Here, FSignal is a synchronization event that a main thread pulses whenever a new task is enqueued or a thread should check for work. AllocateTask (called inside the inner loop) does something like: take the core’s lock, get a task from FTasks, mark this thread active (or put it on the idle list if no task was found), then release the lock. If a task was returned, the thread exits the lock and calls task.Execute (which runs your TProc). It then loops to try AllocateTask again, draining as many queued tasks as it can before going back to wait on FSignal.

This simple loop means threads stay active as long as work is available, and otherwise don’t spin or consume CPU.

Race Conditions and Thread Safety

In any multithreaded queue, careful attention to race conditions is crucial. For example, consider what happens if one task is enqueued at the exact same time another thread finishes a task. The core must ensure that the newly enqueued task isn’t “lost” (i.e. enqueued but no thread ever wakes up to handle it).

ComputeCore avoids such races by doing multiple related actions under one lock. For instance, the AllocateTask method (inside the lock) both takes a task from the queue (GetTask_U) and updates the thread’s state (marks it active or pushes it on FInactiveThreads) before releasing the lock. This atomic update prevents the scenario where a thread might check the queue, see nothing, and go inactive, while another thread enqueues a task in the tiny gap. By bundling those steps, either the task is seen by an already-active thread or an idle thread is immediately woken.

Likewise, signaling new tasks to workers is done only after enqueuing under the same lock that protects FTasks. This coordination ensures that any new task always has a chance to trigger exactly one thread. Because all shared state changes (queue and idle list) happen within protected critical sections, there’s no chance for a “blind spot” where a task is sitting in the queue but all threads are sleeping.

Another place where race condition could occur is intial creation of the GlobalComputeCore singleton. Technically, this code could be called from two places (two different threads that were not created by the ComputeCore) at the same time. To solve this, ComputeCore uses optimistic lazy initialization.

It doesn’t lock before creating the instance; it simply creates it and tries to store it in the global variable atomically. This is a common concurrency pattern described in the OmniThreadLibrary documentation. Essentially: if two threads try to initialize simultaneously, they may both create an instance but only one will “win” and the other will be discarded. This avoids a locking bottleneck on startup while still ensuring exactly one instance ends up in use.

Preventing Deadlocks with Nested Tasks

A classical problem in thread pools is deadlock by saturation: suppose every worker thread is busy running a task, and those tasks do something like TTask.Run and then immediately WaitFor the child tasks. If tasks spawn a new task and then wait, it’s possible for all worker threads to be tied up waiting on subtasks, with no thread left to actually execute those subtasks – a deadlock. This is exactly what can happen with a naive parallel QuickSort implementation if not handled carefully.

ComputeCore solves this by letting waiting threads “help out.” Instead of truly blocking when you call WaitFor(task), the waiting thread will loop and execute other tasks from the queue while it’s waiting for task to finish. In pseudo-code, WaitFor looks like:

  while task.WorkDone.WaitFor(0) = wrTimeout do begin     __Acquire;     GetTask_U(newTask);     __Release;     if not assigned(newTask) then       TThread.Yield     else begin       newTask.Execute;       if assigned(newTask.ExceptObj) then         raise Exception(newTask.ExceptObj);     end;   end;

In words, while the WorkDone event of our awaited task is not yet signaled (signifying that the task is still active), the thread takes another pending task if available, and executes it. If there’s no other task, it just yields briefly (allowing other threads to run). This way, threads waiting on child tasks will process unrelated tasks. This effectively keeps the pipeline full and avoids deadlock. In practice this means a depth-N nested chain of tasks can execute without deadlock as long as each waiting thread helps execute work. (This is sometimes called a helping or work-stealing strategy, though here it’s a simple single-queue version.)

With this approach, even parallel QuickSort won’t hang as long as there is at least one “helping” check in the wait loop. In fact, this “wait by executing pending tasks” trick is a common pattern in custom thread pools to avoid starvation in nested parallel workloads.

Unit Testing

Let's talk about unit testing and stress testing in a separate article.

Conclusion

ComputeCore demonstrates that you can implement a custom parallel solution in a reasonable time (and without too many grey hairs) if you follow a few simple rules:

·        Keep it simple. A minimal interface make the code understandable and maintainable.

·        Minimize sharing. Only the task queue and idle-thread list are shared, and only for short critical sections. Most of the work (executing tasks) happens outside the lock.

·        Think about race conditions. Every insertion into the queue or change to the idle list is done under a lock to prevent missed signals. Combining actions (e.g. pulling a task and marking thread state) into one locked block prevents subtle bugs.

·        Test thoroughly. Concurrent code is tricky. Even with careful design, always write unit tests and stress tests to exercise scheduling, nested waits, and edge cases.

By following these ideas you can build your own lightweight parallel framework for Delphi. It’s not that hard, and the result (like ComputeCore) can give you fine-grained control over task scheduling without the overhead of a heavier library.

 

No comments:

Post a Comment