Recently, I had to find bottlenecks in one of our applications that does all sorts of things related to DVB and is handling real-time reading and sending of data streams over IP with bitrates up to 80 Mb/s (in our lab; in real life maybe even more). Our customer created a configuration which essentially brought the app to the crawl and I had to fix it.
It quickly turned out that although the program was not able to handle the load, the CPU was not very busy. The busiest core was only using about 30% CPU. So I suspected the thread contention problems in FastMM, switched it for SapMM and indeed - the problem went away. CPU load went up and application could again handle the load.
Crisis averted, I took time to find the real problem - excessive get/freemem calls in this program. As far as I could tell, there existed no tool to find that so in a true DIY manner I created my own ;)
After some consideration I decided that the easiest way to find bottlenecks would be to change the FastMM4. I won’t go into details on FastMM implementation (you can find some info here), but suffice to say that during allocation/reallocation/release of memory blocks memory manager has to lock some internal structures and if that fails (if they are locked from another thread) it tries in a loop to lock them again.
My modifications detect that condition and each time this loop is entered the code logs the call stack (with the same mechanism FullDebugMode uses). A static data collector (it performs no memory allocations at all) TStaticCollector, implemented in FastMM4DataCollector.pas, collects those call stacks and sorts them by number of occurrences.
When program ends, FastMM accesses the sorted lists of collected stacks and prints the most frequent ones on screen (top 3) and logs them to the _EventLog.txt file (top 10).
For example, our program produced something like this (I had to also define RawStackTraces to get the call stack below the TInterfacedObject._Release):
(With this information at hand it was not hard to find and fix the bottlenect, but I won’t go into that.)
In reality, there are multiple TStaticCollectors and the data is aggregated at the end. There's one for large blocks, one for medium blocks, and one for each small block list.
This data collection & logging is controlled with the LogLockContention define, which works with or without the FullDebugMode. It requires the FullDebugMode DLL to be present (as it is used for call stack collection) and forces Pascal implementation of allocation routines. In theory, the new code works on 32- and 64-bit Windows but I only tested the 32-bit part as the problematic application doesn't compile in 64-bit yet.
This change was already merged into the official FastMM4 repository which Pierre recently moved to GitHub.
During my testing I have found out that the sleep&retry loop is almost always executed from one of the small block lists during the FreeMem operation.
I had an idea on how to improve this situation by adding a small lock-free stack to each memory list. This stack caches memory blocks that should be released in case the list cannot be locked. First tests, however, fail to show any performance improvements. Sometimes, at least during the syntetic tests, the code performs even worse than before.
In case you have a multithreaded application and would like to test it with this approach, you can download the experimental FastMM4 from the Locking_Improvements branch and compile with the UseReleaseStack define. If you do that, please send me your findings – I’m interested in both positive (improvements) and negative (slower performance) results.
By the way, you should always download from Pierre’s GitHub, not from my own FastMM4 fork which should be considered experimental and unstable! I’m contributing all tested changes to Pierre and he is very prompt with merging the pull requests.