Sunday, February 21, 2016

Finding Memory Allocation Bottlenecks with FastMM4

Recently, I had to find bottlenecks in one of our applications that does all sorts of things related to DVB and is handling real-time reading and sending of data streams over IP with bitrates up to 80 Mb/s (in our lab; in real life maybe even more). Our customer created a configuration which essentially brought the app to the crawl and I had to fix it.

It quickly turned out that although the program was not able to handle the load, the CPU was not very busy. The busiest core was only using about 30% CPU. So I suspected the thread contention problems in FastMM, switched it for SapMM and indeed - the problem went away. CPU load went up and application could again handle the load.

Crisis averted, I took time to find the real problem - excessive get/freemem calls in this program. As far as I could tell, there existed no tool to find that so in a true DIY manner I created my own ;)

After some consideration I decided that the easiest way to find bottlenecks would be to change the FastMM4. I won’t go into details on FastMM implementation (you can find some info here), but suffice to say that during allocation/reallocation/release of memory blocks memory manager has to lock some internal structures and if that fails (if they are locked from another thread) it tries in a loop to lock them again.

My modifications detect that condition and each time this loop is entered the code logs the call stack (with the same mechanism FullDebugMode uses). A static data collector  (it performs no memory allocations at all) TStaticCollector, implemented in FastMM4DataCollector.pas, collects those call stacks and sorts them by number of occurrences.

When program ends, FastMM accesses the sorted lists of collected stacks and prints the most frequent ones on screen (top 3) and logs them to the _EventLog.txt file (top 10).

For example, our program produced something like this (I had to also define RawStackTraces to get the call stack below the TInterfacedObject._Release):

293207 x
41B604 [FastMM4.pas][FastMM4][DebugFreeMem$qqrpv][9002]
406D96 [System.pas][System][@FreeMem$qqrpv][4644]
408E31 [System.pas][System][TObject.FreeInstance$qqrv][16187]
409669 [System.pas][System][@ClassDestroy$qqrxp14System.TObject][17536]
408FA6 [System.pas][System][TObject.$bdtr$qqrv][16247]
40F461 [System.pas][System][TInterfacedObject._Release$qqsv][37535]
40F36B [System.pas][System][@IntfCopy$qqrr44System.%DelphiInterface$17System.IInterface%x44System.%DelphiInterface$17System.IInterface%][36628]
4BD8CB [System.Classes.pas][System.Classes][Classes.TInterfaceList.Put$qqrix44System.%DelphiInterface$17System.IInterface%][5139]
4BD442 [System.Classes.pas][System.Classes][Classes.TInterfaceList.Delete$qqri][5009]
58800B [GpLifecycleEventList.pas][GpLifecycleEventList][TGpLifecycleEventList.PruneEventList$qqrv][505]
587857 [GpLifecycleEventList.pas][GpLifecycleEventList][TGpLifecycleEventList.AddEvent$qqrx20System.UnicodeString][353]

215749 x
41B604 [FastMM4.pas][FastMM4][DebugFreeMem$qqrpv][9002]
406D96 [System.pas][System][@FreeMem$qqrpv][4644]
40994A [System.pas][System][TMonitor.Destroy$qqrv][17794]
409922 [System.pas][System][TMonitor.Destroy$qqrxp14System.TObject][17786]
409038 [System.pas][System][TObject.CleanupInstance$qqrv][16399]
408E2A [System.pas][System][TObject.FreeInstance$qqrv][16186]
409669 [System.pas][System][@ClassDestroy$qqrxp14System.TObject][17536]
408FA6 [System.pas][System][TObject.$bdtr$qqrv][16247]
408FB3 [System.pas][System][TObject.Free$qqrv][16255]
4D80DD [System.Generics.Collections.pas][System.Classes][Generics.Collections.%TThreadList__1$44System.%DelphiInterface$17System.IInterface%%.$bdtr$qqrv][6947]
408FB3 [System.pas][System][TObject.Free$qqrv][16255]

(With this information at hand it was not hard to find and fix the bottlenect, but I won’t go into that.)

In reality, there are multiple TStaticCollectors and the data is aggregated at the end. There's one for large blocks, one for medium blocks, and one for each small block list.

This data collection & logging is controlled with the LogLockContention define, which works with or without the FullDebugMode. It requires the FullDebugMode DLL to be present (as it is used for call stack collection) and forces Pascal implementation of allocation routines. In theory, the new code works on 32- and 64-bit Windows but I only tested the 32-bit part as the problematic application doesn't compile in 64-bit yet.

This change was already merged into the official FastMM4 repository which Pierre recently moved to GitHub.

Release Stack

During my testing I have found out that the sleep&retry loop is almost always executed from one of the small block lists during the FreeMem operation.

I had an idea on how to improve this situation by adding a small lock-free stack to each memory list. This stack caches memory blocks that should be released in case the list cannot be locked. First tests, however, fail to show any performance improvements. Sometimes, at least during the syntetic tests, the code performs even worse than before.

In case you have a multithreaded application and would like to test it with this approach, you can download the experimental FastMM4 from the Locking_Improvements branch and compile with the UseReleaseStack define. If you do that, please send me your findings – I’m interested in both positive (improvements) and negative (slower performance) results.

By the way, you should always download from Pierre’s GitHub, not from my own FastMM4 fork which should be considered experimental and unstable! I’m contributing all tested changes to Pierre and he is very prompt with merging the pull requests.

9 comments:

  1. Very nice! Good to see improvements in FastMM4!
    It's a good tool to verify and improve application performance & stability.
    Will you improve FastMM4 for MacOS also?
    Best regards
    Dirk

    ReplyDelete
    Replies
    1. All current changes are Windows only but I'll certainly try to make everything as much OS/X-compatible as possible.

      Delete
  2. You are right. Avoiding memory allocation when possible, is always a good idea, when performance matters, especially in multi-thread apps. See http://blog.synopse.info/post/2011/05/20/How-to-write-fast-multi-thread-Delphi-applications and also http://blog.synopse.info/post/2011/06/02/Fast-JSON-parsing about how in-place JSON parsing and decoding (instead of creating nodes as most JSON libraries do) avoid almost any memory allocation, and results in huge scaling multi-thread abilities.

    ReplyDelete
  3. I made a quick fix for the problem: the pending free queue wasn't cleared properly so it kept sleeping. With my fix I get the following results (when running this test: http://www.stevemaughan.com/delphi/delphi-parallel-programming-library-memory-managers/)

    single multi
    Delphi Seatle X 24.643 11.695
    Primoz's FastMM 25.468 7.711

    So almost perfect scaling and 95% cpu usage on my quad core! Kudos for the great work!

    https://github.com/andremussche/FastMM4/commit/589c87ab55997837156a3bbe9637a691d3be03fb

    ReplyDelete
    Replies
    1. Anonymous14:34

      Did some testing today on FastMM with improved locking. Can not comment on speed, I guess there is not enough memory allocations in my threads to notice the difference. But there is a problem with memory consumption, the "released" memory is not returned to the OS.

      Delete
    2. And how did you notice that problem?

      Delete
    3. Anonymous15:03

      Task manager ... I know it is not the most appropriate way of measurement. During an operation the application allocates about 300 MB of memory, after the operation is completed the usage does not drop as expected. The difference between "stable" and "locking improved" version is about 160 MB.

      Delete
    4. Anonymous17:16

      Maybe I was a little fast. The problem described was with Andre's version of FastMM. The version from https://github.com/pleriche/FastMM4/tree/Locking_Improvements seems OK. I will do some more testing.

      Delete
  4. Can not comment on speed...the application allocates about 300 MB of memory, after the operation is completed the usage does not drop as expected. The difference between "stable" and "locking improved" version is about 160 MB

    ReplyDelete