Wednesday, September 28, 2011

NeverSleepOnThreadContention–NOT!

FastMM is a wonderful memory manager, but it can slow down quite a lot when used in multithreading environment. While Pierre has implemented some conditional defines that could help the multithreaded code, namely NeverSleepOnThreadContention and SwitchToThread, I’m now making a point that you shouldn’t ever use them! Just see for yourself.
Below is a performance graph of some service I’ve wrote. During the testing, it was running 9 very active threads (but none of them was CPU intensive, they spent lots of time sleeping) on 2 cores (limited down from 8 for testing).
image
Given the amount of data the application is processing (three DVB inputs with 80 Mb/s on each) the CPU usage is not high at all.
Compiled with /dNeverSleepOnThreadContention (and regardless of presence of /dSwitchToThread), the performance graph goes wild.
image
Those CPU spikes go to 100% for about 5 seconds!
Looking at the application in Process Explorer I can say that about 6 threads want to do something with the memory at the same time and NeverSleepOnThreadContention code goes berserk and spends many seconds in a tight loop.
This is, of course, the nice version of the story without the PG rating. In reality the trouble started at a customer and we had absolutely no idea what’s going on – sometimes the input and output would start stuttering and software would complain that there’s no input connected and that it cannot transmit. After quite some time I managed to repeat the problem in the lab but still I had no idea what’s going on. I spent two days looking for problems in my code before I spotted the real culprit. And then I spend two days cooling off so I could write this story without a bunch of expletives.
The moral? We really need a good multithreaded memory manager. I know that there are ScaleMM and SynScaleMM and TopMM but as far as I know the first two are still not bug-free and the third is a real memory hog. And I really love all debugging features FastMM has built in.
As the Embarcadero has no interest in hardcore Windows programming anymore, I don’t think we can expect them to step ahead and pay somebody do improve FastMM. Maybe we (as community) can convince Pierre to do the work? Would you donate some hard earned cash to get better FastMM? I did it before just to tell Pierre that his work helped me a lot and I would gladly do it again.

Update 2011-10-13: See also http://delphitools.info/2011/10/13/memory-manager-investigations/

23 comments:

  1. Yes. I can donate $100 - just going to donation page!)

    ReplyDelete
  2. Done) Please, everyone - go and donate! $100 is not too much!)

    ReplyDelete
  3. While I totally support any donation towards a good cause, please keep in mind that I have absolutely no idea whether the Pierre has time and will to do anything in the direction of improving FastMM for the multithreaded environment.

    ReplyDelete
  4. Steffen13:08

    There is also the Nexus MM, but I have no idea how good it is.

    ReplyDelete
  5. @gabr You are bang on here. I have also encountered severe problems with NeverSleepOnThreadContention. For what it's worth my conclusions were that the best MM I found was Emery Berger's Hoard. You have to do a bit of grovelling in C to hook it up with Delphi but once you do it is stunning. I have also discovered that malloc from the msvcrt.dll that ships with Windows performs very well under thread contention. Again some grovelling required. I personally use FastMM when developing and debugging for its debug features but I ship code built against msvcrt.

    ReplyDelete
  6. @David All great ideas. Is there a publicly available Delphi MM stub for Hoard and msvcrt?

    ReplyDelete
  7. Donating would be ok, but it would also mean that the sleeping guys at Embarcadero would get a free (for them) boost for their expensive product. Probably I would prefer to buy a commercial FastMM with a license that forbids Delphi to use it as its default MM. IMHO they should fire DevRel wholly, and use those money to pay for development like this.

    ReplyDelete
  8. As Pierre mentions with the NeverSleepOnThreadContention switch, this may have a positive effect only with a *low* (i.e. lower than 2) Thread to Core ratio. You said you had 9 threads with 2 cores - that would mean, that performance hits are to be expected.

    BTW: Without "SwitchToThread" you would basically have busy-waiting, without you would switch to threads on the *same* core only. In other words, the usage cases of these options are very very specific!

    ReplyDelete
  9. @Olaf while I agree with you, I have to point out that in many cases you don't know how many threads will be run on how much cores. My application could maybe run 3 threads on 8 cores, or 6 threads on 4 cores or any different number of combinations depending on configured input channels and the motherboard. As the FastMM doesn't allow for runtime configuration, I would have to distribute two versions of the exe and then dynamically load one of them depending on the configuration.

    As this is ugly and hard to maintain, I'll be happy to run a version that performs well on recommended configurations (we recommend two cores for one input (= three active threads)) while failing gracefully on configuration with lower core/thread ratio. I though that NeverSleepOnThreadContention would give me that but, alas, I could hardly call spinning in a tight loop for 5 seconds a "graceful failover".

    ReplyDelete
  10. We performed many memory-intensive tests for FastMM VS TopMM. TopMM seems to be a possible replacement for FastMM in a multithreaded app running on a modern machine. TopMM is a little slower for single threaded memory operations, but it scales very very well under heavy multithreaded operations. Yes, it takes more memory, but with at least 2 Gb of RAM on every workstation I don't think it's such a big problem. But still, I really want Pierre to improve FastMM's scaling for multithreading.

    PS As for Hoard, it is not free for commercial use. And there seems no Delphi stub for it. Can any C developer help with bringing Hoard and Delphi together?

    ReplyDelete
  11. Careful. Hoard is GPL, which makes it worthless for most Delphi development since so many good and widely-used Delphi libraries are MPL licensed. Getting around that requires a commercial license for Hoard, which will run you hundreds of dollars.

    ReplyDelete
  12. @David - did you try jemalloc or TCMalloc?

    http://www.canonware.com/jemalloc/
    http://code.google.com/p/google-perftools/

    ReplyDelete
  13. I'm a little confused...

    The first performance graph was also obtained with FastMM, yes ?

    And you were happy with that performance, yes ?

    If I read it correctly, the problem came from using a conditional define which identified certain constraints for it's effective use that in this particular case were exceeded. You said yourself you do not know what configurations your code might encounter at runtime, but this conditional define was known to be applicable to only a subset of those possible configurations.

    It seems to me that you simply should not have been using the conditional define in the first place.


    Although it may not be applicable in your specific case, your post reminded me of a system I worked on many years ago where we found that we had to provide a mechanism for "tuning" our application via configuration files to get the best performance on different hardware we encountered.

    I started explaining it all here, but then realised it was far too meaty and would make for a good blog post instead. :)

    http://www.deltics.co.nz/blog/?p=807

    Like I say, I'm not saying this is absolutely what you should be doing or that it is relevant in your case.

    ReplyDelete
  14. >>>>It seems to me that you simply should not have been using the conditional define in the first place.

    That's right, but without this define FastMM _not_scales_ at all, i.e. it downgrades to almost single thread under intensive multi-threaded memory operations because of locking mechanism it uses. TopMM uses per-thread memory pools, allowing it to avoid locking.

    ReplyDelete
  15. @Jolyon, basically, yes. And now I'm warning other people that they should not use NeverSleepOnThreadContention too.

    ReplyDelete
  16. For msvcrt I published my barebones MM wrapper on SO: http://stackoverflow.com/questions/6072269/need-multi-threading-memory-manager/6072362#6072362

    For Hoard you build a C DLL which links against Hoard and then export malloc/realloc/free. Wrap it in Delphi the same was as for msvcrt.

    ReplyDelete
  17. @David: I tried this solution for msvcrt, but it worked extremely slow (up to 10 and more times slower than FastMM). Maybe I did something wrong.

    ReplyDelete
  18. Because of this article, I looked around and found out that even though Intel's Thread Building Blocks allocator performs quite good, there's another allocator that beats it hands down : "Scalable High Speed Heap Project", see http://www.velocityreviews.com/forums/t745415-a-fast-malloc-free-implementation-and-benchmarks.html and http://arstechnica.com/civis/viewtopic.php?f=20&t=1136974&start=0 for details, and you can find the project here : http://sourceforge.net/projects/shshp/

    Will do some tests with it - but I'm already wondering how it will perform under 64 bit W7, compared to 'just' using msvcrt's malloc...

    ReplyDelete
  19. @Anton What code were you using to test?

    ReplyDelete
  20. I'm using TopMM in a commercial project for 3 weeks and I like it :) During this time I’ve made two modifications:
    1) delete link to SysUtils and Classes units from TopMM. It is need because SysUtils allocates memory from heap (such as strings in procedure GetFormatSettings) using default memory manager. When another code try to change this memory (change format settings) with TopMM, Access Violation happens.
    2) add condition to activate TopMM in application only when computer have 512+ Mb RAM and 2+ CPU cores. I’ve chosen this parameters for my application because TopMM have less performance in 1 CPU system and it takes more memory. It required just one change in TopInstall unit:
    function NeedInstall: Boolean; // <- my function
    var
    vMemory: TMemoryStatusEx;
    begin
    Result := CPUCount > 1;
    if Result then
    begin
    FillChar(vMemory, SizeOf(vMemory), 0);
    vMemory.dwLength := SizeOf(vMemory);
    if not GlobalMemoryStatusEx(vMemory) then
    Assert(False);
    Result := vMemory.ullTotalPhys >= 512 * 1024 * 1024;
    end;
    end;

    initialization
    if NeedInstall then
    begin
    {$IFDEF TOPDEBUG}
    PatchINT3;
    {$ENDIF}
    TopMMInstall;
    end;

    If anybody need this changes, wrote me to altush at yandex.ru

    ReplyDelete
  21. @David: the code was from real-time app. It was intensively creating/freeing relatively small objects (<100 bytes) and performed string operations from multiple threads.

    ReplyDelete
  22. Leslie20:48

    Anyone has tested Nexus MM?

    ReplyDelete
  23. Anonymous07:42

    FastMM4 has improved since 2010, but it is still has room for improvement when it comes to multi-threaded applications.

    I have made a fork to improve multi-threaded work of FastMM4. See https://github.com/maximmasiutin/FastMM4

    Here are the comparison of the Original FastMM4 version 4.992, with default
    options compiled for Win64 by Delphi 10.2 Tokyo (Release with Optimization),
    and the current FastMM4-AVX branch. Under some scenarios, the FastMM4-AVX branch
    is more than twice as fast comparing to the Original FastMM4. The tests
    have been run on two different computers: one under Xeon E6-2543v2 with 2 CPU
    sockets, each has 6 physical cores (12 logical threads) - with only 5 physical
    core per socket enabled for the test application. Another test was done under
    a i7-7700K CPU.

    Used the "Multi-threaded allocate, use and free" and "NexusDB"
    test cases from the FastCode Challenge Memory Manager test suite,
    modified to run under 64-bit.



    Xeon E6-2543v2 2*CPU i7-7700K CPU
    (allocated 20 logical (allocated 8 logical
    threads, 10 physical threads, 4 physical
    cores, NUMA) cores)

    Orig. AVX-br. Ratio Orig. AVX-br. Ratio
    ------ ----- ------ ----- ----- ------
    02-threads realloc 96552 59951 62.09% 65213 49471 75.86%
    04-threads realloc 97998 39494 40.30% 64402 47714 74.09%
    08-threads realloc 98325 33743 34.32% 64796 58754 90.68%
    16-threads realloc 116708 45855 39.29% 71457 60173 84.21%
    16-threads realloc 116273 45161 38.84% 70722 60293 85.25%
    31-threads realloc 122528 53616 43.76% 70939 62962 88.76%
    64-threads realloc 137661 54330 39.47% 73696 64824 87.96%
    NexusDB 02 threads 122846 90380 73.72% 79479 66153 83.23%
    NexusDB 04 threads 122131 53103 43.77% 69183 43001 62.16%
    NexusDB 08 threads 124419 40914 32.88% 64977 33609 51.72%
    NexusDB 12 threads 181239 55818 30.80% 83983 44658 53.18%
    NexusDB 16 threads 135211 62044 43.61% 59917 32463 54.18%
    NexusDB 31 threads 134815 48132 33.46% 54686 31184 57.02%
    NexusDB 64 threads 187094 57672 30.25% 63089 41955 66.50%


    (the tests have been done on 14-Jul-2017)

    ReplyDelete