> There is no garbage collector thread or something similar in some > worker thread. But maybe something similar could be done in your > code. No, there is nothing of the kind in the application. > You might experiment with a tradeoff speed vs memory usage. The > minimum memory usage should be achieved with MALLOC_ARENA_MAX=1 > see 'man mallopt' for other options. MALLOC_ARENA_MAX=1 made a huge difference. The initial memory allocations went done on the scale of magnitude. In fact, I do not see that much of the application slowdown but this will need more profiling. The stable allocations growth is still ~2Mb/second. I am going to investigate your idea of long living objects contention/memory fragmentation. This sounds very probably, even though I do not see real memory leaks even after all the aux threads died. I have TLS instances in use, maybe those really get in the way. Thanks a lot for your help. If/when I find something new or interesting, I will send an update - hope it will help someone else, too. Regards, - Nikolay On 2/7/23 11:16, Paulo César Pereira de Andrade wrote: > Em ter., 7 de fev. de 2023 às 12:07, Nikolay Shustov via Libc-alpha > escreveu: >> Hi, >> I have a question about the malloc() behavior which I observe. >> The synopsis is that the during the stress load, the application >> aggressively allocates virtual memory without any upper limit. >> Just to note, after application is loaded just with the peak of activity >> and goes idle, its virtual memory doesn't scale back (I do not expect >> much of that though - should I?). > There is no garbage collector thread or something similar in some > worker thread. But maybe something similar could be done in your > code. > >> The application is heavily multithreaded; at its peak of its activitiy >> it creates new threads and destroys them at a pace of approx. 100/second. >> After the long and tedious investigation I dare to say that there are no >> memory leaks involved. >> (Well, there were memory leaks and I first went after those; found and >> fixed - but the result did not change much.) > You might experiment with a tradeoff speed vs memory usage. The > minimum memory usage should be achieved with MALLOC_ARENA_MAX=1 > see 'man mallopt' for other options. > >> The application is cross-platform and runs on Windows and some other >> platforms too. >> There is an OS abstraction layer that provides the unified thread and >> memory allocation API for business logic, but the business logic that >> triggers memory allocations is platform-independent. >> There are no progressive memory allocations in OS abstraction layer >> which could be blamed for the memory growth. >> >> The thing is, on Windows, for the same activity there is no such >> application memory growth at all. >> It allocates memory moderately and scales back after peak of activity. >> This makes me think it is not the business logic to be blamed (to the >> extent of that it does not leak memory). >> >> I used valigrind to profile for memory leaks and heap usage. >> Please see massif outputs attached (some callstacks had to be trimmed out). >> I am also attaching the memory map for the application (run without >> valgrind); snapshot is taken after all the threads but main were >> destroyed and application is idle. >> >> The pace of the virtual memory growth is not quite linear. > Most likely there are long lived objects doing contention and also > probably memory fragmentation, preventing returning memory to > the system after a free call. > >> From my observation, it allocates a big hunk in the beginning of the >> peak loading, then in some time starts to grow in steps of ~80Mb / 10 >> seconds, then after some times starts to steadily grow it at pace of >> ~2Mb/second. >> >> Some stats from the host: >> >> OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) >> >> ldd -version >> >> ldd (GNU libc) 2.17 >> Copyright (C) 2012 Free Software Foundation, Inc. >> This is free software; see the source for copying conditions. There >> is NO >> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR >> PURPOSE. >> Written by Roland McGrath and Ulrich Drepper. >> >> uname -a >> >> Linux 3.10.0-1160.53.1.el7.x86_64 #1 SMP Thu Dec 16 >> 10:19:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux >> >> >> At a peak load, the number of application threads is ~180. >> If application is left running, I did not observe it would hit any max >> virtual memory threshold and eventually ends up with hitting ulimit. >> >> My questions are: >> >> - Is this memory growth an expected behavior? > It should eventually stabilize. But it is possible that some allocation > pattern is causing both, fragmentation and long lived objects preventing > consolidation of memory chunks. > >> - What can be done to prevent it from happening? > First approach is MALLOC_ARENA_MAX. After that some coding > patterns might help, for example, have large long lived objects allocated > from the same thread, preferably at startup. > Can also attempt to cache some memory, but note that caching is also > an easy way to get contention. To avoid this, you could use memory from > buffers from mmap. > > Depending on your code, you can also experiment with jemalloc or > tcmalloc. I would suggest tcmalloc, as its main feature is to work > in multithreaded environments: > > https://gperftools.github.io/gperftools/tcmalloc.html > > Glibc newer than 2.17 has a per thread cache, but the issue you > are experimenting is not it being slow, but memory usage. AFAIK tcmalloc > has a kind of garbage collector, but it should not be much different than > glibc consolidation logic; it should only run during free, and if there is > some contention, it might not be able to release memory. > >> Thanks in advance, >> - Nikolay > Thanks! > > Paulo