Hello Carlos, many thanks for your support. On 11/25/21 7:20 PM, Carlos O'Donell wrote: > How many cpus does the system have? We have 8 CPUs. > How many threads do you create? Our computations are running in 2 threads. However, the lifetime of both threads is limited to the duration of the computation. Both threads exit after the calculations are complete. The next round of computations will be performed by 2 other, newly started threads. > Is this 10GiB of RSS or VSS? It is 10 GiB of RSS memory. Meaning that the calculations each time allocate 10 GiB of memory, which is then used and so it becomes resident RSS memory. After the calculations are done, we free() this memory again. At this point, the memory is returned to the glibc allocator > This coalescing and freeing is prevented it there are in-use chunks in the heap. > > Consider this scenario: > - Make many large allocations that have a short lifetime. > - Make one small allocation that has a very long lifetime. > - Free all the large allocations. > > The heap cannot be freed downwards because of the small long liftetime allocation. > > The call to malloc_trim() walks the heap chunks and frees page-sized chunks or > larger without the requirement that they come from the top of the heap. > > In glibc's allocator, mixing lifetimes for allocations will cause heap growth. I think that is exactly what is happening in our case. Thanks for the explanation! > I have an important question to ask now: > > Do you use aligned allocations? > > We have right now an outstanding defect where aligned allocations create small > residual free chunks, and when free'd back and allocated again as an aligned > chunk, we are forced to split chunks again, which can lead to ratcheting effects > with certain aligned allocations. > > We had a prototype patch for this in Fedora in 2019: > https://lists.fedoraproject.org/archives/list/glibc@lists.fedoraproject.org/thread/2PCHP5UWONIOAEUG34YBAQQYD7JL5JJ4/ > No, the 512 KiB allocations for the computation are not aligned. We just request it using malloc(). But the application is a Java application that is running some native C++ code. And I don't know if Java allocates some aligned memory. But the vast majority of allocations are ~512 KiB and these are not aligned. >> And then we also have one other problem. The first run of the >> computations is always fine: we allocate 10 GB of memory and the >> application grows to 10 GB. Afterwards, we release those 10 GB of memory >> since the computations are now done and at this point the freed memory >> is returned back to the allocator (however, the size of the process >> remains 10 GB unless we call malloc_trim()). But if we now re-run the >> same computations again a second time (this time using different >> threads), a problem occurs. In this case, the size of the application >> grows well beyond 10 GB. It can get 20 GB or larger and the process is >> eventually killed because the system runs out of memory. > You need to determine what is going on under the hood here. > > You may want to just use malloc_info() to get a routine dump of the heap state. > > This will give us a starting point to see what is growing. To make it easier to run multiple rounds of calculations, I have now modified the code a bit so that only ~5 GiB of memory is allocated every time when we perform the computations. The 5 GiB are still allocated in chunks of about 512 KiB. After the calculations, all this memory is free()'ed again. After running the calculations for the first time, we see that the application consumes about 5 GiB of RSS memory. But our problem is that if we run the computations again for a second and third time, the memory usage increases even beyond 5 GiB, even though we release all memory that the calculation consumes after each iteration. After running the same workload 12 times, our processing workstation runs out of memory and gets very slow. In detail, the memory consumption after each round of calculations is captured in the table below. After Iteration   Memory Consumption 1             5.13 GiB 2             8.9 GiB 3             10.48 GiB 4             14.48 GiB 5             18.11 GiB 6             16.03 GiB ......           .............. 12            21.79 GiB As you can see, the RSS memory usage of our application increases continuously, especially during the first few rounds of calculations. Our expectation would be that the RSS memory usage remains at 5 GiB as the computations only allocate about 5 GiB of memory each time. After running the computations 12 times, glibc allocator caches have grown to over 20 GiB. All this memory can only be reclaimed by calling malloc_trim(). I have also attached the traces from malloc_info() after each iteration of the computations. The first trace ("after_run_no_1.txt") was captured after running the computations once and shows a relatively low memory usage of 5 GiB. But in the subsequent traces, memory consumption increases. Our application has 64 arenas (which is eight times eight CPUs). As I mentioned, the lifetime of the 2 computation threads is limited to the computation itself. The threads will be restarted with each run of the computations. Once the computation starts, the threads are assigned to an arena. Could it be that these two threads are always assigned to different arenas on each run of the computations and could that explain that the glibc allocator caches are growing from run to run? > We have a malloc allocation tracer that you can use to capture a workload and > share a snapshot of the workload with upstream: > https://pagure.io/glibc-malloc-trace-utils > > Sharing the workload might be hard because this is a full API trace and it gets > difficult to share. For now, I haven't done that yet (as it would be difficult to share, just as you said). I hope that the malloc_info() traces already give a picture of what is happening. But if we need them, I will be happy to capture these traces. Best regards,    Christian