Hello Carlos,

many thanks for your support.

On 11/25/21 7:20 PM, Carlos O'Donell wrote:

> How many cpus does the system have?
We have 8 CPUs.
> How many threads do you create?
Our computations are running in 2 threads. However, the lifetime of both
threads is limited to the duration of the computation. Both threads exit
after the calculations are complete. The next round of computations will
be performed by 2 other, newly started threads.
> Is this 10GiB of RSS or VSS?

It is 10 GiB of RSS memory. Meaning that the calculations each time
allocate 10 GiB of memory, which is then used and so it becomes resident
RSS memory. After the calculations are done, we free() this memory
again. At this point, the memory is returned to the glibc allocator

> This coalescing and freeing is prevented it there are in-use chunks in the heap.
>
> Consider this scenario:
> - Make many large allocations that have a short lifetime.
> - Make one small allocation that has a very long lifetime.
> - Free all the large allocations.
>
> The heap cannot be freed downwards because of the small long liftetime allocation.
>
> The call to malloc_trim() walks the heap chunks and frees page-sized chunks or
> larger without the requirement that they come from the top of the heap.
>
> In glibc's allocator, mixing lifetimes for allocations will cause heap growth.
I think that is exactly what is happening in our case. Thanks for the
explanation!
> I have an important question to ask now:
>
> Do you use aligned allocations?
>
> We have right now an outstanding defect where aligned allocations create small
> residual free chunks, and when free'd back and allocated again as an aligned
> chunk, we are forced to split chunks again, which can lead to ratcheting effects
> with certain aligned allocations.
>
> We had a prototype patch for this in Fedora in 2019:
> https://lists.fedoraproject.org/archives/list/glibc@lists.fedoraproject.org/thread/2PCHP5UWONIOAEUG34YBAQQYD7JL5JJ4/
>
No, the 512 KiB allocations for the computation are not aligned. We just
request it using malloc(). But the application is a Java application
that is running some native C++ code. And I don't know if Java allocates
some aligned memory. But the vast majority of allocations are ~512 KiB
and these are not aligned.
>> And then we also have one other problem. The first run of the
>> computations is always fine: we allocate 10 GB of memory and the
>> application grows to 10 GB. Afterwards, we release those 10 GB of memory
>> since the computations are now done and at this point the freed memory
>> is returned back to the allocator (however, the size of the process
>> remains 10 GB unless we call malloc_trim()). But if we now re-run the
>> same computations again a second time (this time using different
>> threads), a problem occurs. In this case, the size of the application
>> grows well beyond 10 GB. It can get 20 GB or larger and the process is
>> eventually killed because the system runs out of memory.
> You need to determine what is going on under the hood here.
>
> You may want to just use malloc_info() to get a routine dump of the heap state.
>
> This will give us a starting point to see what is growing.

To make it easier to run multiple rounds of calculations, I have now
modified the code a bit so that only ~5 GiB of memory is allocated every
time when we perform the computations. The 5 GiB are still allocated in
chunks of about 512 KiB. After the calculations, all this memory is
free()'ed again.

After running the calculations for the first time, we see that the
application consumes about 5 GiB of RSS memory. But our problem is that
if we run the computations again for a second and third time, the memory
usage increases even beyond 5 GiB, even though we release all memory
that the calculation consumes after each iteration. After running the
same workload 12 times, our processing workstation runs out of memory
and gets very slow. In detail, the memory consumption after each round
of calculations is captured in the table below.

After Iteration   Memory Consumption
1             5.13 GiB
2             8.9 GiB
3             10.48 GiB
4             14.48 GiB
5             18.11 GiB
6             16.03 GiB
......           ..............
12            21.79 GiB

As you can see, the RSS memory usage of our application increases
continuously, especially during the first few rounds of calculations.
Our expectation would be that the RSS memory usage remains at 5 GiB as
the computations only allocate about 5 GiB of memory each time. After
running the computations 12 times, glibc allocator caches have grown to
over 20 GiB. All this memory can only be reclaimed by calling malloc_trim().

I have also attached the traces from malloc_info() after each iteration
of the computations. The first trace ("after_run_no_1.txt") was captured
after running the computations once and shows a relatively low memory
usage of 5 GiB. But in the subsequent traces, memory consumption
increases. Our application has 64 arenas (which is eight times eight CPUs).

As I mentioned, the lifetime of the 2 computation threads is limited to
the computation itself. The threads will be restarted with each run of
the computations. Once the computation starts, the threads are assigned
to an arena. Could it be that these two threads are always assigned to
different arenas on each run of the computations and could that explain
that the glibc allocator caches are growing from run to run?

> We have a malloc allocation tracer that you can use to capture a workload and
> share a snapshot of the workload with upstream:
> https://pagure.io/glibc-malloc-trace-utils
>
> Sharing the workload might be hard because this is a full API trace and it gets
> difficult to share.

For now, I haven't done that yet (as it would be difficult to share,
just as you said). I hope that the malloc_info() traces already give a
picture of what is happening. But if we need them, I will be happy to
capture these traces.


Best regards,

    Christian