>  There is no garbage collector thread or something similar in some
> worker thread. But maybe something similar could be done in your
> code.

No, there is nothing of the kind in the application.

> You might experiment with a tradeoff speed vs memory usage. The
> minimum memory usage should be achieved with MALLOC_ARENA_MAX=1
> see 'man mallopt' for other options.

MALLOC_ARENA_MAX=1 made a huge difference.
The initial memory allocations went done on the scale of magnitude.
In fact, I do not see that much of the application slowdown but this 
will need more profiling.
The stable allocations growth is still ~2Mb/second.

I am going to investigate your idea of long living objects 
contention/memory fragmentation.

This sounds very probably, even though I do not see real memory leaks 
even after all the aux threads died.
I have TLS instances in use, maybe those really get in the way.

Thanks a lot for your help.
If/when I find something new or interesting, I will send an update - 
hope it will help someone else, too.

Regards,
- Nikolay

On 2/7/23 11:16, Paulo César Pereira de Andrade wrote:
> Em ter., 7 de fev. de 2023 às 12:07, Nikolay Shustov via Libc-alpha
> <libc-alpha@sourceware.org>  escreveu:
>> Hi,
>> I have a question about the malloc() behavior which I observe.
>> The synopsis is that the during the stress load, the application
>> aggressively allocates virtual memory without any upper limit.
>> Just to note, after application is loaded just with the peak of activity
>> and goes idle, its virtual memory doesn't scale back (I do not expect
>> much of that though - should I?).
>    There is no garbage collector thread or something similar in some
> worker thread. But maybe something similar could be done in your
> code.
>
>> The application is heavily multithreaded; at its peak of its activitiy
>> it creates new threads and destroys them at a pace of approx. 100/second.
>> After the long and tedious investigation I dare to say that there are no
>> memory leaks involved.
>> (Well, there were memory leaks and I first went after those; found and
>> fixed - but the result did not change much.)
>    You might experiment with a tradeoff speed vs memory usage. The
> minimum memory usage should be achieved with MALLOC_ARENA_MAX=1
> see 'man mallopt' for other options.
>
>> The application is cross-platform and runs on Windows and some other
>> platforms too.
>> There is an OS abstraction layer that provides the unified thread and
>> memory allocation API for business logic, but the business logic that
>> triggers memory allocations is platform-independent.
>> There are no progressive memory allocations in OS abstraction layer
>> which could be blamed for the memory growth.
>>
>> The thing is, on Windows, for the same activity there is no such
>> application memory growth at all.
>> It allocates memory moderately and scales back after peak of activity.
>> This makes me think it is not the business logic to be blamed (to the
>> extent of that it does not leak memory).
>>
>> I used valigrind to profile for memory leaks and heap usage.
>> Please see massif outputs attached (some callstacks had to be trimmed out).
>> I am also attaching the memory map for the application (run without
>> valgrind); snapshot is taken after all the threads but main were
>> destroyed and application is idle.
>>
>> The pace of the virtual memory growth is not quite linear.
>    Most likely there are long lived objects doing contention and also
> probably memory fragmentation, preventing returning memory to
> the system after a free call.
>
>>   From my observation, it allocates a big hunk in the beginning of the
>> peak loading, then in some time starts to grow in steps of ~80Mb / 10
>> seconds, then after some times starts to steadily grow it at pace of
>> ~2Mb/second.
>>
>> Some stats from the host:
>>
>>      OS: Red Hat Enterprise Linux Server release 7.9 (Maipo)
>>
>> ldd -version
>>
>>      ldd (GNU libc) 2.17
>>      Copyright (C) 2012 Free Software Foundation, Inc.
>>      This is free software; see the source for copying conditions. There
>>      is NO
>>      warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>      PURPOSE.
>>      Written by Roland McGrath and Ulrich Drepper.
>>
>> uname -a
>>
>>      Linux <skipped> 3.10.0-1160.53.1.el7.x86_64 #1 SMP Thu Dec 16
>>      10:19:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>> At a peak load, the number of application threads is ~180.
>> If application is left running, I did not observe it would hit any  max
>> virtual memory threshold and eventually ends up with hitting ulimit.
>>
>> My questions are:
>>
>> - Is this memory growth an expected behavior?
>    It should eventually stabilize. But it is possible that some allocation
> pattern is causing both, fragmentation and long lived objects preventing
> consolidation of memory chunks.
>
>> - What can be done to prevent it from happening?
>    First approach is MALLOC_ARENA_MAX. After that some coding
> patterns might help, for example, have large long lived objects allocated
> from the same thread, preferably at startup.
>    Can also attempt to cache some memory, but note that caching is also
> an easy way to get contention. To avoid this, you could use memory from
> buffers from mmap.
>
>    Depending on your code, you can also experiment with jemalloc or
> tcmalloc. I would suggest tcmalloc, as its main feature is to work
> in multithreaded environments:
>
> https://gperftools.github.io/gperftools/tcmalloc.html
>
>    Glibc newer than 2.17 has a per thread cache, but the issue you
> are experimenting is not it being slow, but memory usage. AFAIK tcmalloc
> has a kind of garbage collector, but it should not be much different than
> glibc consolidation logic; it should only run during free, and if there is
> some contention, it might not be able to release memory.
>
>> Thanks in advance,
>> - Nikolay
> Thanks!
>
> Paulo