public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Nikolay Shustov <nikolay.shustov@gmail.com>
To: "Paulo César Pereira de Andrade"
	<paulo.cesar.pereira.de.andrade@gmail.com>
Cc: Ben Woodard <woodard@redhat.com>, libc-alpha@sourceware.org
Subject: Re: GLIBC malloc behavior question
Date: Tue, 7 Feb 2023 18:41:34 -0500	[thread overview]
Message-ID: <032010f8-1df3-4d97-810f-fd77c92b5de5@gmail.com> (raw)
In-Reply-To: <CAHAq8pGG+aqUgGKcTHFRTqWy0KV+QRiuoHZKYFKA3zMsFUmQKg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9196 bytes --]

Makes sense, thank you.

On 2/7/23 16:38, Paulo César Pereira de Andrade wrote:
> Em ter., 7 de fev. de 2023 às 17:56, Nikolay Shustov
> <nikolay.shustov@gmail.com>  escreveu:
>> I was able to increase ratio of threads reuse in application (i.e. reuse running thread instead of destroying it and creating new) and it seemed to have a positive effect on the amount of allocated memory.
>> Valgrind reports some small memory leaks - and I know from experience these reports are bogus as they deal with its inability to detect thread local storage key destructor function invocations.
>> I was able to ensure these are invoked as it supposed to be. The amount of the reported leaks is really small and in no way could contribute to gigabytes of VIRT been held in allocations.
>>
>> I experimented more and at this stage most of my suspicions are towards heap fragmentation.
>>
>> However, if all the used memory has been released (as I expect it do be when all but main threads exited), should have I expected mmaped regions to be released and process virtual memory size shrink?
>    For glibc-2.17, what I see is, when free is called:
> 1. if not the main arena, it will munmap if it is a large chunk
> allocated with mmap.
> 2. If the main arena, it can trim the memory with sbrk and a negative argument
>      if the top free memory when merged, is 128K or larger.
> 3. If not the main arena, it will call madvise(addr, size, MADV_DONTNEED) if
>     it finds 64K of contiguous unused memory.
>
>    In either case, if releasing with sbrk or mdavise, the process virtual size
> should decrease.
>
>> Or GLIBC would do it upon some special event? (can it be forced somehow?)
>    You can call malloc_trim().
>
>    You can also use a smaller value for MALLOC_MMAP_THRESHOLD_ to
> have more memory allocated/released with mmap. This value should not be
> too small. Basically, tell it to use mmap for large blocks. The default is 128k.
>
>    Fragmentation usually happens when allocating different sized objects, and
> due to memory layout, with too many small objects not released with free,
> it cannot find contiguous free blocks.
>
>> Thanks,
>> - Nikolay
>>
>>
>> On 2/7/23 13:01, Nikolay Shustov wrote:
>>
>> Got it, thanks.
>> For now, I am using this tunable merely to hep to surface whatever I might have missed in terms of long living objects and heap fragmentation.
>> But I am definitely going to play with it when I am reasonably sure that it is what does the major impact.
>>
>> On 2/7/23 12:26, Ben Woodard wrote:
>>
>>
>> On 2/7/23 08:55, Nikolay Shustov via Libc-alpha wrote:
>>
>>   There is no garbage collector thread or something similar in some
>> worker thread. But maybe something similar could be done in your
>> code.
>>
>>
>> No, there is nothing of the kind in the application.
>>
>> You might experiment with a tradeoff speed vs memory usage. The
>> minimum memory usage should be achieved with MALLOC_ARENA_MAX=1
>> see 'man mallopt' for other options.
>>
>>
>> MALLOC_ARENA_MAX=1 made a huge difference.
>>
>> I just wanted to point out that it isn't 1 or the default. That was most likely a simple test to test a hypothesis about what could be going wrong. This is a tunable knob and your application could have a sweet spot. For some of the applications that I help support, we have empirically found that a good number is slightly lower than the number of processors that they system has. e.g. if their are 16 cores giving it 12 arenas doesn't impact speed but makes the memory footprint more compact.
>>
>> The initial memory allocations went done on the scale of magnitude.
>> In fact, I do not see that much of the application slowdown but this will need more profiling.
>> The stable allocations growth is still ~2Mb/second.
>>
>> I am going to investigate your idea of long living objects contention/memory fragmentation.
>>
>> This sounds very probably, even though I do not see real memory leaks even after all the aux threads died.
>> I have TLS instances in use, maybe those really get in the way.
>>
>> Thanks a lot for your help.
>> If/when I find something new or interesting, I will send an update - hope it will help someone else, too.
>>
>> Regards,
>> - Nikolay
>>
>> On 2/7/23 11:16, Paulo César Pereira de Andrade wrote:
>>
>> Em ter., 7 de fev. de 2023 às 12:07, Nikolay Shustov via Libc-alpha
>> <libc-alpha@sourceware.org>   escreveu:
>>
>> Hi,
>> I have a question about the malloc() behavior which I observe.
>> The synopsis is that the during the stress load, the application
>> aggressively allocates virtual memory without any upper limit.
>> Just to note, after application is loaded just with the peak of activity
>> and goes idle, its virtual memory doesn't scale back (I do not expect
>> much of that though - should I?).
>>
>>     There is no garbage collector thread or something similar in some
>> worker thread. But maybe something similar could be done in your
>> code.
>>
>> The application is heavily multithreaded; at its peak of its activitiy
>> it creates new threads and destroys them at a pace of approx. 100/second.
>> After the long and tedious investigation I dare to say that there are no
>> memory leaks involved.
>> (Well, there were memory leaks and I first went after those; found and
>> fixed - but the result did not change much.)
>>
>>     You might experiment with a tradeoff speed vs memory usage. The
>> minimum memory usage should be achieved with MALLOC_ARENA_MAX=1
>> see 'man mallopt' for other options.
>>
>> The application is cross-platform and runs on Windows and some other
>> platforms too.
>> There is an OS abstraction layer that provides the unified thread and
>> memory allocation API for business logic, but the business logic that
>> triggers memory allocations is platform-independent.
>> There are no progressive memory allocations in OS abstraction layer
>> which could be blamed for the memory growth.
>>
>> The thing is, on Windows, for the same activity there is no such
>> application memory growth at all.
>> It allocates memory moderately and scales back after peak of activity.
>> This makes me think it is not the business logic to be blamed (to the
>> extent of that it does not leak memory).
>>
>> I used valigrind to profile for memory leaks and heap usage.
>> Please see massif outputs attached (some callstacks had to be trimmed out).
>> I am also attaching the memory map for the application (run without
>> valgrind); snapshot is taken after all the threads but main were
>> destroyed and application is idle.
>>
>> The pace of the virtual memory growth is not quite linear.
>>
>>     Most likely there are long lived objects doing contention and also
>> probably memory fragmentation, preventing returning memory to
>> the system after a free call.
>>
>>    From my observation, it allocates a big hunk in the beginning of the
>> peak loading, then in some time starts to grow in steps of ~80Mb / 10
>> seconds, then after some times starts to steadily grow it at pace of
>> ~2Mb/second.
>>
>> Some stats from the host:
>>
>>       OS: Red Hat Enterprise Linux Server release 7.9 (Maipo)
>>
>> ldd -version
>>
>>       ldd (GNU libc) 2.17
>>       Copyright (C) 2012 Free Software Foundation, Inc.
>>       This is free software; see the source for copying conditions. There
>>       is NO
>>       warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>       PURPOSE.
>>       Written by Roland McGrath and Ulrich Drepper.
>>
>> uname -a
>>
>>       Linux <skipped> 3.10.0-1160.53.1.el7.x86_64 #1 SMP Thu Dec 16
>>       10:19:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>> At a peak load, the number of application threads is ~180.
>> If application is left running, I did not observe it would hit any  max
>> virtual memory threshold and eventually ends up with hitting ulimit.
>>
>> My questions are:
>>
>> - Is this memory growth an expected behavior?
>>
>>     It should eventually stabilize. But it is possible that some allocation
>> pattern is causing both, fragmentation and long lived objects preventing
>> consolidation of memory chunks.
>>
>> - What can be done to prevent it from happening?
>>
>>     First approach is MALLOC_ARENA_MAX. After that some coding
>> patterns might help, for example, have large long lived objects allocated
>> from the same thread, preferably at startup.
>>     Can also attempt to cache some memory, but note that caching is also
>> an easy way to get contention. To avoid this, you could use memory from
>> buffers from mmap.
>>
>>     Depending on your code, you can also experiment with jemalloc or
>> tcmalloc. I would suggest tcmalloc, as its main feature is to work
>> in multithreaded environments:
>>
>> https://gperftools.github.io/gperftools/tcmalloc.html
>>
>>     Glibc newer than 2.17 has a per thread cache, but the issue you
>> are experimenting is not it being slow, but memory usage. AFAIK tcmalloc
>> has a kind of garbage collector, but it should not be much different than
>> glibc consolidation logic; it should only run during free, and if there is
>> some contention, it might not be able to release memory.
>>
>> Thanks in advance,
>> - Nikolay
>>
>> Thanks!
>>
>> Paulo
>>
>>
>>

      reply	other threads:[~2023-02-07 23:41 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-07 15:06 Nikolay Shustov
2023-02-07 16:16 ` Paulo César Pereira de Andrade
2023-02-07 16:55   ` Nikolay Shustov
2023-02-07 17:26     ` Ben Woodard
2023-02-07 18:01       ` Nikolay Shustov
2023-02-07 20:56         ` Nikolay Shustov
2023-02-07 21:38           ` Paulo César Pereira de Andrade
2023-02-07 23:41             ` Nikolay Shustov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=032010f8-1df3-4d97-810f-fd77c92b5de5@gmail.com \
    --to=nikolay.shustov@gmail.com \
    --cc=libc-alpha@sourceware.org \
    --cc=paulo.cesar.pereira.de.andrade@gmail.com \
    --cc=woodard@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).