From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x136.google.com (mail-lf1-x136.google.com [IPv6:2a00:1450:4864:20::136]) by sourceware.org (Postfix) with ESMTPS id AB4A33858D33 for ; Tue, 7 Feb 2023 21:39:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org AB4A33858D33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x136.google.com with SMTP id o20so24278057lfk.5 for ; Tue, 07 Feb 2023 13:39:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VB6xtKwiZAdA4rCBd/GkDelEat0J+BISZlCycpe8TDg=; b=lz1CzZJydhq/s0xJcinTp+hMkzxYC0lwSxIJJXEh31aLSJ8PUx8wH+bbNoDtkJGHDQ bY9bkK1WokO+R1yxvJdqOrSFqMXh3U+n0cYgRHOtLb59wwefD5pD2zs3uXd6po6m3wUq 6Ckt6K6oo+rb0aywkEiDDVsEZ1y77yfiRIa+3ITWvmOCOKoUk/z7EomYb0yhwcXOH9K3 7R4nIxkEMyIZAjPiRgYNr1wF3S9Kd9y0LqYefQoK2eTRlNs+kDQn9e1+MV3l7ZXZUgjP OvjiKWJi7kxlPiwLX9wG90pY59KJUmMYQKgE0rd0wFY3HQQDP9G4c8ZjgBBiAW8nwu4C shJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VB6xtKwiZAdA4rCBd/GkDelEat0J+BISZlCycpe8TDg=; b=km83WHKpEEks0daPh/39LFlz4pvziMD3Pw7OJA3Wa0TDcQHSsVLNhU9Oh4qxzKDLNV HeDKyritE25/nN7FqwN7ZcZ7+MgNqw/hG47kRU1UjlWgVtYdErtebM2eLLqUJVBXwpnc f1L6gTGCCXpebCvgCkKc6SKwnmrqpcJhPGEF/xNZj0h1G1CKMV4b1CWSwF6l0sdM9Con ZRiWS/2LcTgx2Rml8dZzxrEBxmgt/iYXS6NrUZqLIgxZ+aoIQRuIzXzhMFjMJSff0/r4 D+FtB/6gmbPAOpLgj77tSGNrbvuQI+38+Nq6Di0gFpdUNMZnbKRXwPjqd6WtTTH1jT42 t74Q== X-Gm-Message-State: AO0yUKV7+qd/MdO5p+A5nuiiJCRf93ylDcMgVWMvFiOCF6DLKnt7k02J ffq5WTOK3gE5Jvs8ZKp9cWrDQaAXtbyA0UPC9hw= X-Google-Smtp-Source: AK7set+/BeMpGYYHIV9/1OErbo7Dn3oiBHNZdy/vmG3a4UE8zcsIEphsUrgkeQLQilfs5UCmiJaOHytEyp4BYCwuIcI= X-Received: by 2002:ac2:520d:0:b0:4b5:635c:f330 with SMTP id a13-20020ac2520d000000b004b5635cf330mr826312lfl.237.1675805947839; Tue, 07 Feb 2023 13:39:07 -0800 (PST) MIME-Version: 1.0 References: <1ccd66cd-7d6e-1825-95e8-38b49320737f@gmail.com> <24f0f242-6f95-df0d-c89a-36c5c1130a88@gmail.com> In-Reply-To: <24f0f242-6f95-df0d-c89a-36c5c1130a88@gmail.com> From: =?UTF-8?Q?Paulo_C=C3=A9sar_Pereira_de_Andrade?= Date: Tue, 7 Feb 2023 18:38:55 -0300 Message-ID: Subject: Re: GLIBC malloc behavior question To: Nikolay.Shustov@gmail.com Cc: Ben Woodard , libc-alpha@sourceware.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=0.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Em ter., 7 de fev. de 2023 =C3=A0s 17:56, Nikolay Shustov escreveu: > > I was able to increase ratio of threads reuse in application (i.e. reuse = running thread instead of destroying it and creating new) and it seemed to = have a positive effect on the amount of allocated memory. > Valgrind reports some small memory leaks - and I know from experience the= se reports are bogus as they deal with its inability to detect thread local= storage key destructor function invocations. > I was able to ensure these are invoked as it supposed to be. The amount o= f the reported leaks is really small and in no way could contribute to giga= bytes of VIRT been held in allocations. > > I experimented more and at this stage most of my suspicions are towards h= eap fragmentation. > > However, if all the used memory has been released (as I expect it do be w= hen all but main threads exited), should have I expected mmaped regions to = be released and process virtual memory size shrink? For glibc-2.17, what I see is, when free is called: 1. if not the main arena, it will munmap if it is a large chunk allocated with mmap. 2. If the main arena, it can trim the memory with sbrk and a negative argum= ent if the top free memory when merged, is 128K or larger. 3. If not the main arena, it will call madvise(addr, size, MADV_DONTNEED) i= f it finds 64K of contiguous unused memory. In either case, if releasing with sbrk or mdavise, the process virtual si= ze should decrease. > Or GLIBC would do it upon some special event? (can it be forced somehow?) You can call malloc_trim(). You can also use a smaller value for MALLOC_MMAP_THRESHOLD_ to have more memory allocated/released with mmap. This value should not be too small. Basically, tell it to use mmap for large blocks. The default is = 128k. Fragmentation usually happens when allocating different sized objects, an= d due to memory layout, with too many small objects not released with free, it cannot find contiguous free blocks. > Thanks, > - Nikolay > > > On 2/7/23 13:01, Nikolay Shustov wrote: > > Got it, thanks. > For now, I am using this tunable merely to hep to surface whatever I migh= t have missed in terms of long living objects and heap fragmentation. > But I am definitely going to play with it when I am reasonably sure that = it is what does the major impact. > > On 2/7/23 12:26, Ben Woodard wrote: > > > On 2/7/23 08:55, Nikolay Shustov via Libc-alpha wrote: > > There is no garbage collector thread or something similar in some > worker thread. But maybe something similar could be done in your > code. > > > No, there is nothing of the kind in the application. > > You might experiment with a tradeoff speed vs memory usage. The > minimum memory usage should be achieved with MALLOC_ARENA_MAX=3D1 > see 'man mallopt' for other options. > > > MALLOC_ARENA_MAX=3D1 made a huge difference. > > I just wanted to point out that it isn't 1 or the default. That was most = likely a simple test to test a hypothesis about what could be going wrong. = This is a tunable knob and your application could have a sweet spot. For so= me of the applications that I help support, we have empirically found that = a good number is slightly lower than the number of processors that they sys= tem has. e.g. if their are 16 cores giving it 12 arenas doesn't impact spee= d but makes the memory footprint more compact. > > The initial memory allocations went done on the scale of magnitude. > In fact, I do not see that much of the application slowdown but this will= need more profiling. > The stable allocations growth is still ~2Mb/second. > > I am going to investigate your idea of long living objects contention/mem= ory fragmentation. > > This sounds very probably, even though I do not see real memory leaks eve= n after all the aux threads died. > I have TLS instances in use, maybe those really get in the way. > > Thanks a lot for your help. > If/when I find something new or interesting, I will send an update - hope= it will help someone else, too. > > Regards, > - Nikolay > > On 2/7/23 11:16, Paulo C=C3=A9sar Pereira de Andrade wrote: > > Em ter., 7 de fev. de 2023 =C3=A0s 12:07, Nikolay Shustov via Libc-alpha > escreveu: > > Hi, > I have a question about the malloc() behavior which I observe. > The synopsis is that the during the stress load, the application > aggressively allocates virtual memory without any upper limit. > Just to note, after application is loaded just with the peak of activity > and goes idle, its virtual memory doesn't scale back (I do not expect > much of that though - should I?). > > There is no garbage collector thread or something similar in some > worker thread. But maybe something similar could be done in your > code. > > The application is heavily multithreaded; at its peak of its activitiy > it creates new threads and destroys them at a pace of approx. 100/second. > After the long and tedious investigation I dare to say that there are no > memory leaks involved. > (Well, there were memory leaks and I first went after those; found and > fixed - but the result did not change much.) > > You might experiment with a tradeoff speed vs memory usage. The > minimum memory usage should be achieved with MALLOC_ARENA_MAX=3D1 > see 'man mallopt' for other options. > > The application is cross-platform and runs on Windows and some other > platforms too. > There is an OS abstraction layer that provides the unified thread and > memory allocation API for business logic, but the business logic that > triggers memory allocations is platform-independent. > There are no progressive memory allocations in OS abstraction layer > which could be blamed for the memory growth. > > The thing is, on Windows, for the same activity there is no such > application memory growth at all. > It allocates memory moderately and scales back after peak of activity. > This makes me think it is not the business logic to be blamed (to the > extent of that it does not leak memory). > > I used valigrind to profile for memory leaks and heap usage. > Please see massif outputs attached (some callstacks had to be trimmed out= ). > I am also attaching the memory map for the application (run without > valgrind); snapshot is taken after all the threads but main were > destroyed and application is idle. > > The pace of the virtual memory growth is not quite linear. > > Most likely there are long lived objects doing contention and also > probably memory fragmentation, preventing returning memory to > the system after a free call. > > From my observation, it allocates a big hunk in the beginning of the > peak loading, then in some time starts to grow in steps of ~80Mb / 10 > seconds, then after some times starts to steadily grow it at pace of > ~2Mb/second. > > Some stats from the host: > > OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) > > ldd -version > > ldd (GNU libc) 2.17 > Copyright (C) 2012 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There > is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR > PURPOSE. > Written by Roland McGrath and Ulrich Drepper. > > uname -a > > Linux 3.10.0-1160.53.1.el7.x86_64 #1 SMP Thu Dec 16 > 10:19:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > At a peak load, the number of application threads is ~180. > If application is left running, I did not observe it would hit any max > virtual memory threshold and eventually ends up with hitting ulimit. > > My questions are: > > - Is this memory growth an expected behavior? > > It should eventually stabilize. But it is possible that some allocatio= n > pattern is causing both, fragmentation and long lived objects preventing > consolidation of memory chunks. > > - What can be done to prevent it from happening? > > First approach is MALLOC_ARENA_MAX. After that some coding > patterns might help, for example, have large long lived objects allocated > from the same thread, preferably at startup. > Can also attempt to cache some memory, but note that caching is also > an easy way to get contention. To avoid this, you could use memory from > buffers from mmap. > > Depending on your code, you can also experiment with jemalloc or > tcmalloc. I would suggest tcmalloc, as its main feature is to work > in multithreaded environments: > > https://gperftools.github.io/gperftools/tcmalloc.html > > Glibc newer than 2.17 has a per thread cache, but the issue you > are experimenting is not it being slow, but memory usage. AFAIK tcmalloc > has a kind of garbage collector, but it should not be much different than > glibc consolidation logic; it should only run during free, and if there i= s > some contention, it might not be able to release memory. > > Thanks in advance, > - Nikolay > > Thanks! > > Paulo > > >