From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=DNiB=6D=gmail.com=paulo.cesar.pereira.de.andrade@sourceware.org>
Received: from mail-lf1-x136.google.com (mail-lf1-x136.google.com [IPv6:2a00:1450:4864:20::136])
	by sourceware.org (Postfix) with ESMTPS id AB4A33858D33
	for <libc-alpha@sourceware.org>; Tue,  7 Feb 2023 21:39:09 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org AB4A33858D33
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x136.google.com with SMTP id o20so24278057lfk.5
        for <libc-alpha@sourceware.org>; Tue, 07 Feb 2023 13:39:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=VB6xtKwiZAdA4rCBd/GkDelEat0J+BISZlCycpe8TDg=;
        b=lz1CzZJydhq/s0xJcinTp+hMkzxYC0lwSxIJJXEh31aLSJ8PUx8wH+bbNoDtkJGHDQ
         bY9bkK1WokO+R1yxvJdqOrSFqMXh3U+n0cYgRHOtLb59wwefD5pD2zs3uXd6po6m3wUq
         6Ckt6K6oo+rb0aywkEiDDVsEZ1y77yfiRIa+3ITWvmOCOKoUk/z7EomYb0yhwcXOH9K3
         7R4nIxkEMyIZAjPiRgYNr1wF3S9Kd9y0LqYefQoK2eTRlNs+kDQn9e1+MV3l7ZXZUgjP
         OvjiKWJi7kxlPiwLX9wG90pY59KJUmMYQKgE0rd0wFY3HQQDP9G4c8ZjgBBiAW8nwu4C
         shJw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=VB6xtKwiZAdA4rCBd/GkDelEat0J+BISZlCycpe8TDg=;
        b=km83WHKpEEks0daPh/39LFlz4pvziMD3Pw7OJA3Wa0TDcQHSsVLNhU9Oh4qxzKDLNV
         HeDKyritE25/nN7FqwN7ZcZ7+MgNqw/hG47kRU1UjlWgVtYdErtebM2eLLqUJVBXwpnc
         f1L6gTGCCXpebCvgCkKc6SKwnmrqpcJhPGEF/xNZj0h1G1CKMV4b1CWSwF6l0sdM9Con
         ZRiWS/2LcTgx2Rml8dZzxrEBxmgt/iYXS6NrUZqLIgxZ+aoIQRuIzXzhMFjMJSff0/r4
         D+FtB/6gmbPAOpLgj77tSGNrbvuQI+38+Nq6Di0gFpdUNMZnbKRXwPjqd6WtTTH1jT42
         t74Q==
X-Gm-Message-State: AO0yUKV7+qd/MdO5p+A5nuiiJCRf93ylDcMgVWMvFiOCF6DLKnt7k02J
	ffq5WTOK3gE5Jvs8ZKp9cWrDQaAXtbyA0UPC9hw=
X-Google-Smtp-Source: AK7set+/BeMpGYYHIV9/1OErbo7Dn3oiBHNZdy/vmG3a4UE8zcsIEphsUrgkeQLQilfs5UCmiJaOHytEyp4BYCwuIcI=
X-Received: by 2002:ac2:520d:0:b0:4b5:635c:f330 with SMTP id
 a13-20020ac2520d000000b004b5635cf330mr826312lfl.237.1675805947839; Tue, 07
 Feb 2023 13:39:07 -0800 (PST)
MIME-Version: 1.0
References: <bcbfa7fc-d7c6-7c23-76f4-22c3e2f036a8@gmail.com>
 <CAHAq8pHcCHn-Bkw-Ny96DYaTc2DojD6pHVmhczBMon9PqMhBGA@mail.gmail.com>
 <c902439a-48e5-7f7e-be47-44c4ce644d65@gmail.com> <ceb50abd-a5bf-8ac9-a23f-57dfe60be1f6@redhat.com>
 <1ccd66cd-7d6e-1825-95e8-38b49320737f@gmail.com> <24f0f242-6f95-df0d-c89a-36c5c1130a88@gmail.com>
In-Reply-To: <24f0f242-6f95-df0d-c89a-36c5c1130a88@gmail.com>
From: =?UTF-8?Q?Paulo_C=C3=A9sar_Pereira_de_Andrade?= <paulo.cesar.pereira.de.andrade@gmail.com>
Date: Tue, 7 Feb 2023 18:38:55 -0300
Message-ID: <CAHAq8pGG+aqUgGKcTHFRTqWy0KV+QRiuoHZKYFKA3zMsFUmQKg@mail.gmail.com>
Subject: Re: GLIBC malloc behavior question
To: Nikolay.Shustov@gmail.com
Cc: Ben Woodard <woodard@redhat.com>, libc-alpha@sourceware.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=0.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

Em ter., 7 de fev. de 2023 =C3=A0s 17:56, Nikolay Shustov
<nikolay.shustov@gmail.com> escreveu:
>
> I was able to increase ratio of threads reuse in application (i.e. reuse =
running thread instead of destroying it and creating new) and it seemed to =
have a positive effect on the amount of allocated memory.
> Valgrind reports some small memory leaks - and I know from experience the=
se reports are bogus as they deal with its inability to detect thread local=
 storage key destructor function invocations.
> I was able to ensure these are invoked as it supposed to be. The amount o=
f the reported leaks is really small and in no way could contribute to giga=
bytes of VIRT been held in allocations.
>
> I experimented more and at this stage most of my suspicions are towards h=
eap fragmentation.
>
> However, if all the used memory has been released (as I expect it do be w=
hen all but main threads exited), should have I expected mmaped regions to =
be released and process virtual memory size shrink?

  For glibc-2.17, what I see is, when free is called:
1. if not the main arena, it will munmap if it is a large chunk
allocated with mmap.
2. If the main arena, it can trim the memory with sbrk and a negative argum=
ent
    if the top free memory when merged, is 128K or larger.
3. If not the main arena, it will call madvise(addr, size, MADV_DONTNEED) i=
f
   it finds 64K of contiguous unused memory.

  In either case, if releasing with sbrk or mdavise, the process virtual si=
ze
should decrease.

> Or GLIBC would do it upon some special event? (can it be forced somehow?)

  You can call malloc_trim().

  You can also use a smaller value for MALLOC_MMAP_THRESHOLD_ to
have more memory allocated/released with mmap. This value should not be
too small. Basically, tell it to use mmap for large blocks. The default is =
128k.

  Fragmentation usually happens when allocating different sized objects, an=
d
due to memory layout, with too many small objects not released with free,
it cannot find contiguous free blocks.

> Thanks,
> - Nikolay
>
>
> On 2/7/23 13:01, Nikolay Shustov wrote:
>
> Got it, thanks.
> For now, I am using this tunable merely to hep to surface whatever I migh=
t have missed in terms of long living objects and heap fragmentation.
> But I am definitely going to play with it when I am reasonably sure that =
it is what does the major impact.
>
> On 2/7/23 12:26, Ben Woodard wrote:
>
>
> On 2/7/23 08:55, Nikolay Shustov via Libc-alpha wrote:
>
>  There is no garbage collector thread or something similar in some
> worker thread. But maybe something similar could be done in your
> code.
>
>
> No, there is nothing of the kind in the application.
>
> You might experiment with a tradeoff speed vs memory usage. The
> minimum memory usage should be achieved with MALLOC_ARENA_MAX=3D1
> see 'man mallopt' for other options.
>
>
> MALLOC_ARENA_MAX=3D1 made a huge difference.
>
> I just wanted to point out that it isn't 1 or the default. That was most =
likely a simple test to test a hypothesis about what could be going wrong. =
This is a tunable knob and your application could have a sweet spot. For so=
me of the applications that I help support, we have empirically found that =
a good number is slightly lower than the number of processors that they sys=
tem has. e.g. if their are 16 cores giving it 12 arenas doesn't impact spee=
d but makes the memory footprint more compact.
>
> The initial memory allocations went done on the scale of magnitude.
> In fact, I do not see that much of the application slowdown but this will=
 need more profiling.
> The stable allocations growth is still ~2Mb/second.
>
> I am going to investigate your idea of long living objects contention/mem=
ory fragmentation.
>
> This sounds very probably, even though I do not see real memory leaks eve=
n after all the aux threads died.
> I have TLS instances in use, maybe those really get in the way.
>
> Thanks a lot for your help.
> If/when I find something new or interesting, I will send an update - hope=
 it will help someone else, too.
>
> Regards,
> - Nikolay
>
> On 2/7/23 11:16, Paulo C=C3=A9sar Pereira de Andrade wrote:
>
> Em ter., 7 de fev. de 2023 =C3=A0s 12:07, Nikolay Shustov via Libc-alpha
> <libc-alpha@sourceware.org>  escreveu:
>
> Hi,
> I have a question about the malloc() behavior which I observe.
> The synopsis is that the during the stress load, the application
> aggressively allocates virtual memory without any upper limit.
> Just to note, after application is loaded just with the peak of activity
> and goes idle, its virtual memory doesn't scale back (I do not expect
> much of that though - should I?).
>
>    There is no garbage collector thread or something similar in some
> worker thread. But maybe something similar could be done in your
> code.
>
> The application is heavily multithreaded; at its peak of its activitiy
> it creates new threads and destroys them at a pace of approx. 100/second.
> After the long and tedious investigation I dare to say that there are no
> memory leaks involved.
> (Well, there were memory leaks and I first went after those; found and
> fixed - but the result did not change much.)
>
>    You might experiment with a tradeoff speed vs memory usage. The
> minimum memory usage should be achieved with MALLOC_ARENA_MAX=3D1
> see 'man mallopt' for other options.
>
> The application is cross-platform and runs on Windows and some other
> platforms too.
> There is an OS abstraction layer that provides the unified thread and
> memory allocation API for business logic, but the business logic that
> triggers memory allocations is platform-independent.
> There are no progressive memory allocations in OS abstraction layer
> which could be blamed for the memory growth.
>
> The thing is, on Windows, for the same activity there is no such
> application memory growth at all.
> It allocates memory moderately and scales back after peak of activity.
> This makes me think it is not the business logic to be blamed (to the
> extent of that it does not leak memory).
>
> I used valigrind to profile for memory leaks and heap usage.
> Please see massif outputs attached (some callstacks had to be trimmed out=
).
> I am also attaching the memory map for the application (run without
> valgrind); snapshot is taken after all the threads but main were
> destroyed and application is idle.
>
> The pace of the virtual memory growth is not quite linear.
>
>    Most likely there are long lived objects doing contention and also
> probably memory fragmentation, preventing returning memory to
> the system after a free call.
>
>   From my observation, it allocates a big hunk in the beginning of the
> peak loading, then in some time starts to grow in steps of ~80Mb / 10
> seconds, then after some times starts to steadily grow it at pace of
> ~2Mb/second.
>
> Some stats from the host:
>
>      OS: Red Hat Enterprise Linux Server release 7.9 (Maipo)
>
> ldd -version
>
>      ldd (GNU libc) 2.17
>      Copyright (C) 2012 Free Software Foundation, Inc.
>      This is free software; see the source for copying conditions. There
>      is NO
>      warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>      PURPOSE.
>      Written by Roland McGrath and Ulrich Drepper.
>
> uname -a
>
>      Linux <skipped> 3.10.0-1160.53.1.el7.x86_64 #1 SMP Thu Dec 16
>      10:19:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>
>
> At a peak load, the number of application threads is ~180.
> If application is left running, I did not observe it would hit any  max
> virtual memory threshold and eventually ends up with hitting ulimit.
>
> My questions are:
>
> - Is this memory growth an expected behavior?
>
>    It should eventually stabilize. But it is possible that some allocatio=
n
> pattern is causing both, fragmentation and long lived objects preventing
> consolidation of memory chunks.
>
> - What can be done to prevent it from happening?
>
>    First approach is MALLOC_ARENA_MAX. After that some coding
> patterns might help, for example, have large long lived objects allocated
> from the same thread, preferably at startup.
>    Can also attempt to cache some memory, but note that caching is also
> an easy way to get contention. To avoid this, you could use memory from
> buffers from mmap.
>
>    Depending on your code, you can also experiment with jemalloc or
> tcmalloc. I would suggest tcmalloc, as its main feature is to work
> in multithreaded environments:
>
> https://gperftools.github.io/gperftools/tcmalloc.html
>
>    Glibc newer than 2.17 has a per thread cache, but the issue you
> are experimenting is not it being slow, but memory usage. AFAIK tcmalloc
> has a kind of garbage collector, but it should not be much different than
> glibc consolidation logic; it should only run during free, and if there i=
s
> some contention, it might not be able to release memory.
>
> Thanks in advance,
> - Nikolay
>
> Thanks!
>
> Paulo
>
>
>