From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qv1-xf36.google.com (mail-qv1-xf36.google.com [IPv6:2607:f8b0:4864:20::f36]) by sourceware.org (Postfix) with ESMTPS id 44C2F3858D33 for ; Tue, 7 Feb 2023 23:41:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 44C2F3858D33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-qv1-xf36.google.com with SMTP id d13so10314586qvj.8 for ; Tue, 07 Feb 2023 15:41:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:references:cc:to:content-language:subject:reply-to :user-agent:mime-version:date:message-id:from:from:to:cc:subject :date:message-id:reply-to; bh=OdmyOiwqdGxYrnReU0dlB/W4n004G7/Ls0ynLB8DNZU=; b=WlpMj/IeD73IqGChJBzixWnOK2c9s20ytpAQISgySw59AStU1KiUpT4xfZKKxFXBAu bRQbW9cEus/0qEIAg7M9Ecm1Ehgmtz/d8sTitq0KxWoGxlByBNRGovvypmhWn+/Nmpeb HGZ9/FIlzOIIO4+G0ShEz33loD0mIPtmHIaazAGNIYU+p6KkoDmsHUMNKGDD8lLbPZZp mfOATsPxv21TJsvBQJMih+z1zm2Pt+GRUv6WLl6cIafIuCC0W+Jh9X/bLDneDjbrfn/7 raSIDKA/mF0ygqoCRsd73uQuEnHjxcxQmq706BFqXhNnYFbDd97echz3e4Hz/E/UMytz rmOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:references:cc:to:content-language:subject:reply-to :user-agent:mime-version:date:message-id:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=OdmyOiwqdGxYrnReU0dlB/W4n004G7/Ls0ynLB8DNZU=; b=P9dkP23JWD6QJSGni4+FXxmiZ5ykA1DEY6jAxuZIcY4z4z6ZswcQH4nov4JRKlODwP lTo5IAhWDQ/jPwurVjXIkWmL1e4ZnvaPRCwsfEtnXodESPhT7BtDxglwpiMK7jQtmdQU hNwQiHsW3rmX48tDgn0bzBU/EYtHRjaSTovwAtxsj31UfltYUgT5wpnt2dQgEaYjSRae FKwndaWR3KM6TOeZh4ewTm99tu7N1ixOupaSg5a/RPCNPKuqVPv8fRiRYqqlq6fyd+B9 FblkC17n3Iul6XQNn5nMdv/U7D0rmJE3YCT9mNKnpvh9ESm+XuPM+lRll+EuvPi+svjZ pbPA== X-Gm-Message-State: AO0yUKVTDVN4lrkOKGYwiBmmCaEFLttNhMnGLkRPNax48yUBuy0UrJp/ M4Xqc0ircxR7vDUU21L6FN0= X-Google-Smtp-Source: AK7set+9cLY2W7PBi2RISv3O5fyMU6LG+0SEXI9EwFm+ti9iifA9rCumj14Gtgbrzjt8//NIdGTxRQ== X-Received: by 2002:a05:6214:2a89:b0:568:68f7:57f5 with SMTP id jr9-20020a0562142a8900b0056868f757f5mr7288804qvb.4.1675813296325; Tue, 07 Feb 2023 15:41:36 -0800 (PST) Received: from [192.168.1.10] (c-73-143-206-114.hsd1.ma.comcast.net. [73.143.206.114]) by smtp.gmail.com with ESMTPSA id l20-20020a37f514000000b0072835b8e4a8sm10354938qkk.75.2023.02.07.15.41.34 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 07 Feb 2023 15:41:35 -0800 (PST) From: Nikolay Shustov X-Google-Original-From: Nikolay Shustov Content-Type: multipart/alternative; boundary="------------L5EuMC5Dm2WJjBPrdioTgffG" Message-ID: <032010f8-1df3-4d97-810f-fd77c92b5de5@gmail.com> Date: Tue, 7 Feb 2023 18:41:34 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Reply-To: Nikolay.Shustov@gmail.com Subject: Re: GLIBC malloc behavior question Content-Language: en-US To: =?UTF-8?Q?Paulo_C=c3=a9sar_Pereira_de_Andrade?= Cc: Ben Woodard , libc-alpha@sourceware.org References: <1ccd66cd-7d6e-1825-95e8-38b49320737f@gmail.com> <24f0f242-6f95-df0d-c89a-36c5c1130a88@gmail.com> In-Reply-To: X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: This is a multi-part message in MIME format. --------------L5EuMC5Dm2WJjBPrdioTgffG Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Makes sense, thank you. On 2/7/23 16:38, Paulo César Pereira de Andrade wrote: > Em ter., 7 de fev. de 2023 às 17:56, Nikolay Shustov > escreveu: >> I was able to increase ratio of threads reuse in application (i.e. reuse running thread instead of destroying it and creating new) and it seemed to have a positive effect on the amount of allocated memory. >> Valgrind reports some small memory leaks - and I know from experience these reports are bogus as they deal with its inability to detect thread local storage key destructor function invocations. >> I was able to ensure these are invoked as it supposed to be. The amount of the reported leaks is really small and in no way could contribute to gigabytes of VIRT been held in allocations. >> >> I experimented more and at this stage most of my suspicions are towards heap fragmentation. >> >> However, if all the used memory has been released (as I expect it do be when all but main threads exited), should have I expected mmaped regions to be released and process virtual memory size shrink? > For glibc-2.17, what I see is, when free is called: > 1. if not the main arena, it will munmap if it is a large chunk > allocated with mmap. > 2. If the main arena, it can trim the memory with sbrk and a negative argument > if the top free memory when merged, is 128K or larger. > 3. If not the main arena, it will call madvise(addr, size, MADV_DONTNEED) if > it finds 64K of contiguous unused memory. > > In either case, if releasing with sbrk or mdavise, the process virtual size > should decrease. > >> Or GLIBC would do it upon some special event? (can it be forced somehow?) > You can call malloc_trim(). > > You can also use a smaller value for MALLOC_MMAP_THRESHOLD_ to > have more memory allocated/released with mmap. This value should not be > too small. Basically, tell it to use mmap for large blocks. The default is 128k. > > Fragmentation usually happens when allocating different sized objects, and > due to memory layout, with too many small objects not released with free, > it cannot find contiguous free blocks. > >> Thanks, >> - Nikolay >> >> >> On 2/7/23 13:01, Nikolay Shustov wrote: >> >> Got it, thanks. >> For now, I am using this tunable merely to hep to surface whatever I might have missed in terms of long living objects and heap fragmentation. >> But I am definitely going to play with it when I am reasonably sure that it is what does the major impact. >> >> On 2/7/23 12:26, Ben Woodard wrote: >> >> >> On 2/7/23 08:55, Nikolay Shustov via Libc-alpha wrote: >> >> There is no garbage collector thread or something similar in some >> worker thread. But maybe something similar could be done in your >> code. >> >> >> No, there is nothing of the kind in the application. >> >> You might experiment with a tradeoff speed vs memory usage. The >> minimum memory usage should be achieved with MALLOC_ARENA_MAX=1 >> see 'man mallopt' for other options. >> >> >> MALLOC_ARENA_MAX=1 made a huge difference. >> >> I just wanted to point out that it isn't 1 or the default. That was most likely a simple test to test a hypothesis about what could be going wrong. This is a tunable knob and your application could have a sweet spot. For some of the applications that I help support, we have empirically found that a good number is slightly lower than the number of processors that they system has. e.g. if their are 16 cores giving it 12 arenas doesn't impact speed but makes the memory footprint more compact. >> >> The initial memory allocations went done on the scale of magnitude. >> In fact, I do not see that much of the application slowdown but this will need more profiling. >> The stable allocations growth is still ~2Mb/second. >> >> I am going to investigate your idea of long living objects contention/memory fragmentation. >> >> This sounds very probably, even though I do not see real memory leaks even after all the aux threads died. >> I have TLS instances in use, maybe those really get in the way. >> >> Thanks a lot for your help. >> If/when I find something new or interesting, I will send an update - hope it will help someone else, too. >> >> Regards, >> - Nikolay >> >> On 2/7/23 11:16, Paulo César Pereira de Andrade wrote: >> >> Em ter., 7 de fev. de 2023 às 12:07, Nikolay Shustov via Libc-alpha >> escreveu: >> >> Hi, >> I have a question about the malloc() behavior which I observe. >> The synopsis is that the during the stress load, the application >> aggressively allocates virtual memory without any upper limit. >> Just to note, after application is loaded just with the peak of activity >> and goes idle, its virtual memory doesn't scale back (I do not expect >> much of that though - should I?). >> >> There is no garbage collector thread or something similar in some >> worker thread. But maybe something similar could be done in your >> code. >> >> The application is heavily multithreaded; at its peak of its activitiy >> it creates new threads and destroys them at a pace of approx. 100/second. >> After the long and tedious investigation I dare to say that there are no >> memory leaks involved. >> (Well, there were memory leaks and I first went after those; found and >> fixed - but the result did not change much.) >> >> You might experiment with a tradeoff speed vs memory usage. The >> minimum memory usage should be achieved with MALLOC_ARENA_MAX=1 >> see 'man mallopt' for other options. >> >> The application is cross-platform and runs on Windows and some other >> platforms too. >> There is an OS abstraction layer that provides the unified thread and >> memory allocation API for business logic, but the business logic that >> triggers memory allocations is platform-independent. >> There are no progressive memory allocations in OS abstraction layer >> which could be blamed for the memory growth. >> >> The thing is, on Windows, for the same activity there is no such >> application memory growth at all. >> It allocates memory moderately and scales back after peak of activity. >> This makes me think it is not the business logic to be blamed (to the >> extent of that it does not leak memory). >> >> I used valigrind to profile for memory leaks and heap usage. >> Please see massif outputs attached (some callstacks had to be trimmed out). >> I am also attaching the memory map for the application (run without >> valgrind); snapshot is taken after all the threads but main were >> destroyed and application is idle. >> >> The pace of the virtual memory growth is not quite linear. >> >> Most likely there are long lived objects doing contention and also >> probably memory fragmentation, preventing returning memory to >> the system after a free call. >> >> From my observation, it allocates a big hunk in the beginning of the >> peak loading, then in some time starts to grow in steps of ~80Mb / 10 >> seconds, then after some times starts to steadily grow it at pace of >> ~2Mb/second. >> >> Some stats from the host: >> >> OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) >> >> ldd -version >> >> ldd (GNU libc) 2.17 >> Copyright (C) 2012 Free Software Foundation, Inc. >> This is free software; see the source for copying conditions. There >> is NO >> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR >> PURPOSE. >> Written by Roland McGrath and Ulrich Drepper. >> >> uname -a >> >> Linux 3.10.0-1160.53.1.el7.x86_64 #1 SMP Thu Dec 16 >> 10:19:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux >> >> >> At a peak load, the number of application threads is ~180. >> If application is left running, I did not observe it would hit any max >> virtual memory threshold and eventually ends up with hitting ulimit. >> >> My questions are: >> >> - Is this memory growth an expected behavior? >> >> It should eventually stabilize. But it is possible that some allocation >> pattern is causing both, fragmentation and long lived objects preventing >> consolidation of memory chunks. >> >> - What can be done to prevent it from happening? >> >> First approach is MALLOC_ARENA_MAX. After that some coding >> patterns might help, for example, have large long lived objects allocated >> from the same thread, preferably at startup. >> Can also attempt to cache some memory, but note that caching is also >> an easy way to get contention. To avoid this, you could use memory from >> buffers from mmap. >> >> Depending on your code, you can also experiment with jemalloc or >> tcmalloc. I would suggest tcmalloc, as its main feature is to work >> in multithreaded environments: >> >> https://gperftools.github.io/gperftools/tcmalloc.html >> >> Glibc newer than 2.17 has a per thread cache, but the issue you >> are experimenting is not it being slow, but memory usage. AFAIK tcmalloc >> has a kind of garbage collector, but it should not be much different than >> glibc consolidation logic; it should only run during free, and if there is >> some contention, it might not be able to release memory. >> >> Thanks in advance, >> - Nikolay >> >> Thanks! >> >> Paulo >> >> >> --------------L5EuMC5Dm2WJjBPrdioTgffG--