From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw1-x1135.google.com (mail-yw1-x1135.google.com [IPv6:2607:f8b0:4864:20::1135]) by sourceware.org (Postfix) with ESMTPS id 9D1103858D39 for ; Tue, 7 Feb 2023 18:01:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9D1103858D39 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-yw1-x1135.google.com with SMTP id 00721157ae682-5249a65045aso173549167b3.13 for ; Tue, 07 Feb 2023 10:01:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:references:cc:to:content-language:subject:reply-to :user-agent:mime-version:date:message-id:from:from:to:cc:subject :date:message-id:reply-to; bh=NlrTKHuty7kFE/WquQdh+KHfatwssJgPthYeXLhsx/k=; b=Ihckg7m/8WmAbT/+adMmkFvT2LIy6nYdCTMDuPI/MDAqVDORncNZ4dYERBDiQ+nbk8 jg/Ivv4lglUs1bXxiy5G85DmcjVuyjNGlwL8Rb9ck4I/6L/7Y6jvcKUUCIv9vSp/vLC/ 0yRDUQ1C76fugv8rbvgX7E9nBhWm6QEQkrGtiHnj4ZCr1bnGyoyrHN4QB32hWT6N/9IG xQnFo2L15XRVAS5GJk1vPwg7tgfU/X6j6YxeuZfSlay3hrKBNPzErHU3oEZOVLMW3Lo9 OqB8ZCMsehV7y3VhMO0AeQbpvbUkPQJQbTSXadvpR7H/M3MeEccik8SGsU1SiFzmH1tF IE/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:references:cc:to:content-language:subject:reply-to :user-agent:mime-version:date:message-id:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=NlrTKHuty7kFE/WquQdh+KHfatwssJgPthYeXLhsx/k=; b=NI00UPHanA0TfYJUjV246+a5LpNRAsCQUNgvnfeK2HMuXQ4nRC9zBtCrZqM0dTrHYH jfdtSCzAZB4ATdrcmXBmFfRyb9zneOhorSJiNtqj2fYVU9bs+EgUis29npjrrEUXYy1d neeUm9hYbqnTtEbujzbxZE8fsJDbimTVJVacBa7/dsBxTpOGi7EpUYUcp/g8QqKpW55z +6VIKKa6TYee+Y3YbyuzdnhT659PSYlbxHQ6nZ21TUOPMRO9SzWpnT0W0tDY+1OebQl/ IyAPiYf5c9MSZmPqGyg1uisM74KTl26eMk30TlvYppWXbBSN744MXcecL6NRx2Om4qv7 wFLg== X-Gm-Message-State: AO0yUKW8uwc+8oDwXVYG+78Foc8UUOYPHqlYLf6viSOzZtI5FxwP1wh2 8nb914DaN6oArlQ8SafAQV0= X-Google-Smtp-Source: AK7set8xwW3BC+FwGDvno55Ommp8rHVRZpHkI/KuYRs0zEYuWS/oWFhxEZYOy6tg4a7NZUXcC2lhwQ== X-Received: by 2002:a81:400d:0:b0:4fa:7973:eb93 with SMTP id l13-20020a81400d000000b004fa7973eb93mr2206294ywn.7.1675792901985; Tue, 07 Feb 2023 10:01:41 -0800 (PST) Received: from [192.168.1.10] (c-73-143-206-114.hsd1.ma.comcast.net. [73.143.206.114]) by smtp.gmail.com with ESMTPSA id h62-20020a37b741000000b007283b33bfbesm9826465qkf.121.2023.02.07.10.01.40 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 07 Feb 2023 10:01:41 -0800 (PST) From: Nikolay Shustov X-Google-Original-From: Nikolay Shustov Content-Type: multipart/alternative; boundary="------------vxyDZAyqT46z2XCCejmpCtDk" Message-ID: <1ccd66cd-7d6e-1825-95e8-38b49320737f@gmail.com> Date: Tue, 7 Feb 2023 13:01:40 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Reply-To: Nikolay.Shustov@gmail.com Subject: Re: GLIBC malloc behavior question Content-Language: en-US To: Ben Woodard , =?UTF-8?Q?Paulo_C=c3=a9sar_Pereira_de_Andrade?= Cc: libc-alpha@sourceware.org References: In-Reply-To: X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: This is a multi-part message in MIME format. --------------vxyDZAyqT46z2XCCejmpCtDk Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Got it, thanks. For now, I am using this tunable merely to hep to surface whatever I might have missed in terms of long living objects and heap fragmentation. But I am definitely going to play with it when I am reasonably sure that it is what does the major impact. On 2/7/23 12:26, Ben Woodard wrote: > > On 2/7/23 08:55, Nikolay Shustov via Libc-alpha wrote: >>>  There is no garbage collector thread or something similar in some >>> worker thread. But maybe something similar could be done in your >>> code. >> >> No, there is nothing of the kind in the application. >> >>> You might experiment with a tradeoff speed vs memory usage. The >>> minimum memory usage should be achieved with MALLOC_ARENA_MAX=1 >>> see 'man mallopt' for other options. >> >> MALLOC_ARENA_MAX=1 made a huge difference. > I just wanted to point out that it isn't 1 or the default. That was > most likely a simple test to test a hypothesis about what could be > going wrong. This is a tunable knob and your application could have a > sweet spot. For some of the applications that I help support, we have > empirically found that a good number is slightly lower than the number > of processors that they system has. e.g. if their are 16 cores giving > it 12 arenas doesn't impact speed but makes the memory footprint more > compact. >> The initial memory allocations went done on the scale of magnitude. >> In fact, I do not see that much of the application slowdown but this >> will need more profiling. >> The stable allocations growth is still ~2Mb/second. >> >> I am going to investigate your idea of long living objects >> contention/memory fragmentation. >> >> This sounds very probably, even though I do not see real memory leaks >> even after all the aux threads died. >> I have TLS instances in use, maybe those really get in the way. >> >> Thanks a lot for your help. >> If/when I find something new or interesting, I will send an update - >> hope it will help someone else, too. >> >> Regards, >> - Nikolay >> >> On 2/7/23 11:16, Paulo César Pereira de Andrade wrote: >>> Em ter., 7 de fev. de 2023 às 12:07, Nikolay Shustov via Libc-alpha >>>   escreveu: >>>> Hi, >>>> I have a question about the malloc() behavior which I observe. >>>> The synopsis is that the during the stress load, the application >>>> aggressively allocates virtual memory without any upper limit. >>>> Just to note, after application is loaded just with the peak of >>>> activity >>>> and goes idle, its virtual memory doesn't scale back (I do not expect >>>> much of that though - should I?). >>>    There is no garbage collector thread or something similar in some >>> worker thread. But maybe something similar could be done in your >>> code. >>> >>>> The application is heavily multithreaded; at its peak of its activitiy >>>> it creates new threads and destroys them at a pace of approx. >>>> 100/second. >>>> After the long and tedious investigation I dare to say that there >>>> are no >>>> memory leaks involved. >>>> (Well, there were memory leaks and I first went after those; found and >>>> fixed - but the result did not change much.) >>>    You might experiment with a tradeoff speed vs memory usage. The >>> minimum memory usage should be achieved with MALLOC_ARENA_MAX=1 >>> see 'man mallopt' for other options. >>> >>>> The application is cross-platform and runs on Windows and some other >>>> platforms too. >>>> There is an OS abstraction layer that provides the unified thread and >>>> memory allocation API for business logic, but the business logic that >>>> triggers memory allocations is platform-independent. >>>> There are no progressive memory allocations in OS abstraction layer >>>> which could be blamed for the memory growth. >>>> >>>> The thing is, on Windows, for the same activity there is no such >>>> application memory growth at all. >>>> It allocates memory moderately and scales back after peak of activity. >>>> This makes me think it is not the business logic to be blamed (to the >>>> extent of that it does not leak memory). >>>> >>>> I used valigrind to profile for memory leaks and heap usage. >>>> Please see massif outputs attached (some callstacks had to be >>>> trimmed out). >>>> I am also attaching the memory map for the application (run without >>>> valgrind); snapshot is taken after all the threads but main were >>>> destroyed and application is idle. >>>> >>>> The pace of the virtual memory growth is not quite linear. >>>    Most likely there are long lived objects doing contention and also >>> probably memory fragmentation, preventing returning memory to >>> the system after a free call. >>> >>>>   From my observation, it allocates a big hunk in the beginning of the >>>> peak loading, then in some time starts to grow in steps of ~80Mb / 10 >>>> seconds, then after some times starts to steadily grow it at pace of >>>> ~2Mb/second. >>>> >>>> Some stats from the host: >>>> >>>>      OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) >>>> >>>> ldd -version >>>> >>>>      ldd (GNU libc) 2.17 >>>>      Copyright (C) 2012 Free Software Foundation, Inc. >>>>      This is free software; see the source for copying conditions. >>>> There >>>>      is NO >>>>      warranty; not even for MERCHANTABILITY or FITNESS FOR A >>>> PARTICULAR >>>>      PURPOSE. >>>>      Written by Roland McGrath and Ulrich Drepper. >>>> >>>> uname -a >>>> >>>>      Linux 3.10.0-1160.53.1.el7.x86_64 #1 SMP Thu Dec 16 >>>>      10:19:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux >>>> >>>> >>>> At a peak load, the number of application threads is ~180. >>>> If application is left running, I did not observe it would hit any  >>>> max >>>> virtual memory threshold and eventually ends up with hitting ulimit. >>>> >>>> My questions are: >>>> >>>> - Is this memory growth an expected behavior? >>>    It should eventually stabilize. But it is possible that some >>> allocation >>> pattern is causing both, fragmentation and long lived objects >>> preventing >>> consolidation of memory chunks. >>> >>>> - What can be done to prevent it from happening? >>>    First approach is MALLOC_ARENA_MAX. After that some coding >>> patterns might help, for example, have large long lived objects >>> allocated >>> from the same thread, preferably at startup. >>>    Can also attempt to cache some memory, but note that caching is also >>> an easy way to get contention. To avoid this, you could use memory from >>> buffers from mmap. >>> >>>    Depending on your code, you can also experiment with jemalloc or >>> tcmalloc. I would suggest tcmalloc, as its main feature is to work >>> in multithreaded environments: >>> >>> https://gperftools.github.io/gperftools/tcmalloc.html >>> >>>    Glibc newer than 2.17 has a per thread cache, but the issue you >>> are experimenting is not it being slow, but memory usage. AFAIK >>> tcmalloc >>> has a kind of garbage collector, but it should not be much different >>> than >>> glibc consolidation logic; it should only run during free, and if >>> there is >>> some contention, it might not be able to release memory. >>> >>>> Thanks in advance, >>>> - Nikolay >>> Thanks! >>> >>> Paulo >> > --------------vxyDZAyqT46z2XCCejmpCtDk--