From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ua1-x92e.google.com (mail-ua1-x92e.google.com [IPv6:2607:f8b0:4864:20::92e]) by sourceware.org (Postfix) with ESMTPS id CA7133858D35 for ; Thu, 25 Nov 2021 20:56:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CA7133858D35 Received: by mail-ua1-x92e.google.com with SMTP id y5so14697332ual.7 for ; Thu, 25 Nov 2021 12:56:14 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:references:from:in-reply-to :content-transfer-encoding; bh=nvGchA/LkCSHeZ1ah8QEtcLqpVPsFTo4ZPJbJJelskk=; b=Dpr8JKrjVvJ9oC8nTv1fV9zpgiFxh6PpBxXUDB/ajHlMoT3vDFazs/ff7F4jszFqXs Y7cNeNM+TeWEBMAQfXC6HxjEID7Yz6D3WoCG0dBIl6M/lO9FK5tEXz6zON1vmUUjLgI9 sK26SHgNTrQgb0pUUutDeRBRRBx92cIz6z5SyX0xTCwODahpnSO9OR/O3Bm7XqtlXrMt rq8WaoCQ8h2rOp12zuZLJ/gQMiwDBAYVUX6+4tmnzP8ujKZeyPOOFw3mYvRw6L2HkM2X Pes/UyS4FcznXuoQxoQfgEeCdEWQCCHmYFw0M0t38N02W0tQdxDoG79Coep8WclryR03 Hx+g== X-Gm-Message-State: AOAM532wVEB89o+alVchfWayJaY08/VY1QVfwCVFP2XUs0j95nBIg0+O dNRsKwMmzduDSSbC82b0lvjyHA== X-Google-Smtp-Source: ABdhPJzKa2bSh68qVIeLm/0qRQoh8eXXW9PKPusjZYb7xEd9gd/HV7hAXB/W8dQ2rYfRLZrUchKgVQ== X-Received: by 2002:ab0:4911:: with SMTP id z17mr30115044uac.91.1637873774306; Thu, 25 Nov 2021 12:56:14 -0800 (PST) Received: from ?IPV6:2804:431:c7cb:e054:f9cd:8920:996d:94da? ([2804:431:c7cb:e054:f9cd:8920:996d:94da]) by smtp.gmail.com with ESMTPSA id b11sm2578245vsp.6.2021.11.25.12.56.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 25 Nov 2021 12:56:14 -0800 (PST) Message-ID: Date: Thu, 25 Nov 2021 17:56:11 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.2 Subject: Re: Excessive memory consumption when using malloc() Content-Language: en-US To: Carlos O'Donell , Konstantin Kharlamov , Christian Hoff , libc-help@sourceware.org References: <560ed6888a62b21362cda5385655c3a84fd354b9.camel@yandex.ru> <56522c8f847ddd27fdffedecb516f778837f9e92.camel@yandex.ru> From: Adhemerval Zanella In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-help@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-help mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Nov 2021 20:56:17 -0000 On 25/11/2021 15:21, Carlos O'Donell via Libc-help wrote: > On 11/25/21 13:12, Konstantin Kharlamov via Libc-help wrote: >> So there you go, you 10G of unreleased memory is a Glibc feature, no complaints >> ;-P > > Freeing memory back to the OS is a form of cache invalidation, and cache > invalidation is hard and workload dependent. > > In this specific case, particularly with 50MiB, you are within the 64MiB > 64-bit process heap size, and the 1024-byte frees do not trigger the > performance expensive consolidation and heap reduction (which requires > a munmap syscall to release the resources). > > In the case of 10GiB, and 512KiB allocations, we are talking different > behaviour. I have responded here with my recommendations: > https://sourceware.org/pipermail/libc-help/2021-November/006052.html > The BZ#27103 issues seems to be a memory fragmentation due the usage of sbrk() plus the deallocation done in reverse order, which prevents free() to coalescence the previous allocation automatically. For instance with the testcase below: $ gcc -Wall test.c -o test -DNTIMES=50000 -DCHUNK=1024 $ ./test memory usage: 1036 Kb allocate ...done memory usage: 52812 Kb If you force the mmap usage: $ GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test memory usage: 1044 Kb allocate ...done memory usage: 2052 Kb As Carlos has put, it is tradeoff since sbrk() is usually faster to expand the data segments compared to mmap() and subsequent allocations will fill the fragmented heap (so multiple allocation will avoid further memory fragmentation). Just to give you comparative, always using mmap() incurs more page-faults and way more cpu utilization $ perf stat ./test memory usage: 964 Kb allocate ...done memory usage: 52796 Kb memory usage: 52796 Kb allocate ...done memory usage: 52796 Kb Performance counter stats for './test': 15.22 msec task-clock # 0.983 CPUs utilized 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 12,853 page-faults # 844.546 K/sec 68,518,548 cycles # 4.502 GHz (73.73%) 480,717 stalled-cycles-frontend # 0.70% frontend cycles idle (73.72%) 2,333 stalled-cycles-backend # 0.00% backend cycles idle (73.72%) 105,356,108 instructions # 1.54 insn per cycle # 0.00 stalled cycles per insn (91.81%) 23,787,860 branches # 1.563 G/sec 58,990 branch-misses # 0.25% of all branches (87.01%) 0.015478114 seconds time elapsed 0.010348000 seconds user 0.005174000 seconds sys $ perf stat env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test memory usage: 956 Kb allocate ...done memory usage: 2012 Kb memory usage: 2012 Kb allocate ...done memory usage: 2012 Kb Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test': 156.52 msec task-clock # 0.998 CPUs utilized 1 context-switches # 6.389 /sec 0 cpu-migrations # 0.000 /sec 100,228 page-faults # 640.338 K/sec 738,047,682 cycles # 4.715 GHz (82.11%) 8,779,463 stalled-cycles-frontend # 1.19% frontend cycles idle (82.11%) 34,195 stalled-cycles-backend # 0.00% backend cycles idle (82.97%) 1,254,219,911 instructions # 1.70 insn per cycle # 0.01 stalled cycles per insn (84.68%) 237,180,662 branches # 1.515 G/sec (84.67%) 687,051 branch-misses # 0.29% of all branches (83.46%) 0.156904324 seconds time elapsed 0.024142000 seconds user 0.132786000 seconds sys That's why I think it might not be the best strategy to use the mmap() strategy as default. What I think we might improve is to maybe add an heuristic to call malloc_trim once a certain level of fragmentation in the main_arena is found. The question is which metric and threshold to use. The trimming does have a cost, however I think it worth to decrease fragmentation and memory utilization. --- $ cat test.c #include #include #include #include #include #include static size_t pagesize; static size_t read_rss (void) { int fd = open ("/proc/self/statm", O_RDONLY); assert (fd != -1); char line[256]; ssize_t r = read (fd, line, sizeof (line)); assert (r != -1); line[r] = '\0'; size_t rss; sscanf (line, "%*u %zu %*u %*u 0 %*u 0\n", &rss); close (fd); return rss * pagesize; } static void * allocate (void *args) { enum { chunk = CHUNK }; enum { ntimes = NTIMES * chunk }; void *chunks[NTIMES]; for (int i = 0; i < sizeof (chunks) / sizeof (chunks[0]); i++) { chunks[i] = malloc (chunk); memset (chunks[i], 0, chunk); assert (chunks[i] != NULL); } for (int i = (sizeof (chunks) / sizeof (chunks[0])) - 1; i >= 0; i--) free (chunks[i]); return NULL; } int main (int argc, char *argv[]) { pagesize = sysconf (_SC_PAGESIZE); assert (pagesize != -1); { printf ("memory usage: %zu Kb\n", read_rss () / 1024); printf ("allocate ..."); allocate (NULL); printf ("done\n"); printf ("memory usage: %zu Kb\n", read_rss () / 1024); } return 0; }