From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <adhemerval.zanella@linaro.org>
Received: from mail-ua1-x92e.google.com (mail-ua1-x92e.google.com
 [IPv6:2607:f8b0:4864:20::92e])
 by sourceware.org (Postfix) with ESMTPS id CA7133858D35
 for <libc-help@sourceware.org>; Thu, 25 Nov 2021 20:56:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CA7133858D35
Received: by mail-ua1-x92e.google.com with SMTP id y5so14697332ual.7
 for <libc-help@sourceware.org>; Thu, 25 Nov 2021 12:56:14 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:message-id:date:mime-version:user-agent:subject
 :content-language:to:references:from:in-reply-to
 :content-transfer-encoding;
 bh=nvGchA/LkCSHeZ1ah8QEtcLqpVPsFTo4ZPJbJJelskk=;
 b=Dpr8JKrjVvJ9oC8nTv1fV9zpgiFxh6PpBxXUDB/ajHlMoT3vDFazs/ff7F4jszFqXs
 Y7cNeNM+TeWEBMAQfXC6HxjEID7Yz6D3WoCG0dBIl6M/lO9FK5tEXz6zON1vmUUjLgI9
 sK26SHgNTrQgb0pUUutDeRBRRBx92cIz6z5SyX0xTCwODahpnSO9OR/O3Bm7XqtlXrMt
 rq8WaoCQ8h2rOp12zuZLJ/gQMiwDBAYVUX6+4tmnzP8ujKZeyPOOFw3mYvRw6L2HkM2X
 Pes/UyS4FcznXuoQxoQfgEeCdEWQCCHmYFw0M0t38N02W0tQdxDoG79Coep8WclryR03
 Hx+g==
X-Gm-Message-State: AOAM532wVEB89o+alVchfWayJaY08/VY1QVfwCVFP2XUs0j95nBIg0+O
 dNRsKwMmzduDSSbC82b0lvjyHA==
X-Google-Smtp-Source: ABdhPJzKa2bSh68qVIeLm/0qRQoh8eXXW9PKPusjZYb7xEd9gd/HV7hAXB/W8dQ2rYfRLZrUchKgVQ==
X-Received: by 2002:ab0:4911:: with SMTP id z17mr30115044uac.91.1637873774306; 
 Thu, 25 Nov 2021 12:56:14 -0800 (PST)
Received: from ?IPV6:2804:431:c7cb:e054:f9cd:8920:996d:94da?
 ([2804:431:c7cb:e054:f9cd:8920:996d:94da])
 by smtp.gmail.com with ESMTPSA id b11sm2578245vsp.6.2021.11.25.12.56.13
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Thu, 25 Nov 2021 12:56:14 -0800 (PST)
Message-ID: <d8d5cd5c-afb8-b6aa-408b-60d5d55b4353@linaro.org>
Date: Thu, 25 Nov 2021 17:56:11 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.3.2
Subject: Re: Excessive memory consumption when using malloc()
Content-Language: en-US
To: Carlos O'Donell <carlos@redhat.com>,
 Konstantin Kharlamov <hi-angel@yandex.ru>,
 Christian Hoff <christian_hoff@gmx.net>, libc-help@sourceware.org
References: <bb70214a-029a-df1f-983e-87a8d3c05d58@gmx.net>
 <560ed6888a62b21362cda5385655c3a84fd354b9.camel@yandex.ru>
 <56522c8f847ddd27fdffedecb516f778837f9e92.camel@yandex.ru>
 <c20fabf9-3eb1-3f06-47a3-20da3ddc8e25@redhat.com>
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
In-Reply-To: <c20fabf9-3eb1-3f06-47a3-20da3ddc8e25@redhat.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-help@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-help mailing list <libc-help.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-help/>
List-Help: <mailto:libc-help-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-help>,
 <mailto:libc-help-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Nov 2021 20:56:17 -0000


On 25/11/2021 15:21, Carlos O'Donell via Libc-help wrote:
> On 11/25/21 13:12, Konstantin Kharlamov via Libc-help wrote:
>> So there you go, you 10G of unreleased memory is a Glibc feature, no complaints
>> ;-P
> 
> Freeing memory back to the OS is a form of cache invalidation, and cache
> invalidation is hard and workload dependent.
> 
> In this specific case, particularly with 50MiB, you are within the 64MiB
> 64-bit process heap size, and the 1024-byte frees do not trigger the
> performance expensive consolidation and heap reduction (which requires
> a munmap syscall to release the resources).
> 
> In the case of 10GiB, and 512KiB allocations, we are talking different
> behaviour. I have responded here with my recommendations:
> https://sourceware.org/pipermail/libc-help/2021-November/006052.html
> 
The BZ#27103 issues seems to be a memory fragmentation due the usage of
sbrk() plus the deallocation done in reverse order, which prevents free()
to coalescence the previous allocation automatically.

For instance with the testcase below:

$ gcc -Wall test.c -o test -DNTIMES=50000 -DCHUNK=1024
$ ./test
memory usage: 1036 Kb
allocate ...done
memory usage: 52812 Kb

If you force the mmap usage:

$ GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test
memory usage: 1044 Kb
allocate ...done
memory usage: 2052 Kb

As Carlos has put, it is tradeoff since sbrk() is usually faster to expand
the data segments compared to mmap() and subsequent allocations will fill
the fragmented heap (so multiple allocation will avoid further memory
fragmentation).

Just to give you comparative, always using mmap() incurs more page-faults
and way more cpu utilization

$ perf stat ./test
memory usage: 964 Kb
allocate ...done
memory usage: 52796 Kb
memory usage: 52796 Kb
allocate ...done
memory usage: 52796 Kb

 Performance counter stats for './test':

             15.22 msec task-clock                #    0.983 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            12,853      page-faults               #  844.546 K/sec                  
        68,518,548      cycles                    #    4.502 GHz                      (73.73%)
           480,717      stalled-cycles-frontend   #    0.70% frontend cycles idle     (73.72%)
             2,333      stalled-cycles-backend    #    0.00% backend cycles idle      (73.72%)
       105,356,108      instructions              #    1.54  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (91.81%)
        23,787,860      branches                  #    1.563 G/sec                  
            58,990      branch-misses             #    0.25% of all branches          (87.01%)

       0.015478114 seconds time elapsed

       0.010348000 seconds user
       0.005174000 seconds sys


$ perf stat env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test
memory usage: 956 Kb
allocate ...done
memory usage: 2012 Kb
memory usage: 2012 Kb
allocate ...done
memory usage: 2012 Kb

 Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test':

            156.52 msec task-clock                #    0.998 CPUs utilized          
                 1      context-switches          #    6.389 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
           100,228      page-faults               #  640.338 K/sec                  
       738,047,682      cycles                    #    4.715 GHz                      (82.11%)
         8,779,463      stalled-cycles-frontend   #    1.19% frontend cycles idle     (82.11%)
            34,195      stalled-cycles-backend    #    0.00% backend cycles idle      (82.97%)
     1,254,219,911      instructions              #    1.70  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (84.68%)
       237,180,662      branches                  #    1.515 G/sec                    (84.67%)
           687,051      branch-misses             #    0.29% of all branches          (83.46%)

       0.156904324 seconds time elapsed

       0.024142000 seconds user
       0.132786000 seconds sys

That's why I think it might not be the best strategy to use the mmap() strategy
as default. What I think we might improve is to maybe add an heuristic to call
malloc_trim once a certain level of fragmentation in the main_arena is found.
The question is which metric and threshold to use.  The trimming does have
a cost, however I think it worth to decrease fragmentation and memory utilization.

---

$ cat test.c
#include <stdlib.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

static size_t pagesize;

static size_t
read_rss (void)
{
  int fd = open ("/proc/self/statm", O_RDONLY);
  assert (fd != -1);
  char line[256];
  ssize_t r = read (fd, line, sizeof (line));
  assert (r != -1);
  line[r] = '\0';
  size_t rss;
  sscanf (line, "%*u %zu %*u %*u 0 %*u 0\n", &rss);
  close (fd);
  return rss * pagesize;
}

static void *
allocate (void *args)
{
  enum { chunk = CHUNK };
  enum { ntimes = NTIMES * chunk };

  void *chunks[NTIMES];
  for (int i = 0; i < sizeof (chunks) / sizeof (chunks[0]); i++)
    {
      chunks[i] = malloc (chunk);
      memset (chunks[i], 0, chunk);
      assert (chunks[i] != NULL);
    }

  for (int i = (sizeof (chunks) / sizeof (chunks[0])) - 1; i >= 0; i--)
    free (chunks[i]);

  return NULL;
}

int main (int argc, char *argv[])
{
  pagesize = sysconf (_SC_PAGESIZE);
  assert (pagesize != -1);
  {
    printf ("memory usage: %zu Kb\n", read_rss () / 1024);
    printf ("allocate ...");
    allocate (NULL);
    printf ("done\n");
    printf ("memory usage: %zu Kb\n", read_rss () / 1024);
  }

  return 0;
}