public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To: Carlos O'Donell <carlos@redhat.com>,
	Konstantin Kharlamov <hi-angel@yandex.ru>,
	Christian Hoff <christian_hoff@gmx.net>,
	libc-help@sourceware.org
Subject: Re: Excessive memory consumption when using malloc()
Date: Thu, 25 Nov 2021 17:56:11 -0300	[thread overview]
Message-ID: <d8d5cd5c-afb8-b6aa-408b-60d5d55b4353@linaro.org> (raw)
In-Reply-To: <c20fabf9-3eb1-3f06-47a3-20da3ddc8e25@redhat.com>



On 25/11/2021 15:21, Carlos O'Donell via Libc-help wrote:
> On 11/25/21 13:12, Konstantin Kharlamov via Libc-help wrote:
>> So there you go, you 10G of unreleased memory is a Glibc feature, no complaints
>> ;-P
> 
> Freeing memory back to the OS is a form of cache invalidation, and cache
> invalidation is hard and workload dependent.
> 
> In this specific case, particularly with 50MiB, you are within the 64MiB
> 64-bit process heap size, and the 1024-byte frees do not trigger the
> performance expensive consolidation and heap reduction (which requires
> a munmap syscall to release the resources).
> 
> In the case of 10GiB, and 512KiB allocations, we are talking different
> behaviour. I have responded here with my recommendations:
> https://sourceware.org/pipermail/libc-help/2021-November/006052.html
> 
The BZ#27103 issues seems to be a memory fragmentation due the usage of
sbrk() plus the deallocation done in reverse order, which prevents free()
to coalescence the previous allocation automatically.

For instance with the testcase below:

$ gcc -Wall test.c -o test -DNTIMES=50000 -DCHUNK=1024
$ ./test
memory usage: 1036 Kb
allocate ...done
memory usage: 52812 Kb

If you force the mmap usage:

$ GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test
memory usage: 1044 Kb
allocate ...done
memory usage: 2052 Kb

As Carlos has put, it is tradeoff since sbrk() is usually faster to expand
the data segments compared to mmap() and subsequent allocations will fill
the fragmented heap (so multiple allocation will avoid further memory
fragmentation).

Just to give you comparative, always using mmap() incurs more page-faults
and way more cpu utilization

$ perf stat ./test
memory usage: 964 Kb
allocate ...done
memory usage: 52796 Kb
memory usage: 52796 Kb
allocate ...done
memory usage: 52796 Kb

 Performance counter stats for './test':

             15.22 msec task-clock                #    0.983 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            12,853      page-faults               #  844.546 K/sec                  
        68,518,548      cycles                    #    4.502 GHz                      (73.73%)
           480,717      stalled-cycles-frontend   #    0.70% frontend cycles idle     (73.72%)
             2,333      stalled-cycles-backend    #    0.00% backend cycles idle      (73.72%)
       105,356,108      instructions              #    1.54  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (91.81%)
        23,787,860      branches                  #    1.563 G/sec                  
            58,990      branch-misses             #    0.25% of all branches          (87.01%)

       0.015478114 seconds time elapsed

       0.010348000 seconds user
       0.005174000 seconds sys


$ perf stat env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test
memory usage: 956 Kb
allocate ...done
memory usage: 2012 Kb
memory usage: 2012 Kb
allocate ...done
memory usage: 2012 Kb

 Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test':

            156.52 msec task-clock                #    0.998 CPUs utilized          
                 1      context-switches          #    6.389 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
           100,228      page-faults               #  640.338 K/sec                  
       738,047,682      cycles                    #    4.715 GHz                      (82.11%)
         8,779,463      stalled-cycles-frontend   #    1.19% frontend cycles idle     (82.11%)
            34,195      stalled-cycles-backend    #    0.00% backend cycles idle      (82.97%)
     1,254,219,911      instructions              #    1.70  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (84.68%)
       237,180,662      branches                  #    1.515 G/sec                    (84.67%)
           687,051      branch-misses             #    0.29% of all branches          (83.46%)

       0.156904324 seconds time elapsed

       0.024142000 seconds user
       0.132786000 seconds sys

That's why I think it might not be the best strategy to use the mmap() strategy
as default. What I think we might improve is to maybe add an heuristic to call
malloc_trim once a certain level of fragmentation in the main_arena is found.
The question is which metric and threshold to use.  The trimming does have
a cost, however I think it worth to decrease fragmentation and memory utilization.

---

$ cat test.c
#include <stdlib.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

static size_t pagesize;

static size_t
read_rss (void)
{
  int fd = open ("/proc/self/statm", O_RDONLY);
  assert (fd != -1);
  char line[256];
  ssize_t r = read (fd, line, sizeof (line));
  assert (r != -1);
  line[r] = '\0';
  size_t rss;
  sscanf (line, "%*u %zu %*u %*u 0 %*u 0\n", &rss);
  close (fd);
  return rss * pagesize;
}

static void *
allocate (void *args)
{
  enum { chunk = CHUNK };
  enum { ntimes = NTIMES * chunk };

  void *chunks[NTIMES];
  for (int i = 0; i < sizeof (chunks) / sizeof (chunks[0]); i++)
    {
      chunks[i] = malloc (chunk);
      memset (chunks[i], 0, chunk);
      assert (chunks[i] != NULL);
    }

  for (int i = (sizeof (chunks) / sizeof (chunks[0])) - 1; i >= 0; i--)
    free (chunks[i]);

  return NULL;
}

int main (int argc, char *argv[])
{
  pagesize = sysconf (_SC_PAGESIZE);
  assert (pagesize != -1);
  {
    printf ("memory usage: %zu Kb\n", read_rss () / 1024);
    printf ("allocate ...");
    allocate (NULL);
    printf ("done\n");
    printf ("memory usage: %zu Kb\n", read_rss () / 1024);
  }

  return 0;
} 

  reply	other threads:[~2021-11-25 20:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-25 17:20 Christian Hoff
2021-11-25 17:46 ` Konstantin Kharlamov
2021-11-25 18:12   ` Konstantin Kharlamov
2021-11-25 18:21     ` Carlos O'Donell
2021-11-25 20:56       ` Adhemerval Zanella [this message]
2021-11-26 18:10         ` Christian Hoff
2021-11-29 17:06           ` Patrick McGehearty
2021-11-25 18:20 ` Carlos O'Donell
2021-11-26 17:58   ` Christian Hoff
2021-11-29 19:44     ` Christian Hoff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d8d5cd5c-afb8-b6aa-408b-60d5d55b4353@linaro.org \
    --to=adhemerval.zanella@linaro.org \
    --cc=carlos@redhat.com \
    --cc=christian_hoff@gmx.net \
    --cc=hi-angel@yandex.ru \
    --cc=libc-help@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).