Re: [RFT] malloc: reduce largebin granularity to reduce fragmentation

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Re: [RFT] malloc: reduce largebin granularity to reduce fragmentation
       [not found] <CAPBLoAfHRKGRjFQFKHXwK6vX7xxUqkoj?= =?ISO-8859-1?Q?-=5Fsttmfn4FgiEgondA@mail.gmail.com>
@ 2024-04-12 21:50 ` DJ Delorie
  0 siblings, 0 replies; 5+ messages in thread
From: DJ Delorie @ 2024-04-12 21:50 UTC (permalink / raw)
  To: Cristian Rodríguez; +Cc: Wilco.Dijkstra, e, libc-alpha


Cristian Rodrguez <cristian@rodriguez.im> writes:
> On Fri, Apr 12, 2024 at 11:19 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
>> My general feeling is that this allocator is way too ancient and hacked up to
>> compete with modern allocators like jemalloc or mimalloc.
>
> both are licensed as free software so glibc could adopt any of them.
> will that happen is the actual question.

This comes up occasionally, so I'll just link

https://lwn.net/ml/fedora-devel/xny3dw8l5o.fsf@greed.delorie.com/
https://lwn.net/Articles/761502/


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFT] malloc: reduce largebin granularity to reduce fragmentation
  2024-04-12 15:19 Wilco Dijkstra
  2024-04-12 15:35 ` Cristian Rodríguez
@ 2024-04-12 17:02 ` Eric Wong
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Wong @ 2024-04-12 17:02 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: libc-alpha

Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Hi Eric,
> 
> I agree that limiting size classes will help with fragmentation. However
> you need more than this: when it finds a block that is larger than requested
> it will always split off the end even if it is a much smaller block. As a result
> large blocks get smaller over time. Additionally allocations that can't be
> satisfied by the free lists are taken from the large remainder of the arena,
> thus interleaving allocations of all sizes rather than grouping similar sizes
> together in separate regions like modern allocators.

Thanks for taking a look.

I think relying on a smaller (non-sliding) mmap threshold will
still be necessary atm for my use case.

Right now that's MALLOC_MMAP_THRESHOLD_=8388608 which is
analogous to the opt.oversize_threshold default in jemalloc.

jemalloc has a dedicated arena for those allocations instead of
doing mmap each time.  This may be another idea we can steal :>

> The tcache and smallbin freelist also keeps freed block listed as 'allocated',
> thus blocking merging of free blocks into larger ones - this is only done
> every now and again during a consolidation pass. Threads that free blocks
> but never allocate a large block again may never really free their freed
> blocks.

Yeah, I'm only focusing on single-threaded behavior ATM.

My wfcqueue idea from 2018 should make things much nicer to
reduce fragmentation w/ threads.  I'm mainly working on
single-threaded codebases nowadays, so not yet motivated to work
on it.

> So to reduce fragmentation you also have to address all these issues too.

Agreed, I'm trying to take small incremental steps in the right
direction.  Perfect is the enemy of good, of course.

> My general feeling is that this allocator is way too ancient and hacked up to
> compete with modern allocators like jemalloc or mimalloc.

I remember there were compatibility reasons to stick with the
current one.  That's way above my paygrade, though.  That said,
I do believe our old design can be improved to be competitive
with newer mallocs.

Getting users to run LD_PRELOAD or recompile Perl against a
different malloc is a huge ask; and I don't like having
redundant software on my systems for auditability and space
reasons.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFT] malloc: reduce largebin granularity to reduce fragmentation
  2024-04-12 15:19 Wilco Dijkstra
@ 2024-04-12 15:35 ` Cristian Rodríguez
  2024-04-12 17:02 ` Eric Wong
  1 sibling, 0 replies; 5+ messages in thread
From: Cristian Rodríguez @ 2024-04-12 15:35 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: e, GNU C Library

On Fri, Apr 12, 2024 at 11:19 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> My general feeling is that this allocator is way too ancient and hacked up to
> compete with modern allocators like jemalloc or mimalloc.

both are licensed as free software so glibc could adopt any of them.
will that happen is the actual question.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFT] malloc: reduce largebin granularity to reduce fragmentation
@ 2024-04-12 15:19 Wilco Dijkstra
  2024-04-12 15:35 ` Cristian Rodríguez
  2024-04-12 17:02 ` Eric Wong
  0 siblings, 2 replies; 5+ messages in thread
From: Wilco Dijkstra @ 2024-04-12 15:19 UTC (permalink / raw)
  To: e; +Cc: 'GNU C Library'

Hi Eric,

I agree that limiting size classes will help with fragmentation. However
you need more than this: when it finds a block that is larger than requested
it will always split off the end even if it is a much smaller block. As a result
large blocks get smaller over time. Additionally allocations that can't be
satisfied by the free lists are taken from the large remainder of the arena,
thus interleaving allocations of all sizes rather than grouping similar sizes
together in separate regions like modern allocators.

The tcache and smallbin freelist also keeps freed block listed as 'allocated',
thus blocking merging of free blocks into larger ones - this is only done
every now and again during a consolidation pass. Threads that free blocks
but never allocate a large block again may never really free their freed
blocks.

So to reduce fragmentation you also have to address all these issues too.
My general feeling is that this allocator is way too ancient and hacked up to
compete with modern allocators like jemalloc or mimalloc.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFT] malloc: reduce largebin granularity to reduce fragmentation
@ 2024-04-09  9:33 Eric Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2024-04-09  9:33 UTC (permalink / raw)
  To: libc-alpha

Anybody volunteers to help test and get reproducible results for
this?  Thanks in advance.

Testing and getting reproducible results is proving extremely
expensive due to trace sizes and CPU usage required to handle
my existing web/IMAP/NNTP-facing traffic.

But the theory works in my mind and could be a solution to a
problem I've noticed for decades at this point across long-lived
Ruby and Perl web daemons.

I'm also considering having this as a tunable and mallopt(3).

And perhaps limiting the alignment to pagesize can work, because
20% of a large sliding mmap_threshold on 64-bit is several
megabytes...

------8<------
From: Eric Wong <normalperson@yhbt.net>
Subject: [PATCH] malloc: reduce largebin granularity to reduce fragmentation

TL;DR: trade initial best fits for better fits over long process
lifetimes

Size classes in modern versions of jemalloc add a 0-20% overhead
in an attempt to get better fits over time when dealing with
unpredictably-sized allocations and frees.  Initially, this can
be more wasteful than the best (initial) fit strategy used by
dlmalloc, but less wasteful than buddy allocators (which can
have 0-99% overhead).

While the dlmalloc "best fit" strategy is ideal for one-off
permanent allocations and applications that only deal in uniform
allocation sizes; "best fit" gets bogged down over time when
dealing with variably-sized allocations with mixed and
interleaving lifetimes commonly seen in high-level language
runtimes.

Such allocation patterns are common in long-lived web app, NNTP,
and IMAP servers doing text processing with concurrent ("C10K").
Unpredictable lifetimes are often a result of resource sharing
between network clients of disparate lifetimes, but some of it
is outside the application developers control.

Fragmentation is further exacerbated by long-lived allocations
happening late in process life of high-level language runtimes
(e.g. Perl and Ruby).  These runtimes and their standard
libraries can trigger long-lived allocations late in process
life due to the use lazy loading, caching, and internal slab
allocators using malloc to create slabs.

This change only affects largebin allocations which are too
small for mmap, and too small for smallbins and fastbins.
---
 malloc/malloc.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/malloc/malloc.c b/malloc/malloc.c
index bcb6e5b83c..9a1057f5a7 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3288,6 +3288,56 @@ tcache_thread_shutdown (void)

 #endif /* !USE_TCACHE  */

+/*
+ * Use jemalloc-inspired size classes for largebin allocations to
+ * minimize fragmentation.  This means we pay a 0-20% overhead on
+ * allocations to improve the likelyhood of reuse across many
+ * odd-sized allocations and frees.  This should reduce fragmentation
+ * in long-lived applications processing various text fragments
+ * (e.g. mail, news, and web services)
+ */
+static inline void
+size_class_align (size_t *nb, size_t *req)
+{
+  // TODO: make this a mallopt + tunable?
+
+  if (in_smallbin_range (*nb))
+    return;
+
+  size_t mm_thresh = MAX (DEFAULT_MMAP_THRESHOLD_MAX, mp_.mmap_threshold);
+
+  if (*nb >= mm_thresh && mp_.n_mmaps < mp_.n_mmaps_max)
+    return;
+
+  size_t n = *req - 1;
+  for (size_t i = 1; i < sizeof (size_t) * CHAR_BIT; i <<= 1)
+    n |= n >> i;
+
+  size_t next_power_of_two = n + 1;
+
+  /*
+   * This size alignment causes up to 20% overhead which can be several
+   * megabytes on 64-bit systems with high mmap_threshold.  Perhaps we
+   * can clamp alignment to pagesize or similar to save space.
+   */
+  size_t align = next_power_of_two >> 3;
+  size_t areq = ALIGN_UP (*req, align);
+  size_t anb = checked_request2size (areq);
+
+  if (anb == 0)
+    return; // aligned size is too big, but unaligned wasn't
+
+  if (anb < mm_thresh || mp_.n_mmaps >= mp_.n_mmaps_max)
+    {
+      *nb = anb;
+      *req = areq;
+    }
+  else // too big for largebins, force it to mmap
+    {
+      *nb = mm_thresh;
+    }
+}
+
 #if IS_IN (libc)
 void *
 __libc_malloc (size_t bytes)
@@ -3308,6 +3358,7 @@ __libc_malloc (size_t bytes)
       __set_errno (ENOMEM);
       return NULL;
     }
+  size_class_align (&tbytes, &bytes);
   size_t tc_idx = csize2tidx (tbytes);

   MAYBE_INIT_TCACHE ();
@@ -3503,6 +3554,8 @@ __libc_realloc (void *oldmem, size_t bytes)
       return newmem;
     }

+  size_class_align (&nb, &bytes);
+
   if (SINGLE_THREAD_P)
     {
       newp = _int_realloc (ar_ptr, oldp, oldsize, nb);
@@ -3713,6 +3766,13 @@ __libc_calloc (size_t n, size_t elem_size)
     }

   sz = bytes;
+  size_t nb = checked_request2size (sz);
+  if (nb == 0)
+    {
+      __set_errno (ENOMEM);
+      return NULL;
+    }
+  size_class_align (&nb, &sz);

   if (!__malloc_initialized)
     ptmalloc_init ();

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-12 21:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAPBLoAfHRKGRjFQFKHXwK6vX7xxUqkoj?= =?ISO-8859-1?Q?-=5Fsttmfn4FgiEgondA@mail.gmail.com>
2024-04-12 21:50 ` [RFT] malloc: reduce largebin granularity to reduce fragmentation DJ Delorie
2024-04-12 15:19 Wilco Dijkstra
2024-04-12 15:35 ` Cristian Rodríguez
2024-04-12 17:02 ` Eric Wong
  -- strict thread matches above, loose matches on Subject: below --
2024-04-09  9:33 Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).