public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* malloc: performance improvements and bugfixes
@ 2016-01-26  0:26 Joern Engel
  2016-01-26  0:26 ` [PATCH] malloc: remove dead code Joern Engel
                   ` (64 more replies)
  0 siblings, 65 replies; 119+ messages in thread
From: Joern Engel @ 2016-01-26  0:26 UTC (permalink / raw)
  To: GNU C. Library; +Cc: Siddhesh Poyarekar, Joern Engel

From: Joern Engel <joern@purestorage.org>

Short version:
We have forked libc malloc and added a bunch of patches on top.  Some
patches help performance, some fix bugs, many just change the code to
my personal liking.  Here is a braindump that is _not_ intended to be
merged, at least not as-is.  But individual bits could and should get
extracted.

Long version:
When upgrading glibc from 2.13 to a newer version, we started hitting
OOM bugs.  These were caused by enabling PER_THREAD on newer versions.
We split malloc-2.13 from our previous libc and used that instead.
The beginning of a fork.

Later we found various other problems.  Since we now owned the code,
we made use of it.  Overall our version is roughly on-par with
jemalloc while libc malloc gets replaced by most projects that care
about performance and use multithreading.

Some of our changes may be completely unpalatable to libc.  I made no
distinction and give you the entire list - if only to see what some
people might care about.


Rough list:

Use Lindent and unifdef.
I happen to prefer the kernel coding style over GNU coding style.
These only helped me read the code and make changes, but are
absolutely no upstream material.  Sorry about the noise.

Revert PER_THREAD.
Per-thread arenas are an exquisitely bad idea.  If a thread uses a lot
of memory, then frees most, malloc will hang on to the free memory and
neither return it to the system nor use it for other threads.

Remove mprotect
While I admit that some people might care about commit charge, I wager
that most people don't and in particular we don't.  The way malloc
uses mprotect turned the mmap_sem into the single worst lock inside
the Linux kernel.  Removing mprotect mostly fixed that.
Mprotect also triggers bad behaviour in the kernel VM.  Far more VMAs
get created and after reaching 64k the kernel would stop to mmap() for
our process.  We effectively run out of memory with gigabytes of free
memory available to the system.

Use hugepages
In our project hugepages have become a necessity for low latency.
Transparent hugepages aren't good enough, so we have to deal with them
explicitly.  Probably not upstream-material.

Cleanup of arena_get macros
Removes duplicate (and buggy) code and simplifies the logic.  Existing
code outgrew the size where macros may have made sense.

NUMA support
Once you have a NUMA system, this helps a lot.  Currently does a
syscall for every allocation.  Surprisingly the syscall hardly shows
up in profiles and the benefits clearly dominate.  If libc exposed the
vdso-version of getcpu(), that would be much nicer.

Remove __builtin_expect
I benchmarked the effect.  Even if I reversed the logic and marked
unlikely branches likely and vice versa, there was absolutely no
measureable effect.  Filed under cargo cult and removed.

Revert 1d05c2fb9c6f (Val and Arjan's change to dynamically grow heaps)
I couldn't figure out how the logic actually worked.  While I might
not be the best programmer in the world, I find that disturbing for
what is conceptually such a simple change.  Hence,...

Removed ATOMIC_FASTBINS
Not sure if this was a good change, but the the atomic_free_list
(below) recovered the performance, covers more than just fastbins and
is simpler code.

Added a thread cache
A per-thread cache gives most of the performance benefits of
per-thread arenas without the drawback of memory bloat.  128k is less
than most people's stack consumption, so the cost should be
acceptable.

Added atomic_free_list
Makes free() lockless.  If the arena is locked, we just push memory to
the atomic_free_list and let someone else do the actual free later.
Before this change we had an average of three (3) threads blocked on
an arena lock in the stable state.

Fix startup race
I suppose noone ever hit this because the main_arena initialized so
darn fast that they always won the race.  I changed the timings,
mostly with NUMA code, and started losing.

Simplify calloc
I believe the same also happened upstream and later got reverted.  I
couldn't find the rationale for the revert and find it dodgy.
Technically the existing version of calloc can be faster early on, but
not for long-running processes in the stable state.  And once I found
bugs in calloc I couldn't be arsed to debug them and just removed most
of the code.

Made malloc signal-safe
I think malloc() was always signal-safe, but free() wasn't.  It isn't
hard to trigger this in a testcase.  Our version survives such a test,
mostly because of the atomic_free_list.

Fix calculation of aligned heaps
Looks like this was always buggy.  Is that correct or was I misreading
the code?

Remove hooks
I don't understand what problem they were supposed to solve.  Our
project doesn't seem to need them and I have testcases that break
because of the hooks.


If any of this looks interesting for upstream and you have questions,
feel free to pester me.

And maybe as a closing note, I believe there are some applications that
have deeper knowledge about malloc-internal data structures than they
should (*cough*emacs).  As a result it has become impossible to change
the internals of malloc without breaking said applications and libc
malloc has ossified.

At this point, either a handful of applications need to ship the
ossified version of malloc or Everthing Else(tm) has to switch to a
better version of malloc.  The reality we live in has everything else
ship tcmalloc, jemalloc or somesuch and libc malloc is slowly becoming
irrelevant and the butt of hallway jokes.  I don't find this reality
very desireable, and yet here we are.

Jörn

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2016-01-28 13:51 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-26  0:26 malloc: performance improvements and bugfixes Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: remove dead code Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: unobfuscate an assert Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: Lindent new_heap Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: use MAP_HUGETLB when possible Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: turn arena_get() into a function Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: push down the memset for huge pages Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: kill mprotect Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: unifdef -m -Ulibc_hidden_def Joern Engel
2016-01-26  0:26 ` [PATCH] malloc: remove emacs style guards Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: limit free_atomic_list() latency Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: simplify and fix calloc Joern Engel
2016-01-26 10:32   ` Will Newton
2016-01-26  0:27 ` [PATCH] malloc: avoid main_arena Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: better inline documentation Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: Revert glibc 1d05c2fb9c6f Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: create a useful assert Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: tune thread cache Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: quenche last compiler warnings Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: fix startup races Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: make numa_node_count more robust Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: Lindent before functional changes Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: Lindent public_fREe() Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: add documentation Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: s/max_node/num_nodes/ Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: remove __builtin_expect Joern Engel
2016-01-26  7:56   ` Yury Gribov
2016-01-26  9:00     ` Jörn Engel
2016-01-26  9:37       ` Yury Gribov
2016-01-26 15:46       ` Jeff Law
2016-01-26 20:43     ` Steven Munroe
2016-01-26 21:08       ` Florian Weimer
2016-01-26 21:35         ` Steven Munroe
2016-01-26 21:42           ` Jeff Law
2016-01-27  0:37             ` Steven Munroe
2016-01-27  3:16               ` Jeff Law
2016-01-26 21:45           ` Florian Weimer
2016-01-26  0:27 ` [PATCH] malloc: remove mstate typedef Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: introduce get_backup_arena() Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: unifdef -m -DUSE_ARENAS -DHAVE_MMAP Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: fix hard-coded constant Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: prefetch for tcache_malloc Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: destroy thread cache on thread exit Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: initial numa support Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: unifdef -m -UPER_THREAD -U_LIBC Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: Lindent users of arena_get2 Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: always free objects locklessly Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: document and fix linked list handling Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: hide THREAD_STATS Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: use atomic free list Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: add locking to thread cache Joern Engel
2016-01-26 12:45   ` Szabolcs Nagy
2016-01-26 13:14     ` Yury Gribov
2016-01-26 13:14     ` Florian Weimer
2016-01-26 13:23       ` Yury Gribov
2016-01-26 13:40         ` Szabolcs Nagy
2016-01-26 18:00           ` Mike Frysinger
     [not found]             ` <56A8966D.9080000@arm.com>
2016-01-27 17:45               ` Jörn Engel
2016-01-27 19:19                 ` Torvald Riegel
2016-01-27 19:43                   ` Jörn Engel
2016-01-27 21:36                 ` Carlos O'Donell
2016-01-26 17:41     ` Jörn Engel
2016-01-26  0:27 ` [PATCH] malloc: use tsd_getspecific for arena_get Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: fix local_next handling Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: unifdef -m -UATOMIC_FASTBINS Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: fix perturb_byte handling for tcache Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: unifdef -D__STD_C Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: use mbind() Joern Engel
2016-01-26  0:27 ` [PATCH] malloc: only free half the objects on tcache_gc Joern Engel
2016-01-26  0:28 ` [PATCH] malloc: plug thread-cache memory leak Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: use bitmap to conserve hot bins Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: remove stale condition Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: remove tcache prefetching Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: fix mbind on old kernels Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: create aliases for malloc, free, Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: allow recursion from ptmalloc_init to malloc Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: remove all remaining hooks Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: remove atfork hooks Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: Don't call tsd_setspecific before tsd_key_create Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: speed up mmap Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: fix calculation of aligned heaps Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: remove hooks from malloc() and free() Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: brain-dead thread cache Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: rename *.ch to *.h Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: move out perturb_byte conditionals Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: define __libc_memalign Joern Engel
2016-01-26  0:32 ` [PATCH] malloc: remove get_backup_arena() from tcache_malloc() Joern Engel
2016-01-26  0:50 ` malloc: performance improvements and bugfixes Paul Eggert
2016-01-26  1:01   ` Jörn Engel
2016-01-26  1:52     ` Joseph Myers
2016-01-26  1:52     ` Siddhesh Poyarekar
2016-01-26  2:45       ` Jörn Engel
2016-01-26  3:22         ` Jörn Engel
2016-01-26  6:22           ` Mike Frysinger
2016-01-26  7:54             ` Jörn Engel
2016-01-26  9:53               ` Florian Weimer
2016-01-26 17:05                 ` Jörn Engel
2016-01-26 17:31                   ` Paul Eggert
2016-01-26 17:48                     ` Adhemerval Zanella
2016-01-26 17:49                   ` Joseph Myers
2016-01-26 17:57                   ` Mike Frysinger
2016-01-27 20:46                     ` Manuel López-Ibáñez
2016-01-26 12:37           ` Torvald Riegel
2016-01-26 13:23           ` Florian Weimer
2016-01-26  7:40         ` Paul Eggert
2016-01-26  9:54         ` Florian Weimer
2016-01-26 20:50   ` Steven Munroe
2016-01-26 21:40     ` Florian Weimer
2016-01-26 21:48       ` Steven Munroe
2016-01-26 21:51         ` Florian Weimer
2016-01-26 21:51       ` Paul Eggert
2016-01-26 21:57         ` Florian Weimer
2016-01-26 22:18           ` Paul Eggert
2016-01-26 22:24             ` Florian Weimer
2016-01-27  1:31               ` Paul Eggert
2016-01-26 22:00       ` Jörn Engel
2016-01-26 22:02         ` Florian Weimer
2016-01-27 21:45   ` Carlos O'Donell
2016-01-28 13:51 ` Carlos O'Donell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).