From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-66626-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 74135 invoked by alias); 26 Jan 2016 00:27:37 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 70933 invoked by uid 89); 26 Jan 2016 00:27:11 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=threadlocal, suffer, thread-local, costa
X-HELO: mail-qg0-f43.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-type:content-transfer-encoding;
        bh=lmTTbd7yFv3z+3BqF+R85C/FtK6HKoSdRmYWFJS48F4=;
        b=J5Z5wo68/MV6x90hj68sCaOaY4fIVdhgnetVOjbZFWGQf22jnQP8rUWCjnXMZ8D1G6
         to3Wx+EouA+yqTVrYoV/1cImlRIzzWr6AZcOeYkemHOKRGXXFXAEz0gX7nID+0anVAb5
         R2kaMZJNPX5FOZVOoxPfSammpzJZWbjgwMKimH5Q7NWVBlFyj0A3UGvD5gRS3+vCEfjf
         d5wtR7ijx7IDb7Ju+nMEo9LLJNUpxLqaxwpTiDOM9TXZYcddI1Kdc/V7pPJ9tywUKCQ8
         gtbje2s2hMnDKmH/jWci4nNDCqCnkkNO6PL7N/3Ud8egvGK2FMRIkG2TiIU9DTiLgvTU
         PvqA==
X-Gm-Message-State: AG10YOSWn0SJWKYHI38hnJwoV92EfLf/Uk7hFz7hHmDtRb7KtF3Ztg5l7IhprMT3SjS/dKl/
X-Received: by 10.140.20.145 with SMTP id 17mr25108554qgj.45.1453768024486;
        Mon, 25 Jan 2016 16:27:04 -0800 (PST)
From: Joern Engel <joern@purestorage.com>
To: "GNU C. Library" <libc-alpha@sourceware.org>
Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>,
	Joern Engel <joern@purestorage.com>
Subject: [PATCH] malloc: add documentation
Date: Tue, 26 Jan 2016 00:27:00 -0000
Message-Id: <1453767942-19369-35-git-send-email-joern@purestorage.com>
In-Reply-To: <1453767942-19369-1-git-send-email-joern@purestorage.com>
References: <1453767942-19369-1-git-send-email-joern@purestorage.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2016-01/txt/msg00680.txt.bz2

JIRA: PURE-27597
---
 tpc/malloc2.13/design | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)
 create mode 100644 tpc/malloc2.13/design

diff --git a/tpc/malloc2.13/design b/tpc/malloc2.13/design
new file mode 100644
index 000000000000..1c3018093cdd
--- /dev/null
+++ b/tpc/malloc2.13/design
@@ -0,0 +1,90 @@
+This malloc is based on glibc 2.13 malloc.  Doug Lea started it all,
+Wolfram Gloger extended it for multithreading, Ulrich Drepper
+mainained it as part of glibc, then "improved" it so much we had to
+fork it.
+
+For introduction, please read http://gee.cs.oswego.edu/dl/html/malloc.html
+
+Nomenclature:
+dlmalloc: Doug Lea's version
+ptmalloc: Wolfram Gloger's version
+libcmalloc: Libc version as of 2.13
+puremalloc: Pure Storage's version
+
+
+Arenas:
+Large allocations are done by mmap.  All small allocations come from
+an arena, which is split into suitable chunks.  Dlmalloc had a single
+arena, which provided a locking hotspot.  Arena was enlarged by
+sbrk(2).  Ptmalloc introduced multiple arenas, creating them on the
+fly when locks when observing lock contention.
+
+Libcmalloc made arenas per-thread, which further reduced lock
+contention, but significantly increased memory consumption.  Glibc
+2.13 was the latest version without per-thread arenas enabled.  Costa
+gave Pure a private copy of 2.13 malloc to avoid the regression.
+
+Arenas are on a single-linked list with a pointer kept in thread-local
+storage.  If the last arena used for a thread is locked, it will try
+the next arena, etc.  If all arenas are locked, a new arena is
+created.
+
+JÃ¶rn changed this into a single-linked list per numa node.  Threads
+always allocate from a numa-local arena.
+
+
+NUMA locality:
+All arenas use mbind() to preferentially get memory from just one numa
+node.  In case of memory shortage the kernel is allowed to go
+cross-numa.  As always, memory shortage should be avoided.
+
+The getcpu() syscall is used to detect the current numa node when
+allocating memory.  If the scheduler moves threads to different numa
+nodes, performance will suffer.  No surprise there.  Syscall overhead
+could be a performance problem.  We have plans to create a shared
+memory page containing information like the thread's current numa node
+to solve that.  Surprisingly the syscall overhead doesn't seem that
+high, so it may take a while.
+
+
+Hugepages:
+Using hugepages instead of small pages makes a significant difference
+in process exit time.  We have repeatedly observer >20s spent in
+exit_mm(), freeing all process memory.  Going from small pages to huge
+pages solves that problem.  Puremalloc uses huge pages for all mmap
+allocations.
+
+The main_arena still uses sbrk() to allocate system memory, i.e. it
+uses small pages.  To solve this the main_arena was taken off the
+per-node lists.  It is still being used in special cases, e.g. when
+creating a new thread.  Changing that is a lot of work and not worth
+it yet.
+
+
+Thread caching:
+tcmalloc and jemalloc demonstrated that a per-thread cache for malloc
+can be beneficial, so we introduced the same to puremalloc.  Freed
+objects initially stay in the thread cache.  Roughly half the time
+they get reused by an allocation shortly after.  If objects exist in
+cache, malloc doesn't have to take the arena lock.
+
+When going to the arena, puremalloc pre-allocates a second object.
+Pre-allocation further reduces arena contention.  Pre-allocating more
+than one object yields diminishing returns.  The performance
+difference between thread cache and arena just isn't high enough.
+
+
+Binning:
+The binning strategies of dlmalloc and jemalloc are pretty ad-hoc and
+discontinuous.  Jemalloc is extremely fine-grained up to 4k, then
+jumps from 4k to 8k to 16k, etc.  As a result, allocations slightly
+above 4k, 8k, etc. result in nearly 100% overhead.  Dlmalloc is less
+bad, but similar.
+
+For the thread cache, puremalloc uses 16 bins per power of two.  This
+requires an implementation of fls(), which is not standardized in C.
+Fls() done in inline assembly is slower than a predicted jump, but
+faster than a mispredicted jump.  Overall performance is about awash.
+Not having corner cases with 100% memory overhead is a real benefit,
+though.  Worst-case overhead is 6%, and hard to achieve even when
+trying.
-- 
2.7.0.rc3