From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 74135 invoked by alias); 26 Jan 2016 00:27:37 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 70933 invoked by uid 89); 26 Jan 2016 00:27:11 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=threadlocal, suffer, thread-local, costa X-HELO: mail-qg0-f43.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-type:content-transfer-encoding; bh=lmTTbd7yFv3z+3BqF+R85C/FtK6HKoSdRmYWFJS48F4=; b=J5Z5wo68/MV6x90hj68sCaOaY4fIVdhgnetVOjbZFWGQf22jnQP8rUWCjnXMZ8D1G6 to3Wx+EouA+yqTVrYoV/1cImlRIzzWr6AZcOeYkemHOKRGXXFXAEz0gX7nID+0anVAb5 R2kaMZJNPX5FOZVOoxPfSammpzJZWbjgwMKimH5Q7NWVBlFyj0A3UGvD5gRS3+vCEfjf d5wtR7ijx7IDb7Ju+nMEo9LLJNUpxLqaxwpTiDOM9TXZYcddI1Kdc/V7pPJ9tywUKCQ8 gtbje2s2hMnDKmH/jWci4nNDCqCnkkNO6PL7N/3Ud8egvGK2FMRIkG2TiIU9DTiLgvTU PvqA== X-Gm-Message-State: AG10YOSWn0SJWKYHI38hnJwoV92EfLf/Uk7hFz7hHmDtRb7KtF3Ztg5l7IhprMT3SjS/dKl/ X-Received: by 10.140.20.145 with SMTP id 17mr25108554qgj.45.1453768024486; Mon, 25 Jan 2016 16:27:04 -0800 (PST) From: Joern Engel To: "GNU C. Library" Cc: Siddhesh Poyarekar , Joern Engel Subject: [PATCH] malloc: add documentation Date: Tue, 26 Jan 2016 00:27:00 -0000 Message-Id: <1453767942-19369-35-git-send-email-joern@purestorage.com> In-Reply-To: <1453767942-19369-1-git-send-email-joern@purestorage.com> References: <1453767942-19369-1-git-send-email-joern@purestorage.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2016-01/txt/msg00680.txt.bz2 JIRA: PURE-27597 --- tpc/malloc2.13/design | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 90 insertions(+) create mode 100644 tpc/malloc2.13/design diff --git a/tpc/malloc2.13/design b/tpc/malloc2.13/design new file mode 100644 index 000000000000..1c3018093cdd --- /dev/null +++ b/tpc/malloc2.13/design @@ -0,0 +1,90 @@ +This malloc is based on glibc 2.13 malloc. Doug Lea started it all, +Wolfram Gloger extended it for multithreading, Ulrich Drepper +mainained it as part of glibc, then "improved" it so much we had to +fork it. + +For introduction, please read http://gee.cs.oswego.edu/dl/html/malloc.html + +Nomenclature: +dlmalloc: Doug Lea's version +ptmalloc: Wolfram Gloger's version +libcmalloc: Libc version as of 2.13 +puremalloc: Pure Storage's version + + +Arenas: +Large allocations are done by mmap. All small allocations come from +an arena, which is split into suitable chunks. Dlmalloc had a single +arena, which provided a locking hotspot. Arena was enlarged by +sbrk(2). Ptmalloc introduced multiple arenas, creating them on the +fly when locks when observing lock contention. + +Libcmalloc made arenas per-thread, which further reduced lock +contention, but significantly increased memory consumption. Glibc +2.13 was the latest version without per-thread arenas enabled. Costa +gave Pure a private copy of 2.13 malloc to avoid the regression. + +Arenas are on a single-linked list with a pointer kept in thread-local +storage. If the last arena used for a thread is locked, it will try +the next arena, etc. If all arenas are locked, a new arena is +created. + +Jörn changed this into a single-linked list per numa node. Threads +always allocate from a numa-local arena. + + +NUMA locality: +All arenas use mbind() to preferentially get memory from just one numa +node. In case of memory shortage the kernel is allowed to go +cross-numa. As always, memory shortage should be avoided. + +The getcpu() syscall is used to detect the current numa node when +allocating memory. If the scheduler moves threads to different numa +nodes, performance will suffer. No surprise there. Syscall overhead +could be a performance problem. We have plans to create a shared +memory page containing information like the thread's current numa node +to solve that. Surprisingly the syscall overhead doesn't seem that +high, so it may take a while. + + +Hugepages: +Using hugepages instead of small pages makes a significant difference +in process exit time. We have repeatedly observer >20s spent in +exit_mm(), freeing all process memory. Going from small pages to huge +pages solves that problem. Puremalloc uses huge pages for all mmap +allocations. + +The main_arena still uses sbrk() to allocate system memory, i.e. it +uses small pages. To solve this the main_arena was taken off the +per-node lists. It is still being used in special cases, e.g. when +creating a new thread. Changing that is a lot of work and not worth +it yet. + + +Thread caching: +tcmalloc and jemalloc demonstrated that a per-thread cache for malloc +can be beneficial, so we introduced the same to puremalloc. Freed +objects initially stay in the thread cache. Roughly half the time +they get reused by an allocation shortly after. If objects exist in +cache, malloc doesn't have to take the arena lock. + +When going to the arena, puremalloc pre-allocates a second object. +Pre-allocation further reduces arena contention. Pre-allocating more +than one object yields diminishing returns. The performance +difference between thread cache and arena just isn't high enough. + + +Binning: +The binning strategies of dlmalloc and jemalloc are pretty ad-hoc and +discontinuous. Jemalloc is extremely fine-grained up to 4k, then +jumps from 4k to 8k to 16k, etc. As a result, allocations slightly +above 4k, 8k, etc. result in nearly 100% overhead. Dlmalloc is less +bad, but similar. + +For the thread cache, puremalloc uses 16 bins per power of two. This +requires an implementation of fls(), which is not standardized in C. +Fls() done in inline assembly is slower than a predicted jump, but +faster than a mispredicted jump. Overall performance is about awash. +Not having corner cases with 100% memory overhead is a real benefit, +though. Worst-case overhead is 6%, and hard to achieve even when +trying. -- 2.7.0.rc3