* [PATCH] Add malloc micro benchmark @ 2017-12-01 13:51 Wilco Dijkstra 2017-12-01 16:13 ` Carlos O'Donell 0 siblings, 1 reply; 31+ messages in thread From: Wilco Dijkstra @ 2017-12-01 13:51 UTC (permalink / raw) To: libc-alpha; +Cc: nd Add a malloc micro benchmark to enable accurate testing of the various paths in malloc and free. The benchmark does a varying number of allocations of a given block size, then frees them again. It does so for several block sizes and number of allocated blocks. Although the test is single-threaded, it also tests what happens when you disable single-threaded fast paths (ie. SINGLE_THREAD_P is false). OK for commit? Typical output on an x64 box: { "timing_type": "hp_timing", "functions": { "malloc": { "malloc_block_size_0016": { "st_num_allocs_0025_time": 53.5486, "st_num_allocs_0100_time": 57.2553, "st_num_allocs_0400_time": 57.3204, "st_num_allocs_1000_time": 57.2059, "mt_num_allocs_0025_time": 87.7903, "mt_num_allocs_0100_time": 100.772, "mt_num_allocs_0400_time": 103.827, "mt_num_allocs_1000_time": 104.812 }, "malloc_block_size_0256": { "st_num_allocs_0025_time": 78.3056, "st_num_allocs_0100_time": 85.6392, "st_num_allocs_0400_time": 91.5187, "st_num_allocs_1000_time": 163.458, "mt_num_allocs_0025_time": 115.925, "mt_num_allocs_0100_time": 140.735, "mt_num_allocs_0400_time": 152.044, "mt_num_allocs_1000_time": 225.118 }, "malloc_block_size_1024": { "st_num_allocs_0025_time": 113.705, "st_num_allocs_0100_time": 103.79, "st_num_allocs_0400_time": 479.029, "st_num_allocs_1000_time": 634.228, "mt_num_allocs_0025_time": 145.807, "mt_num_allocs_0100_time": 151.157, "mt_num_allocs_0400_time": 526.499, "mt_num_allocs_1000_time": 687.357 }, "malloc_block_size_4096": { "st_num_allocs_0025_time": 105.101, "st_num_allocs_0100_time": 1640.23, "st_num_allocs_0400_time": 2411.26, "st_num_allocs_1000_time": 2641.56, "mt_num_allocs_0025_time": 156.323, "mt_num_allocs_0100_time": 1702.94, "mt_num_allocs_0400_time": 2453, "mt_num_allocs_1000_time": 2676.75 } } } } Note something very bad happens for the larger allocations, there is a 25x slowdown from 25 to 400 allocations of 4KB blocks... ChangeLog: 2017-12-01 Wilco Dijkstra <wdijkstr@arm.com> * benchtests/Makefile: Add malloc-simple benchmark. * benchtests/bench-malloc-simple.c: New benchmark. -- diff --git a/benchtests/Makefile b/benchtests/Makefile index d8681fce8cf399bc655f3f6a7717897eb9c30619..a4b2573cfa706bd6369063a995d512e0947c7bd5 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -67,8 +67,10 @@ stdio-common-benchset := sprintf math-benchset := math-inlines +malloc-benchset := malloc-simple + benchset := $(string-benchset-all) $(stdlib-benchset) $(stdio-common-benchset) \ - $(math-benchset) + $(math-benchset) $(malloc-benchset) CFLAGS-bench-ffs.c += -fno-builtin CFLAGS-bench-ffsll.c += -fno-builtin @@ -86,6 +88,7 @@ $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm) $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) $(objpfx)bench-malloc-thread: $(shared-thread-library) +$(objpfx)bench-malloc-simple: $(shared-thread-library) \f diff --git a/benchtests/bench-malloc-simple.c b/benchtests/bench-malloc-simple.c new file mode 100644 index 0000000000000000000000000000000000000000..e786ddd9635f835b2f01b00a80f3cf0d2de82d48 --- /dev/null +++ b/benchtests/bench-malloc-simple.c @@ -0,0 +1,152 @@ +/* Benchmark malloc and free functions. + Copyright (C) 2017 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <pthread.h> +#include <stdio.h> +#include <stdlib.h> +#include "bench-timing.h" +#include "json-lib.h" + +#define NUM_ITERS 1000000 +#define NUM_ALLOCS 4 +#define NUM_SIZES 4 +#define MAX_ALLOCS 1000 + +typedef struct +{ + size_t iters; + size_t size; + int n; + timing_t elapsed; +} malloc_args; + +static void +do_benchmark (malloc_args *args, int **arr) +{ + timing_t start, stop; + size_t iters = args->iters; + size_t size = args->size; + int n = args->n; + + TIMING_NOW (start); + + for (int j = 0; j < iters; j++) + { + for (int i = 0; i < n; i++) + arr[i] = malloc (size); + + for (int i = 0; i < n; i++) + free (arr[i]); + } + + TIMING_NOW (stop); + + TIMING_DIFF (args->elapsed, start, stop); +} + +static malloc_args tests[2][NUM_SIZES][NUM_ALLOCS]; +static int allocs[NUM_ALLOCS] = { 25, 100, 400, MAX_ALLOCS }; +static size_t sizes[NUM_SIZES] = { 16, 256, 1024, 4096 }; + +static void * +dummy (void *p) +{ + return p; +} + +int +main (int argc, char **argv) +{ + size_t iters = NUM_ITERS; + int **arr = (int**) malloc (MAX_ALLOCS * sizeof (void*)); + unsigned long res; + + TIMING_INIT (res); + (void) res; + + for (int t = 0; t < 2; t++) + for (int j = 0; j < NUM_SIZES; j++) + for (int i = 0; i < NUM_ALLOCS; i++) + { + tests[t][j][i].n = allocs[i]; + tests[t][j][i].size = sizes[j]; + tests[t][j][i].iters = iters / allocs[i]; + + /* Do a quick warmup run. */ + if (t == 0) + do_benchmark (&tests[0][j][i], arr); + } + + /* Run benchmark single threaded. */ + for (int j = 0; j < NUM_SIZES; j++) + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[0][j][i], arr); + + /* Create an empty thread so SINGLE_THREAD_P becomes false. */ + pthread_t t; + pthread_create(&t, NULL, dummy, NULL); + pthread_join(t, NULL); + + /* Repeat benchmark with SINGLE_THREAD_P == false. */ + for (int j = 0; j < NUM_SIZES; j++) + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[1][j][i], arr); + + free (arr); + + json_ctx_t json_ctx; + + json_init (&json_ctx, 0, stdout); + + json_document_begin (&json_ctx); + + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); + + json_attr_object_begin (&json_ctx, "functions"); + + json_attr_object_begin (&json_ctx, "malloc"); + + for (int j = 0; j < NUM_SIZES; j++) + { + char s[100]; + double iters2 = iters; + sprintf (s, "malloc_block_size_%04ld", sizes[j]); + json_attr_object_begin (&json_ctx, s); + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "st_num_allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[0][j][i].elapsed / iters2); + } + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "mt_num_allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[1][j][i].elapsed / iters2); + } + + json_attr_object_end (&json_ctx); + } + + json_attr_object_end (&json_ctx); + + json_attr_object_end (&json_ctx); + + json_document_end (&json_ctx); + return 0; +} ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Add malloc micro benchmark 2017-12-01 13:51 [PATCH] Add malloc micro benchmark Wilco Dijkstra @ 2017-12-01 16:13 ` Carlos O'Donell 2017-12-18 15:18 ` Wilco Dijkstra 0 siblings, 1 reply; 31+ messages in thread From: Carlos O'Donell @ 2017-12-01 16:13 UTC (permalink / raw) To: Wilco Dijkstra, libc-alpha; +Cc: nd On 12/01/2017 05:51 AM, Wilco Dijkstra wrote: > Add a malloc micro benchmark to enable accurate testing of the > various paths in malloc and free. The benchmark does a varying > number of allocations of a given block size, then frees them again. > It does so for several block sizes and number of allocated blocks. > Although the test is single-threaded, it also tests what happens > when you disable single-threaded fast paths (ie. SINGLE_THREAD_P > is false). > > OK for commit? High level: This test is a long time coming and is a great idea. My "big" question here is: What are we trying to model? Do we want to prove that the single threaded optimizations you added are helping a given size class of allocations? You are currently modeling a workload that has increasing memory size requests and in some ways this is an odd workload that has high external fragmentation characteristics. For example after allocating lots of 256 byte blocks we move on to 1024 byte blocks, with the latter being unusable unless we coalesce. We need to discuss what we want to model here, and I mention that below in my larger comment. I have no objection to modeling what you have here, but if we do this then you need a big paragraph of text explaining why this particular case is important to you at ARM, or to yourself as a developer. Design: Overall this looks good. I like that you test with just one thread and then again after having created a thread. I *wish* we could test main_arena vs. threaded arena, since they have different code and behave differently e.g. sbrk vs. mmap'd heap. Implementation: You need to make this robust against env vars changing malloc behaviour. You should use mallopt to change some parameters. > Typical output on an x64 box: > { > "timing_type": "hp_timing", > "functions": { > "malloc": { > "malloc_block_size_0016": { > "st_num_allocs_0025_time": 53.5486, > "st_num_allocs_0100_time": 57.2553, > "st_num_allocs_0400_time": 57.3204, > "st_num_allocs_1000_time": 57.2059, > "mt_num_allocs_0025_time": 87.7903, > "mt_num_allocs_0100_time": 100.772, > "mt_num_allocs_0400_time": 103.827, > "mt_num_allocs_1000_time": 104.812 > }, > "malloc_block_size_0256": { > "st_num_allocs_0025_time": 78.3056, > "st_num_allocs_0100_time": 85.6392, > "st_num_allocs_0400_time": 91.5187, > "st_num_allocs_1000_time": 163.458, > "mt_num_allocs_0025_time": 115.925, > "mt_num_allocs_0100_time": 140.735, > "mt_num_allocs_0400_time": 152.044, > "mt_num_allocs_1000_time": 225.118 > }, > "malloc_block_size_1024": { > "st_num_allocs_0025_time": 113.705, > "st_num_allocs_0100_time": 103.79, > "st_num_allocs_0400_time": 479.029, > "st_num_allocs_1000_time": 634.228, > "mt_num_allocs_0025_time": 145.807, > "mt_num_allocs_0100_time": 151.157, > "mt_num_allocs_0400_time": 526.499, > "mt_num_allocs_1000_time": 687.357 > }, > "malloc_block_size_4096": { > "st_num_allocs_0025_time": 105.101, > "st_num_allocs_0100_time": 1640.23, > "st_num_allocs_0400_time": 2411.26, > "st_num_allocs_1000_time": 2641.56, > "mt_num_allocs_0025_time": 156.323, > "mt_num_allocs_0100_time": 1702.94, > "mt_num_allocs_0400_time": 2453, > "mt_num_allocs_1000_time": 2676.75 > } > } > } > } > > Note something very bad happens for the larger allocations, there > is a 25x slowdown from 25 to 400 allocations of 4KB blocks... Keep in mind you are testing the performance of sbrk here. In a threaded arena, the non-main_arena mmap's a 64MiB heap (on 64-bit) and then draws allocations from it. So in some ways main_arena is expenseive, but both have to pay a page-touch cost... For each 4KiB block you touch the block to write the co-located metadata and that forces the kernel to give you a blank page, which you then do nothing with. Then you repeat the above again. For all other sizes you amortize the cost of the new page among several allocations. Do you have any other explanation? At some point you will hit the mmap threshold and the cost of the allocation will skyrocket as you have to call mmap. > ChangeLog: > 2017-12-01 Wilco Dijkstra <wdijkstr@arm.com> > > * benchtests/Makefile: Add malloc-simple benchmark. > * benchtests/bench-malloc-simple.c: New benchmark. > -- > > diff --git a/benchtests/Makefile b/benchtests/Makefile > index d8681fce8cf399bc655f3f6a7717897eb9c30619..a4b2573cfa706bd6369063a995d512e0947c7bd5 100644 > --- a/benchtests/Makefile > +++ b/benchtests/Makefile > @@ -67,8 +67,10 @@ stdio-common-benchset := sprintf > > math-benchset := math-inlines > > +malloc-benchset := malloc-simple > + OK. > benchset := $(string-benchset-all) $(stdlib-benchset) $(stdio-common-benchset) \ > - $(math-benchset) > + $(math-benchset) $(malloc-benchset) OK. > > CFLAGS-bench-ffs.c += -fno-builtin > CFLAGS-bench-ffsll.c += -fno-builtin > @@ -86,6 +88,7 @@ $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) > $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm) > $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) > $(objpfx)bench-malloc-thread: $(shared-thread-library) > +$(objpfx)bench-malloc-simple: $(shared-thread-library) OK. > > \f > > diff --git a/benchtests/bench-malloc-simple.c b/benchtests/bench-malloc-simple.c > new file mode 100644 > index 0000000000000000000000000000000000000000..e786ddd9635f835b2f01b00a80f3cf0d2de82d48 > --- /dev/null > +++ b/benchtests/bench-malloc-simple.c > @@ -0,0 +1,152 @@ > +/* Benchmark malloc and free functions. > + Copyright (C) 2017 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +#include <pthread.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include "bench-timing.h" > +#include "json-lib.h" > + > +#define NUM_ITERS 1000000 > +#define NUM_ALLOCS 4 > +#define NUM_SIZES 4 > +#define MAX_ALLOCS 1000 > + > +typedef struct > +{ > + size_t iters; > + size_t size; > + int n; > + timing_t elapsed; > +} malloc_args; > + > +static void > +do_benchmark (malloc_args *args, int **arr) > +{ > + timing_t start, stop; > + size_t iters = args->iters; > + size_t size = args->size; > + int n = args->n; > + > + TIMING_NOW (start); > + > + for (int j = 0; j < iters; j++) > + { > + for (int i = 0; i < n; i++) > + arr[i] = malloc (size); > + > + for (int i = 0; i < n; i++) > + free (arr[i]); > + } > + > + TIMING_NOW (stop); > + > + TIMING_DIFF (args->elapsed, start, stop); > +} > + > +static malloc_args tests[2][NUM_SIZES][NUM_ALLOCS]; > +static int allocs[NUM_ALLOCS] = { 25, 100, 400, MAX_ALLOCS }; > +static size_t sizes[NUM_SIZES] = { 16, 256, 1024, 4096 }; In glibc we have: tcache -> fastbins -> smallbins -> largbing -> unordered -> mmap If you proceed through from small allocations to larger allocations you will create chunks that cannot be used by future allocations. In many cases this is a worst case performance bottleneck. The heap will contain many 256 byte allocations but these cannot service the 1024 bytes, that is unless consolidation has been run. So this tests the consolidation as much as anything else, which might not trigger because of the free thresholds required. So what are we trying to model here? If we want to look at the cost of independent size class allocations then we need a clean process and allocate only a given size, and look at performance across the number of allocations. In which case we need to do: * Spawn new process with size as an argument. * Have the new process track performance at N allocations of the same size. * Record result. * Increase size. * Repeat. This way each new spawned subprocess is "clean" and we exercise a particular size class of allocations. I would also have much finer grained allocations by powers of 2. 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4092 etc. You want to see what happens for the allocations which are: * Less than a chunk size. * Fit in the tcache? * Fit in the fastbins? * Fit in the smallbins? * Fit in the fastbins? * Fit only in mmap? (allocation specifically larger than a MALLOC_MMAP_THRESHOLD_ you set). Would this serve better to show that your single threaded malloc changes were helpful for a given size class? > + > +static void * > +dummy (void *p) > +{ > + return p; > +} > + > +int > +main (int argc, char **argv) > +{ > + size_t iters = NUM_ITERS; > + int **arr = (int**) malloc (MAX_ALLOCS * sizeof (void*)); > + unsigned long res; > + You need to use mallopt to make sure the user's environment did not set MALLOC_MMAP_THRESHOLD_ to a value lower than your maximum allocation size. Similarly any other mallopt parameter you think is important needs to be set. If you spawned a clean subprocess you could clean the env var, and that might actually be easier from a maintenance perspective. > + TIMING_INIT (res); > + (void) res; > + > + for (int t = 0; t < 2; t++) > + for (int j = 0; j < NUM_SIZES; j++) > + for (int i = 0; i < NUM_ALLOCS; i++) > + { > + tests[t][j][i].n = allocs[i]; > + tests[t][j][i].size = sizes[j]; > + tests[t][j][i].iters = iters / allocs[i]; > + > + /* Do a quick warmup run. */ > + if (t == 0) > + do_benchmark (&tests[0][j][i], arr); > + } > + > + /* Run benchmark single threaded. */ > + for (int j = 0; j < NUM_SIZES; j++) > + for (int i = 0; i < NUM_ALLOCS; i++) > + do_benchmark (&tests[0][j][i], arr); > + > + /* Create an empty thread so SINGLE_THREAD_P becomes false. */ > + pthread_t t; > + pthread_create(&t, NULL, dummy, NULL); > + pthread_join(t, NULL); > + > + /* Repeat benchmark with SINGLE_THREAD_P == false. */ > + for (int j = 0; j < NUM_SIZES; j++) > + for (int i = 0; i < NUM_ALLOCS; i++) > + do_benchmark (&tests[1][j][i], arr); > + > + free (arr); > + > + json_ctx_t json_ctx; > + > + json_init (&json_ctx, 0, stdout); > + > + json_document_begin (&json_ctx); > + > + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); > + > + json_attr_object_begin (&json_ctx, "functions"); > + > + json_attr_object_begin (&json_ctx, "malloc"); > + > + for (int j = 0; j < NUM_SIZES; j++) > + { > + char s[100]; > + double iters2 = iters; > + sprintf (s, "malloc_block_size_%04ld", sizes[j]); > + json_attr_object_begin (&json_ctx, s); > + > + for (int i = 0; i < NUM_ALLOCS; i++) > + { > + sprintf (s, "st_num_allocs_%04d_time", allocs[i]); > + json_attr_double (&json_ctx, s, tests[0][j][i].elapsed / iters2); > + } > + > + for (int i = 0; i < NUM_ALLOCS; i++) > + { > + sprintf (s, "mt_num_allocs_%04d_time", allocs[i]); > + json_attr_double (&json_ctx, s, tests[1][j][i].elapsed / iters2); > + } > + > + json_attr_object_end (&json_ctx); > + } > + > + json_attr_object_end (&json_ctx); > + > + json_attr_object_end (&json_ctx); > + > + json_document_end (&json_ctx); > + return 0; > +} > -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Add malloc micro benchmark 2017-12-01 16:13 ` Carlos O'Donell @ 2017-12-18 15:18 ` Wilco Dijkstra 2017-12-18 16:33 ` Carlos O'Donell 2017-12-18 23:02 ` [PATCH] " DJ Delorie 0 siblings, 2 replies; 31+ messages in thread From: Wilco Dijkstra @ 2017-12-18 15:18 UTC (permalink / raw) To: Carlos O'Donell, libc-alpha; +Cc: nd Carlos O'Donell wrote: Thanks for the review! > This test is a long time coming and is a great idea. > > My "big" question here is: What are we trying to model? > > Do we want to prove that the single threaded optimizations you > added are helping a given size class of allocations? Yes that is the main goal of the benchmark. It models the allocation pattern of a few benchmarks which were reported as being slow despite the new tcache (which didn't show any gains). When the tcache was configured to be larger there was a major speedup, suggesting that the tcache doesn't work on patterns with a high number of (de)allocations of similar sized blocks. Since DJ didn't seem keen on increasing the tcache size despite it showing major gains across a wide range of benchmarks, I decided to fix the performance for the single-threaded case at least. It's now 2.5x faster on a few sever benchmarks (of course the next question is whether tcache is actually useful in its current form). > You are currently modeling a workload that has increasing > memory size requests and in some ways this is an odd workload > that has high external fragmentation characteristics. For example > after allocating lots of 256 byte blocks we move on to 1024 byte > blocks, with the latter being unusable unless we coalesce. I'm assuming coalescing works as expected. If it doesn't, it would be a nasty bug. > I *wish* we could test main_arena vs. threaded arena, since they > have different code and behave differently e.g. sbrk vs. mmap'd > heap. I'd have to check how easy it is to force it to use the thread arena. The whole thing is just crazily weird, with too many different code paths and possibilities. It seems much easier just to always use thread arenas, and perhaps use sbrk only if there is some serious advantage over mmap. Also it appears all the values are set to what was perhaps reasonable 10-20 years ago, not today. When a small server has 128GB, there is absolutely no reason to worry about returning 128KB to the OS as quickly as possible... > Implementation: > > You need to make this robust against env vars changing malloc > behaviour. You should use mallopt to change some parameters. You mean setting the tcache size explicitly (maybe even switching off)? >> Note something very bad happens for the larger allocations, there >> is a 25x slowdown from 25 to 400 allocations of 4KB blocks... > > Keep in mind you are testing the performance of sbrk here. In a threaded > arena, the non-main_arena mmap's a 64MiB heap (on 64-bit) and then > draws allocations from it. So in some ways main_arena is expenseive, > but both have to pay a page-touch cost... > > For each 4KiB block you touch the block to write the co-located metadata > and that forces the kernel to give you a blank page, which you then do > nothing with. Then you repeat the above again. > > For all other sizes you amortize the cost of the new page among > several allocations. > > Do you have any other explanation? Well that looks like a reasonable explanation, but it shows a serious performance bug - I think we use MADV_DONTNEED which doesn't work on Linux and will cause all pages to be deallocated, reallocated and zero-filled... This is the sort of case where you need to be very careful to amortize over many allocations or long elapsed time, if at all (many other allocators never give pages back). > At some point you will hit the mmap threshold and the cost of the > allocation will skyrocket as you have to call mmap. That only happens on huge allocations (much larger than 4KB), or when you run out of sbrk space (unlikely). > In glibc we have: > > tcache -> fastbins -> smallbins -> largbing -> unordered -> mmap > > If you proceed through from small allocations to larger allocations > you will create chunks that cannot be used by future allocations. > In many cases this is a worst case performance bottleneck. The > heap will contain many 256 byte allocations but these cannot service > the 1024 bytes, that is unless consolidation has been run. So this > tests the consolidation as much as anything else, which might not > trigger because of the free thresholds required. If consolidation doesn't work that's a serious bug. However allocation performance should not be affected either way - in a real application those small blocks might still be allocated. As long as consolidation runs quickly (generally it's a small percentage in profiles), it won't affect the results. > So what are we trying to model here? > > If we want to look at the cost of independent size class allocations > then we need a clean process and allocate only a given size, and look > at performance across the number of allocations. That's certainly feasible if we keep the number of sizes small (less than the list below). It should be easy to reuse the bench-malloc-thread.c makefile magic to run the same binary with multiple sizes. > I would also have much finer grained allocations by powers of 2. > 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4092 etc. You want > to see what happens for the allocations which are: .. > Would this serve better to show that your single threaded malloc > changes were helpful for a given size class? Well I can easily add some of the above sizes, it's highly configurable. I don't think there will be much difference with the existing sizes though. > You need to use mallopt to make sure the user's environment > did not set MALLOC_MMAP_THRESHOLD_ to a value lower than your > maximum allocation size. I don't think that is possible given the largest allocation size is 4KB. Wilco ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Add malloc micro benchmark 2017-12-18 15:18 ` Wilco Dijkstra @ 2017-12-18 16:33 ` Carlos O'Donell 2018-01-02 18:20 ` [PATCH v2] " Wilco Dijkstra 2017-12-18 23:02 ` [PATCH] " DJ Delorie 1 sibling, 1 reply; 31+ messages in thread From: Carlos O'Donell @ 2017-12-18 16:33 UTC (permalink / raw) To: Wilco Dijkstra, libc-alpha; +Cc: nd On 12/18/2017 07:18 AM, Wilco Dijkstra wrote: > Carlos O'Donell wrote: > > Thanks for the review! Thank you for the detailed follow up. >> This test is a long time coming and is a great idea. >> >> My "big" question here is: What are we trying to model? >> >> Do we want to prove that the single threaded optimizations you >> added are helping a given size class of allocations? > > Yes that is the main goal of the benchmark. It models the allocation > pattern of a few benchmarks which were reported as being slow > despite the new tcache (which didn't show any gains). OK. > When the tcache was configured to be larger there was a major > speedup, suggesting that the tcache doesn't work on patterns with > a high number of (de)allocations of similar sized blocks. Since DJ > didn't seem keen on increasing the tcache size despite it showing > major gains across a wide range of benchmarks, I decided to fix > the performance for the single-threaded case at least. It's now 2.5x > faster on a few sever benchmarks (of course the next question is > whether tcache is actually useful in its current form). If you have a pattern of malloc/free of *similar* sized blocks, then it overflows the sized bin in the tcache, with other size bins remaining empty. The cache itself does not dynamically reconfigure itself to consume X MiB or Y % of RSS, instead it uses a simple data structure to contain a fixed number of fixed size blocks. Therefore I agree, that enhancing the core data structure in tcache may result in better overall performance, particularly if we got rid of the fixed bin sizes and instead found a way to be performant *and* keep a running total of consumption. This is not a trivial goal though. Likewise *all* of malloc needs to be moved to a better data structure than just linked lists. I would like to see glibc's malloc offer a cacheing footprint of no more than Y % of RSS available, and let the user tweak that. Currently we just consume RSS without much regard for overhead. Though this is a different case than than what you are talking about, the changes are related via data-structure enhancements that would benefit both cases IMO. >> You are currently modeling a workload that has increasing >> memory size requests and in some ways this is an odd workload >> that has high external fragmentation characteristics. For example >> after allocating lots of 256 byte blocks we move on to 1024 byte >> blocks, with the latter being unusable unless we coalesce. > > I'm assuming coalescing works as expected. If it doesn't, it would > be a nasty bug. You are probably right. >> I *wish* we could test main_arena vs. threaded arena, since they >> have different code and behave differently e.g. sbrk vs. mmap'd >> heap. > > I'd have to check how easy it is to force it to use the thread arena. > The whole thing is just crazily weird, with too many different code > paths and possibilities. It seems much easier just to always use > thread arenas, and perhaps use sbrk only if there is some serious > advantage over mmap. Also it appears all the values are set to > what was perhaps reasonable 10-20 years ago, not today. When > a small server has 128GB, there is absolutely no reason to worry > about returning 128KB to the OS as quickly as possible... (a) Returning memory based on a limit of memory cached. The decision to return memory to the operating system should be based on a desire to run within the bounds of a certain amount of cached memory in the user process. This should be the goal IMO. We should not return 128KB to the OS unless we are within our bounds of Y % of RSS cache, or X MiB of RSS cache. This bounded behaviour is more and more important for (b). So I argue that this has nothing to do with how much memory the server has but how much the user wants as cache in the process. This gets back to your point about tcache size needing to be bigger; if you had Y % RSS allocated to tcache it would solve your needs. (b) Packing density matters, or rather consistent RSS usage matters. Yes, and no. We are facing a lot of downstream request for container, and VM packing efficiency. This means that your 128GB is split into 32 servers each with 4GB, or 64 servers each with 2GB running smaller services. In these cases we *do* care a lot about packing density. (b) Maintenance costs of the existing weird cases and harmonizing threaded and main_arena paths. As I suggested in bug 15321: https://sourceware.org/bugzilla/show_bug.cgi?id=15321 We need to merge the main_arena and threaded code together, and stop treating them as different things. Right now the main_arena, if you look at the code, is a *pretend* heap with a partial data structure layered in place. This needs to go away. We need to treat all heaps as identical, with identical code paths, with just different backing storage. I think people still expect that thread 0 allocates from the sbrk heap in a single-threaded application, and we can do that by ensuring sbrk is used to provide the backing store for the main thread. This way we can jump the pointer 64MB like we normally do for mmap'd heaps, but then on page touch there the kernel just extends the heap normally. No difference (except VMA usage). Once that is in place we can experiment with other strategies like never using sbrk. >> Implementation: >> >> You need to make this robust against env vars changing malloc >> behaviour. You should use mallopt to change some parameters. > > You mean setting the tcache size explicitly (maybe even switching off)? You have several options: * Add a wrapper script that clear all mallopt related env vars. * Adjust the Makefile to clear all mallopt related env vars before starting the test. * Set tcache sizes explicitly *if* that is what you want, but likely you don't want this and want to run the test with just the defaults to see how the defaults are performing. >>> Note something very bad happens for the larger allocations, there >>> is a 25x slowdown from 25 to 400 allocations of 4KB blocks... >> >> Keep in mind you are testing the performance of sbrk here. In a threaded >> arena, the non-main_arena mmap's a 64MiB heap (on 64-bit) and then >> draws allocations from it. So in some ways main_arena is expenseive, >> but both have to pay a page-touch cost... >> >> For each 4KiB block you touch the block to write the co-located metadata >> and that forces the kernel to give you a blank page, which you then do >> nothing with. Then you repeat the above again. >> >> For all other sizes you amortize the cost of the new page among >> several allocations. >> >> Do you have any other explanation? > > Well that looks like a reasonable explanation, but it shows a serious > performance bug - I think we use MADV_DONTNEED which doesn't > work on Linux and will cause all pages to be deallocated, reallocated > and zero-filled... This is the sort of case where you need to be very > careful to amortize over many allocations or long elapsed time, if at > all (many other allocators never give pages back). We need to move to MADV_FREE, which was designed for memory allocators. The semantics of MADV_DONTNEED have the problem that one has to consider: * Is the data destructively lost in that page? * Is the data flushed to the underlying store before being not-needed? All of which lead to MADV_DONTNEED doing a lot of teardown work to ensure that users don't corrupt the data in their backing stores. I think that detection of MADV_FREE, and usage, would help performance, but only on > Linux 4.5, and that might be OK for you. >> At some point you will hit the mmap threshold and the cost of the >> allocation will skyrocket as you have to call mmap. > > That only happens on huge allocations (much larger than 4KB), or when > you run out of sbrk space (unlikely). It happens at the mmap threshold, which is variable :-) Please consider the implementation as a fluid set of parameters that model application behaviour. We can run out of sbrk space *immediately* if you have an interposing low-address mmap that means sbrk can't grow (again see swbz#15321). Right now the mmap threshold is 128KiB though, so you're right, for the default. I don't know if that size is a good idea or not. >> In glibc we have: >> >> tcache -> fastbins -> smallbins -> largbing -> unordered -> mmap >> >> If you proceed through from small allocations to larger allocations >> you will create chunks that cannot be used by future allocations. >> In many cases this is a worst case performance bottleneck. The >> heap will contain many 256 byte allocations but these cannot service >> the 1024 bytes, that is unless consolidation has been run. So this >> tests the consolidation as much as anything else, which might not >> trigger because of the free thresholds required. > > If consolidation doesn't work that's a serious bug. However allocation > performance should not be affected either way - in a real application > those small blocks might still be allocated. As long as consolidation > runs quickly (generally it's a small percentage in profiles), it won't > affect the results. OK. >> So what are we trying to model here? >> >> If we want to look at the cost of independent size class allocations >> then we need a clean process and allocate only a given size, and look >> at performance across the number of allocations. > > That's certainly feasible if we keep the number of sizes small (less > than the list below). It should be easy to reuse the bench-malloc-thread.c > makefile magic to run the same binary with multiple sizes. OK. >> I would also have much finer grained allocations by powers of 2. >> 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4092 etc. You want >> to see what happens for the allocations which are: > .. >> Would this serve better to show that your single threaded malloc >> changes were helpful for a given size class? > > Well I can easily add some of the above sizes, it's highly configurable. > I don't think there will be much difference with the existing sizes though. Perhaps, but I don't know the answer to that. >> You need to use mallopt to make sure the user's environment >> did not set MALLOC_MMAP_THRESHOLD_ to a value lower than your >> maximum allocation size. > > I don't think that is possible given the largest allocation size is 4KB. We carry out the allocation with mmap regardless, rounding up the size to that of a page. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v2] Add malloc micro benchmark 2017-12-18 16:33 ` Carlos O'Donell @ 2018-01-02 18:20 ` Wilco Dijkstra 2018-01-02 18:45 ` DJ Delorie 2018-01-05 14:33 ` Carlos O'Donell 0 siblings, 2 replies; 31+ messages in thread From: Wilco Dijkstra @ 2018-01-02 18:20 UTC (permalink / raw) To: Carlos O'Donell, libc-alpha; +Cc: nd Carlos O'Donell wrote: > If you have a pattern of malloc/free of *similar* sized blocks, then > it overflows the sized bin in the tcache, with other size bins remaining > empty. The cache itself does not dynamically reconfigure itself to consume > X MiB or Y % of RSS, instead it uses a simple data structure to contain > a fixed number of fixed size blocks. > > Therefore I agree, that enhancing the core data structure in tcache may > result in better overall performance, particularly if we got rid of the > fixed bin sizes and instead found a way to be performant *and* keep a > running total of consumption. Well it could keep track of sum of all block sizes and limit that. That would be better than a fixed limit on number of blocks in each bin. > Likewise *all* of malloc needs to be moved to a better data structure than > just linked lists. I would like to see glibc's malloc offer a cacheing > footprint of no more than Y % of RSS available, and let the user tweak that. > Currently we just consume RSS without much regard for overhead. Though this > is a different case than than what you are talking about, the changes are > related via data-structure enhancements that would benefit both cases IMO. What kind of datastructure do you have in mind? Small blocks could be allocated together in pages. This would avoid the per-block overhead and changes the linked list into a bitmap scan. However there aren't that many alternatives. I've written first-fit allocators using autobalancing trees, but walking the tree is expensive due to being not having good locality. >> I *wish* we could test main_arena vs. threaded arena, since they >> have different code and behave differently e.g. sbrk vs. mmap'd >> heap. I've added test of main and thread arenas in the latest version as well as a test with a larger tcache. > As I suggested in bug 15321: > https://sourceware.org/bugzilla/show_bug.cgi?id=15321 > > We need to merge the main_arena and threaded code together, and stop > treating them as different things. Right now the main_arena, if you > look at the code, is a *pretend* heap with a partial data structure > layered in place. This needs to go away. We need to treat all heaps > as identical, with identical code paths, with just different backing > storage. Yes that sounds like a good plan. > I think people still expect that thread 0 allocates from the sbrk > heap in a single-threaded application, and we can do that by ensuring > sbrk is used to provide the backing store for the main thread. This way > we can jump the pointer 64MB like we normally do for mmap'd heaps, but > then on page touch there the kernel just extends the heap normally. A quick benchmark shows sbrk is about ~25% faster than mmap, so it appears useful. Looking at the addresses, sbrk grows up, while mmap grows down the address space. I think other arenas could use sbrk as well as long as you can free sbrk memory via MADV_FREE or MADV_DONTUSE. >> Implementation: >> >> You need to make this robust against env vars changing malloc >> behaviour. You should use mallopt to change some parameters. I've added support to change the tcache count in mallopt, so we can benchmark different settings. > We need to move to MADV_FREE, which was designed for memory allocators. > > The semantics of MADV_DONTNEED have the problem that one has to consider: > * Is the data destructively lost in that page? > * Is the data flushed to the underlying store before being not-needed? > All of which lead to MADV_DONTNEED doing a lot of teardown work to ensure > that users don't corrupt the data in their backing stores. > > I think that detection of MADV_FREE, and usage, would help performance, > but only on > Linux 4.5, and that might be OK for you. Well we should detect when MADV_FREE is supported and use that. If not, tweak the size of MADV_DONTNEED - perhaps based on a percentage of the RSS size so we limit the overhead. Anyway, here is the new version: Add a malloc micro benchmark to enable accurate testing of the various paths in malloc and free. The benchmark does a varying number of allocations of a given block size, then frees them again. It tests 4 different scenarios: single-threaded using main arena, multi-threaded using thread-arena, main arena with SINGLE_THREAD_P false and main arena with the tcache count set larger than the default. To enable this, add support for M_TCACHE_COUNT in mallopt. OK for commit? ChangeLog: 2018-01-02 Wilco Dijkstra <wdijkstr@arm.com> * benchtests/Makefile: Add malloc-simple benchmark. * benchtests/bench-malloc-simple.c: New benchmark. * malloc/malloc.h (M_TCACHE_COUNT): Add new define. * malloc/malloc.c (__libc_mallopt): Handle M_TCACHE_COUNT. -- diff --git a/benchtests/Makefile b/benchtests/Makefile index 74b3821ccfea6912e68578ad2598d68a9e38223c..5052bbbfe79f6d5a0b16c427dfc4807271805e61 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -90,7 +90,7 @@ CFLAGS-bench-trunc.c += -fno-builtin CFLAGS-bench-truncf.c += -fno-builtin ifeq (${BENCHSET},) -bench-malloc := malloc-thread +bench-malloc := malloc-thread malloc-simple else bench-malloc := $(filter malloc-%,${BENCHSET}) endif @@ -98,7 +98,7 @@ endif $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm) $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) -$(objpfx)bench-malloc-thread: $(shared-thread-library) +$(addprefix $(objpfx)bench-,$(bench-malloc)): $(shared-thread-library) \f @@ -165,7 +165,7 @@ bench-clean: ifneq ($(strip ${BENCHSET}),) VALIDBENCHSETNAMES := bench-pthread bench-math bench-string string-benchset \ wcsmbs-benchset stdlib-benchset stdio-common-benchset math-benchset \ - malloc-thread + malloc-thread malloc-simple INVALIDBENCHSETNAMES := $(filter-out ${VALIDBENCHSETNAMES},${BENCHSET}) ifneq (${INVALIDBENCHSETNAMES},) $(info The following values in BENCHSET are invalid: ${INVALIDBENCHSETNAMES}) @@ -201,10 +201,18 @@ bench-set: $(binaries-benchset) bench-malloc: $(binaries-bench-malloc) for run in $^; do \ + echo "$${run}"; \ + if [ `basename $${run}` = "bench-malloc-thread" ]; then \ for thr in 1 8 16 32; do \ echo "Running $${run} $${thr}"; \ - $(run-bench) $${thr} > $${run}-$${thr}.out; \ - done;\ + $(run-bench) $${thr} > $${run}-$${thr}.out; \ + done;\ + else \ + for thr in 8 16 32 64 128 256 512 1024 2048 4096; do \ + echo "Running $${run} $${thr}"; \ + $(run-bench) $${thr} > $${run}-$${thr}.out; \ + done;\ + fi;\ done # Build and execute the benchmark functions. This target generates JSON diff --git a/benchtests/bench-malloc-simple.c b/benchtests/bench-malloc-simple.c new file mode 100644 index 0000000000000000000000000000000000000000..151c38de50c5e747e05d69c717452241a47d7d22 --- /dev/null +++ b/benchtests/bench-malloc-simple.c @@ -0,0 +1,201 @@ +/* Benchmark malloc and free functions. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <pthread.h> +#include <stdio.h> +#include <stdlib.h> +#include <malloc.h> +#include <sys/resource.h> +#include "bench-timing.h" +#include "json-lib.h" + +#define NUM_ITERS 1000000 +#define NUM_ALLOCS 4 +#define MAX_ALLOCS 1000 + +typedef struct +{ + size_t iters; + size_t size; + int n; + timing_t elapsed; +} malloc_args; + +static void +do_benchmark (malloc_args *args, int **arr) +{ + timing_t start, stop; + size_t iters = args->iters; + size_t size = args->size; + int n = args->n; + + TIMING_NOW (start); + + for (int j = 0; j < iters; j++) + { + for (int i = 0; i < n; i++) + arr[i] = malloc (size); + + for (int i = 0; i < n; i++) + free (arr[i]); + } + + TIMING_NOW (stop); + + TIMING_DIFF (args->elapsed, start, stop); +} + +static malloc_args tests[4][NUM_ALLOCS]; +static int allocs[NUM_ALLOCS] = { 25, 100, 400, MAX_ALLOCS }; + +static void * +thread_test (void *p) +{ + int **arr = (int**)p; + + /* Run benchmark multi-threaded. */ + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[2][i], arr); + + return p; +} + +void +bench (unsigned long size) +{ + size_t iters = NUM_ITERS; + int **arr = (int**) malloc (MAX_ALLOCS * sizeof (void*)); + unsigned long res; + + TIMING_INIT (res); + (void) res; + + /* Set tcache count to default. */ + mallopt (M_TCACHE_COUNT, -1); + + for (int t = 0; t <= 3; t++) + for (int i = 0; i < NUM_ALLOCS; i++) + { + tests[t][i].n = allocs[i]; + tests[t][i].size = size; + tests[t][i].iters = iters / allocs[i]; + + /* Do a quick warmup run. */ + if (t == 0) + do_benchmark (&tests[0][i], arr); + } + + /* Run benchmark single threaded in main_arena. */ + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[0][i], arr); + + /* Run benchmark in a thread_arena. */ + pthread_t t; + pthread_create (&t, NULL, thread_test, (void*)arr); + pthread_join (t, NULL); + + /* Repeat benchmark in main_arena with SINGLE_THREAD_P == false. */ + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[1][i], arr); + + /* Increase size of tcache. */ + mallopt (M_TCACHE_COUNT, 100); + + /* Run again but with larger tcache. */ + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[3][i], arr); + + mallopt (M_TCACHE_COUNT, -1); + + free (arr); + + json_ctx_t json_ctx; + + json_init (&json_ctx, 0, stdout); + + json_document_begin (&json_ctx); + + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); + + json_attr_object_begin (&json_ctx, "functions"); + + json_attr_object_begin (&json_ctx, "malloc"); + + char s[100]; + double iters2 = iters; + + json_attr_object_begin (&json_ctx, ""); + json_attr_double (&json_ctx, "malloc_block_size", size); + + struct rusage usage; + getrusage (RUSAGE_SELF, &usage); + json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss); + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "main_arena_st_allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[0][i].elapsed / iters2); + } + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "main_arena_mt_allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[1][i].elapsed / iters2); + } + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "big_tcache_mt_allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[3][i].elapsed / iters2); + } + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "thread_arena__allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[2][i].elapsed / iters2); + } + + json_attr_object_end (&json_ctx); + + json_attr_object_end (&json_ctx); + + json_attr_object_end (&json_ctx); + + json_document_end (&json_ctx); +} + +static void usage (const char *name) +{ + fprintf (stderr, "%s: <alloc_size>\n", name); + exit (1); +} + +int +main (int argc, char **argv) +{ + long val = 16; + if (argc == 2) + val = strtol (argv[1], NULL, 0); + + if (argc > 2 || val <= 0) + usage (argv[0]); + + bench (val); + + return 0; +} diff --git a/malloc/malloc.h b/malloc/malloc.h index 339ab64c7d336873211a9057a923d87e8c1e025d..a047385a4fc8d7b3bb3e120a94440193dba306ed 100644 --- a/malloc/malloc.h +++ b/malloc/malloc.h @@ -121,6 +121,7 @@ extern struct mallinfo mallinfo (void) __THROW; #define M_PERTURB -6 #define M_ARENA_TEST -7 #define M_ARENA_MAX -8 +#define M_TCACHE_COUNT -9 /* General SVID/XPG interface to tunable parameters. */ extern int mallopt (int __param, int __val) __THROW; diff --git a/malloc/malloc.c b/malloc/malloc.c index 0c9e0748b4c10988f6fe99ac2e5b21b8b7b603c3..a07438d276ff4c8177552e1c0d186ee7c8bd7692 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -5177,6 +5177,11 @@ __libc_mallopt (int param_number, int value) if (value > 0) do_set_arena_max (value); break; +#if USE_TCACHE + case M_TCACHE_COUNT: + do_set_tcache_count (value >= 0 ? value : TCACHE_FILL_COUNT); + break; +#endif } __libc_lock_unlock (av->mutex); return res; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-02 18:20 ` [PATCH v2] " Wilco Dijkstra @ 2018-01-02 18:45 ` DJ Delorie 2018-01-03 12:12 ` Wilco Dijkstra 2018-01-05 14:33 ` Carlos O'Donell 1 sibling, 1 reply; 31+ messages in thread From: DJ Delorie @ 2018-01-02 18:45 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: carlos, libc-alpha, nd Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > To enable this, add support for M_TCACHE_COUNT in mallopt. What other tests do is create a second test that just #include's the first test, and set an environment variable in the Makefile specific to that test. Adding an ABI just for a test is a big hammer, although we could discuss adding tcache to mallopt() as a seperate topic. I don't have any objection to adding tcache to mallopt (although please add all three tunables if you do), just saying we should discuss it as an ABI change separately. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-02 18:45 ` DJ Delorie @ 2018-01-03 12:12 ` Wilco Dijkstra 2018-01-03 15:07 ` Carlos O'Donell 0 siblings, 1 reply; 31+ messages in thread From: Wilco Dijkstra @ 2018-01-03 12:12 UTC (permalink / raw) To: DJ Delorie; +Cc: carlos, libc-alpha, nd DJ Delorie wrote: > What other tests do is create a second test that just #include's the > first test, and set an environment variable in the Makefile specific to > that test. Adding an ABI just for a test is a big hammer, although we > could discuss adding tcache to mallopt() as a seperate topic. Yeah but the makefiles are already insanely complex. Adding the new benchmark to the makefile took more than 10x as much time as writing the test itself... > I don't have any objection to adding tcache to mallopt (although please > add all three tunables if you do), just saying we should discuss it as > an ABI change separately. It doesn't have to be an external ABI, I'd suggest keeping this internal to GLIBC to make testing and benchmarking easier. Wilco ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-03 12:12 ` Wilco Dijkstra @ 2018-01-03 15:07 ` Carlos O'Donell 2018-01-04 13:48 ` Wilco Dijkstra 0 siblings, 1 reply; 31+ messages in thread From: Carlos O'Donell @ 2018-01-03 15:07 UTC (permalink / raw) To: Wilco Dijkstra, DJ Delorie; +Cc: libc-alpha, nd On 01/03/2018 04:12 AM, Wilco Dijkstra wrote: > DJ Delorie wrote: > >> What other tests do is create a second test that just #include's the >> first test, and set an environment variable in the Makefile specific to >> that test. Adding an ABI just for a test is a big hammer, although we >> could discuss adding tcache to mallopt() as a seperate topic. > > Yeah but the makefiles are already insanely complex. Adding the new > benchmark to the makefile took more than 10x as much time as writing > the test itself... > >> I don't have any objection to adding tcache to mallopt (although please >> add all three tunables if you do), just saying we should discuss it as >> an ABI change separately. > > It doesn't have to be an external ABI, I'd suggest keeping this internal to > GLIBC to make testing and benchmarking easier. Don't use mallopt, please make it a tunable then. The mallopt API already had 2 secret arena options which eventually became so well used they were baked into the API and had to be made public. At least with tunables we are allowed to deprecate them. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-03 15:07 ` Carlos O'Donell @ 2018-01-04 13:48 ` Wilco Dijkstra 2018-01-04 16:37 ` Adhemerval Zanella 2018-01-05 14:32 ` Carlos O'Donell 0 siblings, 2 replies; 31+ messages in thread From: Wilco Dijkstra @ 2018-01-04 13:48 UTC (permalink / raw) To: Carlos O'Donell, DJ Delorie; +Cc: libc-alpha, nd Carlos O'Donell wrote: > > Don't use mallopt, please make it a tunable then. > > The mallopt API already had 2 secret arena options which eventually became > so well used they were baked into the API and had to be made public. Unfortunately tunables are not exported so you can't use them outside of GLIBC: /build/glibc/benchtests/bench-malloc-simple.o: In function `bench': bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val' collect2: error: ld returned 1 exit status Wilco ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-04 13:48 ` Wilco Dijkstra @ 2018-01-04 16:37 ` Adhemerval Zanella 2018-01-05 14:32 ` Carlos O'Donell 1 sibling, 0 replies; 31+ messages in thread From: Adhemerval Zanella @ 2018-01-04 16:37 UTC (permalink / raw) To: libc-alpha On 04/01/2018 11:48, Wilco Dijkstra wrote: > Carlos O'Donell wrote: >> >> Don't use mallopt, please make it a tunable then. >> >> The mallopt API already had 2 secret arena options which eventually became >> so well used they were baked into the API and had to be made public. > > Unfortunately tunables are not exported so you can't use them outside of GLIBC: > > /build/glibc/benchtests/bench-malloc-simple.o: In function `bench': > bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val' > collect2: error: ld returned 1 exit status > > Wilco > You will need to the environment variable to check outside GLIBC (as expected by normal programs). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-04 13:48 ` Wilco Dijkstra 2018-01-04 16:37 ` Adhemerval Zanella @ 2018-01-05 14:32 ` Carlos O'Donell 2018-01-05 15:50 ` Adhemerval Zanella 1 sibling, 1 reply; 31+ messages in thread From: Carlos O'Donell @ 2018-01-05 14:32 UTC (permalink / raw) To: Wilco Dijkstra, DJ Delorie; +Cc: libc-alpha, nd On 01/04/2018 05:48 AM, Wilco Dijkstra wrote: > Carlos O'Donell wrote: >> >> Don't use mallopt, please make it a tunable then. >> >> The mallopt API already had 2 secret arena options which eventually became >> so well used they were baked into the API and had to be made public. > > Unfortunately tunables are not exported so you can't use them outside of GLIBC: > > /build/glibc/benchtests/bench-malloc-simple.o: In function `bench': > bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val' > collect2: error: ld returned 1 exit status Correct, we only have a env-var frontend right now, and the internal API is not made accessible via GLIBC_PRIVATE. You have 3 options for tests: * Use the env vars to adjust test behaviour. Run the tests multiple times. * Add a new C API frontend, very valuable, but more time consuming. * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw it away later when we get a proper C API frontend. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-05 14:32 ` Carlos O'Donell @ 2018-01-05 15:50 ` Adhemerval Zanella 2018-01-05 16:17 ` Carlos O'Donell 0 siblings, 1 reply; 31+ messages in thread From: Adhemerval Zanella @ 2018-01-05 15:50 UTC (permalink / raw) To: libc-alpha On 05/01/2018 12:32, Carlos O'Donell wrote: > On 01/04/2018 05:48 AM, Wilco Dijkstra wrote: >> Carlos O'Donell wrote: >>> >>> Don't use mallopt, please make it a tunable then. >>> >>> The mallopt API already had 2 secret arena options which eventually became >>> so well used they were baked into the API and had to be made public. >> >> Unfortunately tunables are not exported so you can't use them outside of GLIBC: >> >> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench': >> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val' >> collect2: error: ld returned 1 exit status > > Correct, we only have a env-var frontend right now, and the internal API is not > made accessible via GLIBC_PRIVATE. > > You have 3 options for tests: > > * Use the env vars to adjust test behaviour. Run the tests multiple times. > * Add a new C API frontend, very valuable, but more time consuming. > * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw > it away later when we get a proper C API frontend. > Do we want a C API to tied the malloc implementation to some tunables? My understanding is the tunable api idea is not really enforce retro-compability (where a C api would enforce it). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-05 15:50 ` Adhemerval Zanella @ 2018-01-05 16:17 ` Carlos O'Donell 2018-01-05 16:46 ` Adhemerval Zanella 0 siblings, 1 reply; 31+ messages in thread From: Carlos O'Donell @ 2018-01-05 16:17 UTC (permalink / raw) To: Adhemerval Zanella, libc-alpha On 01/05/2018 07:50 AM, Adhemerval Zanella wrote: > > > On 05/01/2018 12:32, Carlos O'Donell wrote: >> On 01/04/2018 05:48 AM, Wilco Dijkstra wrote: >>> Carlos O'Donell wrote: >>>> >>>> Don't use mallopt, please make it a tunable then. >>>> >>>> The mallopt API already had 2 secret arena options which eventually became >>>> so well used they were baked into the API and had to be made public. >>> >>> Unfortunately tunables are not exported so you can't use them outside of GLIBC: >>> >>> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench': >>> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val' >>> collect2: error: ld returned 1 exit status >> >> Correct, we only have a env-var frontend right now, and the internal API is not >> made accessible via GLIBC_PRIVATE. >> >> You have 3 options for tests: >> >> * Use the env vars to adjust test behaviour. Run the tests multiple times. >> * Add a new C API frontend, very valuable, but more time consuming. >> * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw >> it away later when we get a proper C API frontend. >> > > Do we want a C API to tied the malloc implementation to some tunables? My > understanding is the tunable api idea is not really enforce retro-compability > (where a C api would enforce it). If we add a C API to the tunables, we would honour that API for tunables for all time, but the tunables themselves would not be stable. e.g. * get list of tunables supported * get the default value for a tunable * get the value of a tunable * set the value of a tunable So you would use this API in the tests to get the tunable list, assert the tcache tunable was accepted (or fail the test), and then set it to a special value for the part of the test that needs it. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-05 16:17 ` Carlos O'Donell @ 2018-01-05 16:46 ` Adhemerval Zanella 2018-01-05 17:28 ` Carlos O'Donell 0 siblings, 1 reply; 31+ messages in thread From: Adhemerval Zanella @ 2018-01-05 16:46 UTC (permalink / raw) To: Carlos O'Donell, libc-alpha On 05/01/2018 14:17, Carlos O'Donell wrote: > On 01/05/2018 07:50 AM, Adhemerval Zanella wrote: >> >> >> On 05/01/2018 12:32, Carlos O'Donell wrote: >>> On 01/04/2018 05:48 AM, Wilco Dijkstra wrote: >>>> Carlos O'Donell wrote: >>>>> >>>>> Don't use mallopt, please make it a tunable then. >>>>> >>>>> The mallopt API already had 2 secret arena options which eventually became >>>>> so well used they were baked into the API and had to be made public. >>>> >>>> Unfortunately tunables are not exported so you can't use them outside of GLIBC: >>>> >>>> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench': >>>> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val' >>>> collect2: error: ld returned 1 exit status >>> >>> Correct, we only have a env-var frontend right now, and the internal API is not >>> made accessible via GLIBC_PRIVATE. >>> >>> You have 3 options for tests: >>> >>> * Use the env vars to adjust test behaviour. Run the tests multiple times. >>> * Add a new C API frontend, very valuable, but more time consuming. >>> * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw >>> it away later when we get a proper C API frontend. >>> >> >> Do we want a C API to tied the malloc implementation to some tunables? My >> understanding is the tunable api idea is not really enforce retro-compability >> (where a C api would enforce it). > > If we add a C API to the tunables, we would honour that API for tunables for > all time, but the tunables themselves would not be stable. > > e.g. > > * get list of tunables supported > * get the default value for a tunable > * get the value of a tunable > * set the value of a tunable > > So you would use this API in the tests to get the tunable list, assert the > tcache tunable was accepted (or fail the test), and then set it to a special > value for the part of the test that needs it. Right, this seems a reasonable approach (although I think out of the scope for this change). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-05 16:46 ` Adhemerval Zanella @ 2018-01-05 17:28 ` Carlos O'Donell 0 siblings, 0 replies; 31+ messages in thread From: Carlos O'Donell @ 2018-01-05 17:28 UTC (permalink / raw) To: Adhemerval Zanella, libc-alpha On 01/05/2018 08:46 AM, Adhemerval Zanella wrote: > > > On 05/01/2018 14:17, Carlos O'Donell wrote: >> On 01/05/2018 07:50 AM, Adhemerval Zanella wrote: >>> >>> >>> On 05/01/2018 12:32, Carlos O'Donell wrote: >>>> On 01/04/2018 05:48 AM, Wilco Dijkstra wrote: >>>>> Carlos O'Donell wrote: >>>>>> >>>>>> Don't use mallopt, please make it a tunable then. >>>>>> >>>>>> The mallopt API already had 2 secret arena options which eventually became >>>>>> so well used they were baked into the API and had to be made public. >>>>> >>>>> Unfortunately tunables are not exported so you can't use them outside of GLIBC: >>>>> >>>>> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench': >>>>> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val' >>>>> collect2: error: ld returned 1 exit status >>>> >>>> Correct, we only have a env-var frontend right now, and the internal API is not >>>> made accessible via GLIBC_PRIVATE. >>>> >>>> You have 3 options for tests: >>>> >>>> * Use the env vars to adjust test behaviour. Run the tests multiple times. >>>> * Add a new C API frontend, very valuable, but more time consuming. >>>> * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw >>>> it away later when we get a proper C API frontend. >>>> >>> >>> Do we want a C API to tied the malloc implementation to some tunables? My >>> understanding is the tunable api idea is not really enforce retro-compability >>> (where a C api would enforce it). >> >> If we add a C API to the tunables, we would honour that API for tunables for >> all time, but the tunables themselves would not be stable. >> >> e.g. >> >> * get list of tunables supported >> * get the default value for a tunable >> * get the value of a tunable >> * set the value of a tunable >> >> So you would use this API in the tests to get the tunable list, assert the >> tcache tunable was accepted (or fail the test), and then set it to a special >> value for the part of the test that needs it. > > Right, this seems a reasonable approach (although I think out of the scope for > this change). That is up to Wilco to decide, but in general I agree that he need not take on this work to get the current patch set merged, there are other solutions to the need to tweak the settings. I think the env var and multiple test run approach is going to be the simplest. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-02 18:20 ` [PATCH v2] " Wilco Dijkstra 2018-01-02 18:45 ` DJ Delorie @ 2018-01-05 14:33 ` Carlos O'Donell 2018-01-05 16:28 ` Joseph Myers 1 sibling, 1 reply; 31+ messages in thread From: Carlos O'Donell @ 2018-01-05 14:33 UTC (permalink / raw) To: Wilco Dijkstra, libc-alpha; +Cc: nd On 01/02/2018 10:20 AM, Wilco Dijkstra wrote: > Carlos O'Donell wrote: > >> If you have a pattern of malloc/free of *similar* sized blocks, then >> it overflows the sized bin in the tcache, with other size bins remaining >> empty. The cache itself does not dynamically reconfigure itself to consume >> X MiB or Y % of RSS, instead it uses a simple data structure to contain >> a fixed number of fixed size blocks. >> >> Therefore I agree, that enhancing the core data structure in tcache may >> result in better overall performance, particularly if we got rid of the >> fixed bin sizes and instead found a way to be performant *and* keep a >> running total of consumption. > > Well it could keep track of sum of all block sizes and limit that. That would > be better than a fixed limit on number of blocks in each bin. Yes, in practice what we want to enforce is % of total RSS, or fixed X MiB of RSS in thread caches, and we want that value to be a per-thread variable that can be changed for each thread depending on the thread's workload. e.g. Thread 1, 2, and 3 each need 5% of RSS for their workloads. Threads 4, and 5, need 25% of RSS for their workloads. Obviously, if RSS is very low, then our 32MiB/64MiB starting heaps may be unusable, but could also be tuned, it's a decision that can only be made once at startup because we use this embedded assumption for pointer arithmetic to find the arena structure address. It could be changed, but this would increase costs. >> Likewise *all* of malloc needs to be moved to a better data structure than >> just linked lists. I would like to see glibc's malloc offer a cacheing >> footprint of no more than Y % of RSS available, and let the user tweak that. >> Currently we just consume RSS without much regard for overhead. Though this >> is a different case than than what you are talking about, the changes are >> related via data-structure enhancements that would benefit both cases IMO. > > What kind of datastructure do you have in mind? Small blocks could be > allocated together in pages. This would avoid the per-block overhead and > changes the linked list into a bitmap scan. However there aren't that many > alternatives. I've written first-fit allocators using autobalancing trees, but > walking the tree is expensive due to being not having good locality. Keep in mind we are deviating now from the topic at hand, but I'll lay out two improvements, one of which you already touched upon. (1) Small blocks cost too much RSS. I have had customers hit terrible corner cases with C++ and small blocks which see ~50% waste due to colocated metadata e.g. new of 13-byte objects. Getting smaller allocations working together for the small blocks would be a big win, and using some kind of bitmap would be my suggested solution. This requires a data structure to track the parent allocation, bitmaps, and sub-allocations. We have some of these data structures in place, but no easy generic way to use them (I think Florian's pushing of more general data structure use in glibc is the right way to go e.g. dynarray). I think that for blocks smaller than the fundamental language types (which require malloc to have 16-byte alignment) we do not have to return sufficiently aligned memory. For example if you allocate a 3-byte block or a 13-byte block, you cannot possibly put a 16-byte long double there, nor can you use that for a stack block, so it's a waste to guarantee alignment. * Group small allocations. * Violate the ABI and use < MALLOC_ALIGNMENT sized alignments for sub-group members. (2) Track total memory used and free back based on more consistent heuristics. Again, we have customers who complain that glibc's malloc is bad for container or VM workloads with tight packing because it just keeps allocating more heaps. We could do better to track free page runs, and when we exceed some %RSS threshold, free back larger runs to the OS, something which malloc_trim() already does but in a more costly manner by walking the whole arena heaps from start to end. This isn't explicitly about performance. (3) Make arena's truly a per-cpu data structure. We do not want per-thread arenas. We want per-thread caches (avoids locks) We want per-cpu arenas (avoid NUMA issues, and get better locality) The per-thread cache does the job of caching the thread-local requests and avoiding locks. The per-cpu arenas do the job of containing cpu-local memory, and handling requests for the threads that reside on that CPU, either pinned, or there temporarily. Memory being returned should go back to the cpu that the memory came from. The best we have done today is to have the arenas scale with the number of CPUs, and let them fall where they may across the cores, and let the numa page scheduler move the memory around them to minimize cost. The best way would be to use something like restartable sequences to get the CPU #, select the arena, and get memory from there, and then choose another arena next time perhaps. (4) Distribute threads among arenas. We have no data structure for balancing arena usage across threads. We have seen workloads where the maximum number of arenas is allocated, but threads don't move smoothly across arenas, instead the first unlocked arena is chosen, and that might not be the best from a memory locality perspective (see (3)). > Add a malloc micro benchmark to enable accurate testing of the > various paths in malloc and free. The benchmark does a varying > number of allocations of a given block size, then frees them again. > > It tests 4 different scenarios: single-threaded using main arena, > multi-threaded using thread-arena, main arena with SINGLE_THREAD_P > false and main arena with the tcache count set larger than the > default. To enable this, add support for M_TCACHE_COUNT in mallopt. > > OK for commit? > > ChangeLog: > 2018-01-02 Wilco Dijkstra <wdijkstr@arm.com> > > * benchtests/Makefile: Add malloc-simple benchmark. > * benchtests/bench-malloc-simple.c: New benchmark. > * malloc/malloc.h (M_TCACHE_COUNT): Add new define. > * malloc/malloc.c (__libc_mallopt): Handle M_TCACHE_COUNT. I don't like the creation of a new public ABI via M_TCACHE_COUNT through mallopt. I suggest splitting these tests apart and using the tunable env var to touch this. See my other email. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-05 14:33 ` Carlos O'Donell @ 2018-01-05 16:28 ` Joseph Myers 2018-01-05 17:26 ` Carlos O'Donell 0 siblings, 1 reply; 31+ messages in thread From: Joseph Myers @ 2018-01-05 16:28 UTC (permalink / raw) To: Carlos O'Donell; +Cc: Wilco Dijkstra, libc-alpha, nd On Fri, 5 Jan 2018, Carlos O'Donell wrote: > I think that for blocks smaller than the fundamental language types > (which require malloc to have 16-byte alignment) we do not have to > return sufficiently aligned memory. For example if you allocate a 3-byte > block or a 13-byte block, you cannot possibly put a 16-byte long double > there, nor can you use that for a stack block, so it's a waste to > guarantee alignment. As per DR#075, the memory needs to be aligned for any type of object (with a fundamental alignment requirement, in C11 and later), not just those that will fit in the block. (This in turn allows for applications using low bits for tagged pointers.) This does not of course rule out having another allocation API that supports smaller alignment requirements. -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-05 16:28 ` Joseph Myers @ 2018-01-05 17:26 ` Carlos O'Donell 2018-02-28 12:57 ` Florian Weimer 0 siblings, 1 reply; 31+ messages in thread From: Carlos O'Donell @ 2018-01-05 17:26 UTC (permalink / raw) To: Joseph Myers; +Cc: Wilco Dijkstra, libc-alpha, nd, Florian Weimer On 01/05/2018 08:28 AM, Joseph Myers wrote: > On Fri, 5 Jan 2018, Carlos O'Donell wrote: > >> I think that for blocks smaller than the fundamental language types >> (which require malloc to have 16-byte alignment) we do not have to >> return sufficiently aligned memory. For example if you allocate a 3-byte >> block or a 13-byte block, you cannot possibly put a 16-byte long double >> there, nor can you use that for a stack block, so it's a waste to >> guarantee alignment. > > As per DR#075, the memory needs to be aligned for any type of object (with > a fundamental alignment requirement, in C11 and later), not just those > that will fit in the block. (This in turn allows for applications using > low bits for tagged pointers.) Thanks for the reference to DR#075, I had not considered the cast equality issue. > This does not of course rule out having another allocation API that > supports smaller alignment requirements. Agreed. It would still be a win if we did not have co-located metadata (something Florian whispered into my ear years ago now) for small constant sized blocks. We would go from this: N * 1-byte allocations => N * (32-byte header + 1-byte allocation + 15-bytes alignment) [97% constant waste] To this: N * 1-byte allocations => N * (1-byte allocation + 15-bytes alignment) + (N/8)-bytes in-use-bit + 16-bytes header [96% waste for 1-byte] [94% waste for 100*1-byte] ... towards a 93.75% constant waste (limit of the alignment e.g. 15/16) This is a gain of 5% RSS efficiency for a structural change. For a 13-byte allocation: N * 1-byte allocations => N * (32-byte header + 13-byte allocation + 3-bytes alignment) [73% constant waste] To this: N * 1-byte allocations => N * (13-byte allocation + 3-bytes alignment) + (N/8)-bytes in-use-bit + 16-bytes header [60% waste for 13-bytes] [20% waste for 100*13-bytes] [19% waste for 1000*13-bytes] ... towards a 18.75% constant waste (limit of the alignment e.g. 3/16) Note: We never reach the constant limit because the in-use bit-array still grows quickly. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-01-05 17:26 ` Carlos O'Donell @ 2018-02-28 12:57 ` Florian Weimer 2018-02-28 15:16 ` Ondřej Bílka 0 siblings, 1 reply; 31+ messages in thread From: Florian Weimer @ 2018-02-28 12:57 UTC (permalink / raw) To: Carlos O'Donell, Joseph Myers; +Cc: Wilco Dijkstra, libc-alpha, nd On 01/05/2018 06:26 PM, Carlos O'Donell wrote: > It would still be a win if we did not have co-located metadata (something > Florian whispered into my ear years ago now) for small constant sized blocks. > > We would go from this: > > N * 1-byte allocations => N * (32-byte header > + 1-byte allocation > + 15-bytes alignment) > [97% constant waste] Actually, we have an 8-byte header, a 16-byte alignment requirement, and a 24-byte minimum allocation size. (i386: 4-byte header, 16-byte alignment, 12-byte minimum size.) The non-main arenas actually need only a four-byte chunk header (or even fewer bits) because there, the chunk size is very limited. The alignment is non-negotiable, considering our current stance regarding fundamental alignment, even for smaller allocations. If we introduce heap layout for all arenas (which needs some way to compensate for the variable sbrk offset, preferably without wasting a gap there), then we should bring down the minimum allocation size to 12 bytes on 64-bit as well because for the smallest bin, we can use 4-byte forward/backward links within a heap, and keep 8-byte pointers separate for each heap. None of this needs algorithm changes, but it's still difficult to identify all the places which need changing, due to the code duplication and some other issues with the code. But it only helps with oddly-sized allocations (such as 12 bytes or 28 bytes), so it is unclear whether this work is worthwhile. > To this: > > N * 1-byte allocations => N * (1-byte allocation > + 15-bytes alignment) > + (N/8)-bytes in-use-bit + 16-bytes header > [96% waste for 1-byte] > [94% waste for 100*1-byte] > ... towards a 93.75% constant waste (limit of the alignment e.g. 15/16) Another significant win is for allocation sizes such as 32, where we currently need to allocate 48 bytes. In fact, powers-of-two are quite bad for the current allocator. However, it is difficult to combine this with the existing allocator because you need to perform some table lookup during free to discover whether the pointer has adjacent metadata or not, or waste some address space and use bits inside the address for that. If we want to bring the existing allocator further along, we should perhaps try to port dlmalloc changes which get rid of the unsorted bins, using balanced binary trees. I have a feeling that this would allow us to consolidate far more often, and this should help us to avoid some of the anomalies we have seen. Thanks, Florian ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-02-28 12:57 ` Florian Weimer @ 2018-02-28 15:16 ` Ondřej Bílka 2018-02-28 16:16 ` Florian Weimer 0 siblings, 1 reply; 31+ messages in thread From: Ondřej Bílka @ 2018-02-28 15:16 UTC (permalink / raw) To: Florian Weimer Cc: Carlos O'Donell, Joseph Myers, Wilco Dijkstra, libc-alpha, nd On Wed, Feb 28, 2018 at 01:40:28PM +0100, Florian Weimer wrote: > On 01/05/2018 06:26 PM, Carlos O'Donell wrote: > >It would still be a win if we did not have co-located metadata (something > >Florian whispered into my ear years ago now) for small constant sized blocks. > > > >We would go from this: > > > >N * 1-byte allocations => N * (32-byte header > > + 1-byte allocation > > + 15-bytes alignment) > > [97% constant waste] > > Actually, we have an 8-byte header, a 16-byte alignment requirement, > and a 24-byte minimum allocation size. (i386: 4-byte header, > 16-byte alignment, 12-byte minimum size.) > > The non-main arenas actually need only a four-byte chunk header (or > even fewer bits) because there, the chunk size is very limited. > > The alignment is non-negotiable, considering our current stance > regarding fundamental alignment, even for smaller allocations. > Thats rather ineffective, it is easier to start fresh than try to maintain rather obsolete allocator. Most of other are faster and more space effective because of their layout. I have several ideas based on that all allocations on a page of size P must have same capacity(say upto 256 bytes), options where metadata for pointer x are due to possibilitity of bigger alignment requirements following 1. End of previous page, easiest to implement. formula P*(x/P)-16 2. Page map, on 64-bit systems its possible on mmap also allocate pages for mapping 16*(x/P). On 32-bit its possible add constant and rezerve pages to map entire address space with same formula. 3. Encode size to address to table lookup in free for say size<=256. This trick is probably applicable only for 64-bit systems as it could cause problems with virtual address space. First calculate size by formula 16*((x-start)/(1<<25)) if its less than 257 we got correct size, otherwise use option 1/2 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-02-28 15:16 ` Ondřej Bílka @ 2018-02-28 16:16 ` Florian Weimer 2018-02-28 18:21 ` Carlos O'Donell 2018-02-28 19:56 ` Ondřej Bílka 0 siblings, 2 replies; 31+ messages in thread From: Florian Weimer @ 2018-02-28 16:16 UTC (permalink / raw) To: Ondřej Bílka Cc: Carlos O'Donell, Joseph Myers, Wilco Dijkstra, libc-alpha, nd On 02/28/2018 03:11 PM, OndÅej BÃlka wrote: > Thats rather ineffective, it is easier to start fresh than try to > maintain rather obsolete allocator. Most of other are faster and more > space effective because of their layout. That's not quite true. Despite its limitations, glibc malloc still compares remarkably well to other allocators. Of course, there are workloads where it loses big, but those exist for other allocators, too. People simple don't write blog posts comparing *alloc with glibc malloc if glibc malloc provides comparable or better performance because it's quite boring. I think a heap-style allocator which does not segregate allocations of different sizes still has its place, and why not provide one in glibc? Thanks, Florian ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-02-28 16:16 ` Florian Weimer @ 2018-02-28 18:21 ` Carlos O'Donell 2018-02-28 23:17 ` Ondřej Bílka 2018-02-28 19:56 ` Ondřej Bílka 1 sibling, 1 reply; 31+ messages in thread From: Carlos O'Donell @ 2018-02-28 18:21 UTC (permalink / raw) To: Florian Weimer, Ondřej Bílka Cc: Joseph Myers, Wilco Dijkstra, libc-alpha, nd On 02/28/2018 06:16 AM, Florian Weimer wrote: > On 02/28/2018 03:11 PM, OndÅej BÃlka wrote: >> Thats rather ineffective, it is easier to start fresh than try to >> maintain rather obsolete allocator. Most of other are faster and >> more space effective because of their layout. > > That's not quite true. Despite its limitations, glibc malloc still > compares remarkably well to other allocators. Of course, there are > workloads where it loses big, but those exist for other allocators, > too. People simple don't write blog posts comparing *alloc with > glibc malloc if glibc malloc provides comparable or better > performance because it's quite boring. > > I think a heap-style allocator which does not segregate allocations > of different sizes still has its place, and why not provide one in > glibc? I agree. I think an incremental improvement would be to start with some further code cleanups, all with the goal of simplifying the allocator maintenance. Getting rid of unsorted bins would be interesting. One of the problems I see is that without changes in our data structures and algorithms it is hard for a heap-style allocator to keep a bound on the total amount of heap consumed. One of the big requirements that I see coming down the pipe is to provide an allocator that has bounded memory usage. To do that we'd have to track all of the unsorted blocks, free blocks, consolidate more regularly, and free against the heuristic which tells us how much memory to keep cached. If having got rid of unsorted makes this easier, if having a balanced binary tree makes this easier, then that's where we should be looking. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-02-28 18:21 ` Carlos O'Donell @ 2018-02-28 23:17 ` Ondřej Bílka 0 siblings, 0 replies; 31+ messages in thread From: Ondřej Bílka @ 2018-02-28 23:17 UTC (permalink / raw) To: Carlos O'Donell Cc: Florian Weimer, Joseph Myers, Wilco Dijkstra, libc-alpha, nd On Wed, Feb 28, 2018 at 08:16:13AM -0800, Carlos O'Donell wrote: > On 02/28/2018 06:16 AM, Florian Weimer wrote: > > On 02/28/2018 03:11 PM, OndÅej BÃlka wrote: > >> Thats rather ineffective, it is easier to start fresh than try to > >> maintain rather obsolete allocator. Most of other are faster and > >> more space effective because of their layout. > > > > That's not quite true. Despite its limitations, glibc malloc still > > compares remarkably well to other allocators. Of course, there are > > workloads where it loses big, but those exist for other allocators, > > too. People simple don't write blog posts comparing *alloc with > > glibc malloc if glibc malloc provides comparable or better > > performance because it's quite boring. > > > > I think a heap-style allocator which does not segregate allocations > > of different sizes still has its place, and why not provide one in > > glibc? > > I agree. > > I think an incremental improvement would be to start with some further > code cleanups, all with the goal of simplifying the allocator maintenance. > You should like I did try to decruft implementation. I decided that starting again is simpler after looking lot on existing how to do it. I send some patches with decrufting but I found that with changing algorithm, data structures mmap logic and basically everything else its just unnecessary overhead. You couldn't decrease data structure overhead without changing data structure. And for RSS size problem is in design that you couldn't return memory to system. Large areas get pined by small allocations with return data and you couldn't sbrk. It is needed to redesign it to unmap pages individually. Alternative would be to add something like malloca with separate arenas for which existing logic works. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-02-28 16:16 ` Florian Weimer 2018-02-28 18:21 ` Carlos O'Donell @ 2018-02-28 19:56 ` Ondřej Bílka 2018-02-28 20:17 ` Wilco Dijkstra 1 sibling, 1 reply; 31+ messages in thread From: Ondřej Bílka @ 2018-02-28 19:56 UTC (permalink / raw) To: Florian Weimer Cc: Carlos O'Donell, Joseph Myers, Wilco Dijkstra, libc-alpha, nd On Wed, Feb 28, 2018 at 03:16:09PM +0100, Florian Weimer wrote: > On 02/28/2018 03:11 PM, OndÅej BÃlka wrote: > >Thats rather ineffective, it is easier to start fresh than try to > >maintain rather obsolete allocator. Most of other are faster and more > >space effective because of their layout. > > That's not quite true. Despite its limitations, glibc malloc still > compares remarkably well to other allocators. That confuses cause and effect. People spent lot of effort to fix workloads where this allocator is bad (usually starts with saying that malloc is slow) with trick like static buffers, merging allocations, memory pools etc. With better allocator it wouldn't be neccessary as it would handle simpler code with same performance. > > I think a heap-style allocator which does not segregate allocations > of different sizes still has its place, and why not provide one in > glibc? > That isn't case for any allocator and it is asking for trouble. You want to avoid sitation where two big chunks couldn't be merged because of tiny chunk between them. Also you will likely spend more space on header overhead than could you save. For small sizes merging chunks doesn't happen so we lose nothing by requiring uniform capacity. For larger size this representation is still problematic and you could improve performance with another representation that also avoids alignment problem by placing metadata elsewhere(for mine only 4 bytes are needed). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-02-28 19:56 ` Ondřej Bílka @ 2018-02-28 20:17 ` Wilco Dijkstra 2018-02-28 21:56 ` Carlos O'Donell 2018-02-28 21:56 ` Ondřej Bílka 0 siblings, 2 replies; 31+ messages in thread From: Wilco Dijkstra @ 2018-02-28 20:17 UTC (permalink / raw) To: Ondřej Bílka, Florian Weimer Cc: Carlos O'Donell, Joseph Myers, libc-alpha, nd Ondřej Bílka wrote: >> I think a heap-style allocator which does not segregate allocations >> of different sizes still has its place, and why not provide one in >> glibc? >> > That isn't case for any allocator and it is asking for trouble. You want > to avoid sitation where two big chunks couldn't be merged because of > tiny chunk between them. Agreed, you always want to special case small blocks. I don't believe there is any advantage in using a single big heap. > For larger size this representation is still problematic and you could > improve performance with another representation that also avoids > alignment problem by placing metadata elsewhere(for mine only 4 bytes are needed). Larger sizes would be helped a lot once small blocks are dealt with separately. So I don't think we need complicated balanced binary trees when dealing with a small number of large blocks. You won't need an unsorted list either, large blocks can be merged in O(1) time. There may be an advantage to place meta data elsewhere, for example it could make adding/removing/walking free lists much faster (spatial locality) or to make heap overflow attacks almost impossible, Wilco ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-02-28 20:17 ` Wilco Dijkstra @ 2018-02-28 21:56 ` Carlos O'Donell 2018-02-28 21:56 ` Ondřej Bílka 1 sibling, 0 replies; 31+ messages in thread From: Carlos O'Donell @ 2018-02-28 21:56 UTC (permalink / raw) To: Wilco Dijkstra, Ondřej Bílka, Florian Weimer Cc: Joseph Myers, libc-alpha, nd On 02/28/2018 09:01 AM, Wilco Dijkstra wrote: > Ondøej BĂlka wrote: >  >>> I think a heap-style allocator which does not segregate allocations >>> of different sizes still has its place, and why not provide one in >>> glibc? >>> >> That isn't case for any allocator and it is asking for trouble. You want >> to avoid sitation where two big chunks couldn't be merged because of >> tiny chunk between them. > > Agreed, you always want to special case small blocks. I don't believe there is > any advantage in using a single big heap. > >> For larger size this representation is still problematic and you could >> improve performance with another representation that also avoids >> alignment problem by placing metadata elsewhere(for mine only 4 bytes are needed). > > Larger sizes would be helped a lot once small blocks are dealt with separately. > So I don't think we need complicated balanced binary trees when dealing with a > small number of large blocks. You won't need an unsorted list either, large blocks > can be merged in O(1) time. > > There may be an advantage to place meta data elsewhere, for example it could make > adding/removing/walking free lists much faster (spatial locality) or to make heap > overflow attacks almost impossible, I agree with many of the things you and Ondrej are proposing. The outcome of our current discussions should be an incremental plan to go from where we are today, to where we want to be tomorrow. However, I do *not* believe it is a good plan to simply throw away the present allocator and claim it should be replaced from scratch. We do not have that luxury as a core project, we must remain answerable to our users. A high-level concrete problem today with glibc's malloc, and the only problem being reported by our users is that it consumes too much RSS. Solving that problem in the abstract is what we should be looking at. If we think that having multiple heaps for different sized objects is the way to do this, then we should think about how to go down that path with an experiment. Any cleanup we do before that is a win. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2] Add malloc micro benchmark 2018-02-28 20:17 ` Wilco Dijkstra 2018-02-28 21:56 ` Carlos O'Donell @ 2018-02-28 21:56 ` Ondřej Bílka [not found] ` <xnmuzssrqr.fsf@greed.delorie.com> 1 sibling, 1 reply; 31+ messages in thread From: Ondřej Bílka @ 2018-02-28 21:56 UTC (permalink / raw) To: Wilco Dijkstra Cc: Florian Weimer, Carlos O'Donell, Joseph Myers, libc-alpha, nd On Wed, Feb 28, 2018 at 05:01:57PM +0000, Wilco Dijkstra wrote: > OndÅej BÃlka wrote: > Â > >> I think a heap-style allocator which does not segregate allocations > >> of different sizes still has its place, and why not provide one in > >> glibc? > >> > > That isn't case for any allocator and it is asking for trouble. You want > > to avoid sitation where two big chunks couldn't be merged because of > > tiny chunk between them. > > Agreed, you always want to special case small blocks. I don't believe there is > any advantage in using a single big heap. > > > For larger size this representation is still problematic and you could > > improve performance with another representation that also avoids > > alignment problem by placing metadata elsewhere(for mine only 4 bytes are needed). > > Larger sizes would be helped a lot once small blocks are dealt with separately. > So I don't think we need complicated balanced binary trees when dealing with a > small number of large blocks. You won't need an unsorted list either, large blocks > can be merged in O(1) time. > > There may be an advantage to place meta data elsewhere, for example it could make > adding/removing/walking free lists much faster (spatial locality) or to make heap > overflow attacks almost impossible, > I will answer now what I plan for larger blocks, I have new data structure mostly in head so I won't put concrete example. First for small sizes allocation would be just poping element from thread local single linked list, or calling function to refill lists with enough elements when empty. I plan to add inline version to make performance of constant small allocations same as memory pool. By using pointer to list refill could do best-fit by making multiple buckets point to same stack. This is pretty generic interface, question is for which sizes it should be used. For larger I could do best fit in O(1) with merging on free. It needs a condition like that we are rounding up size/alignment to multiple of 32 for 256-2048 range and 256 for 2048-16384 as example. Data structure would be 64 double-linked lists and 64bit integer where i-th bit says if i-th list is nonempty. Last bucket could be special to hold larger elements. Finally for larger allocations I would use page-based logic as mmaping/remapping/unmapping is about only way to actually decrease memory footprint, I didn't try that much yet. Code for allocation would be something like this if (size < 256) { bucket = (size + 15) / 16; return small_list_pop (small_list[bucket]); } else if (size < 32 * 64) { bucket = (size + 31) / 32 uint64_t t = bitmap & ((-1) << bucket); if (t) bucket = __builtin_ctzl (t); else bucket = allocate (size); return list_pop(bucket); } else if (size < 256 * 64) bucket = (size + 255) / 256; /* ditto with bigger buckets */ else /* mmap */ As free for small sizes I didn't decided yet how reclaim that to cache. For inlining it could be something simple like create single linked list of 32 elements, then call mass free for that list. For medium elements it would first determine free areas before and after free chunk, remove them from their double linked lists and unset bit if necessary. Then sum these sizes and put it into appropriate bucket. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <xnmuzssrqr.fsf@greed.delorie.com>]
* Re: [PATCH v2] Add malloc micro benchmark [not found] ` <xnmuzssrqr.fsf@greed.delorie.com> @ 2018-03-01 11:24 ` Ondřej Bílka 0 siblings, 0 replies; 31+ messages in thread From: Ondřej Bílka @ 2018-03-01 11:24 UTC (permalink / raw) To: DJ Delorie; +Cc: libc-alpha On Wed, Feb 28, 2018 at 04:56:28PM -0500, DJ Delorie wrote: > > Ond?ej B?lka <neleai@seznam.cz> writes: > > First for small sizes allocation would be just popping element from > > thread local single linked list, or calling function to refill lists > > with enough elements when empty. > > This is basically what tcache does. Although I tested a few ways of > pre-filling the cache, there's room for more research there beyond the > few algorithms I used. Some folks have been experimenting with "ideal" > values for tcache_count and tcache_max too. > That isn't case, tcache counters slow things down and are unneccessary. One could establish limits by using tricks in free. I had different idea for representation now, so I will send it in separate thread. > > Finally for larger allocations I would use page-based logic as > > mmaping/remapping/unmapping is about only way to actually decrease > > memory footprint, I didn't try that much yet. > > Note that mmap() itself is expensive (slow) and there may be a limit on > how many discrete mapped regions a kernel[*] can support. > For larger its more that one should be aware of paging as to handle memory allocated from system. One such issue is that often big allocation doesn't use pages at end. For returning memory to system I think that best strategy is something like hitting pages that weren't used last second by madvise(dont_need) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Add malloc micro benchmark 2017-12-18 15:18 ` Wilco Dijkstra 2017-12-18 16:33 ` Carlos O'Donell @ 2017-12-18 23:02 ` DJ Delorie 2017-12-28 14:09 ` Wilco Dijkstra 1 sibling, 1 reply; 31+ messages in thread From: DJ Delorie @ 2017-12-18 23:02 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: carlos, libc-alpha, nd Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > Since DJ didn't seem keen on increasing the tcache size despite it > showing major gains across a wide range of benchmarks, It's not that I'm not keen on increasing the size, it's that there are drawbacks to doing so and I don't want to base such a change on a guess (even a good guess). If you have benchmarks, let's collect them and add them to the trace corpus. I can send you my corpus. (We don't have a good solution for centrally storing such a corpus, yet) Let's run all the tests against all the options and make an informed decision, that's all. If it shows gains for synthetic benchmarks, but makes qemu slower, we need to know that. Also, as Carlos noted, there are some downstream uses where a larger cache may be detrimental. Sometimes there are no universally "better" defaults, and we provide tunables for those cases. And, as always, I can be out-voted if the consensus disagrees with me ;-) > I decided to fix the performance for the single-threaded case at > least. It's now 2.5x faster on a few sever benchmarks (of course the > next question is whether tcache is actually useful in its current > form). Again, tcache is intended to help the multi-threaded case. Your patches help the single-threaded case. If you recall, I ran your patch against my corpus of multi-threaded tests, and saw no regressions, which is good. So our paranoia here is twofold... 1. Make sure that when someone says "some benchmarks" we have those benchmarks available to us, either as a microbenchmark in glibc or as a trace we can simulate and benchmark. No more random benchmarks! :-) 2. When we say a patch "is faster", let's run all our benchmarks and make sure that we don't mean "on some benchmarks." The whole point of the trace/sim stuff is to make sure key downstream users aren't left out of the optimization work, and end up with worse performance. We probably should add "on all major architectures" too but that assumes we have machines on which we can run the benchmarks. So we should be able to answer your question, not just wonder... > I'd have to check how easy it is to force it to use the thread arena. I'm guessing we could have a glibc-internal API to tag the heap as "corrupt" which would preclude using it. > If consolidation doesn't work that's a serious bug. Sometimes it's not a case of "doesn't work" as a case of "not attempted for performance reasons". If we can show that a different design choice is universally better[*], we should change it. [*] or at least, universally-enough for a "system" allocator like glibc must provide. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Add malloc micro benchmark 2017-12-18 23:02 ` [PATCH] " DJ Delorie @ 2017-12-28 14:09 ` Wilco Dijkstra 2017-12-28 19:01 ` DJ Delorie 0 siblings, 1 reply; 31+ messages in thread From: Wilco Dijkstra @ 2017-12-28 14:09 UTC (permalink / raw) To: DJ Delorie; +Cc: carlos, libc-alpha, nd DJ Delorie wrote: > Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > > Since DJ didn't seem keen on increasing the tcache size despite it > > showing major gains across a wide range of benchmarks, > > It's not that I'm not keen on increasing the size, it's that there are > drawbacks to doing so and I don't want to base such a change on a guess > (even a good guess). If you have benchmarks, let's collect them and add > them to the trace corpus. I can send you my corpus. (We don't have a > good solution for centrally storing such a corpus, yet) Let's run all > the tests against all the options and make an informed decision, that's > all. If it shows gains for synthetic benchmarks, but makes qemu slower, > we need to know that. Yes I'd be interested in the traces. I presume they are ISA independent and can just be replayed? > Also, as Carlos noted, there are some downstream uses where a larger > cache may be detrimental. Sometimes there are no universally "better" > defaults, and we provide tunables for those cases. It depends. I've seen cases where returning pages to the OS too quickly causes a huge performance loss. I think in many of these cases we can be far smarter and use adaptive algorithms. If say 50% of your memory ends up in the tcache and you can't allocate a new block, it seems a good idea to consolidate first. If it's less than 1%, why worry about it? So short term there may be simple ways to tune tcache, eg. allow a larger number of small blocks (trivial change), or limit total bytes in the tcache (which could be dynamically increased as more memory is allocated). Longer term we need to make arena's per-thread - see below. > Again, tcache is intended to help the multi-threaded case. Your patches > help the single-threaded case. If you recall, I ran your patch against > my corpus of multi-threaded tests, and saw no regressions, which is > good. Arenas are already mostly per-thread. My observation was that the gains from tcache are due to bypassing completely uncontended locks. If an arena could be marked as owned by a thread, the fast single-threaded paths could be used all of the time (you'd have to handle frees from other threads of course but those could go in a separate bin for consolidation). > So our paranoia here is twofold... > > 1. Make sure that when someone says "some benchmarks" we have those > benchmarks available to us, either as a microbenchmark in glibc or as > a trace we can simulate and benchmark. No more random benchmarks! :-) Agreed, it's quite feasible to create more traces and more microbenchmarks. > 2. When we say a patch "is faster", let's run all our benchmarks and > make sure that we don't mean "on some benchmarks." The whole point > of the trace/sim stuff is to make sure key downstream users aren't > left out of the optimization work, and end up with worse performance. Well you can't expect gains on all benchmarks or have a "never regress anything ever" rule. Minor changes in alignment of a heap block or allocation of pages from the OS can have a large performance impact that's hard to control. The smallest possible RSS isn't always better. The goal should be to improve average performance across a wide range of applications. > We probably should add "on all major architectures" too but that assumes > we have machines on which we can run the benchmarks. Szabolcs or I would be happy to run the traces on AArch64. Wilco ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Add malloc micro benchmark 2017-12-28 14:09 ` Wilco Dijkstra @ 2017-12-28 19:01 ` DJ Delorie 0 siblings, 0 replies; 31+ messages in thread From: DJ Delorie @ 2017-12-28 19:01 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: carlos, libc-alpha, nd Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > Yes I'd be interested in the traces. I presume they are ISA independent and > can just be replayed? Yup, they're ISA-agnostic compressed pseudo-codes that run a state machine. I put them here (except the two biggest, which take days to transfer): http://www.delorie.com/malloc/ Be kind, that's my house's bandwidth ;-) The simulator is in the dj/malloc branch, in malloc/trace_run.c > I think in many of these cases we can be far smarter and use adaptive > algorithms. I'm wary of "smart" algorithms because it's so hard to generalize them to work sufficiently well in all cases, without accidentally causing super-stupid behavior in some unusual app. But yeah :-) > Longer term we need to make arena's per-thread - see below. They're sort of per-thread now, but a lot of apps pass malloc'd memory between threads. A strict one-per-thread isn't optimal either. > My observation was that the gains from tcache are due to bypassing > completely uncontended locks. Yup. > If an arena could be marked as owned by > a thread, That mark would need to be an uncontended lock... >> 2. When we say a patch "is faster", let's run all our benchmarks and >> make sure that we don't mean "on some benchmarks." The whole point >> of the trace/sim stuff is to make sure key downstream users aren't >> left out of the optimization work, and end up with worse performance. > > Well you can't expect gains on all benchmarks or have a "never regress > anything ever" rule. Sure, I just don't want surprises. Also, benchmarks from the Real World can be prioritized - a patch which improves a synthetic benchmark but makes chrome worse should be rejected. A generally good patch that makes a few apps worse should at least be investigated to find out why. Etc. > Szabolcs or I would be happy to run the traces on AArch64. That would be helpful :-) ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2018-03-01 11:24 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-01 13:51 [PATCH] Add malloc micro benchmark Wilco Dijkstra 2017-12-01 16:13 ` Carlos O'Donell 2017-12-18 15:18 ` Wilco Dijkstra 2017-12-18 16:33 ` Carlos O'Donell 2018-01-02 18:20 ` [PATCH v2] " Wilco Dijkstra 2018-01-02 18:45 ` DJ Delorie 2018-01-03 12:12 ` Wilco Dijkstra 2018-01-03 15:07 ` Carlos O'Donell 2018-01-04 13:48 ` Wilco Dijkstra 2018-01-04 16:37 ` Adhemerval Zanella 2018-01-05 14:32 ` Carlos O'Donell 2018-01-05 15:50 ` Adhemerval Zanella 2018-01-05 16:17 ` Carlos O'Donell 2018-01-05 16:46 ` Adhemerval Zanella 2018-01-05 17:28 ` Carlos O'Donell 2018-01-05 14:33 ` Carlos O'Donell 2018-01-05 16:28 ` Joseph Myers 2018-01-05 17:26 ` Carlos O'Donell 2018-02-28 12:57 ` Florian Weimer 2018-02-28 15:16 ` Ondřej Bílka 2018-02-28 16:16 ` Florian Weimer 2018-02-28 18:21 ` Carlos O'Donell 2018-02-28 23:17 ` Ondřej Bílka 2018-02-28 19:56 ` Ondřej Bílka 2018-02-28 20:17 ` Wilco Dijkstra 2018-02-28 21:56 ` Carlos O'Donell 2018-02-28 21:56 ` Ondřej Bílka [not found] ` <xnmuzssrqr.fsf@greed.delorie.com> 2018-03-01 11:24 ` Ondřej Bílka 2017-12-18 23:02 ` [PATCH] " DJ Delorie 2017-12-28 14:09 ` Wilco Dijkstra 2017-12-28 19:01 ` DJ Delorie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).