From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by sourceware.org (Postfix) with ESMTPS id 0A5D23858C60 for ; Fri, 15 Oct 2021 16:40:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0A5D23858C60 Received: by mail-pg1-x530.google.com with SMTP id 75so9105579pga.3 for ; Fri, 15 Oct 2021 09:40:00 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=LJfW+UtndfmTSfePEfjA4/sudRyGLw7UvvOJVS/TZVk=; b=ZtHaFHdXyhdVJdu2vvJw+aGLlYavaQnfQmQVCwO/aBSPplx+yGFZV+FVDrOrq6rWeS pUY/SXiESyIFPUJBuw4e+dDUhXuyKCaRnj/21h8kkse5ODDlGsxjtznTpoHFh5H/y4pX SWlXnXd9668RjkZOiKPnkHSFq2Mtnzg336oE5ijI8EFWyuDkF8AOsfTnRNKkWXs/zXIE 6cxiJGVZlof0m3lJkqWmpZv/BGosdeRne70OFCcu7xXsGzK32SGAfZsV0gn14wIfMQK2 k/ATNGW+j+8zgNZWKlhw5MMgEePbzwv+ekEr43Qn+Xvv5Vkl6t2TMk5Rq8acneQjE5Kv 61sg== X-Gm-Message-State: AOAM531QohjY2SSWyAT+D0qU+HSlF+s+wSQhxxEBtm2oDvj/8Spn7jWr DfSePDO1YIFf54+u/fcEYTNy4AEeFiCAwQNhIUc= X-Google-Smtp-Source: ABdhPJxJIOYZROacD3FWlcgTwHDXsZG48ShtbOQ/pz26tp6pzgm1tl/76aYAyoojdouIZgULYPLIVBr6ob4BahJhiyM= X-Received: by 2002:a63:5b09:: with SMTP id p9mr9813001pgb.321.1634315999773; Fri, 15 Oct 2021 09:39:59 -0700 (PDT) MIME-Version: 1.0 References: <20210903171144.952737-1-adhemerval.zanella@linaro.org> <20210903171144.952737-2-adhemerval.zanella@linaro.org> In-Reply-To: From: Noah Goldstein Date: Fri, 15 Oct 2021 11:39:48 -0500 Message-ID: Subject: Re: [PATCH v3 1/7] benchtests: Add bench-qsort To: Adhemerval Zanella Cc: GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Oct 2021 16:40:03 -0000 On Fri, Oct 15, 2021 at 7:52 AM Adhemerval Zanella wrote: > > > > On 13/10/2021 00:19, Noah Goldstein wrote: > > On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha > > wrote: > >> > >> This patch adds a qsort benchmark. I tried to model the benchmark > >> taking in consideration the possible input variation in both internal > >> element size, element numbers, and internal state for 1. real word cases > >> and 2. possible scenarios based on hardware characteristics. > >> > >> For 1. I tracked qsort usage (using a simple preload library to dump its > >> usage and a script to pos-process it) on both GCC bootstrap and Firefox. > >> GCC 8 bootstrap build shows 51786641 call to qsort with the following > >> characterics: > >> > >> Key: number of elements: > >> key=2 : 39.74 > >> key=3 : 19.23 > >> key=4 : 9.77 > >> key=1 : 8.44 > >> key=0 : 6.60 > >> key=5 : 4.40 > >> key=7 : 2.37 > >> key=6 : 2.25 > >> key=9 : 1.48 > >> key=8 : 0.97 > >> > >> Key: element size in bytes: > >> key=8 : 91.74 > >> key=32 : 3.62 > >> key=4 : 2.42 > >> key=40 : 1.20 > >> key=16 : 0.67 > >> key=24 : 0.30 > >> key=48 : 0.05 > >> key=56 : 0.00 > >> key=1 : 0.00 > >> > >> Key: total size (number of elements * element size): > >> key=16 : 35.98 > >> key=24 : 18.67 > >> key=32 : 9.79 > >> key=8 : 8.28 > >> key=0 : 6.60 > >> key=40 : 4.21 > >> key=64 : 3.15 > >> key=48 : 2.24 > >> key=56 : 2.15 > >> key=80 : 1.45 > >> > >> So for GCC: > >> > >> - 80% of total qsort usage are done with 10 elements of less. > >> - All usages are done element size of maximum of 56 bytes. > >> - 90% of calls are done with array of maximum size of 80 bytes or > >> less. > >> > >> (GCC on recent version now uses a internal qsort implementation instead > >> of the libc one). > >> > >> The Firefox usage was done with 2 hours of usage, with first 10 minutes > >> activelly openning and closing different types of sites. It resulted in > >> 21042 calls with following characteristics: > >> > >> Key: number of elements: > >> key=7 : 24.40 > >> key=1 : 10.44 > >> key=3 : 6.33 > >> key=4 : 5.81 > >> key=2 : 5.46 > >> key=6 : 4.80 > >> key=17 : 4.54 > >> key=0 : 3.07 > >> key=5 : 3.05 > >> key=9 : 2.51 > >> key=12 : 2.06 > >> > >> Key: element size in bytes: > >> key=8 : 94.49 > >> key=28 : 4.40 > >> key=2 : 0.70 > >> key=16 : 0.19 > >> key=36 : 0.07 > >> key=12 : 0.07 > >> key=40 : 0.07 > >> key=24 : 0.03 > >> > >> Key: total size (number of elements * element size): > >> key=56 : 24.20 > >> key=8 : 10.27 > >> key=24 : 6.36 > >> key=32 : 5.86 > >> key=16 : 5.46 > >> key=48 : 4.80 > >> key=476 : 3.75 > >> key=0 : 3.07 > >> key=40 : 3.05 > >> key=72 : 2.50 > >> > >> So for Firefox: > >> > >> - 72% of total qsort usage are done with 18 elements of less. > >> - All usages are done element size of maximum of 40 bytes. > >> - 70% of calls are done with array of maximum size of 476 bytes or > >> less. > >> > >> For 2. I used the idea of a machine with 3 levels of cache with sizes > >> L1 32kb, L2 256kb, and L3 4096Kb. > >> > >> It resulted in a benchmark with following traits: > >> > >> * It checks four types of input arrays: sorted, mostly sorted, random, > >> repeated, and bitonic: > >> > >> - 'sorted': the array is already sorted. > >> - 'mostly sorted': the array will have a certain number of random > >> position with random values (current ratio used is 20%). > >> - 'random' the array will contain random elements. > >> - 'repeated' the array will have random elements with a certain > >> number (current ratio is 20%) of a repeated element distributed > >> randomly. > >> - 'bitonic': elements will be strictly increasing to up middle of the > >> array then strictly decreasing. This is to trigger the > >> pattern-defeting input for the median of 3 pivot selection used > >> in quicksort implementation. > >> > >> * Three elements sizes are checked: uint32_t, uint64_t, and an > >> element with 32 bytes (but using the uint32_t comparison checks). > >> These element sizes are used to 1. to avoid include the comparison > >> function itself and/or memory copy in sort benchmark itself, and 2. > >> because key of size_t are the most used for both GCC and Firefox. > >> > >> * Five different element numbers: 64 (which cover mostly of used > >> elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB > >> of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and > >> 24288/1048576 (L3 with 4 MB). The sizes are configurable by > >> --nelem option. > >> > >> Checked on x86_64-linux-gnu > >> --- > >> benchtests/Makefile | 2 +- > >> benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++ > >> 2 files changed, 196 insertions(+), 1 deletion(-) > >> create mode 100644 benchtests/bench-qsort.c > >> > >> diff --git a/benchtests/Makefile b/benchtests/Makefile > >> index 1530939a8c..781cbca8af 100644 > >> --- a/benchtests/Makefile > >> +++ b/benchtests/Makefile > >> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \ > >> include ../gen-locales.mk > >> endif > >> > >> -stdlib-benchset := strtod > >> +stdlib-benchset := strtod qsort > >> > >> stdio-common-benchset := sprintf > >> > >> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c > >> new file mode 100644 > >> index 0000000000..905aa6d7b6 > >> --- /dev/null > >> +++ b/benchtests/bench-qsort.c > >> @@ -0,0 +1,195 @@ > >> +/* Measure qsort function. > >> + Copyright (C) 2018 Free Software Foundation, Inc. > >> + This file is part of the GNU C Library. > >> + > >> + The GNU C Library is free software; you can redistribute it and/or > >> + modify it under the terms of the GNU Lesser General Public > >> + License as published by the Free Software Foundation; either > >> + version 2.1 of the License, or (at your option) any later version. > >> + > >> + The GNU C Library is distributed in the hope that it will be useful, > >> + but WITHOUT ANY WARRANTY; without even the implied warranty of > >> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > >> + Lesser General Public License for more details. > >> + > >> + You should have received a copy of the GNU Lesser General Public > >> + License along with the GNU C Library; if not, see > >> + . */ > >> + > >> +#include > >> +#include > >> +#include > >> +#include > >> +#include > >> +#include > >> +#include > >> + > >> +#include > >> + > >> +/* Number of elements of determined type to be measured. */ > >> +static const size_t default_nelems[] = > >> +{ > >> + 256/sizeof(size_t), /* 64/128 which covers mostly used element number > >> + on GCC build. */ > >> + 32768/sizeof(size_t), /* 4096/8192 to fit on a L1 with 32 KB. */ > >> + 262144/sizeof(size_t), /* 32768/65536 to fit on a L2 with 256 KB. */ > >> + 4194304/sizeof(size_t), /* 524288/1048576 to fix on a L3 with 4 MB. */ > >> +}; > >> + > >> +#define OPT_NELEM 10000 > >> +#define OPT_SEED 10001 > >> +#define CMDLINE_OPTIONS \ > >> + { "nelem", required_argument, NULL, OPT_NELEM }, \ > >> + { "seed", required_argument, NULL, OPT_SEED }, > >> + > >> +static const size_t max_nelem = 16; > >> +static size_t *elems = NULL; > >> +static size_t nelem = 0; > >> +static uint64_t seed = 0; > >> +static bool seed_set = false; > >> + > >> +static void > >> +cmdline_process_function (int c) > >> +{ > >> + switch (c) > >> + { > >> + /* Handle the --nelem option to run different sizes than DEFAULT_ELEM. > >> + The elements sizes as passed with a ':' as the delimiter, for > >> + instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements. */ > >> + case OPT_NELEM: > >> + { > >> + elems = xmalloc (max_nelem * sizeof (size_t)); > >> + nelem = 0; > >> + > >> + char *saveptr; > >> + char *token; > >> + token = strtok_r (optarg, ":", &saveptr); > >> + if (token == NULL) > >> + { > >> + printf ("error: invalid --nelem value\n"); > >> + exit (EXIT_FAILURE); > >> + } > >> + do > >> + { > >> + if (nelem == max_nelem) > >> + { > >> + printf ("error: invalid --nelem value (max elem)\n"); > >> + exit (EXIT_FAILURE); > >> + } > >> + elems[nelem++] = atol (token); > >> + token = strtok_r (saveptr, ":", &saveptr); > >> + } while (token != NULL); > >> + } > >> + break; > >> + > >> + /* handle the --seed option to use a different seed than a random one. > >> + The SEED used should be a uint64_t number. */ > >> + case OPT_SEED: > >> + { > >> + uint64_t value = strtoull (optarg, NULL, 0); > >> + if (errno == ERANGE || value > UINT64_MAX) > >> + { > >> + printf ("error: seed should be a value in range of " > >> + "[0, UINT64_MAX]\n"); > >> + exit (EXIT_FAILURE); > >> + } > >> + seed = value; > >> + seed_set = true; > >> + } > >> + } > >> +} > >> + > >> +#define CMDLINE_PROCESS cmdline_process_function > >> + > >> +static int > >> +do_test (void) > >> +{ > >> + random_init (); > >> + > >> + > >> + json_ctx_t json_ctx; > >> + > >> + json_init (&json_ctx, 0, stdout); > >> + > >> + json_document_begin (&json_ctx); > >> + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); > >> + > >> + json_attr_object_begin (&json_ctx, "functions"); > >> + json_attr_object_begin (&json_ctx, "qsort"); > >> + json_attr_uint (&json_ctx, "seed", seed); > >> + > >> + json_array_begin (&json_ctx, "results"); > >> + > >> + const size_t *welem = elems == NULL ? default_nelems : elems; > >> + const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem; > >> + > >> + struct test_t > >> + { > >> + size_t type_size; > >> + cmpfunc_t cmpfunc; > >> + }; > >> + static const struct test_t tests[] = > >> + { > >> + { sizeof (uint32_t), uint32_t_cmp }, > >> + { sizeof (uint64_t), uint64_t_cmp }, > >> + /* Test swap with large elements. */ > >> + { 32, uint32_t_cmp }, > >> + }; > >> + > >> + const size_t inner_loop_iters = 16; > >> + for (const struct test_t *test = tests; test < array_end (tests); ++test) > >> + { > >> + for (const struct array_t *arraytype = arraytypes; > >> + arraytype < array_end (arraytypes); > >> + ++arraytype) > >> + { > >> + for (int k = 0; k < wnelem; k++) > >> + { > >> + json_element_object_begin (&json_ctx); > >> + > >> + size_t arraysize = welem[k] * test->type_size; > >> + > >> + json_attr_uint (&json_ctx, "nmemb", welem[k]); > >> + json_attr_uint (&json_ctx, "type_size", test->type_size); > >> + json_attr_string (&json_ctx, "property", arraytype->name); > >> + > >> + void *base = create_array (welem[k], test->type_size, > >> + arraytype->type); > >> + void *work = xmalloc (arraysize); > >> + > >> + timing_t total = 0; > >> + > >> + for (int n = 0; n < inner_loop_iters; n++) > >> + { > >> + memcpy (work, base, arraysize); > >> + > >> + timing_t start, end, diff; > >> + TIMING_NOW (start); > >> + qsort (work, welem[k], test->type_size, test->cmpfunc); > >> + TIMING_NOW (end); > >> + > >> + TIMING_DIFF (diff, start, end); > >> + TIMING_ACCUM (total, diff); > >> + } > >> + > > To avoid overhead from timing. Maybe you could allocate / initialize > > multiple buffers to test a given sort on. Then instead of having to > > re-memcpy / time you just loop through the N initialized buffers > > with one timing call. For the smaller sorts this may help decrease > > the timing overhead. > > I think this make sense, although it would require more memory. I changed > to: > > --- > for (const struct test_t *test = tests; test < array_end (tests); ++test) > { > for (size_t at = 0; at < array_length (arraytypes); at++) > { > for (int ne = 0; ne < wnelem; ne++) > { > void *inputs[INNER_LOOP_ITERS]; > for (int i = 0; i < INNER_LOOP_ITERS; i++) > inputs[i] = create_array (welem[ne], test->type_size, > arraytypes[at].type); I think it may pay to use a 1-dimensional array. i.e size_t qsort_bytes = welem[ne] * test->type_size; void * inputs = malloc(qsort_bytes * NTESTS); // initialize [inputs, inputs + qsort_bytes] for (int i = 1; i < NTESTS; ++i) memcpy(inputs + qsort_bytes * i, inputs, qsort_bytes) then increment inputs by 'qsort_bytes' in the test loop. NTESTS may be fine being INNER_LOOP_ITERS but if we are concerned about using too much memory we could add another loop through INNER_LOOP_ITERS / NTESTS and accumulate the time diffs. > > timing_t start, end, diff; > TIMING_NOW (start); > for (int i = 0; i < INNER_LOOP_ITERS; i++) > qsort (inputs[i], welem[ne], test->type_size, test->cmpfunc); > TIMING_NOW (end); > > TIMING_DIFF (diff, start, end); > > json_element_object_begin (&json_ctx); > json_attr_uint (&json_ctx, "nmemb", welem[ne]); > json_attr_uint (&json_ctx, "type_size", test->type_size); > json_attr_string (&json_ctx, "property", arraytypes[at].name); > uint64_t timing = (double) diff / (double) INNER_LOOP_ITERS; > json_attr_uint (&json_ctx, "timings", timing); > json_element_object_end (&json_ctx); > > for (int i = 0; i < INNER_LOOP_ITERS; i++) > free (inputs[i]); > } > } > } > ---