From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goldstein.w.n@gmail.com>
Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com
 [IPv6:2607:f8b0:4864:20::530])
 by sourceware.org (Postfix) with ESMTPS id 0A5D23858C60
 for <libc-alpha@sourceware.org>; Fri, 15 Oct 2021 16:40:01 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0A5D23858C60
Received: by mail-pg1-x530.google.com with SMTP id 75so9105579pga.3
 for <libc-alpha@sourceware.org>; Fri, 15 Oct 2021 09:40:00 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=LJfW+UtndfmTSfePEfjA4/sudRyGLw7UvvOJVS/TZVk=;
 b=ZtHaFHdXyhdVJdu2vvJw+aGLlYavaQnfQmQVCwO/aBSPplx+yGFZV+FVDrOrq6rWeS
 pUY/SXiESyIFPUJBuw4e+dDUhXuyKCaRnj/21h8kkse5ODDlGsxjtznTpoHFh5H/y4pX
 SWlXnXd9668RjkZOiKPnkHSFq2Mtnzg336oE5ijI8EFWyuDkF8AOsfTnRNKkWXs/zXIE
 6cxiJGVZlof0m3lJkqWmpZv/BGosdeRne70OFCcu7xXsGzK32SGAfZsV0gn14wIfMQK2
 k/ATNGW+j+8zgNZWKlhw5MMgEePbzwv+ekEr43Qn+Xvv5Vkl6t2TMk5Rq8acneQjE5Kv
 61sg==
X-Gm-Message-State: AOAM531QohjY2SSWyAT+D0qU+HSlF+s+wSQhxxEBtm2oDvj/8Spn7jWr
 DfSePDO1YIFf54+u/fcEYTNy4AEeFiCAwQNhIUc=
X-Google-Smtp-Source: ABdhPJxJIOYZROacD3FWlcgTwHDXsZG48ShtbOQ/pz26tp6pzgm1tl/76aYAyoojdouIZgULYPLIVBr6ob4BahJhiyM=
X-Received: by 2002:a63:5b09:: with SMTP id p9mr9813001pgb.321.1634315999773; 
 Fri, 15 Oct 2021 09:39:59 -0700 (PDT)
MIME-Version: 1.0
References: <20210903171144.952737-1-adhemerval.zanella@linaro.org>
 <20210903171144.952737-2-adhemerval.zanella@linaro.org>
 <CAFUsyfKw97xiQmcaRGP0QjVq2sq-KN9HbEv_BDzAFAt3UZfoMA@mail.gmail.com>
 <b8dadcca-a668-aeda-529b-b01266a16d88@linaro.org>
In-Reply-To: <b8dadcca-a668-aeda-529b-b01266a16d88@linaro.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Fri, 15 Oct 2021 11:39:48 -0500
Message-ID: <CAFUsyfK8oNc5=wsCZTDeaZ4_bpFnNrgzxr-vvfy=LqF22++mfg@mail.gmail.com>
Subject: Re: [PATCH v3 1/7] benchtests: Add bench-qsort
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: GNU C Library <libc-alpha@sourceware.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-9.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Oct 2021 16:40:03 -0000

On Fri, Oct 15, 2021 at 7:52 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 13/10/2021 00:19, Noah Goldstein wrote:
> > On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> This patch adds a qsort benchmark.  I tried to model the benchmark
> >> taking in consideration the possible input variation in both internal
> >> element size, element numbers, and internal state for 1. real word cases
> >> and 2. possible scenarios based on hardware characteristics.
> >>
> >> For 1. I tracked qsort usage (using a simple preload library to dump its
> >> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
> >> GCC 8 bootstrap build shows 51786641 call to qsort with the following
> >> characterics:
> >>
> >> Key: number of elements:
> >> key=2 : 39.74
> >> key=3 : 19.23
> >> key=4 : 9.77
> >> key=1 : 8.44
> >> key=0 : 6.60
> >> key=5 : 4.40
> >> key=7 : 2.37
> >> key=6 : 2.25
> >> key=9 : 1.48
> >> key=8 : 0.97
> >>
> >> Key: element size in bytes:
> >> key=8 : 91.74
> >> key=32 : 3.62
> >> key=4 : 2.42
> >> key=40 : 1.20
> >> key=16 : 0.67
> >> key=24 : 0.30
> >> key=48 : 0.05
> >> key=56 : 0.00
> >> key=1 : 0.00
> >>
> >> Key: total size (number of elements * element size):
> >> key=16 : 35.98
> >> key=24 : 18.67
> >> key=32 : 9.79
> >> key=8 : 8.28
> >> key=0 : 6.60
> >> key=40 : 4.21
> >> key=64 : 3.15
> >> key=48 : 2.24
> >> key=56 : 2.15
> >> key=80 : 1.45
> >>
> >> So for GCC:
> >>
> >>   - 80% of total qsort usage are done with 10 elements of less.
> >>   - All usages are done element size of maximum of 56 bytes.
> >>   - 90% of calls are done with array of maximum size of 80 bytes or
> >>     less.
> >>
> >> (GCC on recent version now uses a internal qsort implementation instead
> >> of the libc one).
> >>
> >> The Firefox usage was done with 2 hours of usage, with first 10 minutes
> >> activelly openning and closing different types of sites.  It resulted in
> >> 21042 calls with following characteristics:
> >>
> >> Key: number of elements:
> >> key=7 : 24.40
> >> key=1 : 10.44
> >> key=3 : 6.33
> >> key=4 : 5.81
> >> key=2 : 5.46
> >> key=6 : 4.80
> >> key=17 : 4.54
> >> key=0 : 3.07
> >> key=5 : 3.05
> >> key=9 : 2.51
> >> key=12 : 2.06
> >>
> >> Key: element size in bytes:
> >> key=8 : 94.49
> >> key=28 : 4.40
> >> key=2 : 0.70
> >> key=16 : 0.19
> >> key=36 : 0.07
> >> key=12 : 0.07
> >> key=40 : 0.07
> >> key=24 : 0.03
> >>
> >> Key: total size (number of elements * element size):
> >> key=56 : 24.20
> >> key=8 : 10.27
> >> key=24 : 6.36
> >> key=32 : 5.86
> >> key=16 : 5.46
> >> key=48 : 4.80
> >> key=476 : 3.75
> >> key=0 : 3.07
> >> key=40 : 3.05
> >> key=72 : 2.50
> >>
> >> So for Firefox:
> >>
> >>   - 72% of total qsort usage are done with 18 elements of less.
> >>   - All usages are done element size of maximum of 40 bytes.
> >>   - 70% of calls are done with array of maximum size of 476 bytes or
> >>     less.
> >>
> >> For 2. I used the idea of a machine with 3 levels of cache with sizes
> >> L1 32kb, L2 256kb, and L3 4096Kb.
> >>
> >> It resulted in a benchmark with following traits:
> >>
> >>   * It checks four types of input arrays: sorted, mostly sorted, random,
> >>     repeated, and bitonic:
> >>
> >>     - 'sorted': the array is already sorted.
> >>     - 'mostly sorted': the array will have a certain number of random
> >>        position with random values (current ratio used is 20%).
> >>     - 'random' the array will contain random elements.
> >>     - 'repeated' the array will have random elements with a certain
> >>        number (current ratio is 20%) of a repeated element distributed
> >>        randomly.
> >>     - 'bitonic': elements will be strictly increasing to up middle of the
> >>        array then strictly decreasing.  This is to trigger the
> >>        pattern-defeting input for the median of 3 pivot selection used
> >>        in quicksort implementation.
> >>
> >>   * Three elements sizes are checked: uint32_t, uint64_t, and an
> >>     element with 32 bytes (but using the uint32_t comparison checks).
> >>     These element sizes are used to 1. to avoid include the comparison
> >>     function itself and/or memory copy in sort benchmark itself, and 2.
> >>     because key of size_t are the most used for both GCC and Firefox.
> >>
> >>   * Five different element numbers: 64 (which cover mostly of used
> >>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
> >>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
> >>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
> >>     --nelem option.
> >>
> >> Checked on x86_64-linux-gnu
> >> ---
> >>  benchtests/Makefile      |   2 +-
> >>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 196 insertions(+), 1 deletion(-)
> >>  create mode 100644 benchtests/bench-qsort.c
> >>
> >> diff --git a/benchtests/Makefile b/benchtests/Makefile
> >> index 1530939a8c..781cbca8af 100644
> >> --- a/benchtests/Makefile
> >> +++ b/benchtests/Makefile
> >> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
> >>  include ../gen-locales.mk
> >>  endif
> >>
> >> -stdlib-benchset := strtod
> >> +stdlib-benchset := strtod qsort
> >>
> >>  stdio-common-benchset := sprintf
> >>
> >> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
> >> new file mode 100644
> >> index 0000000000..905aa6d7b6
> >> --- /dev/null
> >> +++ b/benchtests/bench-qsort.c
> >> @@ -0,0 +1,195 @@
> >> +/* Measure qsort function.
> >> +   Copyright (C) 2018 Free Software Foundation, Inc.
> >> +   This file is part of the GNU C Library.
> >> +
> >> +   The GNU C Library is free software; you can redistribute it and/or
> >> +   modify it under the terms of the GNU Lesser General Public
> >> +   License as published by the Free Software Foundation; either
> >> +   version 2.1 of the License, or (at your option) any later version.
> >> +
> >> +   The GNU C Library is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> +   Lesser General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU Lesser General Public
> >> +   License along with the GNU C Library; if not, see
> >> +   <http://www.gnu.org/licenses/>.  */
> >> +
> >> +#include <array_length.h>
> >> +#include <bench-timing.h>
> >> +#include <errno.h>
> >> +#include <getopt.h>
> >> +#include <json-lib.h>
> >> +#include <stdio.h>
> >> +#include <unistd.h>
> >> +
> >> +#include <stdlib/tst-qsort-common.c>
> >> +
> >> +/* Number of elements of determined type to be measured.  */
> >> +static const size_t default_nelems[] =
> >> +{
> >> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
> >> +                              on GCC build.  */
> >> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
> >> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
> >> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
> >> +};
> >> +
> >> +#define OPT_NELEM 10000
> >> +#define OPT_SEED  10001
> >> +#define CMDLINE_OPTIONS \
> >> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
> >> +  { "seed", required_argument, NULL, OPT_SEED },
> >> +
> >> +static const size_t max_nelem = 16;
> >> +static size_t *elems = NULL;
> >> +static size_t nelem = 0;
> >> +static uint64_t seed = 0;
> >> +static bool seed_set = false;
> >> +
> >> +static void
> >> +cmdline_process_function (int c)
> >> +{
> >> +  switch (c)
> >> +    {
> >> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
> >> +        The elements sizes as passed with a ':' as the delimiter, for
> >> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
> >> +      case OPT_NELEM:
> >> +        {
> >> +         elems = xmalloc (max_nelem * sizeof (size_t));
> >> +         nelem = 0;
> >> +
> >> +         char *saveptr;
> >> +         char *token;
> >> +         token = strtok_r (optarg, ":", &saveptr);
> >> +         if (token == NULL)
> >> +           {
> >> +             printf ("error: invalid --nelem value\n");
> >> +             exit (EXIT_FAILURE);
> >> +           }
> >> +         do
> >> +           {
> >> +             if (nelem == max_nelem)
> >> +               {
> >> +                 printf ("error: invalid --nelem value (max elem)\n");
> >> +                 exit (EXIT_FAILURE);
> >> +               }
> >> +             elems[nelem++] = atol (token);
> >> +             token = strtok_r (saveptr, ":", &saveptr);
> >> +           } while (token != NULL);
> >> +        }
> >> +      break;
> >> +
> >> +      /* handle the --seed option to use a different seed than a random one.
> >> +        The SEED used should be a uint64_t number.  */
> >> +      case OPT_SEED:
> >> +       {
> >> +         uint64_t value = strtoull (optarg, NULL, 0);
> >> +         if (errno == ERANGE || value > UINT64_MAX)
> >> +           {
> >> +             printf ("error: seed should be a value in range of "
> >> +                     "[0, UINT64_MAX]\n");
> >> +             exit (EXIT_FAILURE);
> >> +           }
> >> +         seed = value;
> >> +         seed_set = true;
> >> +       }
> >> +    }
> >> +}
> >> +
> >> +#define CMDLINE_PROCESS cmdline_process_function
> >> +
> >> +static int
> >> +do_test (void)
> >> +{
> >> +  random_init ();
> >> +
> >> +
> >> +  json_ctx_t json_ctx;
> >> +
> >> +  json_init (&json_ctx, 0, stdout);
> >> +
> >> +  json_document_begin (&json_ctx);
> >> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> >> +
> >> +  json_attr_object_begin (&json_ctx, "functions");
> >> +  json_attr_object_begin (&json_ctx, "qsort");
> >> +  json_attr_uint (&json_ctx, "seed", seed);
> >> +
> >> +  json_array_begin (&json_ctx, "results");
> >> +
> >> +  const size_t *welem = elems == NULL ? default_nelems : elems;
> >> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
> >> +
> >> +  struct test_t
> >> +  {
> >> +    size_t type_size;
> >> +    cmpfunc_t cmpfunc;
> >> +  };
> >> +  static const struct test_t tests[] =
> >> +  {
> >> +    { sizeof (uint32_t), uint32_t_cmp },
> >> +    { sizeof (uint64_t), uint64_t_cmp },
> >> +    /* Test swap with large elements.  */
> >> +    { 32,                uint32_t_cmp },
> >> +  };
> >> +
> >> +  const size_t inner_loop_iters = 16;
> >> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
> >> +    {
> >> +      for (const struct array_t *arraytype = arraytypes;
> >> +          arraytype < array_end (arraytypes);
> >> +          ++arraytype)
> >> +       {
> >> +         for (int k = 0; k < wnelem; k++)
> >> +           {
> >> +             json_element_object_begin (&json_ctx);
> >> +
> >> +             size_t arraysize = welem[k] * test->type_size;
> >> +
> >> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
> >> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
> >> +             json_attr_string (&json_ctx, "property", arraytype->name);
> >> +
> >> +             void *base = create_array (welem[k], test->type_size,
> >> +                                        arraytype->type);
> >> +             void *work = xmalloc (arraysize);
> >> +
> >> +             timing_t total = 0;
> >> +
> >> +             for (int n = 0; n < inner_loop_iters; n++)
> >> +               {
> >> +                 memcpy (work, base, arraysize);
> >> +
> >> +                 timing_t start, end, diff;
> >> +                 TIMING_NOW (start);
> >> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
> >> +                 TIMING_NOW (end);
> >> +
> >> +                 TIMING_DIFF (diff, start, end);
> >> +                 TIMING_ACCUM (total, diff);
> >> +               }
> >> +
> > To avoid overhead from timing. Maybe you could allocate / initialize
> > multiple buffers to test a given sort on. Then instead of having to
> > re-memcpy / time you just loop through the N initialized buffers
> > with one timing call. For the smaller sorts this may help decrease
> > the timing overhead.
>
> I think this make sense, although it would require more memory. I changed
> to:
>
> ---
>   for (const struct test_t *test = tests; test < array_end (tests); ++test)
>     {
>       for (size_t at = 0; at < array_length (arraytypes); at++)
>         {
>           for (int ne = 0; ne < wnelem; ne++)
>             {
>               void *inputs[INNER_LOOP_ITERS];
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 inputs[i] = create_array (welem[ne], test->type_size,
>                                           arraytypes[at].type);

I think it may pay to use a 1-dimensional array. i.e

size_t qsort_bytes = welem[ne] * test->type_size;
void * inputs = malloc(qsort_bytes * NTESTS);
// initialize [inputs, inputs + qsort_bytes]
for (int i = 1; i < NTESTS; ++i)
    memcpy(inputs + qsort_bytes * i, inputs, qsort_bytes)

then increment inputs by 'qsort_bytes' in the test loop.

NTESTS may be fine being INNER_LOOP_ITERS but
if we are concerned about using too much memory
we could add another loop through INNER_LOOP_ITERS / NTESTS
and accumulate the time diffs.


>
>               timing_t start, end, diff;
>               TIMING_NOW (start);
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 qsort (inputs[i], welem[ne], test->type_size, test->cmpfunc);
>               TIMING_NOW (end);
>
>               TIMING_DIFF (diff, start, end);
>
>               json_element_object_begin (&json_ctx);
>               json_attr_uint (&json_ctx, "nmemb", welem[ne]);
>               json_attr_uint (&json_ctx, "type_size", test->type_size);
>               json_attr_string (&json_ctx, "property", arraytypes[at].name);
>               uint64_t timing = (double) diff / (double) INNER_LOOP_ITERS;
>               json_attr_uint (&json_ctx, "timings", timing);
>               json_element_object_end (&json_ctx);
>
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 free (inputs[i]);
>             }
>         }
>     }
> ---