From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goldstein.w.n@gmail.com>
Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com
 [IPv6:2607:f8b0:4864:20::102b])
 by sourceware.org (Postfix) with ESMTPS id AD5E03858414
 for <libc-alpha@sourceware.org>; Mon, 26 Jul 2021 17:22:23 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AD5E03858414
Received: by mail-pj1-x102b.google.com with SMTP id
 m2-20020a17090a71c2b0290175cf22899cso1032878pjs.2
 for <libc-alpha@sourceware.org>; Mon, 26 Jul 2021 10:22:23 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=cNwGfqnsjA38doIrpOQuI3nUQ37/6bMqO4vc0umpBsI=;
 b=Mm5sVU/y7jyRD9O32KDQ6ae2skxHJlZ5lHBW/j3NacPYdLqKlrawl0wNYdBKKvr0rp
 /WthkSqbNt9mG3pIf4FWklNnBaeMVaIa5y3i4+DG/fTZDFc9+nLhRpoR7UH2HmpZgu0R
 p3QRY7zvs0lbBQ1mEHTBvMxdDw7Fx0qr/BjNHPBDEdccE1SrEd2McG2dgpHRY2Hiob7h
 rk5TA1y7XHT6QJvKSlq26ls4DjOiBisVMGMWnA997nGU9CvDXW8wdjxWiEiP6E3bNng7
 IaW5Ld6OsfQ32tEZ5n9g+P3YW7anmS/gicYGG5XTi/oYZtC3mL8dm6I3W57pkV4Cy+dh
 rKdw==
X-Gm-Message-State: AOAM530F5ouO7MrDcN/7m0210qMlZPNoUyZ7SwW6NiFVXtX5pkIz2WqQ
 MTYjAJF4OoPVq1c0VA3ylPPNg/Sprd7+sQZAuFQ=
X-Google-Smtp-Source: ABdhPJxRftlYzZ3djQJ601vLqoCsHNLfeZife5jniMVUzTPXPpiUBIyJgh+knlEU9E16CqjNYK2hssDBBUrN5esIrhs=
X-Received: by 2002:a17:90a:7e13:: with SMTP id
 i19mr11730762pjl.177.1627320142838; 
 Mon, 26 Jul 2021 10:22:22 -0700 (PDT)
MIME-Version: 1.0
References: <20210713082214.307529-1-naohirot@fujitsu.com>
 <20210720063500.362313-1-naohirot@fujitsu.com>
 <CAFUsyf+ofuxmBj_jpMrgeQb48BB=1A43iSVCgTRN3x6p22e6cQ@mail.gmail.com>
 <TYAPR01MB6025720E6684DFC3BBEC7474DFE89@TYAPR01MB6025.jpnprd01.prod.outlook.com>
In-Reply-To: <TYAPR01MB6025720E6684DFC3BBEC7474DFE89@TYAPR01MB6025.jpnprd01.prod.outlook.com>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon, 26 Jul 2021 13:22:11 -0400
Message-ID: <CAFUsyfJgPUNBK0tW6LRJbUz+ewVLyzALcMGdE2LoXZ5=+2s0Ww@mail.gmail.com>
Subject: Re: [PATCH v2 2/5] benchtests: Add memset zero fill benchtest
To: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
Cc: Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
 "Lucas A. M. Magalhaes" <lamm@linux.ibm.com>, 
 GNU C Library <libc-alpha@sourceware.org>
X-Spam-Status: No, score=-3.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, HTML_MESSAGE,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Jul 2021 17:22:25 -0000

On Mon, Jul 26, 2021 at 4:39 AM naohirot@fujitsu.com <naohirot@fujitsu.com>
wrote:

> Hi Noah,
>
> > I see. I think 16 for the inner loop makes sense. From the x86_64
> > perspective this
> > will keep the loop from running out of the LSD which is necessary for
> > accurate
> > benchmarking. I guess then somewhere between [2, 8] is reasonable for the
> > outer
> > loop?
> >
> >
> > > #define START_SIZE (16 * 1024)
> > > ...
> > > static void
> > > __attribute__((noinline, noclone))
> > > do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s,
> > >              int c1 __attribute ((unused)), int c2 __attribute
> ((unused)),
> > >              size_t n)
> > > {
> > >   size_t i, j, iters = INNER_LOOP_ITERS; // 32;
> > >   timing_t start, stop, cur, latency = 0;
> > >
> > >   for (i = 0; i < 512; i++) // for (i = 0; i < 2; i++)
> > >     {
> > >
> > >       CALL (impl, s, c1, n * 16);
> > >       TIMING_NOW (start);
> > >       for (j = 0; j < 16; j++)
> > >         CALL (impl, s + n * j, c2, n);
> > >       TIMING_NOW (stop);
> > >       TIMING_DIFF (cur, start, stop);
> > >       TIMING_ACCUM (latency, cur);
> > >     }
> > >
> > This looks good. But as you said, a much smaller value for outer loop.
>
> I made one improvement that replaced
>   CALL (impl, s, c1, n * 16);
> to
>   __builtin_memset (s, c1, n * 16);
> and tentatively chose outer loop two times such as the followings:
>
> -----
> static void
> __attribute__((noinline, noclone))
> do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s,
>              int c1 __attribute ((unused)), int c2 __attribute ((unused)),
>              size_t n)
> {
>   size_t i, j, iters = 32;
>   timing_t start, stop, cur, latency = 0;
>
>   for (i = 0; i < 2; i++)
>     {
>       __builtin_memset (s, c1, n * 16);
>       TIMING_NOW (start);
>       for (j = 0; j < 16; j++)
>         CALL (impl, s + n * j, c2, n);
>       TIMING_NOW (stop);
>       TIMING_DIFF (cur, start, stop);
>       TIMING_ACCUM (latency, cur);
>     }
>
>   json_element_double (json_ctx, (double) latency / (double) iters);
> }
>

Looks good!

> -----
>
In case of __memset_generic on a64fx, execution of outer loop 8times
> and 2times took as follows:
>
> 8times
> real    0m26.236s
> user    0m18.806s
> sys     0m6.562s
>
> 2times
> real    0m12.956s
> user    0m5.081s
> sys     0m6.594s
>
> The performance difference is shown in a comparison graph [1],
> there is a difference at 16KB.
> This difference would not be critical if we use the performance data
> mainly to compare "before" with "after" such as master version of
> memset with patched version of memset.
>
>
> This graph[1] can be drawn as the following:
>
> $ cat 2times/bench-memset-zerofill.out 8times/bench-memset-zerofill.out | \
> > merge_strings4graph.sh __memset_generic 2times 8times | \
> > plot_strings.py -l -p thru -v -
>
>
> In order to use __builtin_memset() and create the comparison graph [1],
> I submitted two ground work patches [2][3].
>
> [1]
> https://drive.google.com/file/d/1vD1VE3pdHLoYdaAMWXtImvDlGFDHYkyx/view?usp=sharing
> [2] https://sourceware.org/pipermail/libc-alpha/2021-July/129459.html
> [3] https://sourceware.org/pipermail/libc-alpha/2021-July/129460.html
>
> Thanks.
> Naohiro
>