From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by sourceware.org (Postfix) with ESMTPS id AD5E03858414 for ; Mon, 26 Jul 2021 17:22:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AD5E03858414 Received: by mail-pj1-x102b.google.com with SMTP id m2-20020a17090a71c2b0290175cf22899cso1032878pjs.2 for ; Mon, 26 Jul 2021 10:22:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cNwGfqnsjA38doIrpOQuI3nUQ37/6bMqO4vc0umpBsI=; b=Mm5sVU/y7jyRD9O32KDQ6ae2skxHJlZ5lHBW/j3NacPYdLqKlrawl0wNYdBKKvr0rp /WthkSqbNt9mG3pIf4FWklNnBaeMVaIa5y3i4+DG/fTZDFc9+nLhRpoR7UH2HmpZgu0R p3QRY7zvs0lbBQ1mEHTBvMxdDw7Fx0qr/BjNHPBDEdccE1SrEd2McG2dgpHRY2Hiob7h rk5TA1y7XHT6QJvKSlq26ls4DjOiBisVMGMWnA997nGU9CvDXW8wdjxWiEiP6E3bNng7 IaW5Ld6OsfQ32tEZ5n9g+P3YW7anmS/gicYGG5XTi/oYZtC3mL8dm6I3W57pkV4Cy+dh rKdw== X-Gm-Message-State: AOAM530F5ouO7MrDcN/7m0210qMlZPNoUyZ7SwW6NiFVXtX5pkIz2WqQ MTYjAJF4OoPVq1c0VA3ylPPNg/Sprd7+sQZAuFQ= X-Google-Smtp-Source: ABdhPJxRftlYzZ3djQJ601vLqoCsHNLfeZife5jniMVUzTPXPpiUBIyJgh+knlEU9E16CqjNYK2hssDBBUrN5esIrhs= X-Received: by 2002:a17:90a:7e13:: with SMTP id i19mr11730762pjl.177.1627320142838; Mon, 26 Jul 2021 10:22:22 -0700 (PDT) MIME-Version: 1.0 References: <20210713082214.307529-1-naohirot@fujitsu.com> <20210720063500.362313-1-naohirot@fujitsu.com> In-Reply-To: From: Noah Goldstein Date: Mon, 26 Jul 2021 13:22:11 -0400 Message-ID: Subject: Re: [PATCH v2 2/5] benchtests: Add memset zero fill benchtest To: "naohirot@fujitsu.com" Cc: Wilco Dijkstra , "Lucas A. M. Magalhaes" , GNU C Library X-Spam-Status: No, score=-3.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, HTML_MESSAGE, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Jul 2021 17:22:25 -0000 On Mon, Jul 26, 2021 at 4:39 AM naohirot@fujitsu.com wrote: > Hi Noah, > > > I see. I think 16 for the inner loop makes sense. From the x86_64 > > perspective this > > will keep the loop from running out of the LSD which is necessary for > > accurate > > benchmarking. I guess then somewhere between [2, 8] is reasonable for the > > outer > > loop? > > > > > > > #define START_SIZE (16 * 1024) > > > ... > > > static void > > > __attribute__((noinline, noclone)) > > > do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s, > > > int c1 __attribute ((unused)), int c2 __attribute > ((unused)), > > > size_t n) > > > { > > > size_t i, j, iters = INNER_LOOP_ITERS; // 32; > > > timing_t start, stop, cur, latency = 0; > > > > > > for (i = 0; i < 512; i++) // for (i = 0; i < 2; i++) > > > { > > > > > > CALL (impl, s, c1, n * 16); > > > TIMING_NOW (start); > > > for (j = 0; j < 16; j++) > > > CALL (impl, s + n * j, c2, n); > > > TIMING_NOW (stop); > > > TIMING_DIFF (cur, start, stop); > > > TIMING_ACCUM (latency, cur); > > > } > > > > > This looks good. But as you said, a much smaller value for outer loop. > > I made one improvement that replaced > CALL (impl, s, c1, n * 16); > to > __builtin_memset (s, c1, n * 16); > and tentatively chose outer loop two times such as the followings: > > ----- > static void > __attribute__((noinline, noclone)) > do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s, > int c1 __attribute ((unused)), int c2 __attribute ((unused)), > size_t n) > { > size_t i, j, iters = 32; > timing_t start, stop, cur, latency = 0; > > for (i = 0; i < 2; i++) > { > __builtin_memset (s, c1, n * 16); > TIMING_NOW (start); > for (j = 0; j < 16; j++) > CALL (impl, s + n * j, c2, n); > TIMING_NOW (stop); > TIMING_DIFF (cur, start, stop); > TIMING_ACCUM (latency, cur); > } > > json_element_double (json_ctx, (double) latency / (double) iters); > } > Looks good! > ----- > In case of __memset_generic on a64fx, execution of outer loop 8times > and 2times took as follows: > > 8times > real 0m26.236s > user 0m18.806s > sys 0m6.562s > > 2times > real 0m12.956s > user 0m5.081s > sys 0m6.594s > > The performance difference is shown in a comparison graph [1], > there is a difference at 16KB. > This difference would not be critical if we use the performance data > mainly to compare "before" with "after" such as master version of > memset with patched version of memset. > > > This graph[1] can be drawn as the following: > > $ cat 2times/bench-memset-zerofill.out 8times/bench-memset-zerofill.out | \ > > merge_strings4graph.sh __memset_generic 2times 8times | \ > > plot_strings.py -l -p thru -v - > > > In order to use __builtin_memset() and create the comparison graph [1], > I submitted two ground work patches [2][3]. > > [1] > https://drive.google.com/file/d/1vD1VE3pdHLoYdaAMWXtImvDlGFDHYkyx/view?usp=sharing > [2] https://sourceware.org/pipermail/libc-alpha/2021-July/129459.html > [3] https://sourceware.org/pipermail/libc-alpha/2021-July/129460.html > > Thanks. > Naohiro >