From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <hjl.tools@gmail.com>
Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com
 [IPv6:2607:f8b0:4864:20::1032])
 by sourceware.org (Postfix) with ESMTPS id A3BCC3858403
 for <libc-alpha@sourceware.org>; Sat, 13 Nov 2021 19:48:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A3BCC3858403
Received: by mail-pj1-x1032.google.com with SMTP id iq11so9490466pjb.3
 for <libc-alpha@sourceware.org>; Sat, 13 Nov 2021 11:48:29 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=EGecJoMQhhF2/lsCXT58Hg5MwVHM4I1vLpuQgF9hjgc=;
 b=aWLyCYZfM21zhmztGMvxTctFWmK/DIe+aFchRWZvVYjuU9N7bH//1XARd2cW9AJgOx
 b3oGRQSEKJMsFp6lg6+7v/1acM8bc3n1dEa6pSAuqvRKP8azO+niqlD9n3HoB8WJt9FJ
 uyNRPNG1xgeFJtwELuqNkFSCTLxdGAi3qoLWV6brjigYgYoecPh1VsVKrDi8MieP4z6E
 Ikc8ogg7UloAEwfLCtDgOrgTxI/n6/BiFJn+N1QjCOC1VJe84ladSekU/Ct2gKNpWCcJ
 +3jhfsSGfLY1DxuANtu6m+STN/7C2DadjzikOmyrCaJVdEt/tvRhxTjeTyx9VouFn3X7
 Bw6A==
X-Gm-Message-State: AOAM533/bLzke5mJIg5SJeAXW2fx1zo72ZTMVyYOH8HUOiq80AwSKiIL
 BVR4LyYAw3j9v383n+1mzSeWpFBzuHtSfIpQEeg=
X-Google-Smtp-Source: ABdhPJzFb+hENLxN0HEq1HHon7Sv0wq+1SaUwi7epgY3nXocgpCeFbNV5gCvr4tez3ao42RwdH1y3+0sP033zjbvhS8=
X-Received: by 2002:a17:90b:3b82:: with SMTP id
 pc2mr48855874pjb.120.1636832908661; 
 Sat, 13 Nov 2021 11:48:28 -0800 (PST)
MIME-Version: 1.0
References: <CAMAf5_fSS_dtMz-z-0edT6vgOxMg9dz4CUY+xCXRjPX5NhhURw@mail.gmail.com>
 <20211112191800.790574-1-skpgkp2@gmail.com>
 <20211112191800.790574-2-skpgkp2@gmail.com>
 <CAFUsyf+syaV9Vk6WhPdJ+OSbDtCaU3LwWvXB9JqKoAjkg1u4HA@mail.gmail.com>
 <CAMAf5_dBK1msQ+tUcJiNE45n7ZzOR8C53y=E9iLK1NrVrSCFsw@mail.gmail.com>
In-Reply-To: <CAMAf5_dBK1msQ+tUcJiNE45n7ZzOR8C53y=E9iLK1NrVrSCFsw@mail.gmail.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Sat, 13 Nov 2021 11:47:52 -0800
Message-ID: <CAMe9rOqVF4-ocCw6KYeiaAmM67FWKDEuYHiRgHO+uvchfFPyDg@mail.gmail.com>
Subject: Re: [PATCH v2 1/6] x86-64: Create microbenchmark infrastructure for
 libmvec
To: Sunil Pandey <skpgkp2@gmail.com>
Cc: Noah Goldstein <goldstein.w.n@gmail.com>,
 GNU C Library <libc-alpha@sourceware.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3029.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Sat, 13 Nov 2021 19:48:31 -0000

On Fri, Nov 12, 2021 at 2:51 PM Sunil Pandey via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Fri, Nov 12, 2021 at 1:02 PM Noah Goldstein <goldstein.w.n@gmail.com>
> wrote:
>
> > On Fri, Nov 12, 2021 at 1:19 PM Sunil K Pandey via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> > >
> > > Add python script to generate libmvec microbenchmark from the input
> > > values for each libmvec function using skeleton benchmark template.
> > >
> > > Creates double and float benchmarks with vector length 1, 2, 4, 8,
> > > and 16 for each libmvec function.  Vector length 1 corresponds to
> > > scalar version of function and is included for vector function perf
> > > comparison.
> > > ---
> > >  sysdeps/x86_64/fpu/Makeconfig               |  35 ++
> > >  sysdeps/x86_64/fpu/Makefile                 |  40 ++
> > >  sysdeps/x86_64/fpu/bench-libmvec-skeleton.c | 104 +++++
> > >  sysdeps/x86_64/fpu/scripts/bench_libmvec.py | 464 ++++++++++++++++++++
> > >  4 files changed, 643 insertions(+)
> > >  create mode 100644 sysdeps/x86_64/fpu/bench-libmvec-skeleton.c
> > >  create mode 100755 sysdeps/x86_64/fpu/scripts/bench_libmvec.py
> > >
> > > diff --git a/sysdeps/x86_64/fpu/Makeconfig
> > b/sysdeps/x86_64/fpu/Makeconfig
> > > index 24aaee1a43..503e9b5ffa 100644
> > > --- a/sysdeps/x86_64/fpu/Makeconfig
> > > +++ b/sysdeps/x86_64/fpu/Makeconfig
> > > @@ -29,6 +29,23 @@ libmvec-funcs = \
> > >    sin \
> > >    sincos \
> > >
> > > +# Define libmvec function for benchtests directory.
> > > +libmvec-bench-funcs = \
> > > +
> > > +bench-libmvec-double = \
> > > +  $(addprefix double-vlen1-, $(libmvec-bench-funcs)) \
> > > +  $(addprefix double-vlen2-, $(libmvec-bench-funcs)) \
> > > +  $(addprefix double-vlen4-, $(libmvec-bench-funcs)) \
> > > +  $(addprefix double-vlen4-avx2-, $(libmvec-bench-funcs)) \
> > > +  $(addprefix double-vlen8-, $(libmvec-bench-funcs)) \
> > > +
> > > +bench-libmvec-float = \
> > > +  $(addsuffix f, $(addprefix float-vlen1-, $(libmvec-bench-funcs))) \
> > > +  $(addsuffix f, $(addprefix float-vlen4-, $(libmvec-bench-funcs))) \
> > > +  $(addsuffix f, $(addprefix float-vlen8-, $(libmvec-bench-funcs))) \
> > > +  $(addsuffix f, $(addprefix float-vlen8-avx2-,
> > $(libmvec-bench-funcs))) \
> > > +  $(addsuffix f, $(addprefix float-vlen16-, $(libmvec-bench-funcs))) \
> > > +
> > >  # The base libmvec ABI tests.
> > >  libmvec-abi-func-tests = \
> > >    $(addprefix test-double-libmvec-,$(libmvec-funcs)) \
> > > @@ -83,5 +100,23 @@ $(common-objpfx)libmvec.mk:
> > $(common-objpfx)config.make
> > >            echo "  \$$(float-vlen16-arch-ext-cflags)"; \
> > >            echo; \
> > >          done; \
> > > +        echo "endif"; \
> > > +        echo "ifeq (\$$(subdir),benchtests)"; \
> > > +        for t in $(libmvec-bench-funcs); do \
> > > +          echo "CFLAGS-bench-double-vlen4-$$t.c = \\"; \
> > > +          echo "  \$$(double-vlen4-arch-ext-cflags)"; \
> > > +          echo "CFLAGS-bench-double-vlen4-avx2-$$t.c = \\"; \
> > > +          echo "  \$$(double-vlen4-arch-ext2-cflags)"; \
> > > +          echo "CFLAGS-bench-double-vlen8-$$t.c = \\"; \
> > > +          echo "  \$$(double-vlen8-arch-ext-cflags)"; \
> > > +          echo; \
> > > +          echo "CFLAGS-bench-float-vlen8-$${t}f.c = \\"; \
> > > +          echo "  \$$(float-vlen8-arch-ext-cflags)"; \
> > > +          echo "CFLAGS-bench-float-vlen8-avx2-$${t}f.c = \\"; \
> > > +          echo "  \$$(float-vlen8-arch-ext2-cflags)"; \
> > > +          echo "CFLAGS-bench-float-vlen16-$${t}f.c = \\"; \
> > > +          echo "  \$$(float-vlen16-arch-ext-cflags)"; \
> > > +          echo; \
> > > +        done; \
> > >          echo "endif") > $@T
> > >         mv -f $@T $@
> > > diff --git a/sysdeps/x86_64/fpu/Makefile b/sysdeps/x86_64/fpu/Makefile
> > > index d172ae815d..9fb587cf8f 100644
> > > --- a/sysdeps/x86_64/fpu/Makefile
> > > +++ b/sysdeps/x86_64/fpu/Makefile
> > > @@ -72,3 +72,43 @@ ifeq
> > ($(subdir)$(config-cflags-mprefer-vector-width),mathyes)
> > >  # performance of sin and cos by more than 40% on Skylake.
> > >  CFLAGS-branred.c = -mprefer-vector-width=128
> > >  endif
> > > +
> > > +ifeq ($(subdir),benchtests)
> > > +double-vlen4-arch-ext-cflags = -mavx
> > > +double-vlen4-arch-ext2-cflags = -mavx2
> > > +double-vlen8-arch-ext-cflags = -mavx512f
> > > +
> > > +float-vlen8-arch-ext-cflags = -mavx
> > > +float-vlen8-arch-ext2-cflags = -mavx2
> > > +float-vlen16-arch-ext-cflags = -mavx512f
> > > +
> > > +bench-libmvec := $(bench-libmvec-double) $(bench-libmvec-float)
> > > +
> > > +ifeq (${BENCHSET},)
> > > +bench += $(bench-libmvec)
> > > +endif
> > > +
> > > +ifeq (${STATIC-BENCHTESTS},yes)
> > > +libmvec-benchtests = $(common-objpfx)mathvec/libmvec.a
> > $(common-objpfx)math/libm.a
> > > +else
> > > +libmvec-benchtests = $(libmvec) $(libm)
> > > +endif
> > > +
> > > +$(addprefix $(objpfx)bench-,$(bench-libmvec-double)):
> > $(libmvec-benchtests)
> > > +$(addprefix $(objpfx)bench-,$(bench-libmvec-float)):
> > $(libmvec-benchtests)
> > > +bench-libmvec-deps = $(..)sysdeps/x86_64/fpu/bench-libmvec-skeleton.c
> > bench-timing.h Makefile
> > > +
> > > +$(objpfx)bench-float-%.c: $(bench-libmvec-deps)
> > > +       { if [ -n "$($*-INCLUDE)" ]; then \
> > > +         cat $($*-INCLUDE); \
> > > +       fi; \
> > > +       $(PYTHON) $(..)sysdeps/x86_64/fpu/scripts/bench_libmvec.py
> > $(basename $(@F)); } > $@-tmp
> > > +       mv -f $@-tmp $@
> > > +
> > > +$(objpfx)bench-double-%.c: $(bench-libmvec-deps)
> > > +       { if [ -n "$($*-INCLUDE)" ]; then \
> > > +         cat $($*-INCLUDE); \
> > > +       fi; \
> > > +       $(PYTHON) $(..)sysdeps/x86_64/fpu/scripts/bench_libmvec.py
> > $(basename $(@F)); } > $@-tmp
> > > +       mv -f $@-tmp $@
> > > +endif
> > > diff --git a/sysdeps/x86_64/fpu/bench-libmvec-skeleton.c
> > b/sysdeps/x86_64/fpu/bench-libmvec-skeleton.c
> > > new file mode 100644
> > > index 0000000000..d56a0c4462
> > > --- /dev/null
> > > +++ b/sysdeps/x86_64/fpu/bench-libmvec-skeleton.c
> > > @@ -0,0 +1,104 @@
> > > +/* Skeleton for libmvec benchmark programs.
> > > +   Copyright (C) 2021 Free Software Foundation, Inc.
> > > +   This file is part of the GNU C Library.
> > > +
> > > +   The GNU C Library is free software; you can redistribute it and/or
> > > +   modify it under the terms of the GNU Lesser General Public
> > > +   License as published by the Free Software Foundation; either
> > > +   version 2.1 of the License, or (at your option) any later version.
> > > +
> > > +   The GNU C Library is distributed in the hope that it will be useful,
> > > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   Lesser General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU Lesser General Public
> > > +   License along with the GNU C Library; if not, see
> > > +   <https://www.gnu.org/licenses/>.  */
> > > +
> > > +#include <string.h>
> > > +#include <stdint.h>
> > > +#include <stdbool.h>
> > > +#include <stdio.h>
> > > +#include <time.h>
> > > +#include <inttypes.h>
> > > +#include <bench-timing.h>
> > > +#include <json-lib.h>
> > > +#include <bench-util.h>
> > > +
> > > +#include <bench-util.c>
> > > +#include <math-tests-arch.h>
> > > +#define D_ITERS 10000
> > > +
> > > +int
> > > +main (int argc, char **argv)
> > > +{
> > > +  unsigned long i, k;
> > > +  timing_t start, end;
> > > +  json_ctx_t json_ctx;
> > > +
> > > +#if defined REQUIRE_AVX
> > > +  if (!CPU_FEATURE_ACTIVE (AVX))
> > > +    {
> > > +      printf ("AVX not supported.\n");
> > > +      return 0;
> > > +    }
> > > +#elif defined REQUIRE_AVX2
> > > +  if (!CPU_FEATURE_ACTIVE (AVX2))
> > > +    {
> > > +      printf ("AVX2 not supported.\n");
> > > +      return 0;
> > > +    }
> > > +#elif defined REQUIRE_AVX512F
> > > +  if (!CPU_FEATURE_ACTIVE (AVX512F))
> > > +    {
> > > +      printf ("AVX512F not supported.\n");
> > > +      return 0;
> > > +    }
> > > +#endif
> > > +
> > > +  bench_start ();
> > > +
> > > +#ifdef BENCH_INIT
> > > +  BENCH_INIT ();
> > > +#endif
> > > +
> > > +  json_init (&json_ctx, 2, stdout);
> > > +
> > > +  /* Begin function.  */
> > > +  json_attr_object_begin (&json_ctx, FUNCNAME);
> > > +
> > > +  for (int v = 0; v < NUM_VARIANTS; v++)
> > > +    {
> > > +      double d_total_time = 0;
> > > +      uint64_t cur;
> >
> > Think these should also be type `timing_t`
> >
>
> I do not see a difference if I use timing_t or uint64_t. In any case
> variable cur stores the
> difference between start and end time, not time.
>
>
> >
> > > +      for (k = 0; k < D_ITERS; k++)
> > > +       {
> > > +         TIMING_NOW (start);
> > > +         for (i = 0; i < NUM_SAMPLES (v); i++)
> >
> > What is the rationale for both `D_ITERS` and `NUM_SAMPLES (v)`? Why not
> > one loop that iterates for `D_ITERS * NUM_SAMPLES (v)`?
> >
>
> D_ITERS define how many times each variant full data set will run.
> NUM_SAMPLES(v)
> represent the number of data sets in variant v. Index v and i select, i'th
> data set from
> variant v and call vector function.  Having two loops simplifies logic.
>
>
> > > +           BENCH_FUNC (v, i);
> > > +         TIMING_NOW (end);
> > > +
> > > +         TIMING_DIFF (cur, start, end);
> > > +
> > > +         d_total_time += cur;
> >.> > Think this should be `TIMING_ACCUM(d_total_time, cur)`.
> >
>
> Not much difference, if I use TIMING_ACCUM or simply add cur to
> d_total_time.
>

Please use TIMING_ACCUM (d_total_time, cur) to be consistent with
TIMING_DIFF (cur, start, end).

Thanks.


-- 
H.J.