From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ed1-x536.google.com (mail-ed1-x536.google.com [IPv6:2a00:1450:4864:20::536])
	by sourceware.org (Postfix) with ESMTPS id 8A1E53857815
	for <gcc-patches@gcc.gnu.org>; Tue, 25 Oct 2022 11:54:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8A1E53857815
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-ed1-x536.google.com with SMTP id y12so15521195edc.9
        for <gcc-patches@gcc.gnu.org>; Tue, 25 Oct 2022 04:54:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=qYL6KNou0YaAqxprzQqNBYnqhfSpP36weTLO9p0b0cQ=;
        b=FfFGBKoJoA3gz6INNKd9m9nEWEjdH7Er1Dere5l7CQnkvpCv3W/ugWfDZkmaIf9ony
         FYfS95QlN44sqdhhJDBC6jhdEfcQvnMiqlQv9CxexNN4N72dWMFFFEQwwOPp18+C7qn9
         m3jSr5nTGMAU18tH+Faa9MwO2SMOxw0lYr4FPbKNx0VIsfThifLvkWjSpJcyVqjG2Dya
         ZA26+t2blEuaD12DF4KOiiNshz7kmumAbnsSHjFduHuLmWgkmP3oKHwCwtOPv6A6YfOr
         /V63+qZWCbMd2m2XHJ4dbdLDpCh0ejFZPxH2MPFWmsKpMxzRiwCAY+BFVkh/newiaC+M
         RmUg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=qYL6KNou0YaAqxprzQqNBYnqhfSpP36weTLO9p0b0cQ=;
        b=OkvsQn/petVGRy5vJcYbvrY1cDg9lgJMDx067ggh3WLIwEF0OALpF+XHjqLC6tIWbj
         GchHlqWTWzj5jjgEKP/2mACoJ4szTCGiOpAUa18poPvpU8hkvH8sTa9kMnCwh+xKHJtC
         S5bHjwzAJep+hWSr9DUoqVy9+pP/bbMBhz82i8R1idi2hZwrhNkFIuOXxpg26TttAphX
         53tQ3ume5JrwAfzyJd0IR/LV9rAwwo0fJa8mA0VUCOyLrF2ThKdP9014J+1DAUxG9HRT
         7ktP/rTGbNpmllRlcICQD+/A/oQbvHF387dh0p65yUYq76cp6l34D4vFJRWm2SZKyoI9
         9ncw==
X-Gm-Message-State: ACrzQf0a3Qy1FxeouhTE7ycyVeOnoVENyT3QYO4sGVEBwGzE/pGRUkH2
	JZatzbzhfhVhk4AbDAoF9tVLxN2tlGJD0HKJ1uc=
X-Google-Smtp-Source: AMsMyM5BnAzT8vPNxe3EkoqHYdiB/D1/P0nyTok+GNflAgpMZ7o4uCYupEsPQOhyemCrVi+p94Fle01AwZBqEthvO7U=
X-Received: by 2002:a05:6402:550c:b0:443:7d15:d57f with SMTP id
 fi12-20020a056402550c00b004437d15d57fmr35700445edb.147.1666698844088; Tue, 25
 Oct 2022 04:54:04 -0700 (PDT)
MIME-Version: 1.0
References: <20221024024604.18324-1-lili.cui@intel.com> <CAMZc-bxuN27fbPRFU9X253ZUJ_Km4POb6q-583XgSixcD7LNZA@mail.gmail.com>
In-Reply-To: <CAMZc-bxuN27fbPRFU9X253ZUJ_Km4POb6q-583XgSixcD7LNZA@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 25 Oct 2022 13:53:51 +0200
Message-ID: <CAFiYyc0TnyJPZevD5+=nW7XUU6SZ8_nLACG0ONV6guYj=a3vXA@mail.gmail.com>
Subject: Re: [PATCH] ix86: Suggest unroll factor for loop vectorization
To: Hongtao Liu <crazylht@gmail.com>
Cc: =?UTF-8?Q?Martin_Li=C5=A1ka?= <mliska@suse.cz>, gcc-patches@gcc.gnu.org, 
	hongtao.liu@intel.com, "Cui,Lili" <lili.cui@intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-8.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Oct 25, 2022 at 7:46 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Any comments?

+@item x86-vect-unroll-min-ldst-threshold
+The vectorizer will check with target information to determine whether unroll
+it. This parameter is used to limit the mininum of loads and stores in the main
+loop.

It's odd to "limit" the minimum number of something.  I think this
warrants clarification
that for some (unknow to me ;)) reason we think that when we have many loads
and (or?) stores it is beneficial to unroll to get even more loads and
stores in a
single iteration.  Btw, does the parameter limit the number of loads and stores
_after_ unrolling or before?

+@item x86-vect-unroll-max-loop-size
+The vectorizer will check with target information to determine whether unroll
+it. This threshold is used to limit the max size of loop body after unrolling.
+The default value is 200.

it should probably say not "size" but "number of instructions".  Note that 200
is quite large given we are talking about vector instructions here which have
larger encodings than scalar instructions.  Optimistically assuming
4 byte encoding (quite optimistic give we're looking at loops with many
loads/stores) that would be an 800 byte loop body which would be 25 cache lines.
ISTR that at least the loop discovery is limited to a lot smaller cases (but
we are likely not targeting that).  The limit probably still works to
fit the loop
body in the u-op caches though.

That said, the heuristic made me think "what the heck".  Can we explain
in u-arch terms why the unrolling is beneficial instead of just defering to
SPEC CPU 2017 fotonik?

> On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Hi Hongtao,
> >
> > This patch introduces function finish_cost and
> > determine_suggested_unroll_factor for x86 backend, to make it be
> > able to suggest the unroll factor for a given loop being vectorized.
> > Referring to aarch64, RS6000 backends and basing on the analysis on
> > SPEC2017 performance evaluation results.
> >
> > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > OK for trunk?
> >
> >
> >
> > With this patch, SPEC2017 performance evaluation results on
> > ICX/CLX/ADL/Znver3 are listed below:
> >
> > For single copy:
> >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> >
> > For multi-copy:
> >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> >
> > gcc/ChangeLog:
> >
> >         * config/i386/i386.cc (class ix86_vector_costs): Add new members
> >          m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
> >         (ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
> >         and m_nstores.
> >         (ix86_vector_costs::determine_suggested_unroll_factor): New function.
> >         (ix86_vector_costs::finish_cost): Diito.
> >         * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> >         (x86-vect-unroll-min-ldst-threshold): Likewise.
> >         (x86-vect-unroll-max-loop-size): Likewise.
> >         * doc/invoke.texi: Document new parameter.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> >         * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> >         * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> >         * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> >         * gcc.target/i386/vect-reduc-1.c: Ditto.
> > ---
> >  gcc/config/i386/i386.cc                       | 106 ++++++++++++++++++
> >  gcc/config/i386/i386.opt                      |  15 +++
> >  gcc/doc/invoke.texi                           |  17 +++
> >  .../gcc.target/i386/cond_op_maxmin_b-1.c      |   2 +-
> >  .../gcc.target/i386/cond_op_maxmin_ub-1.c     |   2 +-
> >  .../i386/vect-alignment-peeling-1.c           |   2 +-
> >  .../i386/vect-alignment-peeling-2.c           |   2 +-
> >  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
> >  8 files changed, 143 insertions(+), 5 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index aeea26ef4be..a939354e55e 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
> >                               stmt_vec_info stmt_info, slp_tree node,
> >                               tree vectype, int misalign,
> >                               vect_cost_model_location where) override;
> > +
> > +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> > +
> > +  void finish_cost (const vector_costs *) override;
> > +
> > +  /* Total number of vectorized stmts (loop only).  */
> > +  unsigned m_nstmts = 0;
> > +  /* Total number of loads (loop only).  */
> > +  unsigned m_nloads = 0;
> > +  /* Total number of stores (loop only).  */
> > +  unsigned m_nstores = 0;
> >  };
> >
> >  /* Implement targetm.vectorize.create_costs.  */
> > @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >         retval = (retval * 17) / 10;
> >      }
> >
> > +  if (!m_costing_for_scalar
> > +      && is_a<loop_vec_info> (m_vinfo)
> > +      && where == vect_body)
> > +    {
> > +      m_nstmts += count;
> > +      if (kind == scalar_load || kind == vector_load
> > +         || kind == unaligned_load || kind == vector_gather_load)
> > +       m_nloads += count;
> > +      else if (kind == scalar_store || kind == vector_store
> > +              || kind == unaligned_store || kind == vector_scatter_store)
> > +       m_nstores += count;
> > +    }
> > +
> >    m_costs[where] += retval;
> >
> >    return retval;
> > @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
> >    return nunroll;
> >  }
> >
> > +unsigned int
> > +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
> > +{
> > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > +
> > +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> > +  if (loop->unroll == 1
> > +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> > +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> > +    return 1;
> > +
> > +  /* Don't unroll if there is no vectorized stmt.  */
> > +  if (m_nstmts == 0)
> > +    return 1;
> > +
> > +  /* Don't unroll if vector size is zmm, since zmm throughput is lower than other
> > +     sizes.  */
> > +  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
> > +    return 1;
> > +
> > +  /* Calc the total number of loads and stores in the loop body.  */
> > +  unsigned int nstmts_ldst = m_nloads + m_nstores;
> > +
> > +  /* Don't unroll if loop body size big than threshold, the threshold
> > +     is a heuristic value inspired by param_max_unrolled_insns.  */
> > +  unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size
> > +                   ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
> > +                   : 1;
> > +  uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
> > +  uf = 1 << ceil_log2 (uf);
> > +
> > +  /* Early return if don't need to unroll.  */
> > +  if (uf == 1)
> > +    return 1;
> > +
> > +  /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll the loop
> > +     if the number of loads and stores exceeds the threshold, unroll + software
> > +     schedule will reduce cache miss rate.  */
> > +  if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
> > +    return uf;
> > +
> > +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> > +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
> > +  unsigned int unrolled_vf = vf * uf;
> > +  if (est_niter == -1 || est_niter < unrolled_vf)
> > +    /* When the estimated iteration of this loop is unknown, it's possible
> > +       that we are able to vectorize this loop with the original VF but fail
> > +       to vectorize it with the unrolled VF any more if the actual iteration
> > +       count is in between.  */
> > +    return 1;
> > +  else
> > +    {
> > +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
> > +      unsigned int epil_niter = est_niter % vf;
> > +      /* Even if we have partial vector support, it can be still inefficent
> > +       to calculate the length when the iteration count is unknown, so
> > +       only expect it's good to unroll when the epilogue iteration count
> > +       is not bigger than VF (only one time length calculation).  */
> > +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > +         && epil_niter_unr <= vf)
> > +       return uf;
> > +      /* Without partial vector support, conservatively unroll this when
> > +       the epilogue iteration count is less than the original one
> > +       (epilogue execution time wouldn't be longer than before).  */
> > +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > +              && epil_niter_unr <= epil_niter)
> > +       return uf;
> > +    }
> > +
> > +  return 1;
> > +}
> > +
> > +void
> > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > +
> > +{
> > +  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> > +    {
> > +      m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo);
> > +    }
> > +  vector_costs::finish_cost (scalar_costs);
> > +}
> >
> >  /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P.  */
> >
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index 53d534f6392..8e49b406aa5 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -1224,3 +1224,18 @@ mavxvnniint8
> >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> >  AVXVNNIINT8 built-in functions and code generation.
> > +
> > +-param=x86-vect-unroll-limit=
> > +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) IntegerRange(1, 8) Param
> > +Used to limit unroll factor which indicates how much the autovectorizer may
> > +unroll a loop.  The default value is 4.
> > +
> > +-param=x86-vect-unroll-min-ldst-threshold=
> > +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) Init(25) Param
> > +Used to limit the mininum of loads and stores in the main loop.  The default
> > +value is 25.
> > +
> > +-param=x86-vect-unroll-max-loop-size=
> > +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) Param
> > +This threshold is used to limit the maxnum size of loop body after unrolling.
> > +The default value is 200.
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 09548c4528c..c86d686f2cd 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -15779,6 +15779,23 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
> >  @item x86-stlf-window-ninsns
> >  Instructions number above which STFL stall penalty can be compensated.
> >
> > +@item x86-vect-unroll-limit
> > +The vectorizer will check with target information to determine whether it
> > +would be beneficial to unroll the main vectorized loop and by how much.  This
> > +parameter sets the upper bound of how much the vectorizer will unroll the main
> > +loop.  The default value is four.
> > +
> > +@item x86-vect-unroll-min-ldst-threshold
> > +The vectorizer will check with target information to determine whether unroll
> > +it. This parameter is used to limit the mininum of loads and stores in the main
> > +loop.
> > +
> > +@item x86-vect-unroll-max-loop-size
> > +The vectorizer will check with target information to determine whether unroll
> > +it. This threshold is used to limit the max size of loop body after unrolling.
> > +The default value is 200.
> > +
> > +
> >  @end table
> >
> >  @end table
> > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > index 78c6600f83b..3bf1fb1b12d 100644
> > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" } */
> > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-loops -fdump-tree-optimized" } */
> >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> >  /* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > index 117179f2109..ba41fd64386 100644
> > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-optimized" } */
> > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-loops -fdump-tree-optimized" } */
> >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> >  /* { dg-final { scan-assembler-times "vpmaxub"  1 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > index 4aa536ba86c..fd2f054af4a 100644
> > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > @@ -2,7 +2,7 @@
> >  /* This is a test exercising peeling for alignment for a negative step
> >     vector loop.  We're forcing atom tuning here because that has a higher
> >     unaligned vs aligned cost unlike most other archs.  */
> > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
> > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
> >
> >  float a[1024], b[1024];
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > index 834bf0f770d..62c0db2bb9a 100644
> > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > @@ -2,7 +2,7 @@
> >  /* This is a test exercising peeling for alignment for a positive step
> >     vector loop.  We're forcing atom tuning here because that has a higher
> >     unaligned vs aligned cost unlike most other archs.  */
> > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
> > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
> >
> >  float a[1024], b[1024];
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > index 9ee9ba4e736..1ba4be01bea 100644
> > --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */
> > +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-tree-vect-details" } */
> >
> >  #define N 32
> >  int foo (int *a, int n)
> > --
> > 2.17.1
> >
> > Thanks,
> > Lili.
>
>
>
> --
> BR,
> Hongtao