From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ej1-x633.google.com (mail-ej1-x633.google.com [IPv6:2a00:1450:4864:20::633])
	by sourceware.org (Postfix) with ESMTPS id 8DE333857C4C
	for <gcc-patches@gcc.gnu.org>; Fri, 28 Oct 2022 07:13:00 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8DE333857C4C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-ej1-x633.google.com with SMTP id f27so10929044eje.1
        for <gcc-patches@gcc.gnu.org>; Fri, 28 Oct 2022 00:13:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=hlZaLVqpOQ7EeRKuhe2rpgJW6E8TvMELausRXm3PN1U=;
        b=ifV+6cdiw+XsCtS7BdM7VnMCwr6sOayTwPPbrlfTmavXFpJAJ3R1s4XBT/AzhM7b5Q
         qTL2m9UgS/BaiMMeq7HLzNXC5ZzgmePNeGP265Wy18rigHp4noLSbtNtPCPiljMvSkbJ
         kzh4YxPrdacpr5h8C6HKHicNMshWWl6lOOgSk74bxb+vXM2cvZma0ou+mPc5S0D0mlLO
         +en5/T3eWUeaCn2og4zq5FAR6QXzA1Kjj9LZ1zcUqOSd1y63moVl+IbcBojnE2zl+tr1
         hNe8QIGFVg/qSYMihJdqwYE8qc+5inkT9oCaqlp7wTFOkg7AeQfLnpWCCsG7NJB6gniw
         9Tng==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=hlZaLVqpOQ7EeRKuhe2rpgJW6E8TvMELausRXm3PN1U=;
        b=M41XWcs9IAXCJ2eemq+Ornh+R6JbXsG3SZMAJCPRrlau8WAf6IqTqLxY2CY1eXIg5l
         j/8tWnPtPIaWIuZmh7Zds/YSeXvJxEKRSAXhy/oIJwI6gkHE5zMad9zF1TAgmoAVE2zL
         xR9Fd3oAyLgHlocpHnhIg9pjTksBlhYRwW21O3Za+xLovz1ydyGnQD9bIyacXc46zFLc
         XuFg/+KCsxaFhKTEK8vuJ4aAs/sZNvy47ebEcU+Lbe9OJOMtC/w5N5jvQ+iFhjccdFcD
         UAPMsYR3t+yvg85pUo2KeNPq510JHpDwbkU9L6SVuf5zgTYHlV17F1BEWD/OcXQzA8Hg
         MwQQ==
X-Gm-Message-State: ACrzQf365lEEwS84GPuOd7rAwMvarTP6LT/zMLVCfuWEyU3KkFUZZvc1
	AuyLnEiOMURvy1Fy6dOzFjvEjgFlTDBmlY45Myo=
X-Google-Smtp-Source: AMsMyM5lSqSQEwDsCXla15DpsNPDCH7IMSAqkHg57GYeHNu92hpQsvc0IdaVXP6qwgspXhZvO08XCMoe/Ft7MSq6n2c=
X-Received: by 2002:a17:907:2c68:b0:7ad:8ba7:664a with SMTP id
 ib8-20020a1709072c6800b007ad8ba7664amr7046991ejc.488.1666941179078; Fri, 28
 Oct 2022 00:12:59 -0700 (PDT)
MIME-Version: 1.0
References: <20221024024604.18324-1-lili.cui@intel.com> <CAMZc-bxuN27fbPRFU9X253ZUJ_Km4POb6q-583XgSixcD7LNZA@mail.gmail.com>
 <CAFiYyc0TnyJPZevD5+=nW7XUU6SZ8_nLACG0ONV6guYj=a3vXA@mail.gmail.com> <SJ0PR11MB5600F18C6AE23CFCC6E0D5219E309@SJ0PR11MB5600.namprd11.prod.outlook.com>
In-Reply-To: <SJ0PR11MB5600F18C6AE23CFCC6E0D5219E309@SJ0PR11MB5600.namprd11.prod.outlook.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Fri, 28 Oct 2022 09:12:46 +0200
Message-ID: <CAFiYyc1eybPeEsn8HcU682n7rcbwWdoBCNbN_h33g=VfVeeNeA@mail.gmail.com>
Subject: Re: [PATCH] ix86: Suggest unroll factor for loop vectorization
To: "Cui, Lili" <lili.cui@intel.com>
Cc: Hongtao Liu <crazylht@gmail.com>, =?UTF-8?Q?Martin_Li=C5=A1ka?= <mliska@suse.cz>, 
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, "Liu, Hongtao" <hongtao.liu@intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-8.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Wed, Oct 26, 2022 at 1:38 PM Cui, Lili <lili.cui@intel.com> wrote:
>
> Hi Richard,
>
> > +@item x86-vect-unroll-min-ldst-threshold
> > +The vectorizer will check with target information to determine whether
> > +unroll it. This parameter is used to limit the mininum of loads and
> > +stores in the main loop.
> >
> > It's odd to "limit" the minimum number of something.  I think this warrants
> > clarification that for some (unknow to me ;)) reason we think that when we
> > have many loads and (or?) stores it is beneficial to unroll to get even more
> > loads and stores in a single iteration.  Btw, does the parameter limit the
> > number of loads and stores _after_ unrolling or before?
> >
> When the number of loads/stores exceeds the threshold, the loads/stores are more likely to conflict with loop itself in the L1 cache(Assuming that address of loads are scattered).
> Unroll + software scheduling will make 2 or 4 address contiguous loads/stores closer together, it will reduce cache miss rate.

Ah, nice.  Can we express the default as a function of L1 data cache
size, L1 cache line size and
more importantly, the size of the vector memory access?

Btw, I was looking into making a more meaningful cost modeling for loop
distribution.  Similar reasoning might apply there - try to _reduce_ the
number of memory streams so L1 cache utilization allows re-use of a
cache line in the next [next N] iteration[s]?  OTOH given L1D is quite
large I'd expect the loops affected to be either quite huge or bottlenecked
by load/store bandwith (there are 1024 L1D cache lines in zen2 for
example) - what's the effective L1D load you are keying off?.
Btw, how does L1D allocation on stores play a role here?

> > +@item x86-vect-unroll-max-loop-size
> > +The vectorizer will check with target information to determine whether
> > +unroll it. This threshold is used to limit the max size of loop body after
> > unrolling.
> > +The default value is 200.
> >
> > it should probably say not "size" but "number of instructions".  Note that 200
> > is quite large given we are talking about vector instructions here which have
> > larger encodings than scalar instructions.  Optimistically assuming
> > 4 byte encoding (quite optimistic give we're looking at loops with many
> > loads/stores) that would be an 800 byte loop body which would be 25 cache
> > lines.
> > ISTR that at least the loop discovery is limited to a lot smaller cases (but we
> > are likely not targeting that).  The limit probably still works to fit the loop
> > body in the u-op caches though.
> >
> Agree with you, it should be "x86-vect-unroll-max-loop-insns". Thanks for the reminder about larger encodings, I checked the skylake uop cache, it can hold 1.5k uOPs, 200 * 2 (1~3 uops/instruction) = 400 uops. I think 200 still work.
>
> > That said, the heuristic made me think "what the heck".  Can we explain in u-
> > arch terms why the unrolling is beneficial instead of just defering to SPEC
> > CPU 2017 fotonik?
> >
> Regarding the benefits,  I explained in the first answer, I checked 5 hottest functions in the 549, they all benefit from it, it improves the cache hit ratio.
>
> Thanks,
> Lili.
>
> > > On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > Hi Hongtao,
> > > >
> > > > This patch introduces function finish_cost and
> > > > determine_suggested_unroll_factor for x86 backend, to make it be
> > > > able to suggest the unroll factor for a given loop being vectorized.
> > > > Referring to aarch64, RS6000 backends and basing on the analysis on
> > > > SPEC2017 performance evaluation results.
> > > >
> > > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > OK for trunk?
> > > >
> > > >
> > > >
> > > > With this patch, SPEC2017 performance evaluation results on
> > > > ICX/CLX/ADL/Znver3 are listed below:
> > > >
> > > > For single copy:
> > > >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> > > >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> > > >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> > > >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> > > >
> > > > For multi-copy:
> > > >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * config/i386/i386.cc (class ix86_vector_costs): Add new members
> > > >          m_nstmts, m_nloads m_nstores and
> > determine_suggested_unroll_factor.
> > > >         (ix86_vector_costs::add_stmt_cost): Update for m_nstores,
> > m_nloads
> > > >         and m_nstores.
> > > >         (ix86_vector_costs::determine_suggested_unroll_factor): New
> > function.
> > > >         (ix86_vector_costs::finish_cost): Diito.
> > > >         * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> > > >         (x86-vect-unroll-min-ldst-threshold): Likewise.
> > > >         (x86-vect-unroll-max-loop-size): Likewise.
> > > >         * doc/invoke.texi: Document new parameter.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> > > >         * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> > > >         * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> > > >         * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> > > >         * gcc.target/i386/vect-reduc-1.c: Ditto.
> > > > ---
> > > >  gcc/config/i386/i386.cc                       | 106 ++++++++++++++++++
> > > >  gcc/config/i386/i386.opt                      |  15 +++
> > > >  gcc/doc/invoke.texi                           |  17 +++
> > > >  .../gcc.target/i386/cond_op_maxmin_b-1.c      |   2 +-
> > > >  .../gcc.target/i386/cond_op_maxmin_ub-1.c     |   2 +-
> > > >  .../i386/vect-alignment-peeling-1.c           |   2 +-
> > > >  .../i386/vect-alignment-peeling-2.c           |   2 +-
> > > >  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
> > > >  8 files changed, 143 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> > > > aeea26ef4be..a939354e55e 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
> > > >                               stmt_vec_info stmt_info, slp_tree node,
> > > >                               tree vectype, int misalign,
> > > >                               vect_cost_model_location where)
> > > > override;
> > > > +
> > > > +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> > > > +
> > > > +  void finish_cost (const vector_costs *) override;
> > > > +
> > > > +  /* Total number of vectorized stmts (loop only).  */  unsigned
> > > > + m_nstmts = 0;
> > > > +  /* Total number of loads (loop only).  */  unsigned m_nloads = 0;
> > > > +  /* Total number of stores (loop only).  */  unsigned m_nstores =
> > > > + 0;
> > > >  };
> > > >
> > > >  /* Implement targetm.vectorize.create_costs.  */
> > > > @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count,
> > vect_cost_for_stmt kind,
> > > >         retval = (retval * 17) / 10;
> > > >      }
> > > >
> > > > +  if (!m_costing_for_scalar
> > > > +      && is_a<loop_vec_info> (m_vinfo)
> > > > +      && where == vect_body)
> > > > +    {
> > > > +      m_nstmts += count;
> > > > +      if (kind == scalar_load || kind == vector_load
> > > > +         || kind == unaligned_load || kind == vector_gather_load)
> > > > +       m_nloads += count;
> > > > +      else if (kind == scalar_store || kind == vector_store
> > > > +              || kind == unaligned_store || kind == vector_scatter_store)
> > > > +       m_nstores += count;
> > > > +    }
> > > > +
> > > >    m_costs[where] += retval;
> > > >
> > > >    return retval;
> > > > @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll,
> > class loop *loop)
> > > >    return nunroll;
> > > >  }
> > > >
> > > > +unsigned int
> > > > +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info
> > loop_vinfo)
> > > > +{
> > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > +
> > > > +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> > > > +  if (loop->unroll == 1
> > > > +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> > > > +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> > > > +    return 1;
> > > > +
> > > > +  /* Don't unroll if there is no vectorized stmt.  */
> > > > +  if (m_nstmts == 0)
> > > > +    return 1;
> > > > +
> > > > +  /* Don't unroll if vector size is zmm, since zmm throughput is lower
> > than other
> > > > +     sizes.  */
> > > > +  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
> > > > +    return 1;
> > > > +
> > > > +  /* Calc the total number of loads and stores in the loop body.  */
> > > > +  unsigned int nstmts_ldst = m_nloads + m_nstores;
> > > > +
> > > > +  /* Don't unroll if loop body size big than threshold, the threshold
> > > > +     is a heuristic value inspired by param_max_unrolled_insns.  */
> > > > +  unsigned int uf = m_nstmts < (unsigned
> > int)x86_vect_unroll_max_loop_size
> > > > +                   ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
> > > > +                   : 1;
> > > > +  uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
> > > > +  uf = 1 << ceil_log2 (uf);
> > > > +
> > > > +  /* Early return if don't need to unroll.  */
> > > > +  if (uf == 1)
> > > > +    return 1;
> > > > +
> > > > +  /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll
> > the loop
> > > > +     if the number of loads and stores exceeds the threshold, unroll +
> > software
> > > > +     schedule will reduce cache miss rate.  */
> > > > +  if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
> > > > +    return uf;
> > > > +
> > > > +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> > > > +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
> > > > +  unsigned int unrolled_vf = vf * uf;
> > > > +  if (est_niter == -1 || est_niter < unrolled_vf)
> > > > +    /* When the estimated iteration of this loop is unknown, it's possible
> > > > +       that we are able to vectorize this loop with the original VF but fail
> > > > +       to vectorize it with the unrolled VF any more if the actual iteration
> > > > +       count is in between.  */
> > > > +    return 1;
> > > > +  else
> > > > +    {
> > > > +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
> > > > +      unsigned int epil_niter = est_niter % vf;
> > > > +      /* Even if we have partial vector support, it can be still inefficent
> > > > +       to calculate the length when the iteration count is unknown, so
> > > > +       only expect it's good to unroll when the epilogue iteration count
> > > > +       is not bigger than VF (only one time length calculation).  */
> > > > +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > > +         && epil_niter_unr <= vf)
> > > > +       return uf;
> > > > +      /* Without partial vector support, conservatively unroll this when
> > > > +       the epilogue iteration count is less than the original one
> > > > +       (epilogue execution time wouldn't be longer than before).  */
> > > > +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > > +              && epil_niter_unr <= epil_niter)
> > > > +       return uf;
> > > > +    }
> > > > +
> > > > +  return 1;
> > > > +}
> > > > +
> > > > +void
> > > > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > > > +
> > > > +{
> > > > +  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> > > > +    {
> > > > +      m_suggested_unroll_factor = determine_suggested_unroll_factor
> > (loop_vinfo);
> > > > +    }
> > > > +  vector_costs::finish_cost (scalar_costs);
> > > > +}
> > > >
> > > >  /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P.
> > */
> > > >
> > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > > index 53d534f6392..8e49b406aa5 100644
> > > > --- a/gcc/config/i386/i386.opt
> > > > +++ b/gcc/config/i386/i386.opt
> > > > @@ -1224,3 +1224,18 @@ mavxvnniint8
> > > >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> > > >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> > > >  AVXVNNIINT8 built-in functions and code generation.
> > > > +
> > > > +-param=x86-vect-unroll-limit=
> > > > +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4)
> > IntegerRange(1, 8) Param
> > > > +Used to limit unroll factor which indicates how much the autovectorizer
> > may
> > > > +unroll a loop.  The default value is 4.
> > > > +
> > > > +-param=x86-vect-unroll-min-ldst-threshold=
> > > > +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold)
> > Init(25) Param
> > > > +Used to limit the mininum of loads and stores in the main loop.  The
> > default
> > > > +value is 25.
> > > > +
> > > > +-param=x86-vect-unroll-max-loop-size=
> > > > +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200)
> > Param
> > > > +This threshold is used to limit the maxnum size of loop body after
> > unrolling.
> > > > +The default value is 200.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index 09548c4528c..c86d686f2cd 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -15779,6 +15779,23 @@ The following choices of @var{name} are
> > available on i386 and x86_64 targets:
> > > >  @item x86-stlf-window-ninsns
> > > >  Instructions number above which STFL stall penalty can be compensated.
> > > >
> > > > +@item x86-vect-unroll-limit
> > > > +The vectorizer will check with target information to determine whether
> > it
> > > > +would be beneficial to unroll the main vectorized loop and by how much.
> > This
> > > > +parameter sets the upper bound of how much the vectorizer will unroll
> > the main
> > > > +loop.  The default value is four.
> > > > +
> > > > +@item x86-vect-unroll-min-ldst-threshold
> > > > +The vectorizer will check with target information to determine whether
> > unroll
> > > > +it. This parameter is used to limit the mininum of loads and stores in the
> > main
> > > > +loop.
> > > > +
> > > > +@item x86-vect-unroll-max-loop-size
> > > > +The vectorizer will check with target information to determine whether
> > unroll
> > > > +it. This threshold is used to limit the max size of loop body after unrolling.
> > > > +The default value is 200.
> > > > +
> > > > +
> > > >  @end table
> > > >
> > > >  @end table
> > > > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > > index 78c6600f83b..3bf1fb1b12d 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > > @@ -1,5 +1,5 @@
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-
> > optimized" } */
> > > > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-
> > loops -fdump-tree-optimized" } */
> > > >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> > > >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> > > >  /* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > > index 117179f2109..ba41fd64386 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > > @@ -1,5 +1,5 @@
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-
> > optimized" } */
> > > > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-
> > loops -fdump-tree-optimized" } */
> > > >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> > > >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> > > >  /* { dg-final { scan-assembler-times "vpmaxub"  1 } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > > index 4aa536ba86c..fd2f054af4a 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > > @@ -2,7 +2,7 @@
> > > >  /* This is a test exercising peeling for alignment for a negative step
> > > >     vector loop.  We're forcing atom tuning here because that has a higher
> > > >     unaligned vs aligned cost unlike most other archs.  */
> > > > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-
> > details -save-temps" } */
> > > > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -
> > fdump-tree-vect-details -save-temps" } */
> > > >
> > > >  float a[1024], b[1024];
> > > >
> > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > > index 834bf0f770d..62c0db2bb9a 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > > @@ -2,7 +2,7 @@
> > > >  /* This is a test exercising peeling for alignment for a positive step
> > > >     vector loop.  We're forcing atom tuning here because that has a higher
> > > >     unaligned vs aligned cost unlike most other archs.  */
> > > > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-
> > details -save-temps" } */
> > > > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -
> > fdump-tree-vect-details -save-temps" } */
> > > >
> > > >  float a[1024], b[1024];
> > > >
> > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > > index 9ee9ba4e736..1ba4be01bea 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > > @@ -1,5 +1,5 @@
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" }
> > */
> > > > +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-
> > tree-vect-details" } */
> > > >
> > > >  #define N 32
> > > >  int foo (int *a, int n)
> > > > --
> > > > 2.17.1
> > > >
> > > > Thanks,
> > > > Lili.
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao