From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
 by sourceware.org (Postfix) with ESMTPS id 33EC43857C67
 for <gcc-patches@gcc.gnu.org>; Tue, 22 Feb 2022 09:51:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 33EC43857C67
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out2.suse.de (Postfix) with ESMTP id E33DA1F39A;
 Tue, 22 Feb 2022 09:51:45 +0000 (UTC)
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id BD1AFA3B8D;
 Tue, 22 Feb 2022 09:51:45 +0000 (UTC)
Date: Tue, 22 Feb 2022 10:51:45 +0100 (CET)
From: Richard Biener <rguenther@suse.de>
To: Hongtao Liu <crazylht@gmail.com>
cc: GCC Patches <gcc-patches@gcc.gnu.org>, 
 Richard Sandiford <richard.sandiford@arm.com>, 
 "Liu, Hongtao" <hongtao.liu@intel.com>
Subject: Re: [PATCH 3/3] target/99881 - x86 vector cost of CTOR from integer
 regs
In-Reply-To: <s977n7s9-8oon-922q-oqn-462q6q355so@fhfr.qr>
Message-ID: <6n58qs6s-ono5-42s-o0q-s7ss22p8866s@fhfr.qr>
References: <20220218140102.13F4E13C91@imap2.suse-dmz.suse.de>
 <CAMZc-bwo_tSoqOrRXEQ9HgOgfL8_axSxRUp3WocTH_zfQjzj8g@mail.gmail.com>
 <p793rq4-9n31-2460-855s-856374p45243@fhfr.qr>
 <CAMZc-byico_c=x_TTNMAtQJ-_maQpZOLF2+0xdisog3KzEvdGQ@mail.gmail.com>
 <s977n7s9-8oon-922q-oqn-462q6q355so@fhfr.qr>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 22 Feb 2022 09:51:50 -0000

On Tue, 22 Feb 2022, Richard Biener wrote:

> On Tue, 22 Feb 2022, Hongtao Liu wrote:
> 
> > On Mon, Feb 21, 2022 at 5:10 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Mon, 21 Feb 2022, Hongtao Liu wrote:
> > >
> > > > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > This uses the now passed SLP node to the vectorizer costing hook
> > > > > to adjust vector construction costs for the cost of moving an
> > > > > integer component from a GPR to a vector register when that's
> > > > > required for building a vector from components.  A cruical difference
> > > > > here is whether the component is loaded from memory or extracted
> > > > > from a vector register as in those cases no intermediate GPR is involved.
> > > > >
> > > > > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > > > > pr91446.c testcase now produces scalar code which looks superior
> > > > > to me so I've adjusted it as well.
> > > > >
> > > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > > > > after adding the BIT_FIELD_REF vector extracting special casing.
> > > > Does the patch handle PR101929?
> > >
> > > The patch will regress the testcase posted in PR101929 again:
> > >
> > >  _255 1 times scalar_store costs 12 in body
> > >  _261 1 times scalar_store costs 12 in body
> > >  _258 1 times scalar_store costs 12 in body
> > >  _264 1 times scalar_store costs 12 in body
> > >  t0_247 + t2_251 1 times scalar_stmt costs 4 in body
> > >  t1_472 + t3_444 1 times scalar_stmt costs 4 in body
> > >  t0_406 - t2_451 1 times scalar_stmt costs 4 in body
> > >  t1_472 - t3_444 1 times scalar_stmt costs 4 in body
> > > -node 0x4182f48 1 times vec_construct costs 16 in prologue
> > > -node 0x41882b0 1 times vec_construct costs 16 in prologue
> > > +node 0x4182f48 1 times vec_construct costs 28 in prologue
> > > +node 0x41882b0 1 times vec_construct costs 28 in prologue
> > >  t0_406 + t2_451 1 times vector_stmt costs 4 in body
> > >  t1_472 - t3_444 1 times vector_stmt costs 4 in body
> > >  node 0x41829f8 1 times vec_perm costs 4 in body
> > >  _436 1 times vector_store costs 16 in body
> > >  t.c:37:9: note: Cost model analysis for part in loop 0:
> > > -  Vector cost: 60
> > > +  Vector cost: 84
> > >    Scalar cost: 64
> > > +t.c:37:9: missed: not vectorized: vectorization is not profitable.
> > >
> > > We're constructing V4SI from patterns like { _407, _480, _407, _480 }
> > > where the components are results of integer adds (so the result is
> > > definitely in a GPR).  We are costing the construction as
> > > 4 * sse_op + 2 * sse_to_integer which with skylake cost is
> > > 4 * COSTS_N_INSNS (1) + 2 * 6.
> > >
> > > Whether the vectorization itself is profitable is likely questionable
> > > but then it's true that the construction of V4SI is more costly
> > > in terms of uops than a construction of V4SF.
> > >
> > > Now, we can - for the first time - now see the actual construction
> > > pattern and ideal construction might be two GPR->xmm moves
> > > + two splats + one unpack or maybe two GPR->xmm moves + one
> > > unpack + splat of DI (or other means of duplicating the lowpart).
> > Yes, the patch is technically right. I'm ok with the patch.
> 
> Thanks, I've pushed it now.  I've also tested the suggested adjustment
> doing
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..acf2cc977b4 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum 
> vect_cost_for_stmt type_of_cost,
>        case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
> -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> +         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * 
> ix86_cost->sse_op;
>           /* One vinserti128 for combining two SSE vectors for AVX256.  */
>           if (GET_MODE_BITSIZE (mode) == 256)
>             cost += ix86_vec_cost (mode, ix86_cost->addss);
> 
> successfully (with no effect on the PR101929 case as expected), I
> will queue that for stage1 since it isn't known to fix any
> regression (but I will keep it as option in case something pops up).
> 
> I'll also have a more detailed look into the x264_r case to see
> if there's something we can do about the regression that will now
> show up (and I'll watch autotesters).

I found a way to get back the vectorization doing

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 9188d727e33..7f1f12fb6c6 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2374,7 +2375,7 @@ fail:
                n_vector_builds++;
            }
        }
-      if (all_uniform_p
+      if ((all_uniform_p && !two_operators)
          || n_vector_builds > 1
          || (n_vector_builds == children.length ()
              && is_a <gphi *> (stmt_info->stmt)))

which in itself is reasonable since the result of the operation
is a non-uniform vector.  That causes us to do

  _37 = {a3_245_6(D), a3_245_6(D), a3_245_6(D), a3_245_6(D)};
  _36 = {a2_232_5(D), a2_232_5(D), a2_232_5(D), a2_232_5(D)};
  vect__252_8.7_39 = _36 - _37;
  vect__250_7.6_38 = _36 + _37;
  _40 = VEC_PERM_EXPR <vect__250_7.6_38, vect__252_8.7_39, { 0, 5, 2, 7 
}>;

and then vector lowering does what vect_optimize_slp should do
as well (but doesn't):

  _26 = a2_232_5(D) - a3_245_6(D);
  vect__252_8.7_39 = {_26, _26, _26, _26};
  _25 = a2_232_5(D) + a3_245_6(D);
  vect__250_7.6_38 = {_25, _25, _25, _25};
  _40 = VEC_PERM_EXPR <vect__250_7.6_38, vect__252_8.7_39, { 0, 5, 2, 7 
}>;

and later stmt folding will get us back the non-uniform vector
construction by folding the VEC_PERM_EXPR where we simply assume
that we can optimally expand the constructor.

But the important part is that this way we evade vec_construct
costing and instead get

node 0x40e11a8 1 times scalar_to_vec costs 4 in prologue

where of course it's obvious that we miss to account for
the GPR -> XMM penalty here in the backend.  When we are not fixing
that we get the kernel for x264_r optimized like before.

A self-contained testcase for this opportunity is

void foo (unsigned * __restrict tmp, unsigned a0_206, unsigned a1_219,
          unsigned a2_232, unsigned a3_245)
{
  unsigned _246 = a0_206 + a1_219;
  unsigned _248 = a0_206 - a1_219;
  unsigned _250 = a2_232 + a3_245;
  unsigned _252 = a2_232 - a3_245;
  int t0_247 = (int) _246;
  int t1_249 = (int) _248;
  int t2_251 = (int) _250;
  int t3_253 = (int) _252;
  int _254 = t0_247 + t2_251;
  int _260 = t1_249 + t3_253;
  int _257 = t0_247 - t2_251;
  int _263 = t1_249 - t3_253;
  unsigned _255 = (unsigned int) _254;
  unsigned _261 = (unsigned int) _260;
  unsigned _258 = (unsigned int) _257;
  unsigned _264 = (unsigned int) _263;
  tmp[0] = _255;
  tmp[1] = _261;
  tmp[2] = _258;
  tmp[3] = _264;
}

which does show that the vectorization is not profitable since we
get

0000000000000000 <foo>:
   0:   89 c8                   mov    %ecx,%eax
   2:   44 01 c1                add    %r8d,%ecx
   5:   44 29 c0                sub    %r8d,%eax
   8:   c5 f9 6e d9             vmovd  %ecx,%xmm3
   c:   c4 e3 61 22 c8 01       vpinsrd $0x1,%eax,%xmm3,%xmm1
  12:   89 f0                   mov    %esi,%eax
  14:   01 d6                   add    %edx,%esi
  16:   29 d0                   sub    %edx,%eax
  18:   c5 f9 6e e6             vmovd  %esi,%xmm4
  1c:   c4 e3 59 22 c0 01       vpinsrd $0x1,%eax,%xmm4,%xmm0
  22:   c5 f1 6c c9             vpunpcklqdq %xmm1,%xmm1,%xmm1
  26:   c5 f9 6c c0             vpunpcklqdq %xmm0,%xmm0,%xmm0
  2a:   c5 f9 fe d1             vpaddd %xmm1,%xmm0,%xmm2
  2e:   c5 f9 fa c1             vpsubd %xmm1,%xmm0,%xmm0
  32:   c5 e8 c6 c0 e4          vshufps $0xe4,%xmm0,%xmm2,%xmm0
  37:   c5 fa 7f 07             vmovdqu %xmm0,(%rdi)
  3b:   c3                      ret    

vs the scalar code

0000000000000000 <foo>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   8d 3c 16                lea    (%rsi,%rdx,1),%edi
   6:   29 d6                   sub    %edx,%esi
   8:   42 8d 14 01             lea    (%rcx,%r8,1),%edx
   c:   44 29 c1                sub    %r8d,%ecx
   f:   44 8d 04 17             lea    (%rdi,%rdx,1),%r8d
  13:   44 89 00                mov    %r8d,(%rax)
  16:   29 d7                   sub    %edx,%edi
  18:   44 8d 04 0e             lea    (%rsi,%rcx,1),%r8d
  1c:   29 ce                   sub    %ecx,%esi
  1e:   44 89 40 04             mov    %r8d,0x4(%rax)
  22:   89 78 08                mov    %edi,0x8(%rax)
  25:   89 70 0c                mov    %esi,0xc(%rax)
  28:   c3                      ret    

but we could choose to ignore this for the sake of avoiding
the x264 regression (shall it appear which I think it will)
and try to address this in GCC 12.

Richard.

> Thanks,
> Richard.
> 
> 
> > > It still wouldn't help here of course, and we would not have RTL
> > > expansion adhere to this scheme.
> > >
> > > Trying to capture secondary effects (the FRE opportunity unleashed)
> > > in costing of this particular SLP subgraph is difficult and probably
> > > undesirable though.  Adjusting any of the target specific costs
> > > is likely not going to work because of the original narrow profitability
> > > gap (60 vs. 64).
> > >
> > > For the particular kernel the overall vectorization strathegy needs
> > > to improve (PR98138 is about this I think).  I know we've reverted
> > > the previous change that attempted to cost for GPR -> XMM.  It did
> > >
> > >        case vec_construct:
> > >         {
> > >           /* N element inserts into SSE vectors.  */
> > > +         int cost
> > > +           = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
> > > +                                               ix86_cost->sse_op
> > > +                                               :
> > > ix86_cost->integer_to_sse);
> > > +
> > > -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> > >
> > > thus treated integer_to_sse as GPR insert into vector cost for
> > > integer components in general while the now proposed change
> > > _adds_ integer to XMM move costs (but only once for duplicate
> > > elements and not for the cases we know are from memory / vector regs).
> > > With the proposed patch we can probably be more "correct" above
> > > and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
> > > for FP the first one is free and for int we're costing it separately.
> > > Again it won't help here.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > >
> > > > > I suppose we can let autotesters look for SPEC performance fallout.
> > > > >
> > > > > OK if testing succeeds?
> > > > >
> > > > > Thanks,
> > > > > Richard.
> > > > >
> > > > > 2022-02-18  Richard Biener  <rguenther@suse.de>
> > > > >
> > > > >         PR tree-optimization/104582
> > > > >         PR target/99881
> > > > >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > > > >         Cost GPR to vector register moves for integer vector construction.
> > > > >
> > > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> > > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
> > > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
> > > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
> > > > >         * gcc.target/i386/pr99881.c: Un-XFAIL.
> > > > >         * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> > > > > ---
> > > > >  gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
> > > > >  .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
> > > > >  .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
> > > > >  .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
> > > > >  .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
> > > > >  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
> > > > >  7 files changed, 102 insertions(+), 3 deletions(-)
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > index 0830dbd7dca..b2bf90576d5 100644
> > > > > --- a/gcc/config/i386/i386.cc
> > > > > +++ b/gcc/config/i386/i386.cc
> > > > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
> > > > >
> > > > >  unsigned
> > > > >  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > > > -                                 stmt_vec_info stmt_info, slp_tree,
> > > > > +                                 stmt_vec_info stmt_info, slp_tree node,
> > > > >                                   tree vectype, int misalign,
> > > > >                                   vect_cost_model_location where)
> > > > >  {
> > > > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > > >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > > >        stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > > > >      }
> > > > > +  else if (kind == vec_construct
> > > > > +          && node
> > > > > +          && SLP_TREE_DEF_TYPE (node) == vect_external_def
> > > > > +          && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> > > > > +    {
> > > > > +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > > > +      unsigned i;
> > > > > +      tree op;
> > > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > > +       if (TREE_CODE (op) == SSA_NAME)
> > > > > +         TREE_VISITED (op) = 0;
> > > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > > +       {
> > > > > +         if (TREE_CODE (op) != SSA_NAME
> > > > > +             || TREE_VISITED (op))
> > > > > +           continue;
> > > > > +         TREE_VISITED (op) = 1;
> > > > > +         gimple *def = SSA_NAME_DEF_STMT (op);
> > > > > +         tree tem;
> > > > > +         if (is_gimple_assign (def)
> > > > > +             && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
> > > > > +             && ((tem = gimple_assign_rhs1 (def)), true)
> > > > > +             && TREE_CODE (tem) == SSA_NAME
> > > > > +             /* A sign-change expands to nothing.  */
> > > > > +             && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
> > > > > +                                       TREE_TYPE (tem)))
> > > > > +           def = SSA_NAME_DEF_STMT (tem);
> > > > > +         /* When the component is loaded from memory we can directly
> > > > > +            move it to a vector register, otherwise we have to go
> > > > > +            via a GPR or via vpinsr which involves similar cost.
> > > > > +            Likewise with a BIT_FIELD_REF extracting from a vector
> > > > > +            register we can hope to avoid using a GPR.  */
> > > > > +         if (!is_gimple_assign (def)
> > > > > +             || (!gimple_assign_load_p (def)
> > > > > +                 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> > > > > +                     || !VECTOR_TYPE_P (TREE_TYPE
> > > > > +                               (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> > > > > +           stmt_cost += ix86_cost->sse_to_integer;
> > > > > +       }
> > > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > > +       if (TREE_CODE (op) == SSA_NAME)
> > > > > +         TREE_VISITED (op) = 0;
> > > > > +    }
> > > > >    if (stmt_cost == -1)
> > > > >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > > >
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > > > new file mode 100644
> > > > > index 00000000000..992a845ad7a
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > > > @@ -0,0 +1,15 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > > +
> > > > > +struct S { unsigned long a, b; } s;
> > > > > +
> > > > > +void
> > > > > +foo (unsigned long *a, unsigned long *b)
> > > > > +{
> > > > > +  unsigned long a_ = *a;
> > > > > +  unsigned long b_ = *b;
> > > > > +  s.a = a_;
> > > > > +  s.b = b_;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > > > new file mode 100644
> > > > > index 00000000000..7637cdb4a97
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > > > @@ -0,0 +1,13 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > > +
> > > > > +struct S { unsigned long a, b; } s;
> > > > > +
> > > > > +void
> > > > > +foo (unsigned long a, unsigned long b)
> > > > > +{
> > > > > +  s.a = a;
> > > > > +  s.b = b;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > > > new file mode 100644
> > > > > index 00000000000..999c4905708
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > > > @@ -0,0 +1,13 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > > +
> > > > > +struct S { double a, b; } s;
> > > > > +
> > > > > +void
> > > > > +foo (double a, double b)
> > > > > +{
> > > > > +  s.a = a;
> > > > > +  s.b = b;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > > > new file mode 100644
> > > > > index 00000000000..cc471e1ed73
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > > > @@ -0,0 +1,15 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > > +
> > > > > +struct S { unsigned long a, b; } s;
> > > > > +
> > > > > +void
> > > > > +foo (signed long *a, unsigned long *b)
> > > > > +{
> > > > > +  unsigned long a_ = *a;
> > > > > +  unsigned long b_ = *b;
> > > > > +  s.a = a_;
> > > > > +  s.b = b_;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > > index 0243ca3ea68..067bf43f698 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height,
> > > > >    bar (&t);
> > > > >  }
> > > > >
> > > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> > > > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > > index 3e087eb2ed7..a1ec1d1ba8a 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > > @@ -1,7 +1,7 @@
> > > > >  /* PR target/99881.  */
> > > > >  /* { dg-do compile { target { ! ia32 } } } */
> > > > >  /* { dg-options "-Ofast -march=skylake" } */
> > > > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> > > > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
> > > > >
> > > > >  void
> > > > >  foo (int* __restrict a, int n, int c)
> > > > > --
> > > > > 2.34.1
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > > Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
> > 
> > 
> > 
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)