Re: [PATCH 3/3] target/99881 - x86 vector cost of CTOR from integer regs

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <rguenther@suse.de>
To: Hongtao Liu <crazylht@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
	 Richard Sandiford <richard.sandiford@arm.com>,
	 "Liu, Hongtao" <hongtao.liu@intel.com>
Subject: Re: [PATCH 3/3] target/99881 - x86 vector cost of CTOR from integer regs
Date: Mon, 21 Feb 2022 10:10:41 +0100 (CET)	[thread overview]
Message-ID: <p793rq4-9n31-2460-855s-856374p45243@fhfr.qr> (raw)
In-Reply-To: <CAMZc-bwo_tSoqOrRXEQ9HgOgfL8_axSxRUp3WocTH_zfQjzj8g@mail.gmail.com>

On Mon, 21 Feb 2022, Hongtao Liu wrote:

> On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > This uses the now passed SLP node to the vectorizer costing hook
> > to adjust vector construction costs for the cost of moving an
> > integer component from a GPR to a vector register when that's
> > required for building a vector from components.  A cruical difference
> > here is whether the component is loaded from memory or extracted
> > from a vector register as in those cases no intermediate GPR is involved.
> >
> > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > pr91446.c testcase now produces scalar code which looks superior
> > to me so I've adjusted it as well.
> >
> > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > after adding the BIT_FIELD_REF vector extracting special casing.
> Does the patch handle PR101929?

The patch will regress the testcase posted in PR101929 again:

 _255 1 times scalar_store costs 12 in body
 _261 1 times scalar_store costs 12 in body
 _258 1 times scalar_store costs 12 in body
 _264 1 times scalar_store costs 12 in body
 t0_247 + t2_251 1 times scalar_stmt costs 4 in body
 t1_472 + t3_444 1 times scalar_stmt costs 4 in body
 t0_406 - t2_451 1 times scalar_stmt costs 4 in body
 t1_472 - t3_444 1 times scalar_stmt costs 4 in body
-node 0x4182f48 1 times vec_construct costs 16 in prologue
-node 0x41882b0 1 times vec_construct costs 16 in prologue
+node 0x4182f48 1 times vec_construct costs 28 in prologue
+node 0x41882b0 1 times vec_construct costs 28 in prologue
 t0_406 + t2_451 1 times vector_stmt costs 4 in body
 t1_472 - t3_444 1 times vector_stmt costs 4 in body
 node 0x41829f8 1 times vec_perm costs 4 in body
 _436 1 times vector_store costs 16 in body
 t.c:37:9: note: Cost model analysis for part in loop 0:
-  Vector cost: 60
+  Vector cost: 84
   Scalar cost: 64
+t.c:37:9: missed: not vectorized: vectorization is not profitable.

We're constructing V4SI from patterns like { _407, _480, _407, _480 }
where the components are results of integer adds (so the result is
definitely in a GPR).  We are costing the construction as
4 * sse_op + 2 * sse_to_integer which with skylake cost is
4 * COSTS_N_INSNS (1) + 2 * 6.

Whether the vectorization itself is profitable is likely questionable
but then it's true that the construction of V4SI is more costly
in terms of uops than a construction of V4SF.

Now, we can - for the first time - now see the actual construction
pattern and ideal construction might be two GPR->xmm moves
+ two splats + one unpack or maybe two GPR->xmm moves + one
unpack + splat of DI (or other means of duplicating the lowpart).
It still wouldn't help here of course, and we would not have RTL
expansion adhere to this scheme.

Trying to capture secondary effects (the FRE opportunity unleashed)
in costing of this particular SLP subgraph is difficult and probably
undesirable though.  Adjusting any of the target specific costs
is likely not going to work because of the original narrow profitability
gap (60 vs. 64).

For the particular kernel the overall vectorization strathegy needs
to improve (PR98138 is about this I think).  I know we've reverted
the previous change that attempted to cost for GPR -> XMM.  It did

       case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
+         int cost
+           = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
+                                               ix86_cost->sse_op
+                                               : 
ix86_cost->integer_to_sse);
+
-         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;

thus treated integer_to_sse as GPR insert into vector cost for
integer components in general while the now proposed change
_adds_ integer to XMM move costs (but only once for duplicate
elements and not for the cases we know are from memory / vector regs).
With the proposed patch we can probably be more "correct" above
and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
for FP the first one is free and for int we're costing it separately.
Again it won't help here.

Thanks,
Richard.

> >
> > I suppose we can let autotesters look for SPEC performance fallout.
> >
> > OK if testing succeeds?
> >
> > Thanks,
> > Richard.
> >
> > 2022-02-18  Richard Biener  <rguenther@suse.de>
> >
> >         PR tree-optimization/104582
> >         PR target/99881
> >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> >         Cost GPR to vector register moves for integer vector construction.
> >
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
> >         * gcc.target/i386/pr99881.c: Un-XFAIL.
> >         * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> > ---
> >  gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
> >  .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
> >  .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
> >  .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
> >  .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
> >  gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
> >  7 files changed, 102 insertions(+), 3 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 0830dbd7dca..b2bf90576d5 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
> >
> >  unsigned
> >  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > -                                 stmt_vec_info stmt_info, slp_tree,
> > +                                 stmt_vec_info stmt_info, slp_tree node,
> >                                   tree vectype, int misalign,
> >                                   vect_cost_model_location where)
> >  {
> > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> >        stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> >      }
> > +  else if (kind == vec_construct
> > +          && node
> > +          && SLP_TREE_DEF_TYPE (node) == vect_external_def
> > +          && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> > +    {
> > +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > +      unsigned i;
> > +      tree op;
> > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > +       if (TREE_CODE (op) == SSA_NAME)
> > +         TREE_VISITED (op) = 0;
> > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > +       {
> > +         if (TREE_CODE (op) != SSA_NAME
> > +             || TREE_VISITED (op))
> > +           continue;
> > +         TREE_VISITED (op) = 1;
> > +         gimple *def = SSA_NAME_DEF_STMT (op);
> > +         tree tem;
> > +         if (is_gimple_assign (def)
> > +             && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
> > +             && ((tem = gimple_assign_rhs1 (def)), true)
> > +             && TREE_CODE (tem) == SSA_NAME
> > +             /* A sign-change expands to nothing.  */
> > +             && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
> > +                                       TREE_TYPE (tem)))
> > +           def = SSA_NAME_DEF_STMT (tem);
> > +         /* When the component is loaded from memory we can directly
> > +            move it to a vector register, otherwise we have to go
> > +            via a GPR or via vpinsr which involves similar cost.
> > +            Likewise with a BIT_FIELD_REF extracting from a vector
> > +            register we can hope to avoid using a GPR.  */
> > +         if (!is_gimple_assign (def)
> > +             || (!gimple_assign_load_p (def)
> > +                 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> > +                     || !VECTOR_TYPE_P (TREE_TYPE
> > +                               (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> > +           stmt_cost += ix86_cost->sse_to_integer;
> > +       }
> > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > +       if (TREE_CODE (op) == SSA_NAME)
> > +         TREE_VISITED (op) = 0;
> > +    }
> >    if (stmt_cost == -1)
> >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > new file mode 100644
> > index 00000000000..992a845ad7a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > +
> > +struct S { unsigned long a, b; } s;
> > +
> > +void
> > +foo (unsigned long *a, unsigned long *b)
> > +{
> > +  unsigned long a_ = *a;
> > +  unsigned long b_ = *b;
> > +  s.a = a_;
> > +  s.b = b_;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > new file mode 100644
> > index 00000000000..7637cdb4a97
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > +
> > +struct S { unsigned long a, b; } s;
> > +
> > +void
> > +foo (unsigned long a, unsigned long b)
> > +{
> > +  s.a = a;
> > +  s.b = b;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > new file mode 100644
> > index 00000000000..999c4905708
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > +
> > +struct S { double a, b; } s;
> > +
> > +void
> > +foo (double a, double b)
> > +{
> > +  s.a = a;
> > +  s.b = b;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > new file mode 100644
> > index 00000000000..cc471e1ed73
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > +
> > +struct S { unsigned long a, b; } s;
> > +
> > +void
> > +foo (signed long *a, unsigned long *b)
> > +{
> > +  unsigned long a_ = *a;
> > +  unsigned long b_ = *b;
> > +  s.a = a_;
> > +  s.b = b_;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> > index 0243ca3ea68..067bf43f698 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height,
> >    bar (&t);
> >  }
> >
> > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> > index 3e087eb2ed7..a1ec1d1ba8a 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> > @@ -1,7 +1,7 @@
> >  /* PR target/99881.  */
> >  /* { dg-do compile { target { ! ia32 } } } */
> >  /* { dg-options "-Ofast -march=skylake" } */
> > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
> >
> >  void
> >  foo (int* __restrict a, int n, int c)
> > --
> > 2.34.1
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)

next prev parent reply	other threads:[~2022-02-21  9:10 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-18 14:01 Richard Biener
2022-02-21  1:35 ` Hongtao Liu
2022-02-21  9:10   ` Richard Biener [this message]
2022-02-22  4:08     ` Hongtao Liu
2022-02-22  7:58       ` Richard Biener
2022-02-22  9:51         ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=p793rq4-9n31-2460-855s-856374p45243@fhfr.qr \
    --to=rguenther@suse.de \
    --cc=crazylht@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hongtao.liu@intel.com \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).