Re: [PATCH 3/3] target/99881 - x86 vector cost of CTOR from integer regs

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Hongtao Liu <crazylht@gmail.com>
To: Richard Biener <rguenther@suse.de>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
	Richard Sandiford <richard.sandiford@arm.com>,
	 "Liu, Hongtao" <hongtao.liu@intel.com>
Subject: Re: [PATCH 3/3] target/99881 - x86 vector cost of CTOR from integer regs
Date: Tue, 22 Feb 2022 12:08:33 +0800	[thread overview]
Message-ID: <CAMZc-byico_c=x_TTNMAtQJ-_maQpZOLF2+0xdisog3KzEvdGQ@mail.gmail.com> (raw)
In-Reply-To: <p793rq4-9n31-2460-855s-856374p45243@fhfr.qr>

On Mon, Feb 21, 2022 at 5:10 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Mon, 21 Feb 2022, Hongtao Liu wrote:
>
> > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > This uses the now passed SLP node to the vectorizer costing hook
> > > to adjust vector construction costs for the cost of moving an
> > > integer component from a GPR to a vector register when that's
> > > required for building a vector from components.  A cruical difference
> > > here is whether the component is loaded from memory or extracted
> > > from a vector register as in those cases no intermediate GPR is involved.
> > >
> > > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > > pr91446.c testcase now produces scalar code which looks superior
> > > to me so I've adjusted it as well.
> > >
> > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > > after adding the BIT_FIELD_REF vector extracting special casing.
> > Does the patch handle PR101929?
>
> The patch will regress the testcase posted in PR101929 again:
>
>  _255 1 times scalar_store costs 12 in body
>  _261 1 times scalar_store costs 12 in body
>  _258 1 times scalar_store costs 12 in body
>  _264 1 times scalar_store costs 12 in body
>  t0_247 + t2_251 1 times scalar_stmt costs 4 in body
>  t1_472 + t3_444 1 times scalar_stmt costs 4 in body
>  t0_406 - t2_451 1 times scalar_stmt costs 4 in body
>  t1_472 - t3_444 1 times scalar_stmt costs 4 in body
> -node 0x4182f48 1 times vec_construct costs 16 in prologue
> -node 0x41882b0 1 times vec_construct costs 16 in prologue
> +node 0x4182f48 1 times vec_construct costs 28 in prologue
> +node 0x41882b0 1 times vec_construct costs 28 in prologue
>  t0_406 + t2_451 1 times vector_stmt costs 4 in body
>  t1_472 - t3_444 1 times vector_stmt costs 4 in body
>  node 0x41829f8 1 times vec_perm costs 4 in body
>  _436 1 times vector_store costs 16 in body
>  t.c:37:9: note: Cost model analysis for part in loop 0:
> -  Vector cost: 60
> +  Vector cost: 84
>    Scalar cost: 64
> +t.c:37:9: missed: not vectorized: vectorization is not profitable.
>
> We're constructing V4SI from patterns like { _407, _480, _407, _480 }
> where the components are results of integer adds (so the result is
> definitely in a GPR).  We are costing the construction as
> 4 * sse_op + 2 * sse_to_integer which with skylake cost is
> 4 * COSTS_N_INSNS (1) + 2 * 6.
>
> Whether the vectorization itself is profitable is likely questionable
> but then it's true that the construction of V4SI is more costly
> in terms of uops than a construction of V4SF.
>
> Now, we can - for the first time - now see the actual construction
> pattern and ideal construction might be two GPR->xmm moves
> + two splats + one unpack or maybe two GPR->xmm moves + one
> unpack + splat of DI (or other means of duplicating the lowpart).
Yes, the patch is technically right. I'm ok with the patch.
> It still wouldn't help here of course, and we would not have RTL
> expansion adhere to this scheme.
>
> Trying to capture secondary effects (the FRE opportunity unleashed)
> in costing of this particular SLP subgraph is difficult and probably
> undesirable though.  Adjusting any of the target specific costs
> is likely not going to work because of the original narrow profitability
> gap (60 vs. 64).
>
> For the particular kernel the overall vectorization strathegy needs
> to improve (PR98138 is about this I think).  I know we've reverted
> the previous change that attempted to cost for GPR -> XMM.  It did
>
>        case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
> +         int cost
> +           = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
> +                                               ix86_cost->sse_op
> +                                               :
> ix86_cost->integer_to_sse);
> +
> -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
>
> thus treated integer_to_sse as GPR insert into vector cost for
> integer components in general while the now proposed change
> _adds_ integer to XMM move costs (but only once for duplicate
> elements and not for the cases we know are from memory / vector regs).
> With the proposed patch we can probably be more "correct" above
> and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
> for FP the first one is free and for int we're costing it separately.
> Again it won't help here.
>
> Thanks,
> Richard.
>
> > >
> > > I suppose we can let autotesters look for SPEC performance fallout.
> > >
> > > OK if testing succeeds?
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 2022-02-18  Richard Biener  <rguenther@suse.de>
> > >
> > >         PR tree-optimization/104582
> > >         PR target/99881
> > >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > >         Cost GPR to vector register moves for integer vector construction.
> > >
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
> > >         * gcc.target/i386/pr99881.c: Un-XFAIL.
> > >         * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> > > ---
> > >  gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
> > >  .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
> > >  .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
> > >  .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
> > >  .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
> > >  gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
> > >  7 files changed, 102 insertions(+), 3 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 0830dbd7dca..b2bf90576d5 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
> > >
> > >  unsigned
> > >  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > -                                 stmt_vec_info stmt_info, slp_tree,
> > > +                                 stmt_vec_info stmt_info, slp_tree node,
> > >                                   tree vectype, int misalign,
> > >                                   vect_cost_model_location where)
> > >  {
> > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > >        stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > >      }
> > > +  else if (kind == vec_construct
> > > +          && node
> > > +          && SLP_TREE_DEF_TYPE (node) == vect_external_def
> > > +          && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> > > +    {
> > > +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > +      unsigned i;
> > > +      tree op;
> > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > +       if (TREE_CODE (op) == SSA_NAME)
> > > +         TREE_VISITED (op) = 0;
> > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > +       {
> > > +         if (TREE_CODE (op) != SSA_NAME
> > > +             || TREE_VISITED (op))
> > > +           continue;
> > > +         TREE_VISITED (op) = 1;
> > > +         gimple *def = SSA_NAME_DEF_STMT (op);
> > > +         tree tem;
> > > +         if (is_gimple_assign (def)
> > > +             && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
> > > +             && ((tem = gimple_assign_rhs1 (def)), true)
> > > +             && TREE_CODE (tem) == SSA_NAME
> > > +             /* A sign-change expands to nothing.  */
> > > +             && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
> > > +                                       TREE_TYPE (tem)))
> > > +           def = SSA_NAME_DEF_STMT (tem);
> > > +         /* When the component is loaded from memory we can directly
> > > +            move it to a vector register, otherwise we have to go
> > > +            via a GPR or via vpinsr which involves similar cost.
> > > +            Likewise with a BIT_FIELD_REF extracting from a vector
> > > +            register we can hope to avoid using a GPR.  */
> > > +         if (!is_gimple_assign (def)
> > > +             || (!gimple_assign_load_p (def)
> > > +                 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> > > +                     || !VECTOR_TYPE_P (TREE_TYPE
> > > +                               (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> > > +           stmt_cost += ix86_cost->sse_to_integer;
> > > +       }
> > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > +       if (TREE_CODE (op) == SSA_NAME)
> > > +         TREE_VISITED (op) = 0;
> > > +    }
> > >    if (stmt_cost == -1)
> > >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > new file mode 100644
> > > index 00000000000..992a845ad7a
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > @@ -0,0 +1,15 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > +
> > > +struct S { unsigned long a, b; } s;
> > > +
> > > +void
> > > +foo (unsigned long *a, unsigned long *b)
> > > +{
> > > +  unsigned long a_ = *a;
> > > +  unsigned long b_ = *b;
> > > +  s.a = a_;
> > > +  s.b = b_;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > new file mode 100644
> > > index 00000000000..7637cdb4a97
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > @@ -0,0 +1,13 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > +
> > > +struct S { unsigned long a, b; } s;
> > > +
> > > +void
> > > +foo (unsigned long a, unsigned long b)
> > > +{
> > > +  s.a = a;
> > > +  s.b = b;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > new file mode 100644
> > > index 00000000000..999c4905708
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > @@ -0,0 +1,13 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > +
> > > +struct S { double a, b; } s;
> > > +
> > > +void
> > > +foo (double a, double b)
> > > +{
> > > +  s.a = a;
> > > +  s.b = b;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > new file mode 100644
> > > index 00000000000..cc471e1ed73
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > @@ -0,0 +1,15 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > +
> > > +struct S { unsigned long a, b; } s;
> > > +
> > > +void
> > > +foo (signed long *a, unsigned long *b)
> > > +{
> > > +  unsigned long a_ = *a;
> > > +  unsigned long b_ = *b;
> > > +  s.a = a_;
> > > +  s.b = b_;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > index 0243ca3ea68..067bf43f698 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height,
> > >    bar (&t);
> > >  }
> > >
> > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > index 3e087eb2ed7..a1ec1d1ba8a 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > @@ -1,7 +1,7 @@
> > >  /* PR target/99881.  */
> > >  /* { dg-do compile { target { ! ia32 } } } */
> > >  /* { dg-options "-Ofast -march=skylake" } */
> > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
> > >
> > >  void
> > >  foo (int* __restrict a, int n, int c)
> > > --
> > > 2.34.1
> >
> >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)



-- 
BR,
Hongtao

next prev parent reply	other threads:[~2022-02-22  4:08 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-18 14:01 Richard Biener
2022-02-21  1:35 ` Hongtao Liu
2022-02-21  9:10   ` Richard Biener
2022-02-22  4:08     ` Hongtao Liu [this message]
2022-02-22  7:58       ` Richard Biener
2022-02-22  9:51         ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMZc-byico_c=x_TTNMAtQJ-_maQpZOLF2+0xdisog3KzEvdGQ@mail.gmail.com' \
    --to=crazylht@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hongtao.liu@intel.com \
    --cc=rguenther@suse.de \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).