public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
       [not found] <77b5514c-dab1-4803-b906-0c490e59f838@DBAEUR03FT058.eop-EUR03.prod.protection.outlook.com>
@ 2023-07-12 14:15 ` Richard Sandiford
  2023-07-13  6:39   ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2023-07-12 14:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> The PRs ask for optimizing of
>
>   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
>   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
>
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.
>
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.

Yeah, the current RTL codes probably overlap a bit too much.

Maybe we should have a rule that a vec_merge with a constant
third operand should be canonicalised to a vec_select?  And maybe
change the first operand of vec_select to be an rtvec, so that
no separate vec_concat (and thus wider mode) is needed for two-input
permutes?  Would be a lot of work though...

> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).

Yeah, extraction of the low element (0 for LE, N-1 for BE) is special
in RTL, in that it is now folded to a subreg.  But IMO it's reasonable
for even that case to through TARGET_VECTORIZE_VEC_PERM_CONST,
maybe with a target-independent helper function to match permute
vectors that are equivalent to extract-and-insert.

On AArch64, extract-and-insert is a single operation for other
elements too, e.g.:

	ins	v0.s[2], v1.s[1]

is a thing.  But if the helper returns the index of the extracted
elements, targets can decide for themselves whether the index is
supported or not.

Agree that this is the right thing for gimple to do FWIW.

Thanks,
Richard

> But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> now realizes that
>
>   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
>   if (_1 > a_4(D))
>     goto <bb 3>; [50.00%]
>   else
>     goto <bb 4>; [50.00%]
>
>   <bb 3> [local count: 536870913]:
>
>   <bb 4> [local count: 1073741824]:
>   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
>   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
>
> is equal to
>
>   <bb 2> [local count: 1073741824]:
>   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
>   if (_1 > a_4(D))
>     goto <bb 4>; [50.00%]
>   else
>     goto <bb 3>; [50.00%]
>
>   <bb 3> [local count: 536870912]:
>   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
>
>   <bb 4> [local count: 1073741824]:
>   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
>
> and that no longer produces the desired maxsd operation at the RTL
> level (we fail to match .FMAX at the GIMPLE level earlier).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
>
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
>
> I think this is also PR88540 (the lack of min/max detection, not
> sure if the SSE min/max are suitable here)
>
> 	PR tree-optimization/94864
> 	PR tree-optimization/94865
> 	* match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> 	for vector insertion from vector extraction.
>
> 	* gcc.target/i386/pr94864.c: New testcase.
> 	* gcc.target/i386/pr94865.c: Likewise.
> ---
>  gcc/match.pd                            | 25 +++++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
>  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
>  3 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 8543f777a28..8cc106049c4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  		      wi::to_wide (@ipos) + isize))
>      (BIT_FIELD_REF @0 @rsize @rpos)))))
>  
> +/* Simplify vector inserts of other vector extracts to a permute.  */
> +(simplify
> + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> + (if (VECTOR_TYPE_P (type)
> +      && types_match (@0, @1)
> +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> +  (with
> +   {
> +     unsigned HOST_WIDE_INT elsz
> +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> +     vec_perm_builder builder;
> +     builder.new_vector (nunits, nunits, 1);
> +     for (unsigned i = 0; i < nunits; ++i)
> +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> +     vec_perm_indices sel (builder, 2, nunits);
> +   }
> +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> +	|| can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> +                        (build_vector_type (ssizetype, nunits), sel); })))))
> +
>  (if (canonicalize_math_after_vectorization_p ())
>   (for fmas (FMA)
>    (simplify
> diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> new file mode 100644
> index 00000000000..69cb481fcfe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[0] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> new file mode 100644
> index 00000000000..84065ac2467
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[1] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
  2023-07-12 14:15 ` [PATCH] tree-optimization/94864 - vector insert of vector extract simplification Richard Sandiford
@ 2023-07-13  6:39   ` Richard Biener
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Biener @ 2023-07-13  6:39 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, 12 Jul 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > The PRs ask for optimizing of
> >
> >   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
> >   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
> >
> > to a vector permutation.  The following implements this as
> > match.pd pattern, improving code generation on x86_64.
> >
> > On the RTL level we face the issue that backend patterns inconsistently
> > use vec_merge and vec_select of vec_concat to represent permutes.
> 
> Yeah, the current RTL codes probably overlap a bit too much.
> 
> Maybe we should have a rule that a vec_merge with a constant
> third operand should be canonicalised to a vec_select?

But vec_merge always has a constant third operand:

@findex vec_merge
@item (vec_merge:@var{m} @var{vec1} @var{vec2} @var{items})
This describes a merge operation between two vectors.  The result is a 
vector
of mode @var{m}; its elements are selected from either @var{vec1} or
@var{vec2}.  Which elements are selected is described by @var{items}, 
which
is a bit mask represented by a @code{const_int}; a zero bit indicates the
corresponding element in the result vector is taken from @var{vec2} while
a set bit indicates it is taken from @var{vec1}.

the "advantage" of vec_merge over vec_concat + vec_select is
that you don't need the 2x wider vector mode, but that's the
only one.  I guess we could allow a mode-less (VOIDmode) vec_concat as
the first operand of a vec_select since that mode isn't really
used for anything.

That said, we could work around the issue by having combine
also try to match vec_merge when the vec_select + vec_concat
combination is a blend.  But I fear that doesn't resonate well
with Segher.

>  And maybe
> change the first operand of vec_select to be an rtvec, so that
> no separate vec_concat (and thus wider mode) is needed for two-input
> permutes?  Would be a lot of work though...
> 
> > I think using a (supported) permute is almost always better
> > than an extract plus insert, maybe excluding the case we extract
> > element zero and that's aliased to a register that can be used
> > directly for insertion (not sure how to query that).
> 
> Yeah, extraction of the low element (0 for LE, N-1 for BE) is special
> in RTL, in that it is now folded to a subreg.  But IMO it's reasonable
> for even that case to through TARGET_VECTORIZE_VEC_PERM_CONST,
> maybe with a target-independent helper function to match permute
> vectors that are equivalent to extract-and-insert.
> 
> On AArch64, extract-and-insert is a single operation for other
> elements too, e.g.:
> 
> 	ins	v0.s[2], v1.s[1]
> 
> is a thing.  But if the helper returns the index of the extracted
> elements, targets can decide for themselves whether the index is
> supported or not.
> 
> Agree that this is the right thing for gimple to do FWIW.

I think so as well.  Btw, I think only proper re-association and
merging will handle a full sequence of select, merge and permute
optimally.  In principle we have the bswap pass facility for this.

Richard.

> Thanks,
> Richard
> 
> > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > now realizes that
> >
> >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> >   if (_1 > a_4(D))
> >     goto <bb 3>; [50.00%]
> >   else
> >     goto <bb 4>; [50.00%]
> >
> >   <bb 3> [local count: 536870913]:
> >
> >   <bb 4> [local count: 1073741824]:
> >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> >   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
> >
> > is equal to
> >
> >   <bb 2> [local count: 1073741824]:
> >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> >   if (_1 > a_4(D))
> >     goto <bb 4>; [50.00%]
> >   else
> >     goto <bb 3>; [50.00%]
> >
> >   <bb 3> [local count: 536870912]:
> >   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
> >
> >   <bb 4> [local count: 1073741824]:
> >   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
> >
> > and that no longer produces the desired maxsd operation at the RTL
> > level (we fail to match .FMAX at the GIMPLE level earlier).
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> >
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> >
> > I think this is also PR88540 (the lack of min/max detection, not
> > sure if the SSE min/max are suitable here)
> >
> > 	PR tree-optimization/94864
> > 	PR tree-optimization/94865
> > 	* match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> > 	for vector insertion from vector extraction.
> >
> > 	* gcc.target/i386/pr94864.c: New testcase.
> > 	* gcc.target/i386/pr94865.c: Likewise.
> > ---
> >  gcc/match.pd                            | 25 +++++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
> >  3 files changed, 51 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 8543f777a28..8cc106049c4 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >  		      wi::to_wide (@ipos) + isize))
> >      (BIT_FIELD_REF @0 @rsize @rpos)))))
> >  
> > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > +(simplify
> > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > + (if (VECTOR_TYPE_P (type)
> > +      && types_match (@0, @1)
> > +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > +  (with
> > +   {
> > +     unsigned HOST_WIDE_INT elsz
> > +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> > +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> > +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> > +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > +     vec_perm_builder builder;
> > +     builder.new_vector (nunits, nunits, 1);
> > +     for (unsigned i = 0; i < nunits; ++i)
> > +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> > +     vec_perm_indices sel (builder, 2, nunits);
> > +   }
> > +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> > +	|| can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> > +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> > +                        (build_vector_type (ssizetype, nunits), sel); })))))
> > +
> >  (if (canonicalize_math_after_vectorization_p ())
> >   (for fmas (FMA)
> >    (simplify
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> > new file mode 100644
> > index 00000000000..69cb481fcfe
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > +
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df move_sd(v2df a, v2df b)
> > +{
> > +    v2df result = a;
> > +    result[0] = b[1];
> > +    return result;
> > +}
> > +
> > +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> > new file mode 100644
> > index 00000000000..84065ac2467
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > +
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df move_sd(v2df a, v2df b)
> > +{
> > +    v2df result = a;
> > +    result[1] = b[1];
> > +    return result;
> > +}
> > +
> > +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
       [not found] <20230822090522.F0D7B385C6D9@sourceware.org>
@ 2023-08-22  9:21 ` Hongtao Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Hongtao Liu @ 2023-08-22  9:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, hongtao.liu

On Tue, Aug 22, 2023 at 5:05 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The PRs ask for optimizing of
>
>   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
>   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
>
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.
>
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.
>
> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).
>
> The patch FAILs one case in gcc.target/i386/avx512fp16-vmovsh-1a.c
> where we now expand from
>
>  __A_28 = VEC_PERM_EXPR <x2.8_9, x1.9_10, { 0, 9, 10, 11, 12, 13, 14, 15 }>;
>
> instead of
>
>  _28 = BIT_FIELD_REF <x2.8_9, 16, 0>;
>  __A_29 = BIT_INSERT_EXPR <x1.9_10, _28, 0>;
>
> producing a vpblendw instruction instead of the expected vmovsh.  That's
> either a missed vec_perm_const expansion optimization or even better,
> an improvement - Zen4 for example has 4 ports to execute vpblendw
> but only 3 for executing vmovsh and both instructions have the same size.
Looks like Sapphire rapids only have 2 ports for executing vpblendw
but 3 for vmovsh. I guess we may need a micro-architecture tuning for
this specific permutation.
for vmovss/vpblendd, they're equivalent on SPR, both are 3.
The change for the testcase is ok, I'll handle it with an incremental patch.
>
> The patch XFAILs the sub-testcase - is that OK or should I update
> the expected instruction to a vpblend?
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Thanks,
> Richard.
>
>         PR tree-optimization/94864
>         PR tree-optimization/94865
>         PR tree-optimization/93080
>         * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
>         for vector insertion from vector extraction.
>
>         * gcc.target/i386/pr94864.c: New testcase.
>         * gcc.target/i386/pr94865.c: Likewise.
>         * gcc.target/i386/avx512fp16-vmovsh-1a.c: XFAIL.
>         * gcc.dg/tree-ssa/forwprop-40.c: Likewise.
>         * gcc.dg/tree-ssa/forwprop-41.c: Likewise.
> ---
>  gcc/match.pd                                  | 25 +++++++++++++++++++
>  gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c   | 14 +++++++++++
>  gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c   | 16 ++++++++++++
>  .../gcc.target/i386/avx512fp16-vmovsh-1a.c    |  2 +-
>  gcc/testsuite/gcc.target/i386/pr94864.c       | 13 ++++++++++
>  gcc/testsuite/gcc.target/i386/pr94865.c       | 13 ++++++++++
>  6 files changed, 82 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 86fdc606a79..6e083021b27 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -8006,6 +8006,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>                       wi::to_wide (@ipos) + isize))
>      (BIT_FIELD_REF @0 @rsize @rpos)))))
>
> +/* Simplify vector inserts of other vector extracts to a permute.  */
> +(simplify
> + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> + (if (VECTOR_TYPE_P (type)
> +      && types_match (@0, @1)
> +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> +  (with
> +   {
> +     unsigned HOST_WIDE_INT elsz
> +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> +     vec_perm_builder builder;
> +     builder.new_vector (nunits, nunits, 1);
> +     for (unsigned i = 0; i < nunits; ++i)
> +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> +     vec_perm_indices sel (builder, 2, nunits);
> +   }
> +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> +       || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> +                        (build_vector_type (ssizetype, nunits), sel); })))))
> +
>  (if (canonicalize_math_after_vectorization_p ())
>   (for fmas (FMA)
>    (simplify
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
> new file mode 100644
> index 00000000000..7513497f552
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized -Wno-psabi -w" } */
> +
> +#define vector __attribute__((__vector_size__(16) ))
> +
> +vector int g(vector int a)
> +{
> +  int b = a[0];
> +  a[0] = b;
> +  return a;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "BIT_FIELD_REF" 0 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c
> new file mode 100644
> index 00000000000..b1e75797a90
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized -Wno-psabi -w" } */
> +
> +#define vector __attribute__((__vector_size__(16) ))
> +
> +vector int g(vector int a, int c)
> +{
> +  int b = a[2];
> +  a[2] = b;
> +  a[1] = c;
> +  return a;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "BIT_FIELD_REF" 0 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 0 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-vmovsh-1a.c b/gcc/testsuite/gcc.target/i386/avx512fp16-vmovsh-1a.c
> index ba10096aa20..38bf5cc0395 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512fp16-vmovsh-1a.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512fp16-vmovsh-1a.c
> @@ -3,7 +3,7 @@
>  /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r\]*%\[er\]\[ad]x+\[^\n\r]*\{%k\[0-9\]\}(?:\n|\[ \\t\]+#)" 1 } } */
>  /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+\[^\n\r\]*%\[er\]\[ad]x+\[^\n\r]*%xmm\[0-9\]+\{%k\[0-9\]\}(?:\n|\[ \\t\]+#)" 1 } } */
>  /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+\[^\n\r\]*%\[er\]\[ad]x+\[^\n\r]*%xmm\[0-9\]+\{%k\[0-9\]\}\{z\}\[^\n\r]*(?:\n|\[ \\t\]+#)" 1 } } */
> -/* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 { xfail *-*-* } } } */
>  /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\{%k\[0-9\]\}\[^z\n\r]*(?:\n|\[ \\t\]+#)" 1 } } */
>  /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\{%k\[0-9\]\}\{z\}\[^\n\r]*(?:\n|\[ \\t\]+#)" 1 } } */
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> new file mode 100644
> index 00000000000..69cb481fcfe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[0] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> new file mode 100644
> index 00000000000..84065ac2467
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[1] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> --
> 2.35.3



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
@ 2023-08-22  9:05 Richard Biener
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Biener @ 2023-08-22  9:05 UTC (permalink / raw)
  To: gcc-patches; +Cc: hongtao.liu

The PRs ask for optimizing of

  _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
  result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;

to a vector permutation.  The following implements this as
match.pd pattern, improving code generation on x86_64.

On the RTL level we face the issue that backend patterns inconsistently
use vec_merge and vec_select of vec_concat to represent permutes.

I think using a (supported) permute is almost always better
than an extract plus insert, maybe excluding the case we extract
element zero and that's aliased to a register that can be used
directly for insertion (not sure how to query that).

The patch FAILs one case in gcc.target/i386/avx512fp16-vmovsh-1a.c
where we now expand from

 __A_28 = VEC_PERM_EXPR <x2.8_9, x1.9_10, { 0, 9, 10, 11, 12, 13, 14, 15 }>;

instead of

 _28 = BIT_FIELD_REF <x2.8_9, 16, 0>;
 __A_29 = BIT_INSERT_EXPR <x1.9_10, _28, 0>;

producing a vpblendw instruction instead of the expected vmovsh.  That's
either a missed vec_perm_const expansion optimization or even better,
an improvement - Zen4 for example has 4 ports to execute vpblendw
but only 3 for executing vmovsh and both instructions have the same size.

The patch XFAILs the sub-testcase - is that OK or should I update
the expected instruction to a vpblend?

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Thanks,
Richard.

	PR tree-optimization/94864
	PR tree-optimization/94865
	PR tree-optimization/93080
	* match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
	for vector insertion from vector extraction.

	* gcc.target/i386/pr94864.c: New testcase.
	* gcc.target/i386/pr94865.c: Likewise.
	* gcc.target/i386/avx512fp16-vmovsh-1a.c: XFAIL.
	* gcc.dg/tree-ssa/forwprop-40.c: Likewise.
	* gcc.dg/tree-ssa/forwprop-41.c: Likewise.
---
 gcc/match.pd                                  | 25 +++++++++++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c   | 14 +++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c   | 16 ++++++++++++
 .../gcc.target/i386/avx512fp16-vmovsh-1a.c    |  2 +-
 gcc/testsuite/gcc.target/i386/pr94864.c       | 13 ++++++++++
 gcc/testsuite/gcc.target/i386/pr94865.c       | 13 ++++++++++
 6 files changed, 82 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 86fdc606a79..6e083021b27 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8006,6 +8006,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 		      wi::to_wide (@ipos) + isize))
     (BIT_FIELD_REF @0 @rsize @rpos)))))
 
+/* Simplify vector inserts of other vector extracts to a permute.  */
+(simplify
+ (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
+ (if (VECTOR_TYPE_P (type)
+      && types_match (@0, @1)
+      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
+      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
+  (with
+   {
+     unsigned HOST_WIDE_INT elsz
+       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
+     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
+     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
+     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+     vec_perm_builder builder;
+     builder.new_vector (nunits, nunits, 1);
+     for (unsigned i = 0; i < nunits; ++i)
+       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
+     vec_perm_indices sel (builder, 2, nunits);
+   }
+   (if (!VECTOR_MODE_P (TYPE_MODE (type))
+	|| can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
+    (vec_perm @0 @1 { vec_perm_indices_to_tree
+                        (build_vector_type (ssizetype, nunits), sel); })))))
+
 (if (canonicalize_math_after_vectorization_p ())
  (for fmas (FMA)
   (simplify
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
new file mode 100644
index 00000000000..7513497f552
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-40.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized -Wno-psabi -w" } */
+
+#define vector __attribute__((__vector_size__(16) ))
+
+vector int g(vector int a)
+{
+  int b = a[0];
+  a[0] = b;
+  return a;
+}
+
+/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 0 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "BIT_FIELD_REF" 0 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c
new file mode 100644
index 00000000000..b1e75797a90
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-41.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized -Wno-psabi -w" } */
+
+#define vector __attribute__((__vector_size__(16) ))
+
+vector int g(vector int a, int c)
+{
+  int b = a[2];
+  a[2] = b;
+  a[1] = c;
+  return a;
+}
+
+/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "BIT_FIELD_REF" 0 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 0 "optimized" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-vmovsh-1a.c b/gcc/testsuite/gcc.target/i386/avx512fp16-vmovsh-1a.c
index ba10096aa20..38bf5cc0395 100644
--- a/gcc/testsuite/gcc.target/i386/avx512fp16-vmovsh-1a.c
+++ b/gcc/testsuite/gcc.target/i386/avx512fp16-vmovsh-1a.c
@@ -3,7 +3,7 @@
 /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r\]*%\[er\]\[ad]x+\[^\n\r]*\{%k\[0-9\]\}(?:\n|\[ \\t\]+#)" 1 } } */
 /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+\[^\n\r\]*%\[er\]\[ad]x+\[^\n\r]*%xmm\[0-9\]+\{%k\[0-9\]\}(?:\n|\[ \\t\]+#)" 1 } } */
 /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+\[^\n\r\]*%\[er\]\[ad]x+\[^\n\r]*%xmm\[0-9\]+\{%k\[0-9\]\}\{z\}\[^\n\r]*(?:\n|\[ \\t\]+#)" 1 } } */
-/* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\{%k\[0-9\]\}\[^z\n\r]*(?:\n|\[ \\t\]+#)" 1 } } */
 /* { dg-final { scan-assembler-times "vmovsh\[ \\t\]+%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\{%k\[0-9\]\}\{z\}\[^\n\r]*(?:\n|\[ \\t\]+#)" 1 } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
new file mode 100644
index 00000000000..69cb481fcfe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr94864.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -mno-avx" } */
+
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df move_sd(v2df a, v2df b)
+{
+    v2df result = a;
+    result[0] = b[1];
+    return result;
+}
+
+/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
new file mode 100644
index 00000000000..84065ac2467
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr94865.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -mno-avx" } */
+
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df move_sd(v2df a, v2df b)
+{
+    v2df result = a;
+    result[1] = b[1];
+    return result;
+}
+
+/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
-- 
2.35.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
  2023-07-15 17:31 ` Andrew Pinski
@ 2023-07-15 17:34   ` Andrew Pinski
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Pinski @ 2023-07-15 17:34 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford

On Sat, Jul 15, 2023 at 10:31 AM Andrew Pinski <pinskia@gmail.com> wrote:
>
> On Wed, Jul 12, 2023 at 6:37 AM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > The PRs ask for optimizing of
> >
> >   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
> >   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
> >
> > to a vector permutation.  The following implements this as
> > match.pd pattern, improving code generation on x86_64.
>
> This is more general case of PR 93080 really where we had:
>
> _1 = BIT_FIELD_REF <_2, 64, 64>;
>  result_4 = BIT_INSERT_EXPR <_2, _1, 64>;

I should mention the i386 failures show up even with the limited patch
for PR 93080 which is why I didn't move forward on my patch there.

>
> Thanks,
> Andrew Pinski
>
> >
> > On the RTL level we face the issue that backend patterns inconsistently
> > use vec_merge and vec_select of vec_concat to represent permutes.
> >
> > I think using a (supported) permute is almost always better
> > than an extract plus insert, maybe excluding the case we extract
> > element zero and that's aliased to a register that can be used
> > directly for insertion (not sure how to query that).
> >
> > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > now realizes that
> >
> >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> >   if (_1 > a_4(D))
> >     goto <bb 3>; [50.00%]
> >   else
> >     goto <bb 4>; [50.00%]
> >
> >   <bb 3> [local count: 536870913]:
> >
> >   <bb 4> [local count: 1073741824]:
> >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> >   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
> >
> > is equal to
> >
> >   <bb 2> [local count: 1073741824]:
> >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> >   if (_1 > a_4(D))
> >     goto <bb 4>; [50.00%]
> >   else
> >     goto <bb 3>; [50.00%]
> >
> >   <bb 3> [local count: 536870912]:
> >   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
> >
> >   <bb 4> [local count: 1073741824]:
> >   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
> >
> > and that no longer produces the desired maxsd operation at the RTL
> > level (we fail to match .FMAX at the GIMPLE level earlier).
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> >
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> >
> > I think this is also PR88540 (the lack of min/max detection, not
> > sure if the SSE min/max are suitable here)
> >
> >         PR tree-optimization/94864
> >         PR tree-optimization/94865
> >         * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> >         for vector insertion from vector extraction.
> >
> >         * gcc.target/i386/pr94864.c: New testcase.
> >         * gcc.target/i386/pr94865.c: Likewise.
> > ---
> >  gcc/match.pd                            | 25 +++++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
> >  3 files changed, 51 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 8543f777a28..8cc106049c4 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >                       wi::to_wide (@ipos) + isize))
> >      (BIT_FIELD_REF @0 @rsize @rpos)))))
> >
> > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > +(simplify
> > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > + (if (VECTOR_TYPE_P (type)
> > +      && types_match (@0, @1)
> > +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > +  (with
> > +   {
> > +     unsigned HOST_WIDE_INT elsz
> > +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> > +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> > +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> > +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > +     vec_perm_builder builder;
> > +     builder.new_vector (nunits, nunits, 1);
> > +     for (unsigned i = 0; i < nunits; ++i)
> > +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> > +     vec_perm_indices sel (builder, 2, nunits);
> > +   }
> > +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> > +       || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> > +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> > +                        (build_vector_type (ssizetype, nunits), sel); })))))
> > +
> >  (if (canonicalize_math_after_vectorization_p ())
> >   (for fmas (FMA)
> >    (simplify
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> > new file mode 100644
> > index 00000000000..69cb481fcfe
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > +
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df move_sd(v2df a, v2df b)
> > +{
> > +    v2df result = a;
> > +    result[0] = b[1];
> > +    return result;
> > +}
> > +
> > +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> > new file mode 100644
> > index 00000000000..84065ac2467
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > +
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df move_sd(v2df a, v2df b)
> > +{
> > +    v2df result = a;
> > +    result[1] = b[1];
> > +    return result;
> > +}
> > +
> > +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> > --
> > 2.35.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
       [not found] <20230712133739.1E4E23858C62@sourceware.org>
@ 2023-07-15 17:31 ` Andrew Pinski
  2023-07-15 17:34   ` Andrew Pinski
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Pinski @ 2023-07-15 17:31 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford

On Wed, Jul 12, 2023 at 6:37 AM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The PRs ask for optimizing of
>
>   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
>   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
>
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.

This is more general case of PR 93080 really where we had:

_1 = BIT_FIELD_REF <_2, 64, 64>;
 result_4 = BIT_INSERT_EXPR <_2, _1, 64>;

Thanks,
Andrew Pinski

>
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.
>
> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).
>
> But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> now realizes that
>
>   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
>   if (_1 > a_4(D))
>     goto <bb 3>; [50.00%]
>   else
>     goto <bb 4>; [50.00%]
>
>   <bb 3> [local count: 536870913]:
>
>   <bb 4> [local count: 1073741824]:
>   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
>   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
>
> is equal to
>
>   <bb 2> [local count: 1073741824]:
>   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
>   if (_1 > a_4(D))
>     goto <bb 4>; [50.00%]
>   else
>     goto <bb 3>; [50.00%]
>
>   <bb 3> [local count: 536870912]:
>   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
>
>   <bb 4> [local count: 1073741824]:
>   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
>
> and that no longer produces the desired maxsd operation at the RTL
> level (we fail to match .FMAX at the GIMPLE level earlier).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
>
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
>
> I think this is also PR88540 (the lack of min/max detection, not
> sure if the SSE min/max are suitable here)
>
>         PR tree-optimization/94864
>         PR tree-optimization/94865
>         * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
>         for vector insertion from vector extraction.
>
>         * gcc.target/i386/pr94864.c: New testcase.
>         * gcc.target/i386/pr94865.c: Likewise.
> ---
>  gcc/match.pd                            | 25 +++++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
>  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
>  3 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 8543f777a28..8cc106049c4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>                       wi::to_wide (@ipos) + isize))
>      (BIT_FIELD_REF @0 @rsize @rpos)))))
>
> +/* Simplify vector inserts of other vector extracts to a permute.  */
> +(simplify
> + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> + (if (VECTOR_TYPE_P (type)
> +      && types_match (@0, @1)
> +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> +  (with
> +   {
> +     unsigned HOST_WIDE_INT elsz
> +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> +     vec_perm_builder builder;
> +     builder.new_vector (nunits, nunits, 1);
> +     for (unsigned i = 0; i < nunits; ++i)
> +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> +     vec_perm_indices sel (builder, 2, nunits);
> +   }
> +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> +       || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> +                        (build_vector_type (ssizetype, nunits), sel); })))))
> +
>  (if (canonicalize_math_after_vectorization_p ())
>   (for fmas (FMA)
>    (simplify
> diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> new file mode 100644
> index 00000000000..69cb481fcfe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[0] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> new file mode 100644
> index 00000000000..84065ac2467
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[1] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> --
> 2.35.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
  2023-07-13  6:32     ` Richard Biener
@ 2023-07-13  8:29       ` Hongtao Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Hongtao Liu @ 2023-07-13  8:29 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford

On Thu, Jul 13, 2023 at 2:32 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Thu, 13 Jul 2023, Hongtao Liu wrote:
>
> > On Thu, Jul 13, 2023 at 10:47?AM Hongtao Liu <crazylht@gmail.com> wrote:
> > >
> > > On Wed, Jul 12, 2023 at 9:37?PM Richard Biener via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > The PRs ask for optimizing of
> > > >
> > > >   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
> > > >   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
> > > >
> > > > to a vector permutation.  The following implements this as
> > > > match.pd pattern, improving code generation on x86_64.
> > > >
> > > > On the RTL level we face the issue that backend patterns inconsistently
> > > > use vec_merge and vec_select of vec_concat to represent permutes.
> > > >
> > > > I think using a (supported) permute is almost always better
> > > > than an extract plus insert, maybe excluding the case we extract
> > > > element zero and that's aliased to a register that can be used
> > > > directly for insertion (not sure how to query that).
> > > >
> > > > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > > > now realizes that
> > > >
> > > >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> > > >   if (_1 > a_4(D))
> > > >     goto <bb 3>; [50.00%]
> > > >   else
> > > >     goto <bb 4>; [50.00%]
> > > >
> > > >   <bb 3> [local count: 536870913]:
> > > >
> > > >   <bb 4> [local count: 1073741824]:
> > > >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> > > >   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
> > > >
> > > > is equal to
> > > >
> > > >   <bb 2> [local count: 1073741824]:
> > > >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> > > >   if (_1 > a_4(D))
> > > >     goto <bb 4>; [50.00%]
> > > >   else
> > > >     goto <bb 3>; [50.00%]
> > > >
> > > >   <bb 3> [local count: 536870912]:
> > > >   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
> > > >
> > > >   <bb 4> [local count: 1073741824]:
> > > >   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
> > > >
> > > > and that no longer produces the desired maxsd operation at the RTL
> > > The comparison is scalar mode, but operations in then_bb is
> > > vector_mode, if_convert can't eliminate the condition any more(and
> > > won't go into backend ix86_expand_sse_fp_minmax).
> > > I think for ordered comparisons like _1 > a_4, it doesn't match
> > > fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
> > > operand(not the other operand) when there's NONE.
> > I mean NANs.
>
> Btw, I once tried to recognize MAX here at the GIMPLE level but
> while the x86 (vector) max insns are fine for x > y ? x : y we
> have no tree code or optab for exactly that, we have MAX_EXPR
> which behaves differently for NaN and .FMAX which is exactly IEEE
> which the x86 ISA isn't.
>
> I wonder if we thus should if-convert this on the GIMPLE level
> but to x > y ? x : y, thus a COND_EXPR?
COND_EXPR maps to mov<mode>cc, for x86 it's expanded by
ix86_expand_fp_movcc which will try fp minmax detect.
It's probably ok.
>
> Richard.
>
> > > > level (we fail to match .FMAX at the GIMPLE level earlier).
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> > > >
> > > > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> > > > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> > > > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > > > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > > > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > > > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> > > >
> > > > I think this is also PR88540 (the lack of min/max detection, not
> > > > sure if the SSE min/max are suitable here)
> > > >
> > > >         PR tree-optimization/94864
> > > >         PR tree-optimization/94865
> > > >         * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> > > >         for vector insertion from vector extraction.
> > > >
> > > >         * gcc.target/i386/pr94864.c: New testcase.
> > > >         * gcc.target/i386/pr94865.c: Likewise.
> > > > ---
> > > >  gcc/match.pd                            | 25 +++++++++++++++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
> > > >  3 files changed, 51 insertions(+)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> > > >
> > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > index 8543f777a28..8cc106049c4 100644
> > > > --- a/gcc/match.pd
> > > > +++ b/gcc/match.pd
> > > > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > >                       wi::to_wide (@ipos) + isize))
> > > >      (BIT_FIELD_REF @0 @rsize @rpos)))))
> > > >
> > > > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > > > +(simplify
> > > > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > > > + (if (VECTOR_TYPE_P (type)
> > > > +      && types_match (@0, @1)
> > > > +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > > > +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > > > +  (with
> > > > +   {
> > > > +     unsigned HOST_WIDE_INT elsz
> > > > +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> > > > +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> > > > +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> > > > +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > > +     vec_perm_builder builder;
> > > > +     builder.new_vector (nunits, nunits, 1);
> > > > +     for (unsigned i = 0; i < nunits; ++i)
> > > > +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> > > > +     vec_perm_indices sel (builder, 2, nunits);
> > > > +   }
> > > > +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> > > > +       || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> > > > +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> > > > +                        (build_vector_type (ssizetype, nunits), sel); })))))
> > > > +
> > > >  (if (canonicalize_math_after_vectorization_p ())
> > > >   (for fmas (FMA)
> > > >    (simplify
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> > > > new file mode 100644
> > > > index 00000000000..69cb481fcfe
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> > > > @@ -0,0 +1,13 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > > > +
> > > > +typedef double v2df __attribute__((vector_size(16)));
> > > > +
> > > > +v2df move_sd(v2df a, v2df b)
> > > > +{
> > > > +    v2df result = a;
> > > > +    result[0] = b[1];
> > > > +    return result;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> > > > new file mode 100644
> > > > index 00000000000..84065ac2467
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> > > > @@ -0,0 +1,13 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > > > +
> > > > +typedef double v2df __attribute__((vector_size(16)));
> > > > +
> > > > +v2df move_sd(v2df a, v2df b)
> > > > +{
> > > > +    v2df result = a;
> > > > +    result[1] = b[1];
> > > > +    return result;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> > > > --
> > > > 2.35.3
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> >
> >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> HRB 36809 (AG Nuernberg)



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
  2023-07-13  2:57   ` Hongtao Liu
@ 2023-07-13  6:32     ` Richard Biener
  2023-07-13  8:29       ` Hongtao Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2023-07-13  6:32 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: gcc-patches, richard.sandiford

On Thu, 13 Jul 2023, Hongtao Liu wrote:

> On Thu, Jul 13, 2023 at 10:47?AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Wed, Jul 12, 2023 at 9:37?PM Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > The PRs ask for optimizing of
> > >
> > >   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
> > >   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
> > >
> > > to a vector permutation.  The following implements this as
> > > match.pd pattern, improving code generation on x86_64.
> > >
> > > On the RTL level we face the issue that backend patterns inconsistently
> > > use vec_merge and vec_select of vec_concat to represent permutes.
> > >
> > > I think using a (supported) permute is almost always better
> > > than an extract plus insert, maybe excluding the case we extract
> > > element zero and that's aliased to a register that can be used
> > > directly for insertion (not sure how to query that).
> > >
> > > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > > now realizes that
> > >
> > >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> > >   if (_1 > a_4(D))
> > >     goto <bb 3>; [50.00%]
> > >   else
> > >     goto <bb 4>; [50.00%]
> > >
> > >   <bb 3> [local count: 536870913]:
> > >
> > >   <bb 4> [local count: 1073741824]:
> > >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> > >   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
> > >
> > > is equal to
> > >
> > >   <bb 2> [local count: 1073741824]:
> > >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> > >   if (_1 > a_4(D))
> > >     goto <bb 4>; [50.00%]
> > >   else
> > >     goto <bb 3>; [50.00%]
> > >
> > >   <bb 3> [local count: 536870912]:
> > >   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
> > >
> > >   <bb 4> [local count: 1073741824]:
> > >   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
> > >
> > > and that no longer produces the desired maxsd operation at the RTL
> > The comparison is scalar mode, but operations in then_bb is
> > vector_mode, if_convert can't eliminate the condition any more(and
> > won't go into backend ix86_expand_sse_fp_minmax).
> > I think for ordered comparisons like _1 > a_4, it doesn't match
> > fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
> > operand(not the other operand) when there's NONE.
> I mean NANs.

Btw, I once tried to recognize MAX here at the GIMPLE level but
while the x86 (vector) max insns are fine for x > y ? x : y we
have no tree code or optab for exactly that, we have MAX_EXPR
which behaves differently for NaN and .FMAX which is exactly IEEE
which the x86 ISA isn't.

I wonder if we thus should if-convert this on the GIMPLE level
but to x > y ? x : y, thus a COND_EXPR?

Richard.

> > > level (we fail to match .FMAX at the GIMPLE level earlier).
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> > >
> > > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> > > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> > > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> > >
> > > I think this is also PR88540 (the lack of min/max detection, not
> > > sure if the SSE min/max are suitable here)
> > >
> > >         PR tree-optimization/94864
> > >         PR tree-optimization/94865
> > >         * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> > >         for vector insertion from vector extraction.
> > >
> > >         * gcc.target/i386/pr94864.c: New testcase.
> > >         * gcc.target/i386/pr94865.c: Likewise.
> > > ---
> > >  gcc/match.pd                            | 25 +++++++++++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
> > >  3 files changed, 51 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > index 8543f777a28..8cc106049c4 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > >                       wi::to_wide (@ipos) + isize))
> > >      (BIT_FIELD_REF @0 @rsize @rpos)))))
> > >
> > > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > > +(simplify
> > > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > > + (if (VECTOR_TYPE_P (type)
> > > +      && types_match (@0, @1)
> > > +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > > +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > > +  (with
> > > +   {
> > > +     unsigned HOST_WIDE_INT elsz
> > > +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> > > +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> > > +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> > > +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > +     vec_perm_builder builder;
> > > +     builder.new_vector (nunits, nunits, 1);
> > > +     for (unsigned i = 0; i < nunits; ++i)
> > > +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> > > +     vec_perm_indices sel (builder, 2, nunits);
> > > +   }
> > > +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> > > +       || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> > > +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> > > +                        (build_vector_type (ssizetype, nunits), sel); })))))
> > > +
> > >  (if (canonicalize_math_after_vectorization_p ())
> > >   (for fmas (FMA)
> > >    (simplify
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> > > new file mode 100644
> > > index 00000000000..69cb481fcfe
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> > > @@ -0,0 +1,13 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > > +
> > > +typedef double v2df __attribute__((vector_size(16)));
> > > +
> > > +v2df move_sd(v2df a, v2df b)
> > > +{
> > > +    v2df result = a;
> > > +    result[0] = b[1];
> > > +    return result;
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> > > new file mode 100644
> > > index 00000000000..84065ac2467
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> > > @@ -0,0 +1,13 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > > +
> > > +typedef double v2df __attribute__((vector_size(16)));
> > > +
> > > +v2df move_sd(v2df a, v2df b)
> > > +{
> > > +    v2df result = a;
> > > +    result[1] = b[1];
> > > +    return result;
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> > > --
> > > 2.35.3
> >
> >
> >
> > --
> > BR,
> > Hongtao
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
  2023-07-13  2:47 ` Hongtao Liu
@ 2023-07-13  2:57   ` Hongtao Liu
  2023-07-13  6:32     ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Hongtao Liu @ 2023-07-13  2:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford

On Thu, Jul 13, 2023 at 10:47 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Jul 12, 2023 at 9:37 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > The PRs ask for optimizing of
> >
> >   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
> >   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
> >
> > to a vector permutation.  The following implements this as
> > match.pd pattern, improving code generation on x86_64.
> >
> > On the RTL level we face the issue that backend patterns inconsistently
> > use vec_merge and vec_select of vec_concat to represent permutes.
> >
> > I think using a (supported) permute is almost always better
> > than an extract plus insert, maybe excluding the case we extract
> > element zero and that's aliased to a register that can be used
> > directly for insertion (not sure how to query that).
> >
> > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > now realizes that
> >
> >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> >   if (_1 > a_4(D))
> >     goto <bb 3>; [50.00%]
> >   else
> >     goto <bb 4>; [50.00%]
> >
> >   <bb 3> [local count: 536870913]:
> >
> >   <bb 4> [local count: 1073741824]:
> >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> >   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
> >
> > is equal to
> >
> >   <bb 2> [local count: 1073741824]:
> >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> >   if (_1 > a_4(D))
> >     goto <bb 4>; [50.00%]
> >   else
> >     goto <bb 3>; [50.00%]
> >
> >   <bb 3> [local count: 536870912]:
> >   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
> >
> >   <bb 4> [local count: 1073741824]:
> >   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
> >
> > and that no longer produces the desired maxsd operation at the RTL
> The comparison is scalar mode, but operations in then_bb is
> vector_mode, if_convert can't eliminate the condition any more(and
> won't go into backend ix86_expand_sse_fp_minmax).
> I think for ordered comparisons like _1 > a_4, it doesn't match
> fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
> operand(not the other operand) when there's NONE.
I mean NANs.
> > level (we fail to match .FMAX at the GIMPLE level earlier).
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> >
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> >
> > I think this is also PR88540 (the lack of min/max detection, not
> > sure if the SSE min/max are suitable here)
> >
> >         PR tree-optimization/94864
> >         PR tree-optimization/94865
> >         * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> >         for vector insertion from vector extraction.
> >
> >         * gcc.target/i386/pr94864.c: New testcase.
> >         * gcc.target/i386/pr94865.c: Likewise.
> > ---
> >  gcc/match.pd                            | 25 +++++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
> >  3 files changed, 51 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 8543f777a28..8cc106049c4 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >                       wi::to_wide (@ipos) + isize))
> >      (BIT_FIELD_REF @0 @rsize @rpos)))))
> >
> > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > +(simplify
> > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > + (if (VECTOR_TYPE_P (type)
> > +      && types_match (@0, @1)
> > +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > +  (with
> > +   {
> > +     unsigned HOST_WIDE_INT elsz
> > +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> > +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> > +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> > +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > +     vec_perm_builder builder;
> > +     builder.new_vector (nunits, nunits, 1);
> > +     for (unsigned i = 0; i < nunits; ++i)
> > +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> > +     vec_perm_indices sel (builder, 2, nunits);
> > +   }
> > +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> > +       || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> > +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> > +                        (build_vector_type (ssizetype, nunits), sel); })))))
> > +
> >  (if (canonicalize_math_after_vectorization_p ())
> >   (for fmas (FMA)
> >    (simplify
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> > new file mode 100644
> > index 00000000000..69cb481fcfe
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > +
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df move_sd(v2df a, v2df b)
> > +{
> > +    v2df result = a;
> > +    result[0] = b[1];
> > +    return result;
> > +}
> > +
> > +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> > new file mode 100644
> > index 00000000000..84065ac2467
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > +
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df move_sd(v2df a, v2df b)
> > +{
> > +    v2df result = a;
> > +    result[1] = b[1];
> > +    return result;
> > +}
> > +
> > +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> > --
> > 2.35.3
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
       [not found] <20230712133715.EDB563858C62@sourceware.org>
@ 2023-07-13  2:47 ` Hongtao Liu
  2023-07-13  2:57   ` Hongtao Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Hongtao Liu @ 2023-07-13  2:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford

On Wed, Jul 12, 2023 at 9:37 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The PRs ask for optimizing of
>
>   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
>   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
>
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.
>
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.
>
> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).
>
> But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> now realizes that
>
>   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
>   if (_1 > a_4(D))
>     goto <bb 3>; [50.00%]
>   else
>     goto <bb 4>; [50.00%]
>
>   <bb 3> [local count: 536870913]:
>
>   <bb 4> [local count: 1073741824]:
>   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
>   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
>
> is equal to
>
>   <bb 2> [local count: 1073741824]:
>   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
>   if (_1 > a_4(D))
>     goto <bb 4>; [50.00%]
>   else
>     goto <bb 3>; [50.00%]
>
>   <bb 3> [local count: 536870912]:
>   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
>
>   <bb 4> [local count: 1073741824]:
>   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
>
> and that no longer produces the desired maxsd operation at the RTL
The comparison is scalar mode, but operations in then_bb is
vector_mode, if_convert can't eliminate the condition any more(and
won't go into backend ix86_expand_sse_fp_minmax).
I think for ordered comparisons like _1 > a_4, it doesn't match
fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
operand(not the other operand) when there's NONE.
> level (we fail to match .FMAX at the GIMPLE level earlier).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
>
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
>
> I think this is also PR88540 (the lack of min/max detection, not
> sure if the SSE min/max are suitable here)
>
>         PR tree-optimization/94864
>         PR tree-optimization/94865
>         * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
>         for vector insertion from vector extraction.
>
>         * gcc.target/i386/pr94864.c: New testcase.
>         * gcc.target/i386/pr94865.c: Likewise.
> ---
>  gcc/match.pd                            | 25 +++++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
>  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
>  3 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 8543f777a28..8cc106049c4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>                       wi::to_wide (@ipos) + isize))
>      (BIT_FIELD_REF @0 @rsize @rpos)))))
>
> +/* Simplify vector inserts of other vector extracts to a permute.  */
> +(simplify
> + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> + (if (VECTOR_TYPE_P (type)
> +      && types_match (@0, @1)
> +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> +  (with
> +   {
> +     unsigned HOST_WIDE_INT elsz
> +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> +     vec_perm_builder builder;
> +     builder.new_vector (nunits, nunits, 1);
> +     for (unsigned i = 0; i < nunits; ++i)
> +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> +     vec_perm_indices sel (builder, 2, nunits);
> +   }
> +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> +       || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
> +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> +                        (build_vector_type (ssizetype, nunits), sel); })))))
> +
>  (if (canonicalize_math_after_vectorization_p ())
>   (for fmas (FMA)
>    (simplify
> diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
> new file mode 100644
> index 00000000000..69cb481fcfe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[0] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
> new file mode 100644
> index 00000000000..84065ac2467
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df move_sd(v2df a, v2df b)
> +{
> +    v2df result = a;
> +    result[1] = b[1];
> +    return result;
> +}
> +
> +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> --
> 2.35.3



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
       [not found] <20230712133711.DEF46385842C@sourceware.org>
@ 2023-07-12 14:05 ` Jeff Law
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff Law @ 2023-07-12 14:05 UTC (permalink / raw)
  To: Richard Biener, gcc-patches; +Cc: richard.sandiford



On 7/12/23 07:36, Richard Biener via Gcc-patches wrote:
> The PRs ask for optimizing of
> 
>    _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
>    result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
> 
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.
> 
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.
> 
> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).
So for a target with aliases at the register level, I'd bet they're 
already aware of the aliasing and are prepared to deal with it in the 
target (and are probably already trying to take advantage of that quirk 
when possible).

So I'd just punt that problem to the targets.  If it turns out to be 
common, then we can try to address it, probably at the gimple->rtl border.

jeff


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] tree-optimization/94864 - vector insert of vector extract simplification
@ 2023-07-12 13:36 Richard Biener
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Biener @ 2023-07-12 13:36 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

The PRs ask for optimizing of

  _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
  result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;

to a vector permutation.  The following implements this as
match.pd pattern, improving code generation on x86_64.

On the RTL level we face the issue that backend patterns inconsistently
use vec_merge and vec_select of vec_concat to represent permutes.

I think using a (supported) permute is almost always better
than an extract plus insert, maybe excluding the case we extract
element zero and that's aliased to a register that can be used
directly for insertion (not sure how to query that).

But this regresses for example gcc.target/i386/pr54855-8.c because PRE
now realizes that

  _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
  if (_1 > a_4(D))
    goto <bb 3>; [50.00%]
  else
    goto <bb 4>; [50.00%]

  <bb 3> [local count: 536870913]:

  <bb 4> [local count: 1073741824]:
  # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
  x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;

is equal to

  <bb 2> [local count: 1073741824]:
  _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
  if (_1 > a_4(D))
    goto <bb 4>; [50.00%]
  else
    goto <bb 3>; [50.00%]

  <bb 3> [local count: 536870912]:
  _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;

  <bb 4> [local count: 1073741824]:
  # prephitmp_8 = PHI <x_3(D)(2), _7(3)>

and that no longer produces the desired maxsd operation at the RTL
level (we fail to match .FMAX at the GIMPLE level earlier).

Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:

FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss

I think this is also PR88540 (the lack of min/max detection, not
sure if the SSE min/max are suitable here)

	PR tree-optimization/94864
	PR tree-optimization/94865
	* match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
	for vector insertion from vector extraction.

	* gcc.target/i386/pr94864.c: New testcase.
	* gcc.target/i386/pr94865.c: Likewise.
---
 gcc/match.pd                            | 25 +++++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
 3 files changed, 51 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 8543f777a28..8cc106049c4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 		      wi::to_wide (@ipos) + isize))
     (BIT_FIELD_REF @0 @rsize @rpos)))))
 
+/* Simplify vector inserts of other vector extracts to a permute.  */
+(simplify
+ (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
+ (if (VECTOR_TYPE_P (type)
+      && types_match (@0, @1)
+      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
+      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
+  (with
+   {
+     unsigned HOST_WIDE_INT elsz
+       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
+     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
+     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
+     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+     vec_perm_builder builder;
+     builder.new_vector (nunits, nunits, 1);
+     for (unsigned i = 0; i < nunits; ++i)
+       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
+     vec_perm_indices sel (builder, 2, nunits);
+   }
+   (if (!VECTOR_MODE_P (TYPE_MODE (type))
+	|| can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, false))
+    (vec_perm @0 @1 { vec_perm_indices_to_tree
+                        (build_vector_type (ssizetype, nunits), sel); })))))
+
 (if (canonicalize_math_after_vectorization_p ())
  (for fmas (FMA)
   (simplify
diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c b/gcc/testsuite/gcc.target/i386/pr94864.c
new file mode 100644
index 00000000000..69cb481fcfe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr94864.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -mno-avx" } */
+
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df move_sd(v2df a, v2df b)
+{
+    v2df result = a;
+    result[0] = b[1];
+    return result;
+}
+
+/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c b/gcc/testsuite/gcc.target/i386/pr94865.c
new file mode 100644
index 00000000000..84065ac2467
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr94865.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -mno-avx" } */
+
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df move_sd(v2df a, v2df b)
+{
+    v2df result = a;
+    result[1] = b[1];
+    return result;
+}
+
+/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
-- 
2.35.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-08-22  9:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <77b5514c-dab1-4803-b906-0c490e59f838@DBAEUR03FT058.eop-EUR03.prod.protection.outlook.com>
2023-07-12 14:15 ` [PATCH] tree-optimization/94864 - vector insert of vector extract simplification Richard Sandiford
2023-07-13  6:39   ` Richard Biener
     [not found] <20230822090522.F0D7B385C6D9@sourceware.org>
2023-08-22  9:21 ` Hongtao Liu
2023-08-22  9:05 Richard Biener
     [not found] <20230712133739.1E4E23858C62@sourceware.org>
2023-07-15 17:31 ` Andrew Pinski
2023-07-15 17:34   ` Andrew Pinski
     [not found] <20230712133715.EDB563858C62@sourceware.org>
2023-07-13  2:47 ` Hongtao Liu
2023-07-13  2:57   ` Hongtao Liu
2023-07-13  6:32     ` Richard Biener
2023-07-13  8:29       ` Hongtao Liu
     [not found] <20230712133711.DEF46385842C@sourceware.org>
2023-07-12 14:05 ` Jeff Law
  -- strict thread matches above, loose matches on Subject: below --
2023-07-12 13:36 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).