VEC_COND_EXPR optimizations

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* VEC_COND_EXPR optimizations
@ 2020-07-30  7:49 Marc Glisse
  2020-07-31 11:18 ` Richard Sandiford
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Marc Glisse @ 2020-07-30  7:49 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1690 bytes --]

When vector comparisons were forced to use vec_cond_expr, we lost a number 
of optimizations (my fault for not adding enough testcases to prevent 
that). This patch tries to unwrap vec_cond_expr a bit so some 
optimizations can still happen.

I wasn't planning to add all those transformations together, but adding 
one caused a regression, whose fix introduced a second regression, etc.

Using a simple fold_binary internally looks like an ok compromise to me. 
It remains cheap enough (not recursive, and vector instructions are not 
that frequent), while still allowing more than const_binop (X|0 or X&X for 
instance). The transformations are quite conservative with :s and folding 
only if everything simplifies, we may want to relax this later. And of 
course we are going to miss things like a?b:c + a?c:b -> b+c.

In terms of number of operations, some transformations turning 2 
VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not 
look like a gain... I expect the bit_not disappears in most cases, and 
VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.

I am a bit confused that with avx512 we get types like "vector(4) 
<signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not 
-1?), but that doesn't matter for this patch.

Regtest+bootstrap on x86_64-pc-linux-gnu

2020-07-30  Marc Glisse  <marc.glisse@inria.fr>

 	PR tree-optimization/95906
 	PR target/70314
 	* match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
 	(v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.

 	* gcc.dg/tree-ssa/andnot-2.c: New file.
 	* gcc.dg/tree-ssa/pr95906.c: Likewise.
 	* gcc.target/i386/pr70314.c: Likewise.

-- 
Marc Glisse

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-diff; name=vec5.patch, Size: 4581 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index c6ae7a7db7a..af52d56162b 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3451,6 +3451,77 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    (if (cst1 && cst2)
     (vec_cond @0 { cst1; } { cst2; })))))
 
+/* Sink binary operation to branches, but only if we can fold it.  */
+#if GIMPLE
+(for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor
+	 rdiv trunc_div ceil_div floor_div round_div
+	 trunc_mod ceil_mod floor_mod round_mod min max)
+/* (c ? a : b) op d  -->  c ? (a op d) : (b op d) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) @3)
+  (with
+   {
+     tree rhs1, rhs2 = NULL;
+     rhs1 = fold_binary (op, type, @1, @3);
+     if (rhs1 && is_gimple_val (rhs1))
+       rhs2 = fold_binary (op, type, @2, @3);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+    (vec_cond @0 { rhs1; } { rhs2; }))))
+ (simplify
+  (op @3 (vec_cond:s @0 @1 @2))
+  (with
+   {
+     tree rhs1, rhs2 = NULL;
+     rhs1 = fold_binary (op, type, @3, @1);
+     if (rhs1 && is_gimple_val (rhs1))
+       rhs2 = fold_binary (op, type, @3, @2);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+    (vec_cond @0 { rhs1; } { rhs2; }))))
+
+/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
+  (with
+   {
+     tree rhs1, rhs2 = NULL;
+     rhs1 = fold_binary (op, type, @1, @3);
+     if (rhs1 && is_gimple_val (rhs1))
+       rhs2 = fold_binary (op, type, @2, @4);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+    (vec_cond @0 { rhs1; } { rhs2; })))))
+#endif
+
+/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
+ (vec_cond (bit_and @0 @3) @1 @2))
+(simplify
+ (vec_cond (vec_cond:s @0 integer_all_onesp @3) @1 @2)
+ (vec_cond (bit_ior @0 @3) @1 @2))
+(simplify
+ (vec_cond (vec_cond:s @0 integer_zerop @3) @1 @2)
+ (vec_cond (bit_ior @0 (bit_not @3)) @2 @1))
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_all_onesp) @1 @2)
+ (vec_cond (bit_and @0 (bit_not @3)) @2 @1))
+
+/* c1 ? c2 ? a : b : b  -->  (c1 & c2) ? a : b  */
+(simplify
+ (vec_cond @0 (vec_cond:s @1 @2 @3) @3)
+ (vec_cond (bit_and @0 @1) @2 @3))
+(simplify
+ (vec_cond @0 @2 (vec_cond:s @1 @2 @3))
+ (vec_cond (bit_ior @0 @1) @2 @3))
+(simplify
+ (vec_cond @0 (vec_cond:s @1 @2 @3) @2)
+ (vec_cond (bit_ior (bit_not @0) @1) @2 @3))
+(simplify
+ (vec_cond @0 @3 (vec_cond:s @1 @2 @3))
+ (vec_cond (bit_and (bit_not @0) @1) @2 @3))
+
 /* Simplification moved from fold_cond_expr_with_comparison.  It may also
    be extended.  */
 /* This pattern implements two kinds simplification:
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c
new file mode 100644
index 00000000000..e0955ce3ffd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */
+
+typedef long vec __attribute__((vector_size(16)));
+vec f(vec x){
+  vec y = x < 10;
+  return y & (y == 0);
+}
+
+/* { dg-final { scan-tree-dump-not "_expr" "forwprop3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c
new file mode 100644
index 00000000000..3d820a58e93
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */
+
+// FIXME: this should further optimize to a MAX_EXPR
+typedef signed char v16i8 __attribute__((vector_size(16)));
+v16i8 f(v16i8 a, v16i8 b)
+{
+    v16i8 cmp = (a > b);
+    return (cmp & a) | (~cmp & b);
+}
+
+/* { dg-final { scan-tree-dump-not "bit_(and|ior)_expr" "forwprop3" } } */
+/* { dg-final { scan-tree-dump-times "vec_cond_expr" 1 "forwprop3" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70314.c b/gcc/testsuite/gcc.target/i386/pr70314.c
new file mode 100644
index 00000000000..aad8dd9b57e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70314.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-march=skylake-avx512 -O2" } */
+/* { dg-final { scan-assembler-times "cmp" 2 } } */
+/* { dg-final { scan-assembler-not "and" } } */
+
+typedef long vec __attribute__((vector_size(16)));
+vec f(vec x, vec y){
+  return (x < 5) & (y < 8);
+}
+
+/* On x86_64, currently
+	vpcmpq	$2, .LC1(%rip), %xmm1, %k1
+	vpcmpq	$2, .LC0(%rip), %xmm0, %k0{%k1}
+	vpmovm2q	%k0, %xmm0
+*/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-30  7:49 VEC_COND_EXPR optimizations Marc Glisse
@ 2020-07-31 11:18 ` Richard Sandiford
  2020-07-31 11:38   ` Marc Glisse
  2020-07-31 11:35 ` Richard Biener
  2020-08-05 13:32 ` VEC_COND_EXPR optimizations v2 Marc Glisse
  2 siblings, 1 reply; 29+ messages in thread
From: Richard Sandiford @ 2020-07-31 11:18 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches

Marc Glisse <marc.glisse@inria.fr> writes:
> +/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
> + (simplify
> +  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
> +  (with
> +   {
> +     tree rhs1, rhs2 = NULL;
> +     rhs1 = fold_binary (op, type, @1, @3);
> +     if (rhs1 && is_gimple_val (rhs1))
> +       rhs2 = fold_binary (op, type, @2, @4);
> +   }
> +   (if (rhs2 && is_gimple_val (rhs2))
> +    (vec_cond @0 { rhs1; } { rhs2; })))))
> +#endif

This one looks dangerous for potentially-trapping ops.

> +/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
> +(simplify
> + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
> + (vec_cond (bit_and @0 @3) @1 @2))

Does something check automatically that @0 and @3 have compatible types?
Same question for the later folds.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-30  7:49 VEC_COND_EXPR optimizations Marc Glisse
  2020-07-31 11:18 ` Richard Sandiford
@ 2020-07-31 11:35 ` Richard Biener
  2020-07-31 11:39   ` Richard Biener
  2020-08-05 13:32 ` VEC_COND_EXPR optimizations v2 Marc Glisse
  2 siblings, 1 reply; 29+ messages in thread
From: Richard Biener @ 2020-07-31 11:35 UTC (permalink / raw)
  To: Marc Glisse; +Cc: GCC Patches

On Thu, Jul 30, 2020 at 9:49 AM Marc Glisse <marc.glisse@inria.fr> wrote:
>
> When vector comparisons were forced to use vec_cond_expr, we lost a number
> of optimizations (my fault for not adding enough testcases to prevent
> that). This patch tries to unwrap vec_cond_expr a bit so some
> optimizations can still happen.
>
> I wasn't planning to add all those transformations together, but adding
> one caused a regression, whose fix introduced a second regression, etc.
>
> Using a simple fold_binary internally looks like an ok compromise to me.
> It remains cheap enough (not recursive, and vector instructions are not
> that frequent), while still allowing more than const_binop (X|0 or X&X for
> instance). The transformations are quite conservative with :s and folding
> only if everything simplifies, we may want to relax this later. And of
> course we are going to miss things like a?b:c + a?c:b -> b+c.
>
> In terms of number of operations, some transformations turning 2
> VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not
> look like a gain... I expect the bit_not disappears in most cases, and
> VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
>
> I am a bit confused that with avx512 we get types like "vector(4)
> <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
> -1?), but that doesn't matter for this patch.
>
> Regtest+bootstrap on x86_64-pc-linux-gnu

+  (with
+   {
+     tree rhs1, rhs2 = NULL;
+     rhs1 = fold_binary (op, type, @1, @3);
+     if (rhs1 && is_gimple_val (rhs1))
+       rhs2 = fold_binary (op, type, @2, @3);

ICK.  I guess a more match-and-simplify way would be

   (with
    {
      tree rhs1, rhs2;
      gimple_match_op op (gimple_match_cond::UNCOND, op,
                                      type, @1, @3);
      if (op.resimplify (NULL, valueize)
          && gimple_simplified_result_is_gimple_val (op))
       {
         rhs1 = op.ops[0];
         ... other operand ...
       }

now in theory we could invent some new syntax for this, like

 (simplify
  (op (vec_cond:s @0 @1 @2) @3)
  (vec_cond @0 (op:x @1 @3) (op:x @2 @3)))

and pick something better instead of :x (:s is taken,
would be 'simplified', :c is taken would be 'constexpr', ...).

_Maybe_ just

 (simplify
  (op (vec_cond:s @0 @1 @2) @3)
  (vec_cond:x @0 (op @1 @3) (op @2 @3)))

which would have the same practical meaning as passing
NULL for the seq argument to simplification - do not allow
any intermediate stmt to be generated.

The other "simple" patterns look good, you can commit
them separately if you like.

Richard.

> 2020-07-30  Marc Glisse  <marc.glisse@inria.fr>
>
>         PR tree-optimization/95906
>         PR target/70314
>         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
>         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
>
>         * gcc.dg/tree-ssa/andnot-2.c: New file.
>         * gcc.dg/tree-ssa/pr95906.c: Likewise.
>         * gcc.target/i386/pr70314.c: Likewise.
>
> --
> Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 11:18 ` Richard Sandiford
@ 2020-07-31 11:38   ` Marc Glisse
  2020-07-31 11:43     ` Richard Biener
  2020-07-31 12:50     ` Richard Sandiford
  0 siblings, 2 replies; 29+ messages in thread
From: Marc Glisse @ 2020-07-31 11:38 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Fri, 31 Jul 2020, Richard Sandiford wrote:

> Marc Glisse <marc.glisse@inria.fr> writes:
>> +/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
>> + (simplify
>> +  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
>> +  (with
>> +   {
>> +     tree rhs1, rhs2 = NULL;
>> +     rhs1 = fold_binary (op, type, @1, @3);
>> +     if (rhs1 && is_gimple_val (rhs1))
>> +       rhs2 = fold_binary (op, type, @2, @4);
>> +   }
>> +   (if (rhs2 && is_gimple_val (rhs2))
>> +    (vec_cond @0 { rhs1; } { rhs2; })))))
>> +#endif
>
> This one looks dangerous for potentially-trapping ops.

I would expect the operation not to be folded if it can trap. Is that too 
optimistic?

>> +/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
>> +(simplify
>> + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
>> + (vec_cond (bit_and @0 @3) @1 @2))
>
> Does something check automatically that @0 and @3 have compatible types?

My memory of this dates from before the avx512-related changes, but at 
least at the time, the type of the condition in vec_cond_expr had to have 
the same size and number of elements as the result, and have signed 
integral elements. Now the size constraint has changed, but it still looks 
like for a given number of elements and size (this last one ignored for 
avx512), there is essentially a single type that can appear as condition. 
Is this wrong for x86? For SVE?

I could add some types_match conditions if that seems too unsafe...

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 11:35 ` Richard Biener
@ 2020-07-31 11:39   ` Richard Biener
  2020-07-31 11:47     ` Richard Biener
  0 siblings, 1 reply; 29+ messages in thread
From: Richard Biener @ 2020-07-31 11:39 UTC (permalink / raw)
  To: Marc Glisse; +Cc: GCC Patches

On Fri, Jul 31, 2020 at 1:35 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Thu, Jul 30, 2020 at 9:49 AM Marc Glisse <marc.glisse@inria.fr> wrote:
> >
> > When vector comparisons were forced to use vec_cond_expr, we lost a number
> > of optimizations (my fault for not adding enough testcases to prevent
> > that). This patch tries to unwrap vec_cond_expr a bit so some
> > optimizations can still happen.
> >
> > I wasn't planning to add all those transformations together, but adding
> > one caused a regression, whose fix introduced a second regression, etc.
> >
> > Using a simple fold_binary internally looks like an ok compromise to me.
> > It remains cheap enough (not recursive, and vector instructions are not
> > that frequent), while still allowing more than const_binop (X|0 or X&X for
> > instance). The transformations are quite conservative with :s and folding
> > only if everything simplifies, we may want to relax this later. And of
> > course we are going to miss things like a?b:c + a?c:b -> b+c.
> >
> > In terms of number of operations, some transformations turning 2
> > VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not
> > look like a gain... I expect the bit_not disappears in most cases, and
> > VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
> >
> > I am a bit confused that with avx512 we get types like "vector(4)
> > <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
> > -1?), but that doesn't matter for this patch.
> >
> > Regtest+bootstrap on x86_64-pc-linux-gnu
>
> +  (with
> +   {
> +     tree rhs1, rhs2 = NULL;
> +     rhs1 = fold_binary (op, type, @1, @3);
> +     if (rhs1 && is_gimple_val (rhs1))
> +       rhs2 = fold_binary (op, type, @2, @3);
>
> ICK.  I guess a more match-and-simplify way would be
>
>    (with
>     {
>       tree rhs1, rhs2;
>       gimple_match_op op (gimple_match_cond::UNCOND, op,
>                                       type, @1, @3);
>       if (op.resimplify (NULL, valueize)
>           && gimple_simplified_result_is_gimple_val (op))
>        {
>          rhs1 = op.ops[0];
>          ... other operand ...
>        }
>
> now in theory we could invent some new syntax for this, like
>
>  (simplify
>   (op (vec_cond:s @0 @1 @2) @3)
>   (vec_cond @0 (op:x @1 @3) (op:x @2 @3)))
>
> and pick something better instead of :x (:s is taken,
> would be 'simplified', :c is taken would be 'constexpr', ...).
>
> _Maybe_ just
>
>  (simplify
>   (op (vec_cond:s @0 @1 @2) @3)
>   (vec_cond:x @0 (op @1 @3) (op @2 @3)))
>
> which would have the same practical meaning as passing
> NULL for the seq argument to simplification - do not allow
> any intermediate stmt to be generated.

Note I specifically do not like those if (it-simplifies) checks
because we already would code-generate those anyway.  For

(simplify
  (plus (vec_cond:s @0 @1 @2) @3)
  (vec_cond @0 (plus @1 @3) (plus @2 @3)))

we get

                    res_op->set_op (VEC_COND_EXPR, type, 3);
                    res_op->ops[0] = captures[1];
                    res_op->ops[0] = unshare_expr (res_op->ops[0]);
                    {
                      tree _o1[2], _r1;
                      _o1[0] = captures[2];
                      _o1[1] = captures[4];
                      gimple_match_op tem_op (res_op->cond.any_else
(), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
                      tem_op.resimplify (lseq, valueize);
                      _r1 = maybe_push_res_to_seq (&tem_op, lseq);  (****)
                      if (!_r1) return false;
                      res_op->ops[1] = _r1;
                    }
                    {
                      tree _o1[2], _r1;
                      _o1[0] = captures[3];
                      _o1[1] = captures[4];
                      gimple_match_op tem_op (res_op->cond.any_else
(), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
                      tem_op.resimplify (lseq, valueize);
                      _r1 = maybe_push_res_to_seq (&tem_op, lseq);  (***)
                      if (!_r1) return false;
                      res_op->ops[2] = _r1;
                    }
                    res_op->resimplify (lseq, valueize);
                    return true;

and the only change required would be to pass NULL to maybe_push_res_to_seq
here instead of lseq at the (***) marked points.

Richard.

> The other "simple" patterns look good, you can commit
> them separately if you like.
>
> Richard.
>
> > 2020-07-30  Marc Glisse  <marc.glisse@inria.fr>
> >
> >         PR tree-optimization/95906
> >         PR target/70314
> >         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
> >         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
> >
> >         * gcc.dg/tree-ssa/andnot-2.c: New file.
> >         * gcc.dg/tree-ssa/pr95906.c: Likewise.
> >         * gcc.target/i386/pr70314.c: Likewise.
> >
> > --
> > Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 11:38   ` Marc Glisse
@ 2020-07-31 11:43     ` Richard Biener
  2020-07-31 11:57       ` Marc Glisse
  2020-07-31 12:50     ` Richard Sandiford
  1 sibling, 1 reply; 29+ messages in thread
From: Richard Biener @ 2020-07-31 11:43 UTC (permalink / raw)
  To: Marc Glisse; +Cc: Richard Sandiford, GCC Patches

On Fri, Jul 31, 2020 at 1:39 PM Marc Glisse <marc.glisse@inria.fr> wrote:
>
> On Fri, 31 Jul 2020, Richard Sandiford wrote:
>
> > Marc Glisse <marc.glisse@inria.fr> writes:
> >> +/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
> >> + (simplify
> >> +  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
> >> +  (with
> >> +   {
> >> +     tree rhs1, rhs2 = NULL;
> >> +     rhs1 = fold_binary (op, type, @1, @3);
> >> +     if (rhs1 && is_gimple_val (rhs1))
> >> +       rhs2 = fold_binary (op, type, @2, @4);
> >> +   }
> >> +   (if (rhs2 && is_gimple_val (rhs2))
> >> +    (vec_cond @0 { rhs1; } { rhs2; })))))
> >> +#endif
> >
> > This one looks dangerous for potentially-trapping ops.
>
> I would expect the operation not to be folded if it can trap. Is that too
> optimistic?
>
> >> +/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
> >> +(simplify
> >> + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
> >> + (vec_cond (bit_and @0 @3) @1 @2))
> >
> > Does something check automatically that @0 and @3 have compatible types?

@0 should always have a vector boolean type and thus will not be generally
compatible with @3.  But OTOH then when you see (vec_cond (vec_cond ...
then @3 must be vector boolean as well...

But in theory with AVX512 the inner vec_cond could have a SImode
condition @0 producing a regular V4SImode vector mask for an outer
AVX512 SSE-style vec-cond and you then would get a mismatch.

So indeed better add a type compatibility check.

> My memory of this dates from before the avx512-related changes, but at
> least at the time, the type of the condition in vec_cond_expr had to have
> the same size and number of elements as the result, and have signed
> integral elements. Now the size constraint has changed, but it still looks
> like for a given number of elements and size (this last one ignored for
> avx512), there is essentially a single type that can appear as condition.
> Is this wrong for x86? For SVE?
>
> I could add some types_match conditions if that seems too unsafe...
>
> --
> Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 11:39   ` Richard Biener
@ 2020-07-31 11:47     ` Richard Biener
  2020-07-31 12:08       ` Richard Biener
  2020-07-31 12:12       ` Marc Glisse
  0 siblings, 2 replies; 29+ messages in thread
From: Richard Biener @ 2020-07-31 11:47 UTC (permalink / raw)
  To: Marc Glisse; +Cc: GCC Patches

On Fri, Jul 31, 2020 at 1:39 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Fri, Jul 31, 2020 at 1:35 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Thu, Jul 30, 2020 at 9:49 AM Marc Glisse <marc.glisse@inria.fr> wrote:
> > >
> > > When vector comparisons were forced to use vec_cond_expr, we lost a number
> > > of optimizations (my fault for not adding enough testcases to prevent
> > > that). This patch tries to unwrap vec_cond_expr a bit so some
> > > optimizations can still happen.
> > >
> > > I wasn't planning to add all those transformations together, but adding
> > > one caused a regression, whose fix introduced a second regression, etc.
> > >
> > > Using a simple fold_binary internally looks like an ok compromise to me.
> > > It remains cheap enough (not recursive, and vector instructions are not
> > > that frequent), while still allowing more than const_binop (X|0 or X&X for
> > > instance). The transformations are quite conservative with :s and folding
> > > only if everything simplifies, we may want to relax this later. And of
> > > course we are going to miss things like a?b:c + a?c:b -> b+c.
> > >
> > > In terms of number of operations, some transformations turning 2
> > > VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not
> > > look like a gain... I expect the bit_not disappears in most cases, and
> > > VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
> > >
> > > I am a bit confused that with avx512 we get types like "vector(4)
> > > <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
> > > -1?), but that doesn't matter for this patch.
> > >
> > > Regtest+bootstrap on x86_64-pc-linux-gnu
> >
> > +  (with
> > +   {
> > +     tree rhs1, rhs2 = NULL;
> > +     rhs1 = fold_binary (op, type, @1, @3);
> > +     if (rhs1 && is_gimple_val (rhs1))
> > +       rhs2 = fold_binary (op, type, @2, @3);
> >
> > ICK.  I guess a more match-and-simplify way would be
> >
> >    (with
> >     {
> >       tree rhs1, rhs2;
> >       gimple_match_op op (gimple_match_cond::UNCOND, op,
> >                                       type, @1, @3);
> >       if (op.resimplify (NULL, valueize)
> >           && gimple_simplified_result_is_gimple_val (op))
> >        {
> >          rhs1 = op.ops[0];
> >          ... other operand ...
> >        }
> >
> > now in theory we could invent some new syntax for this, like
> >
> >  (simplify
> >   (op (vec_cond:s @0 @1 @2) @3)
> >   (vec_cond @0 (op:x @1 @3) (op:x @2 @3)))
> >
> > and pick something better instead of :x (:s is taken,
> > would be 'simplified', :c is taken would be 'constexpr', ...).
> >
> > _Maybe_ just
> >
> >  (simplify
> >   (op (vec_cond:s @0 @1 @2) @3)
> >   (vec_cond:x @0 (op @1 @3) (op @2 @3)))
> >
> > which would have the same practical meaning as passing
> > NULL for the seq argument to simplification - do not allow
> > any intermediate stmt to be generated.
>
> Note I specifically do not like those if (it-simplifies) checks
> because we already would code-generate those anyway.  For
>
> (simplify
>   (plus (vec_cond:s @0 @1 @2) @3)
>   (vec_cond @0 (plus @1 @3) (plus @2 @3)))
>
> we get
>
>                     res_op->set_op (VEC_COND_EXPR, type, 3);
>                     res_op->ops[0] = captures[1];
>                     res_op->ops[0] = unshare_expr (res_op->ops[0]);
>                     {
>                       tree _o1[2], _r1;
>                       _o1[0] = captures[2];
>                       _o1[1] = captures[4];
>                       gimple_match_op tem_op (res_op->cond.any_else
> (), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
>                       tem_op.resimplify (lseq, valueize);
>                       _r1 = maybe_push_res_to_seq (&tem_op, lseq);  (****)
>                       if (!_r1) return false;
>                       res_op->ops[1] = _r1;
>                     }
>                     {
>                       tree _o1[2], _r1;
>                       _o1[0] = captures[3];
>                       _o1[1] = captures[4];
>                       gimple_match_op tem_op (res_op->cond.any_else
> (), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
>                       tem_op.resimplify (lseq, valueize);
>                       _r1 = maybe_push_res_to_seq (&tem_op, lseq);  (***)
>                       if (!_r1) return false;
>                       res_op->ops[2] = _r1;
>                     }
>                     res_op->resimplify (lseq, valueize);
>                     return true;
>
> and the only change required would be to pass NULL to maybe_push_res_to_seq
> here instead of lseq at the (***) marked points.

(simplify
  (plus (vec_cond:s @0 @1 @2) @3)
  (vec_cond:l @0 (plus @1 @3) (plus @2 @3)))

'l' for 'force leaf'.  I'll see if I can quickly cme up with a patch.

Richard.



> Richard.
>
> > The other "simple" patterns look good, you can commit
> > them separately if you like.
> >
> > Richard.
> >
> > > 2020-07-30  Marc Glisse  <marc.glisse@inria.fr>
> > >
> > >         PR tree-optimization/95906
> > >         PR target/70314
> > >         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
> > >         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
> > >
> > >         * gcc.dg/tree-ssa/andnot-2.c: New file.
> > >         * gcc.dg/tree-ssa/pr95906.c: Likewise.
> > >         * gcc.target/i386/pr70314.c: Likewise.
> > >
> > > --
> > > Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 11:43     ` Richard Biener
@ 2020-07-31 11:57       ` Marc Glisse
  0 siblings, 0 replies; 29+ messages in thread
From: Marc Glisse @ 2020-07-31 11:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Sandiford, GCC Patches

On Fri, 31 Jul 2020, Richard Biener wrote:

>>>> +/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
>>>> +(simplify
>>>> + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
>>>> + (vec_cond (bit_and @0 @3) @1 @2))
>>>
>>> Does something check automatically that @0 and @3 have compatible types?
>
> @0 should always have a vector boolean type and thus will not be generally
> compatible with @3.  But OTOH then when you see (vec_cond (vec_cond ...
> then @3 must be vector boolean as well...
>
> But in theory with AVX512 the inner vec_cond could have a SImode
> condition @0 producing a regular V4SImode vector mask for an outer
> AVX512 SSE-style vec-cond and you then would get a mismatch.

Ah, I thought the SSE-style vec_cond was impossible in AVX512 mode, at 
least I couldn't generate one in a few tests, but I didn't try very hard.

> So indeed better add a type compatibility check.

Ok, it can't hurt.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 11:47     ` Richard Biener
@ 2020-07-31 12:08       ` Richard Biener
  2020-07-31 12:12       ` Marc Glisse
  1 sibling, 0 replies; 29+ messages in thread
From: Richard Biener @ 2020-07-31 12:08 UTC (permalink / raw)
  To: Marc Glisse; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 6377 bytes --]

On Fri, Jul 31, 2020 at 1:47 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Fri, Jul 31, 2020 at 1:39 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Fri, Jul 31, 2020 at 1:35 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Thu, Jul 30, 2020 at 9:49 AM Marc Glisse <marc.glisse@inria.fr> wrote:
> > > >
> > > > When vector comparisons were forced to use vec_cond_expr, we lost a number
> > > > of optimizations (my fault for not adding enough testcases to prevent
> > > > that). This patch tries to unwrap vec_cond_expr a bit so some
> > > > optimizations can still happen.
> > > >
> > > > I wasn't planning to add all those transformations together, but adding
> > > > one caused a regression, whose fix introduced a second regression, etc.
> > > >
> > > > Using a simple fold_binary internally looks like an ok compromise to me.
> > > > It remains cheap enough (not recursive, and vector instructions are not
> > > > that frequent), while still allowing more than const_binop (X|0 or X&X for
> > > > instance). The transformations are quite conservative with :s and folding
> > > > only if everything simplifies, we may want to relax this later. And of
> > > > course we are going to miss things like a?b:c + a?c:b -> b+c.
> > > >
> > > > In terms of number of operations, some transformations turning 2
> > > > VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not
> > > > look like a gain... I expect the bit_not disappears in most cases, and
> > > > VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
> > > >
> > > > I am a bit confused that with avx512 we get types like "vector(4)
> > > > <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
> > > > -1?), but that doesn't matter for this patch.
> > > >
> > > > Regtest+bootstrap on x86_64-pc-linux-gnu
> > >
> > > +  (with
> > > +   {
> > > +     tree rhs1, rhs2 = NULL;
> > > +     rhs1 = fold_binary (op, type, @1, @3);
> > > +     if (rhs1 && is_gimple_val (rhs1))
> > > +       rhs2 = fold_binary (op, type, @2, @3);
> > >
> > > ICK.  I guess a more match-and-simplify way would be
> > >
> > >    (with
> > >     {
> > >       tree rhs1, rhs2;
> > >       gimple_match_op op (gimple_match_cond::UNCOND, op,
> > >                                       type, @1, @3);
> > >       if (op.resimplify (NULL, valueize)
> > >           && gimple_simplified_result_is_gimple_val (op))
> > >        {
> > >          rhs1 = op.ops[0];
> > >          ... other operand ...
> > >        }
> > >
> > > now in theory we could invent some new syntax for this, like
> > >
> > >  (simplify
> > >   (op (vec_cond:s @0 @1 @2) @3)
> > >   (vec_cond @0 (op:x @1 @3) (op:x @2 @3)))
> > >
> > > and pick something better instead of :x (:s is taken,
> > > would be 'simplified', :c is taken would be 'constexpr', ...).
> > >
> > > _Maybe_ just
> > >
> > >  (simplify
> > >   (op (vec_cond:s @0 @1 @2) @3)
> > >   (vec_cond:x @0 (op @1 @3) (op @2 @3)))
> > >
> > > which would have the same practical meaning as passing
> > > NULL for the seq argument to simplification - do not allow
> > > any intermediate stmt to be generated.
> >
> > Note I specifically do not like those if (it-simplifies) checks
> > because we already would code-generate those anyway.  For
> >
> > (simplify
> >   (plus (vec_cond:s @0 @1 @2) @3)
> >   (vec_cond @0 (plus @1 @3) (plus @2 @3)))
> >
> > we get
> >
> >                     res_op->set_op (VEC_COND_EXPR, type, 3);
> >                     res_op->ops[0] = captures[1];
> >                     res_op->ops[0] = unshare_expr (res_op->ops[0]);
> >                     {
> >                       tree _o1[2], _r1;
> >                       _o1[0] = captures[2];
> >                       _o1[1] = captures[4];
> >                       gimple_match_op tem_op (res_op->cond.any_else
> > (), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
> >                       tem_op.resimplify (lseq, valueize);
> >                       _r1 = maybe_push_res_to_seq (&tem_op, lseq);  (****)
> >                       if (!_r1) return false;
> >                       res_op->ops[1] = _r1;
> >                     }
> >                     {
> >                       tree _o1[2], _r1;
> >                       _o1[0] = captures[3];
> >                       _o1[1] = captures[4];
> >                       gimple_match_op tem_op (res_op->cond.any_else
> > (), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
> >                       tem_op.resimplify (lseq, valueize);
> >                       _r1 = maybe_push_res_to_seq (&tem_op, lseq);  (***)
> >                       if (!_r1) return false;
> >                       res_op->ops[2] = _r1;
> >                     }
> >                     res_op->resimplify (lseq, valueize);
> >                     return true;
> >
> > and the only change required would be to pass NULL to maybe_push_res_to_seq
> > here instead of lseq at the (***) marked points.
>
> (simplify
>   (plus (vec_cond:s @0 @1 @2) @3)
>   (vec_cond:l @0 (plus @1 @3) (plus @2 @3)))
>
> 'l' for 'force leaf'.  I'll see if I can quickly cme up with a patch.

The attached prototype works for

(simplify
  (plus (vec_cond:s @0 @1 @2) @3)
  (vec_cond @0 (plus:l @1 @3) (plus:l @2 @3)))

but ':...' is already taken for an explicitly specified type so I have
to think about sth better.  As you see I've also moved it to
the actual ops that should simplify.  It doesn't work on the
outermost expression but I guess it doesn't make sense there
(adding support would be possible).

Now I need some non-ambiguous syntax...  it currently
is id[:type][@cid] so maybe id[!][:type][@cid].  I guess
non-ambiguous is good enough?

Richard.

> Richard.
>
>
>
> > Richard.
> >
> > > The other "simple" patterns look good, you can commit
> > > them separately if you like.
> > >
> > > Richard.
> > >
> > > > 2020-07-30  Marc Glisse  <marc.glisse@inria.fr>
> > > >
> > > >         PR tree-optimization/95906
> > > >         PR target/70314
> > > >         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
> > > >         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
> > > >
> > > >         * gcc.dg/tree-ssa/andnot-2.c: New file.
> > > >         * gcc.dg/tree-ssa/pr95906.c: Likewise.
> > > >         * gcc.target/i386/pr70314.c: Likewise.
> > > >
> > > > --
> > > > Marc Glisse

[-- Attachment #2: p --]
[-- Type: application/octet-stream, Size: 2219 bytes --]

diff --git a/gcc/genmatch.c b/gcc/genmatch.c
index 0a8cba62e0c..9a34fe71e78 100644
--- a/gcc/genmatch.c
+++ b/gcc/genmatch.c
@@ -697,12 +697,13 @@ public:
   expr (id_base *operation_, location_t loc, bool is_commutative_ = false)
     : operand (OP_EXPR, loc), operation (operation_),
       ops (vNULL), expr_type (NULL), is_commutative (is_commutative_),
-      is_generic (false), force_single_use (false), opt_grp (0) {}
+      is_generic (false), force_single_use (false), force_leaf (false),
+      opt_grp (0) {}
   expr (expr *e)
     : operand (OP_EXPR, e->location), operation (e->operation),
       ops (vNULL), expr_type (e->expr_type), is_commutative (e->is_commutative),
       is_generic (e->is_generic), force_single_use (e->force_single_use),
-      opt_grp (e->opt_grp) {}
+      force_leaf (e->force_leaf), opt_grp (e->opt_grp) {}
   void append_op (operand *op) { ops.safe_push (op); }
   /* The operator and its operands.  */
   id_base *operation;
@@ -717,6 +718,9 @@ public:
   /* Whether pushing any stmt to the sequence should be conditional
      on this expression having a single-use.  */
   bool force_single_use;
+  /* Whether in the result expression this should be a leaf node
+     with any children simplified down to simple operands.  */
+  bool force_leaf;
   /* If non-zero, the group for optional handling.  */
   unsigned char opt_grp;
   virtual void gen_transform (FILE *f, int, const char *, bool, int,
@@ -2520,7 +2524,8 @@ expr::gen_transform (FILE *f, int indent, const char *dest, bool gimple,
       fprintf (f, ");\n");
       fprintf_indent (f, indent, "tem_op.resimplify (lseq, valueize);\n");
       fprintf_indent (f, indent,
-		      "_r%d = maybe_push_res_to_seq (&tem_op, lseq);\n", depth);
+		      "_r%d = maybe_push_res_to_seq (&tem_op, %s);\n", depth,
+		      !force_leaf ? "lseq" : "NULL");
       fprintf_indent (f, indent,
 		      "if (!_r%d) return false;\n",
 		      depth);
@@ -4250,7 +4255,12 @@ parser::parse_expr ()
 	{
 	  const char *s = get_ident ();
 	  if (!parsing_match_operand)
-	    expr_type = s;
+	    {
+	      if (*s == 'l')
+		e->force_leaf = true;
+	      else
+		expr_type = s;
+	    }
 	  else
 	    {
 	      const char *sp = s;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 11:47     ` Richard Biener
  2020-07-31 12:08       ` Richard Biener
@ 2020-07-31 12:12       ` Marc Glisse
  1 sibling, 0 replies; 29+ messages in thread
From: Marc Glisse @ 2020-07-31 12:12 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Fri, 31 Jul 2020, Richard Biener wrote:

> On Fri, Jul 31, 2020 at 1:39 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
>>
>> On Fri, Jul 31, 2020 at 1:35 PM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>>
>>> On Thu, Jul 30, 2020 at 9:49 AM Marc Glisse <marc.glisse@inria.fr> wrote:
>>>>
>>>> When vector comparisons were forced to use vec_cond_expr, we lost a number
>>>> of optimizations (my fault for not adding enough testcases to prevent
>>>> that). This patch tries to unwrap vec_cond_expr a bit so some
>>>> optimizations can still happen.
>>>>
>>>> I wasn't planning to add all those transformations together, but adding
>>>> one caused a regression, whose fix introduced a second regression, etc.
>>>>
>>>> Using a simple fold_binary internally looks like an ok compromise to me.
>>>> It remains cheap enough (not recursive, and vector instructions are not
>>>> that frequent), while still allowing more than const_binop (X|0 or X&X for
>>>> instance). The transformations are quite conservative with :s and folding
>>>> only if everything simplifies, we may want to relax this later. And of
>>>> course we are going to miss things like a?b:c + a?c:b -> b+c.
>>>>
>>>> In terms of number of operations, some transformations turning 2
>>>> VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not
>>>> look like a gain... I expect the bit_not disappears in most cases, and
>>>> VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
>>>>
>>>> I am a bit confused that with avx512 we get types like "vector(4)
>>>> <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
>>>> -1?), but that doesn't matter for this patch.
>>>>
>>>> Regtest+bootstrap on x86_64-pc-linux-gnu
>>>
>>> +  (with
>>> +   {
>>> +     tree rhs1, rhs2 = NULL;
>>> +     rhs1 = fold_binary (op, type, @1, @3);
>>> +     if (rhs1 && is_gimple_val (rhs1))
>>> +       rhs2 = fold_binary (op, type, @2, @3);
>>>
>>> ICK.  I guess a more match-and-simplify way would be

I was focused on avoiding recursion, but indeed that's independent, I 
could have set a trivial valueize function for that.

>>>    (with
>>>     {
>>>       tree rhs1, rhs2;
>>>       gimple_match_op op (gimple_match_cond::UNCOND, op,
>>>                                       type, @1, @3);
>>>       if (op.resimplify (NULL, valueize)

Oh, so you would be ok with something that recurses without limit? I 
thought we were avoiding this because it may allow some compile time 
explosion with pathological examples.

>>>           && gimple_simplified_result_is_gimple_val (op))
>>>        {
>>>          rhs1 = op.ops[0];
>>>          ... other operand ...
>>>        }
>>>
>>> now in theory we could invent some new syntax for this, like
>>>
>>>  (simplify
>>>   (op (vec_cond:s @0 @1 @2) @3)
>>>   (vec_cond @0 (op:x @1 @3) (op:x @2 @3)))
>>>
>>> and pick something better instead of :x (:s is taken,
>>> would be 'simplified', :c is taken would be 'constexpr', ...).
>>>
>>> _Maybe_ just
>>>
>>>  (simplify
>>>   (op (vec_cond:s @0 @1 @2) @3)
>>>   (vec_cond:x @0 (op @1 @3) (op @2 @3)))
>>>
>>> which would have the same practical meaning as passing
>>> NULL for the seq argument to simplification - do not allow
>>> any intermediate stmt to be generated.
>>
>> Note I specifically do not like those if (it-simplifies) checks
>> because we already would code-generate those anyway.  For
>>
>> (simplify
>>   (plus (vec_cond:s @0 @1 @2) @3)
>>   (vec_cond @0 (plus @1 @3) (plus @2 @3)))
>>
>> we get
>>
>>                     res_op->set_op (VEC_COND_EXPR, type, 3);
>>                     res_op->ops[0] = captures[1];
>>                     res_op->ops[0] = unshare_expr (res_op->ops[0]);
>>                     {
>>                       tree _o1[2], _r1;
>>                       _o1[0] = captures[2];
>>                       _o1[1] = captures[4];
>>                       gimple_match_op tem_op (res_op->cond.any_else
>> (), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
>>                       tem_op.resimplify (lseq, valueize);
>>                       _r1 = maybe_push_res_to_seq (&tem_op, lseq);  (****)
>>                       if (!_r1) return false;
>>                       res_op->ops[1] = _r1;
>>                     }
>>                     {
>>                       tree _o1[2], _r1;
>>                       _o1[0] = captures[3];
>>                       _o1[1] = captures[4];
>>                       gimple_match_op tem_op (res_op->cond.any_else
>> (), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
>>                       tem_op.resimplify (lseq, valueize);
>>                       _r1 = maybe_push_res_to_seq (&tem_op, lseq);  (***)
>>                       if (!_r1) return false;
>>                       res_op->ops[2] = _r1;
>>                     }
>>                     res_op->resimplify (lseq, valueize);
>>                     return true;
>>
>> and the only change required would be to pass NULL to maybe_push_res_to_seq
>> here instead of lseq at the (***) marked points.
>
> (simplify
>  (plus (vec_cond:s @0 @1 @2) @3)
>  (vec_cond:l @0 (plus @1 @3) (plus @2 @3)))
>
> 'l' for 'force leaf'.  I'll see if I can quickly cme up with a patch.

Does it have a clear meaning for GENERIC? I guess that's probably not such 
a big problem.

There are a number of transformations that we would like to perform "if 
<something> simplifies", but I don't know if it will always have exactly 
this form of "if this part of the output pattern doesn't simplify, give 
up". In some cases, we might prefer "ok if at least one of the branches 
simplifies" for instance. Or "ok if this generates at most one extra 
statement". Even for this particular transformation, I am not sure this is 
the right condition.

Your suggestion looks good (although I understand better the version with 
a mark on plus instead of vec_cond_expr), I just want to avoid wasting 
your time if it turns out we don't use it that much in the end (I have no 
idea yet).

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 11:38   ` Marc Glisse
  2020-07-31 11:43     ` Richard Biener
@ 2020-07-31 12:50     ` Richard Sandiford
  2020-07-31 12:59       ` Richard Biener
  2020-07-31 13:01       ` Marc Glisse
  1 sibling, 2 replies; 29+ messages in thread
From: Richard Sandiford @ 2020-07-31 12:50 UTC (permalink / raw)
  To: Marc Glisse; +Cc: gcc-patches

Marc Glisse <marc.glisse@inria.fr> writes:
> On Fri, 31 Jul 2020, Richard Sandiford wrote:
>
>> Marc Glisse <marc.glisse@inria.fr> writes:
>>> +/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
>>> + (simplify
>>> +  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
>>> +  (with
>>> +   {
>>> +     tree rhs1, rhs2 = NULL;
>>> +     rhs1 = fold_binary (op, type, @1, @3);
>>> +     if (rhs1 && is_gimple_val (rhs1))
>>> +       rhs2 = fold_binary (op, type, @2, @4);
>>> +   }
>>> +   (if (rhs2 && is_gimple_val (rhs2))
>>> +    (vec_cond @0 { rhs1; } { rhs2; })))))
>>> +#endif
>>
>> This one looks dangerous for potentially-trapping ops.
>
> I would expect the operation not to be folded if it can trap. Is that too 
> optimistic?

Not sure TBH.  I was thinking of “trapping” in the sense of raising
an IEEE exception, rather than in the could-throw/must-end-bb sense.
I thought match.pd applied to things like FP addition as normal and
it was up to individual patterns to check the appropriate properties.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 12:50     ` Richard Sandiford
@ 2020-07-31 12:59       ` Richard Biener
  2020-07-31 13:01       ` Marc Glisse
  1 sibling, 0 replies; 29+ messages in thread
From: Richard Biener @ 2020-07-31 12:59 UTC (permalink / raw)
  To: Marc Glisse, GCC Patches, Richard Sandiford

On Fri, Jul 31, 2020 at 2:50 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Marc Glisse <marc.glisse@inria.fr> writes:
> > On Fri, 31 Jul 2020, Richard Sandiford wrote:
> >
> >> Marc Glisse <marc.glisse@inria.fr> writes:
> >>> +/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
> >>> + (simplify
> >>> +  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
> >>> +  (with
> >>> +   {
> >>> +     tree rhs1, rhs2 = NULL;
> >>> +     rhs1 = fold_binary (op, type, @1, @3);
> >>> +     if (rhs1 && is_gimple_val (rhs1))
> >>> +       rhs2 = fold_binary (op, type, @2, @4);
> >>> +   }
> >>> +   (if (rhs2 && is_gimple_val (rhs2))
> >>> +    (vec_cond @0 { rhs1; } { rhs2; })))))
> >>> +#endif
> >>
> >> This one looks dangerous for potentially-trapping ops.
> >
> > I would expect the operation not to be folded if it can trap. Is that too
> > optimistic?
>
> Not sure TBH.  I was thinking of “trapping” in the sense of raising
> an IEEE exception, rather than in the could-throw/must-end-bb sense.
> I thought match.pd applied to things like FP addition as normal and
> it was up to individual patterns to check the appropriate properties.

I think it can be indeed defered to the simplification of (op @0 @2)
because that would be invalid if it removes a IEEE exception.
The VEC_COND_EXPR itself cannot trap.

Richard.

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 12:50     ` Richard Sandiford
  2020-07-31 12:59       ` Richard Biener
@ 2020-07-31 13:01       ` Marc Glisse
  2020-07-31 13:13         ` Marc Glisse
  1 sibling, 1 reply; 29+ messages in thread
From: Marc Glisse @ 2020-07-31 13:01 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Fri, 31 Jul 2020, Richard Sandiford wrote:

> Marc Glisse <marc.glisse@inria.fr> writes:
>> On Fri, 31 Jul 2020, Richard Sandiford wrote:
>>
>>> Marc Glisse <marc.glisse@inria.fr> writes:
>>>> +/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
>>>> + (simplify
>>>> +  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
>>>> +  (with
>>>> +   {
>>>> +     tree rhs1, rhs2 = NULL;
>>>> +     rhs1 = fold_binary (op, type, @1, @3);
>>>> +     if (rhs1 && is_gimple_val (rhs1))
>>>> +       rhs2 = fold_binary (op, type, @2, @4);
>>>> +   }
>>>> +   (if (rhs2 && is_gimple_val (rhs2))
>>>> +    (vec_cond @0 { rhs1; } { rhs2; })))))
>>>> +#endif
>>>
>>> This one looks dangerous for potentially-trapping ops.
>>
>> I would expect the operation not to be folded if it can trap. Is that too
>> optimistic?
>
> Not sure TBH.  I was thinking of “trapping” in the sense of raising
> an IEEE exception, rather than in the could-throw/must-end-bb sense.

That's what I understood from your message :-)

> I thought match.pd applied to things like FP addition as normal and
> it was up to individual patterns to check the appropriate properties.

Yes, and in this case I am delegating that to fold_binary, which already 
performs this check.

I tried with this C++ program

typedef double vecf __attribute__((vector_size(16)));
typedef long long veci __attribute__((vector_size(16)));
vecf f(veci c){
   return (c?1.:2.)/(c?3.:7.);
}

the folding happens by default, but not with -frounding-math, which seems 
like exactly what we want.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations
  2020-07-31 13:01       ` Marc Glisse
@ 2020-07-31 13:13         ` Marc Glisse
  0 siblings, 0 replies; 29+ messages in thread
From: Marc Glisse @ 2020-07-31 13:13 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Fri, 31 Jul 2020, Marc Glisse wrote:

> On Fri, 31 Jul 2020, Richard Sandiford wrote:
>
>> Marc Glisse <marc.glisse@inria.fr> writes:
>>> On Fri, 31 Jul 2020, Richard Sandiford wrote:
>>> 
>>>> Marc Glisse <marc.glisse@inria.fr> writes:
>>>>> +/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
>>>>> + (simplify
>>>>> +  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
>>>>> +  (with
>>>>> +   {
>>>>> +     tree rhs1, rhs2 = NULL;
>>>>> +     rhs1 = fold_binary (op, type, @1, @3);
>>>>> +     if (rhs1 && is_gimple_val (rhs1))
>>>>> +       rhs2 = fold_binary (op, type, @2, @4);
>>>>> +   }
>>>>> +   (if (rhs2 && is_gimple_val (rhs2))
>>>>> +    (vec_cond @0 { rhs1; } { rhs2; })))))
>>>>> +#endif
>>>> 
>>>> This one looks dangerous for potentially-trapping ops.
>>> 
>>> I would expect the operation not to be folded if it can trap. Is that too
>>> optimistic?
>> 
>> Not sure TBH.  I was thinking of “trapping” in the sense of raising
>> an IEEE exception, rather than in the could-throw/must-end-bb sense.
>
> That's what I understood from your message :-)
>
>> I thought match.pd applied to things like FP addition as normal and
>> it was up to individual patterns to check the appropriate properties.
>
> Yes, and in this case I am delegating that to fold_binary, which already 
> performs this check.
>
> I tried with this C++ program
>
> typedef double vecf __attribute__((vector_size(16)));
> typedef long long veci __attribute__((vector_size(16)));
> vecf f(veci c){
>  return (c?1.:2.)/(c?3.:7.);
> }
>
> the folding happens by default, but not with -frounding-math, which seems 
> like exactly what we want.

That was for rounding. -ftrapping-math does not disable folding of

typedef double vecf __attribute__((vector_size(16)));
typedef long long veci __attribute__((vector_size(16)));
vecf f(veci c){
     vecf z={};
   return (z+1)/(z+3);
}

despite a possible inexact flag, so it doesn't disable vec_cond_expr 
folding either.

(not sure we want to fix this unless -fno-trapping-math becomes the 
default)

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* VEC_COND_EXPR optimizations v2
  2020-07-30  7:49 VEC_COND_EXPR optimizations Marc Glisse
  2020-07-31 11:18 ` Richard Sandiford
  2020-07-31 11:35 ` Richard Biener
@ 2020-08-05 13:32 ` Marc Glisse
  2020-08-05 14:24   ` Richard Biener
  2 siblings, 1 reply; 29+ messages in thread
From: Marc Glisse @ 2020-08-05 13:32 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1641 bytes --]

New version that passed bootstrap+regtest during the night.

When vector comparisons were forced to use vec_cond_expr, we lost a number of 
optimizations (my fault for not adding enough testcases to prevent that). 
This patch tries to unwrap vec_cond_expr a bit so some optimizations can 
still happen.

I wasn't planning to add all those transformations together, but adding one 
caused a regression, whose fix introduced a second regression, etc.

Restricting to constant folding would not be sufficient, we also need at 
least things like X|0 or X&X. The transformations are quite conservative 
with :s and folding only if everything simplifies, we may want to relax 
this later. And of course we are going to miss things like a?b:c + a?c:b 
-> b+c.

In terms of number of operations, some transformations turning 2 
VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look 
like a gain... I expect the bit_not disappears in most cases, and 
VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.

I am a bit confused that with avx512 we get types like "vector(4) 
<signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not 
-1?), but that doesn't matter for this patch.

2020-08-05  Marc Glisse  <marc.glisse@inria.fr>

 	PR tree-optimization/95906
 	PR target/70314
 	* match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
 	(v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
 	(op (c ? a : b)): Update to match the new transformations.

 	* gcc.dg/tree-ssa/andnot-2.c: New file.
 	* gcc.dg/tree-ssa/pr95906.c: Likewise.
 	* gcc.target/i386/pr70314.c: Likewise.

-- 
Marc Glisse

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-diff; name=vec7.patch, Size: 4623 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index a052c9e3dbc..f9297fcadbe 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3436,20 +3436,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (if (integer_zerop (@0))
    @2)))
 
-/* Sink unary operations to constant branches, but only if we do fold it to
-   constants.  */
+#if GIMPLE
+/* Sink unary operations to branches, but only if we do fold both.  */
 (for op (negate bit_not abs absu)
  (simplify
-  (op (vec_cond @0 VECTOR_CST@1 VECTOR_CST@2))
-  (with
-   {
-     tree cst1, cst2;
-     cst1 = const_unop (op, type, @1);
-     if (cst1)
-       cst2 = const_unop (op, type, @2);
-   }
-   (if (cst1 && cst2)
-    (vec_cond @0 { cst1; } { cst2; })))))
+  (op (vec_cond:s @0 @1 @2))
+  (vec_cond @0 (op! @1) (op! @2))))
+
+/* Sink binary operation to branches, but only if we can fold it.  */
+(for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor
+	 rdiv trunc_div ceil_div floor_div round_div
+	 trunc_mod ceil_mod floor_mod round_mod min max)
+/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
+  (vec_cond @0 (op! @1 @3) (op! @2 @4)))
+
+/* (c ? a : b) op d  -->  c ? (a op d) : (b op d) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) @3)
+  (vec_cond @0 (op! @1 @3) (op! @2 @3)))
+ (simplify
+  (op @3 (vec_cond:s @0 @1 @2))
+  (vec_cond @0 (op! @3 @1) (op! @3 @2))))
+#endif
+
+/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
+ (if (types_match (@0, @3))
+  (vec_cond (bit_and @0 @3) @1 @2)))
+(simplify
+ (vec_cond (vec_cond:s @0 integer_all_onesp @3) @1 @2)
+ (if (types_match (@0, @3))
+  (vec_cond (bit_ior @0 @3) @1 @2)))
+(simplify
+ (vec_cond (vec_cond:s @0 integer_zerop @3) @1 @2)
+ (if (types_match (@0, @3))
+  (vec_cond (bit_ior @0 (bit_not @3)) @2 @1)))
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_all_onesp) @1 @2)
+ (if (types_match (@0, @3))
+  (vec_cond (bit_and @0 (bit_not @3)) @2 @1)))
+
+/* c1 ? c2 ? a : b : b  -->  (c1 & c2) ? a : b  */
+(simplify
+ (vec_cond @0 (vec_cond:s @1 @2 @3) @3)
+ (if (types_match (@0, @1))
+  (vec_cond (bit_and @0 @1) @2 @3)))
+(simplify
+ (vec_cond @0 @2 (vec_cond:s @1 @2 @3))
+ (if (types_match (@0, @1))
+  (vec_cond (bit_ior @0 @1) @2 @3)))
+(simplify
+ (vec_cond @0 (vec_cond:s @1 @2 @3) @2)
+ (if (types_match (@0, @1))
+  (vec_cond (bit_ior (bit_not @0) @1) @2 @3)))
+(simplify
+ (vec_cond @0 @3 (vec_cond:s @1 @2 @3))
+ (if (types_match (@0, @1))
+  (vec_cond (bit_and (bit_not @0) @1) @2 @3)))
 
 /* Simplification moved from fold_cond_expr_with_comparison.  It may also
    be extended.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c
new file mode 100644
index 00000000000..e0955ce3ffd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */
+
+typedef long vec __attribute__((vector_size(16)));
+vec f(vec x){
+  vec y = x < 10;
+  return y & (y == 0);
+}
+
+/* { dg-final { scan-tree-dump-not "_expr" "forwprop3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c
new file mode 100644
index 00000000000..3d820a58e93
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */
+
+// FIXME: this should further optimize to a MAX_EXPR
+typedef signed char v16i8 __attribute__((vector_size(16)));
+v16i8 f(v16i8 a, v16i8 b)
+{
+    v16i8 cmp = (a > b);
+    return (cmp & a) | (~cmp & b);
+}
+
+/* { dg-final { scan-tree-dump-not "bit_(and|ior)_expr" "forwprop3" } } */
+/* { dg-final { scan-tree-dump-times "vec_cond_expr" 1 "forwprop3" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr70314.c b/gcc/testsuite/gcc.target/i386/pr70314.c
new file mode 100644
index 00000000000..aad8dd9b57e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr70314.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-march=skylake-avx512 -O2" } */
+/* { dg-final { scan-assembler-times "cmp" 2 } } */
+/* { dg-final { scan-assembler-not "and" } } */
+
+typedef long vec __attribute__((vector_size(16)));
+vec f(vec x, vec y){
+  return (x < 5) & (y < 8);
+}
+
+/* On x86_64, currently
+	vpcmpq	$2, .LC1(%rip), %xmm1, %k1
+	vpcmpq	$2, .LC0(%rip), %xmm0, %k0{%k1}
+	vpmovm2q	%k0, %xmm0
+*/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-05 13:32 ` VEC_COND_EXPR optimizations v2 Marc Glisse
@ 2020-08-05 14:24   ` Richard Biener
  2020-08-06  8:17     ` Christophe Lyon
  0 siblings, 1 reply; 29+ messages in thread
From: Richard Biener @ 2020-08-05 14:24 UTC (permalink / raw)
  To: Marc Glisse; +Cc: GCC Patches

On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse <marc.glisse@inria.fr> wrote:
>
> New version that passed bootstrap+regtest during the night.
>
> When vector comparisons were forced to use vec_cond_expr, we lost a number of
> optimizations (my fault for not adding enough testcases to prevent that).
> This patch tries to unwrap vec_cond_expr a bit so some optimizations can
> still happen.
>
> I wasn't planning to add all those transformations together, but adding one
> caused a regression, whose fix introduced a second regression, etc.
>
> Restricting to constant folding would not be sufficient, we also need at
> least things like X|0 or X&X. The transformations are quite conservative
> with :s and folding only if everything simplifies, we may want to relax
> this later. And of course we are going to miss things like a?b:c + a?c:b
> -> b+c.
>
> In terms of number of operations, some transformations turning 2
> VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look
> like a gain... I expect the bit_not disappears in most cases, and
> VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
>
> I am a bit confused that with avx512 we get types like "vector(4)
> <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
> -1?), but that doesn't matter for this patch.

OK.

Thanks,
Richard.

> 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
>
>         PR tree-optimization/95906
>         PR target/70314
>         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
>         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
>         (op (c ? a : b)): Update to match the new transformations.
>
>         * gcc.dg/tree-ssa/andnot-2.c: New file.
>         * gcc.dg/tree-ssa/pr95906.c: Likewise.
>         * gcc.target/i386/pr70314.c: Likewise.
>
> --
> Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-05 14:24   ` Richard Biener
@ 2020-08-06  8:17     ` Christophe Lyon
  2020-08-06  9:05       ` Marc Glisse
  2020-08-06 10:29       ` Richard Biener
  0 siblings, 2 replies; 29+ messages in thread
From: Christophe Lyon @ 2020-08-06  8:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: Marc Glisse, GCC Patches

Hi,


On Wed, 5 Aug 2020 at 16:24, Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse <marc.glisse@inria.fr> wrote:
> >
> > New version that passed bootstrap+regtest during the night.
> >
> > When vector comparisons were forced to use vec_cond_expr, we lost a number of
> > optimizations (my fault for not adding enough testcases to prevent that).
> > This patch tries to unwrap vec_cond_expr a bit so some optimizations can
> > still happen.
> >
> > I wasn't planning to add all those transformations together, but adding one
> > caused a regression, whose fix introduced a second regression, etc.
> >
> > Restricting to constant folding would not be sufficient, we also need at
> > least things like X|0 or X&X. The transformations are quite conservative
> > with :s and folding only if everything simplifies, we may want to relax
> > this later. And of course we are going to miss things like a?b:c + a?c:b
> > -> b+c.
> >
> > In terms of number of operations, some transformations turning 2
> > VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look
> > like a gain... I expect the bit_not disappears in most cases, and
> > VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
> >
> > I am a bit confused that with avx512 we get types like "vector(4)
> > <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
> > -1?), but that doesn't matter for this patch.
>
> OK.
>
> Thanks,
> Richard.
>
> > 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
> >
> >         PR tree-optimization/95906
> >         PR target/70314
> >         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
> >         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
> >         (op (c ? a : b)): Update to match the new transformations.
> >
> >         * gcc.dg/tree-ssa/andnot-2.c: New file.
> >         * gcc.dg/tree-ssa/pr95906.c: Likewise.
> >         * gcc.target/i386/pr70314.c: Likewise.
> >

I think this patch is causing several ICEs on arm-none-linux-gnueabihf
--with-cpu cortex-a9 --with-fpu neon-fp16:
  Executed from: gcc.c-torture/compile/compile.exp
    gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
compiler error)
    gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
  Executed from: gcc.dg/dg.exp
    gcc.dg/pr87746.c (internal compiler error)
  Executed from: gcc.dg/tree-ssa/tree-ssa.exp
    gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)
  Executed from: gcc.dg/vect/vect.exp
    gcc.dg/vect/pr59591-1.c (internal compiler error)
    gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
    gcc.dg/vect/pr86927.c (internal compiler error)
    gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
    gcc.dg/vect/slp-cond-5.c (internal compiler error)
    gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
    gcc.dg/vect/vect-23.c (internal compiler error)
    gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
    gcc.dg/vect/vect-24.c (internal compiler error)
    gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
    gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
    gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
compiler error)

Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
-finline-functions
during RTL pass: expand
/gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
compiler error: in do_store_flag, at expr.c:12259
0x8feb26 do_store_flag
        /gcc/expr.c:12259
0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
        /gcc/expr.c:9617
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
        /gcc/expr.c:10159
0x91174e expand_expr
        /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
        /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
        /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
        /gcc/expr.c:10159
0x91174e expand_expr
        /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
        /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
        /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
        /gcc/expr.c:10159
0x91174e expand_expr
        /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
        /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
        /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
        /gcc/expr.c:10159
0x91174e expand_expr
        /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
        /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
        /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
        /gcc/expr.c:10159
0x91174e expand_expr
        /gcc/expr.h:282

Christophe


> > --
> > Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-06  8:17     ` Christophe Lyon
@ 2020-08-06  9:05       ` Marc Glisse
  2020-08-06 11:25         ` Christophe Lyon
  2020-08-06 10:29       ` Richard Biener
  1 sibling, 1 reply; 29+ messages in thread
From: Marc Glisse @ 2020-08-06  9:05 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Biener, GCC Patches

On Thu, 6 Aug 2020, Christophe Lyon wrote:

>>> 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
>>>
>>>         PR tree-optimization/95906
>>>         PR target/70314
>>>         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
>>>         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
>>>         (op (c ? a : b)): Update to match the new transformations.
>>>
>>>         * gcc.dg/tree-ssa/andnot-2.c: New file.
>>>         * gcc.dg/tree-ssa/pr95906.c: Likewise.
>>>         * gcc.target/i386/pr70314.c: Likewise.
>>>
>
> I think this patch is causing several ICEs on arm-none-linux-gnueabihf
> --with-cpu cortex-a9 --with-fpu neon-fp16:
>  Executed from: gcc.c-torture/compile/compile.exp
>    gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
> -funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
> compiler error)
>    gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
>  Executed from: gcc.dg/dg.exp
>    gcc.dg/pr87746.c (internal compiler error)
>  Executed from: gcc.dg/tree-ssa/tree-ssa.exp
>    gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)

I tried a cross from x86_64-linux with current master

.../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16
make

it stops at some point with an error, but I have xgcc and cc1 in
build/gcc.

I copied 2 of the testcases and compiled

./xgcc pr87746.c -Ofast -S -B.
./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B.

without getting any ICE.

Is there a machine on the compile farm where this is easy to reproduce?
Or could you attach the .optimized dump that corresponds to the
backtrace below? It looks like we end up with a comparison with an
unexpected return type.

>  Executed from: gcc.dg/vect/vect.exp
>    gcc.dg/vect/pr59591-1.c (internal compiler error)
>    gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
>    gcc.dg/vect/pr86927.c (internal compiler error)
>    gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
>    gcc.dg/vect/slp-cond-5.c (internal compiler error)
>    gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
>    gcc.dg/vect/vect-23.c (internal compiler error)
>    gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
>    gcc.dg/vect/vect-24.c (internal compiler error)
>    gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
>    gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
>    gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
> compiler error)
>
> Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
> -finline-functions
> during RTL pass: expand
> /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
> compiler error: in do_store_flag, at expr.c:12259
> 0x8feb26 do_store_flag
>        /gcc/expr.c:12259
> 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>        /gcc/expr.c:9617
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>        /gcc/expr.c:10159
> 0x91174e expand_expr
>        /gcc/expr.h:282
> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> rtx_def**, expand_modifier)
>        /gcc/expr.c:8065
> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>        /gcc/expr.c:9950
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>        /gcc/expr.c:10159
> 0x91174e expand_expr
>        /gcc/expr.h:282
> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> rtx_def**, expand_modifier)
>        /gcc/expr.c:8065
> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>        /gcc/expr.c:9950
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>        /gcc/expr.c:10159
> 0x91174e expand_expr
>        /gcc/expr.h:282
> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> rtx_def**, expand_modifier)
>        /gcc/expr.c:8065
> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>        /gcc/expr.c:9950
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>        /gcc/expr.c:10159
> 0x91174e expand_expr
>        /gcc/expr.h:282
> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> rtx_def**, expand_modifier)
>        /gcc/expr.c:8065
> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>        /gcc/expr.c:9950
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>        /gcc/expr.c:10159
> 0x91174e expand_expr
>        /gcc/expr.h:282
>
> Christophe

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-06  8:17     ` Christophe Lyon
  2020-08-06  9:05       ` Marc Glisse
@ 2020-08-06 10:29       ` Richard Biener
  2020-08-06 11:11         ` Marc Glisse
  1 sibling, 1 reply; 29+ messages in thread
From: Richard Biener @ 2020-08-06 10:29 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Marc Glisse, GCC Patches

On Thu, Aug 6, 2020 at 10:17 AM Christophe Lyon
<christophe.lyon@linaro.org> wrote:
>
> Hi,
>
>
> On Wed, 5 Aug 2020 at 16:24, Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse <marc.glisse@inria.fr> wrote:
> > >
> > > New version that passed bootstrap+regtest during the night.
> > >
> > > When vector comparisons were forced to use vec_cond_expr, we lost a number of
> > > optimizations (my fault for not adding enough testcases to prevent that).
> > > This patch tries to unwrap vec_cond_expr a bit so some optimizations can
> > > still happen.
> > >
> > > I wasn't planning to add all those transformations together, but adding one
> > > caused a regression, whose fix introduced a second regression, etc.
> > >
> > > Restricting to constant folding would not be sufficient, we also need at
> > > least things like X|0 or X&X. The transformations are quite conservative
> > > with :s and folding only if everything simplifies, we may want to relax
> > > this later. And of course we are going to miss things like a?b:c + a?c:b
> > > -> b+c.
> > >
> > > In terms of number of operations, some transformations turning 2
> > > VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look
> > > like a gain... I expect the bit_not disappears in most cases, and
> > > VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
> > >
> > > I am a bit confused that with avx512 we get types like "vector(4)
> > > <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
> > > -1?), but that doesn't matter for this patch.
> >
> > OK.
> >
> > Thanks,
> > Richard.
> >
> > > 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
> > >
> > >         PR tree-optimization/95906
> > >         PR target/70314
> > >         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
> > >         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
> > >         (op (c ? a : b)): Update to match the new transformations.
> > >
> > >         * gcc.dg/tree-ssa/andnot-2.c: New file.
> > >         * gcc.dg/tree-ssa/pr95906.c: Likewise.
> > >         * gcc.target/i386/pr70314.c: Likewise.
> > >
>
> I think this patch is causing several ICEs on arm-none-linux-gnueabihf
> --with-cpu cortex-a9 --with-fpu neon-fp16:
>   Executed from: gcc.c-torture/compile/compile.exp
>     gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
> -funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
> compiler error)
>     gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
>   Executed from: gcc.dg/dg.exp
>     gcc.dg/pr87746.c (internal compiler error)
>   Executed from: gcc.dg/tree-ssa/tree-ssa.exp
>     gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)
>   Executed from: gcc.dg/vect/vect.exp
>     gcc.dg/vect/pr59591-1.c (internal compiler error)
>     gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
>     gcc.dg/vect/pr86927.c (internal compiler error)
>     gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
>     gcc.dg/vect/slp-cond-5.c (internal compiler error)
>     gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
>     gcc.dg/vect/vect-23.c (internal compiler error)
>     gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
>     gcc.dg/vect/vect-24.c (internal compiler error)
>     gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
>     gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
>     gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
> compiler error)
>
> Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
> -finline-functions
> during RTL pass: expand
> /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
> compiler error: in do_store_flag, at expr.c:12259
> 0x8feb26 do_store_flag
>         /gcc/expr.c:12259
> 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>         /gcc/expr.c:9617
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         /gcc/expr.c:10159
> 0x91174e expand_expr
>         /gcc/expr.h:282
> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> rtx_def**, expand_modifier)
>         /gcc/expr.c:8065
> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>         /gcc/expr.c:9950
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         /gcc/expr.c:10159
> 0x91174e expand_expr
>         /gcc/expr.h:282
> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> rtx_def**, expand_modifier)
>         /gcc/expr.c:8065
> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>         /gcc/expr.c:9950
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         /gcc/expr.c:10159
> 0x91174e expand_expr
>         /gcc/expr.h:282
> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> rtx_def**, expand_modifier)
>         /gcc/expr.c:8065
> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>         /gcc/expr.c:9950
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         /gcc/expr.c:10159
> 0x91174e expand_expr
>         /gcc/expr.h:282
> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> rtx_def**, expand_modifier)
>         /gcc/expr.c:8065
> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
>         /gcc/expr.c:9950
> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         /gcc/expr.c:10159
> 0x91174e expand_expr
>         /gcc/expr.h:282

Hmm, I guess we might need to verify that the VEC_COND_EXPRs
can be RTL expanded, at least if the folding triggers after vector
lowering (but needing to lower a previously expandable VEC_COND_EXPR
would be similarly bad).  So we may need to handle VEC_COND_EXPRs
like VEC_PERMs and thus need to check target support.  Ick.

Richard.

> Christophe
>
>
> > > --
> > > Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-06 10:29       ` Richard Biener
@ 2020-08-06 11:11         ` Marc Glisse
  0 siblings, 0 replies; 29+ messages in thread
From: Marc Glisse @ 2020-08-06 11:11 UTC (permalink / raw)
  To: Richard Biener; +Cc: Christophe Lyon, GCC Patches

On Thu, 6 Aug 2020, Richard Biener wrote:

> On Thu, Aug 6, 2020 at 10:17 AM Christophe Lyon
> <christophe.lyon@linaro.org> wrote:
>>
>> Hi,
>>
>>
>> On Wed, 5 Aug 2020 at 16:24, Richard Biener via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>> On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse <marc.glisse@inria.fr> wrote:
>>>>
>>>> New version that passed bootstrap+regtest during the night.
>>>>
>>>> When vector comparisons were forced to use vec_cond_expr, we lost a number of
>>>> optimizations (my fault for not adding enough testcases to prevent that).
>>>> This patch tries to unwrap vec_cond_expr a bit so some optimizations can
>>>> still happen.
>>>>
>>>> I wasn't planning to add all those transformations together, but adding one
>>>> caused a regression, whose fix introduced a second regression, etc.
>>>>
>>>> Restricting to constant folding would not be sufficient, we also need at
>>>> least things like X|0 or X&X. The transformations are quite conservative
>>>> with :s and folding only if everything simplifies, we may want to relax
>>>> this later. And of course we are going to miss things like a?b:c + a?c:b
>>>> -> b+c.
>>>>
>>>> In terms of number of operations, some transformations turning 2
>>>> VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look
>>>> like a gain... I expect the bit_not disappears in most cases, and
>>>> VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.
>>>>
>>>> I am a bit confused that with avx512 we get types like "vector(4)
>>>> <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not
>>>> -1?), but that doesn't matter for this patch.
>>>
>>> OK.
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
>>>>
>>>>         PR tree-optimization/95906
>>>>         PR target/70314
>>>>         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
>>>>         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
>>>>         (op (c ? a : b)): Update to match the new transformations.
>>>>
>>>>         * gcc.dg/tree-ssa/andnot-2.c: New file.
>>>>         * gcc.dg/tree-ssa/pr95906.c: Likewise.
>>>>         * gcc.target/i386/pr70314.c: Likewise.
>>>>
>>
>> I think this patch is causing several ICEs on arm-none-linux-gnueabihf
>> --with-cpu cortex-a9 --with-fpu neon-fp16:
>>   Executed from: gcc.c-torture/compile/compile.exp
>>     gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
>> -funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
>> compiler error)
>>     gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
>>   Executed from: gcc.dg/dg.exp
>>     gcc.dg/pr87746.c (internal compiler error)
>>   Executed from: gcc.dg/tree-ssa/tree-ssa.exp
>>     gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)
>>   Executed from: gcc.dg/vect/vect.exp
>>     gcc.dg/vect/pr59591-1.c (internal compiler error)
>>     gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
>>     gcc.dg/vect/pr86927.c (internal compiler error)
>>     gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
>>     gcc.dg/vect/slp-cond-5.c (internal compiler error)
>>     gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
>>     gcc.dg/vect/vect-23.c (internal compiler error)
>>     gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
>>     gcc.dg/vect/vect-24.c (internal compiler error)
>>     gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
>>     gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
>>     gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
>> compiler error)
>>
>> Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
>> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
>> -finline-functions
>> during RTL pass: expand
>> /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
>> compiler error: in do_store_flag, at expr.c:12259
>> 0x8feb26 do_store_flag
>>         /gcc/expr.c:12259
>> 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>> expand_modifier)
>>         /gcc/expr.c:9617
>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>> expand_modifier, rtx_def**, bool)
>>         /gcc/expr.c:10159
>> 0x91174e expand_expr
>>         /gcc/expr.h:282
>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
>> rtx_def**, expand_modifier)
>>         /gcc/expr.c:8065
>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>> expand_modifier)
>>         /gcc/expr.c:9950
>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>> expand_modifier, rtx_def**, bool)
>>         /gcc/expr.c:10159
>> 0x91174e expand_expr
>>         /gcc/expr.h:282
>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
>> rtx_def**, expand_modifier)
>>         /gcc/expr.c:8065
>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>> expand_modifier)
>>         /gcc/expr.c:9950
>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>> expand_modifier, rtx_def**, bool)
>>         /gcc/expr.c:10159
>> 0x91174e expand_expr
>>         /gcc/expr.h:282
>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
>> rtx_def**, expand_modifier)
>>         /gcc/expr.c:8065
>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>> expand_modifier)
>>         /gcc/expr.c:9950
>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>> expand_modifier, rtx_def**, bool)
>>         /gcc/expr.c:10159
>> 0x91174e expand_expr
>>         /gcc/expr.h:282
>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
>> rtx_def**, expand_modifier)
>>         /gcc/expr.c:8065
>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>> expand_modifier)
>>         /gcc/expr.c:9950
>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>> expand_modifier, rtx_def**, bool)
>>         /gcc/expr.c:10159
>> 0x91174e expand_expr
>>         /gcc/expr.h:282
>
> Hmm, I guess we might need to verify that the VEC_COND_EXPRs
> can be RTL expanded, at least if the folding triggers after vector
> lowering (but needing to lower a previously expandable VEC_COND_EXPR
> would be similarly bad).  So we may need to handle VEC_COND_EXPRs
> like VEC_PERMs and thus need to check target support.  Ick.

Maybe. I'd like to see what the gimple looks like that arm fails to 
expand, if that's really a limitation in the hardware, or just some simple 
missing case in the target or the expansion code. Is it that we had
(a<b)?-1:0 which arm can handle, and because of the transformation we have 
to expand a plain c=a<b and arm cannot handle that?

If someone can confirm the breakage, please feel free to revert that 
patch, but also please give some details about how this breaks or provide 
a simple way to reproduce.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-06  9:05       ` Marc Glisse
@ 2020-08-06 11:25         ` Christophe Lyon
  2020-08-06 11:42           ` Marc Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Christophe Lyon @ 2020-08-06 11:25 UTC (permalink / raw)
  To: Marc Glisse; +Cc: Richard Biener, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 5574 bytes --]

On Thu, 6 Aug 2020 at 11:06, Marc Glisse <marc.glisse@inria.fr> wrote:
>
> On Thu, 6 Aug 2020, Christophe Lyon wrote:
>
> >>> 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
> >>>
> >>>         PR tree-optimization/95906
> >>>         PR target/70314
> >>>         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
> >>>         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
> >>>         (op (c ? a : b)): Update to match the new transformations.
> >>>
> >>>         * gcc.dg/tree-ssa/andnot-2.c: New file.
> >>>         * gcc.dg/tree-ssa/pr95906.c: Likewise.
> >>>         * gcc.target/i386/pr70314.c: Likewise.
> >>>
> >
> > I think this patch is causing several ICEs on arm-none-linux-gnueabihf
> > --with-cpu cortex-a9 --with-fpu neon-fp16:
> >  Executed from: gcc.c-torture/compile/compile.exp
> >    gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
> > -funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
> > compiler error)
> >    gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
> >  Executed from: gcc.dg/dg.exp
> >    gcc.dg/pr87746.c (internal compiler error)
> >  Executed from: gcc.dg/tree-ssa/tree-ssa.exp
> >    gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)
>
> I tried a cross from x86_64-linux with current master
>
> .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16
> make
>
> it stops at some point with an error, but I have xgcc and cc1 in
> build/gcc.
>
> I copied 2 of the testcases and compiled
>
> ./xgcc pr87746.c -Ofast -S -B.
> ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B.
>
> without getting any ICE.

Sorry for the delay, I had to reproduce the problem manually.
>
> Is there a machine on the compile farm where this is easy to reproduce?
I don't think there is any arm machine in the compile farm.

> Or could you attach the .optimized dump that corresponds to the
> backtrace below? It looks like we end up with a comparison with an
> unexpected return type.
>

I've compiled pr87746.c with -fdump-tree-ifcvt-details-blocks-details,
here is the log.
Is that what you need?

Thanks,

Christophe

> >  Executed from: gcc.dg/vect/vect.exp
> >    gcc.dg/vect/pr59591-1.c (internal compiler error)
> >    gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
> >    gcc.dg/vect/pr86927.c (internal compiler error)
> >    gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
> >    gcc.dg/vect/slp-cond-5.c (internal compiler error)
> >    gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
> >    gcc.dg/vect/vect-23.c (internal compiler error)
> >    gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
> >    gcc.dg/vect/vect-24.c (internal compiler error)
> >    gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
> >    gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
> >    gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
> > compiler error)
> >
> > Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
> > -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
> > -finline-functions
> > during RTL pass: expand
> > /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
> > compiler error: in do_store_flag, at expr.c:12259
> > 0x8feb26 do_store_flag
> >        /gcc/expr.c:12259
> > 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> > expand_modifier)
> >        /gcc/expr.c:9617
> > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > expand_modifier, rtx_def**, bool)
> >        /gcc/expr.c:10159
> > 0x91174e expand_expr
> >        /gcc/expr.h:282
> > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> > rtx_def**, expand_modifier)
> >        /gcc/expr.c:8065
> > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> > expand_modifier)
> >        /gcc/expr.c:9950
> > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > expand_modifier, rtx_def**, bool)
> >        /gcc/expr.c:10159
> > 0x91174e expand_expr
> >        /gcc/expr.h:282
> > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> > rtx_def**, expand_modifier)
> >        /gcc/expr.c:8065
> > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> > expand_modifier)
> >        /gcc/expr.c:9950
> > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > expand_modifier, rtx_def**, bool)
> >        /gcc/expr.c:10159
> > 0x91174e expand_expr
> >        /gcc/expr.h:282
> > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> > rtx_def**, expand_modifier)
> >        /gcc/expr.c:8065
> > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> > expand_modifier)
> >        /gcc/expr.c:9950
> > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > expand_modifier, rtx_def**, bool)
> >        /gcc/expr.c:10159
> > 0x91174e expand_expr
> >        /gcc/expr.h:282
> > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> > rtx_def**, expand_modifier)
> >        /gcc/expr.c:8065
> > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> > expand_modifier)
> >        /gcc/expr.c:9950
> > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > expand_modifier, rtx_def**, bool)
> >        /gcc/expr.c:10159
> > 0x91174e expand_expr
> >        /gcc/expr.h:282
> >
> > Christophe
>
> --
> Marc Glisse

[-- Attachment #2: pr87746.c.162t.ifcvt --]
[-- Type: application/octet-stream, Size: 17421 bytes --]


;; Function move_replacements (move_replacements, funcdef_no=0, decl_uid=5678, cgraph_uid=1, symbol_order=1)

Creating dr for replacements[i_16].subreg_loc
analyze_innermost: success.
	base_address: &replacements
	offset from base address: 0
	constant offset from base address: 4
	step: 12
	base alignment: 4
	base misalignment: 0
	offset alignment: 64
	step alignment: 4
	base_object: replacements
	Access function 0: 32
	Access function 1: {0, +, 1}_1
Creating dr for replacements[i_16].where
analyze_innermost: success.
	base_address: &replacements
	offset from base address: 0
	constant offset from base address: 0
	step: 12
	base alignment: 4
	base misalignment: 0
	offset alignment: 64
	step alignment: 4
	base_object: replacements
	Access function 0: 0
	Access function 1: {0, +, 1}_1
Creating dr for replacements[i_16].where
analyze_innermost: success.
	base_address: &replacements
	offset from base address: 0
	constant offset from base address: 0
	step: 12
	base alignment: 4
	base misalignment: 0
	offset alignment: 64
	step alignment: 4
	base_object: replacements
	Access function 0: 0
	Access function 1: {0, +, 1}_1
Creating dr for replacements[i_16].subreg_loc
analyze_innermost: success.
	base_address: &replacements
	offset from base address: 0
	constant offset from base address: 4
	step: 12
	base alignment: 4
	base misalignment: 0
	offset alignment: 64
	step alignment: 4
	base_object: replacements
	Access function 0: 32
	Access function 1: {0, +, 1}_1
Creating dr for replacements[i_16].subreg_loc
analyze_innermost: success.
	base_address: &replacements
	offset from base address: 0
	constant offset from base address: 4
	step: 12
	base alignment: 4
	base misalignment: 0
	offset alignment: 64
	step alignment: 4
	base_object: replacements
	Access function 0: 32
	Access function 1: {0, +, 1}_1
----------[3]-------------
----------[4]-------------
----------[5]-------------
----------[6]-------------
----------[12]-------------
----------[7]-------------
----------[10]-------------
-------------------------
replacements[i_16].subreg_loc = y_9(D);
-------------------------
_2 = replacements[i_16].where;
-------------------------
replacements[i_16].where = y_9(D);
-------------------------
replacements[i_16].subreg_loc = 0B;
-------------------------
i_16 = PHI <0(9), i_13(10)>
-------------------------
.MEM_17 = PHI <.MEM_6(D)(9), .MEM_4(10)>
-------------------------
.MEM_4 = PHI <.MEM_12(4), .MEM_11(6), .MEM_17(12)>
Applying if-conversion

Updating SSA:
Registering new PHI nodes in block #2
Registering new PHI nodes in block #9
Registering new PHI nodes in block #20
Registering new PHI nodes in block #21
Registering new PHI nodes in block #3
Updating SSA information for statement _1 = replacements[i_16].subreg_loc;
Updating SSA information for statement if (_1 == x_8(D))
Registering new PHI nodes in block #5
Updating SSA information for statement _2 = replacements[i_16].where;
Updating SSA information for statement if (_2 == x_8(D))
Registering new PHI nodes in block #12
Registering new PHI nodes in block #6
Updating SSA information for statement replacements[i_16].where = y_9(D);
Updating SSA information for statement replacements[i_16].subreg_loc = 0B;
Registering new PHI nodes in block #4
Updating SSA information for statement replacements[i_16].subreg_loc = y_9(D);
Registering new PHI nodes in block #7
Updating SSA information for statement i_13 = i_16 + 1;
Updating SSA information for statement if (n_replacements_7(D) > i_13)
Registering new PHI nodes in block #10
Registering new PHI nodes in block #22
Registering new PHI nodes in block #13
Updating SSA information for statement _27 = replacements[i_16].subreg_loc;
Updating SSA information for statement if (_1 == x_8(D))
Registering new PHI nodes in block #14
Updating SSA information for statement _28 = replacements[i_16].where;
Updating SSA information for statement if (_2 == x_8(D))
Registering new PHI nodes in block #15
Registering new PHI nodes in block #16
Updating SSA information for statement replacements[i_16].where = y_9(D);
Updating SSA information for statement replacements[i_16].subreg_loc = 0B;
Registering new PHI nodes in block #17
Updating SSA information for statement replacements[i_16].subreg_loc = y_9(D);
Registering new PHI nodes in block #18
Updating SSA information for statement i_33 = i_16 + 1;
Updating SSA information for statement if (n_replacements_7(D) > i_13)
Registering new PHI nodes in block #19
Registering new PHI nodes in block #8

SSA replacement table
N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j

i_25 -> { i_16 }
.MEM_26 -> { .MEM_17 }
_27 -> { _1 }
_28 -> { _2 }
.MEM_29 -> { .MEM_10 }
.MEM_30 -> { .MEM_11 }
.MEM_31 -> { .MEM_12 }
.MEM_32 -> { .MEM_4 }
i_33 -> { i_13 }
Incremental SSA update started at block: 2
Number of blocks in CFG: 23
Number of blocks to update: 17 ( 74%)
Affected blocks: 3 4 5 6 7 8 10 12 13 14 15 16 17 18 19 21 22


Removing basic block 4
;; basic block 4, loop depth 1
;;  pred:      
;;  succ:      


Removing basic block 5
;; basic block 5, loop depth 1
;;  pred:      
;;  succ:      


Removing basic block 6
;; basic block 6, loop depth 1
;;  pred:      
;;  succ:      


Removing basic block 12
;; basic block 12, loop depth 1
;;  pred:      
;;  succ:      


Merging blocks 3 and 7
Region does not contain all edges into the entry block, skipping its PHIs.
Processing block 0: BB3
Cannot trust state of predecessor edge 10 -> 3, marking executable
Value numbering stmt = _1 = replacements[i_16].subreg_loc;
Setting value number of _1 to _1 (changed)
Making available beyond BB3 _1 for value _1
Value numbering stmt = _ifc__34 = replacements[i_16].subreg_loc;
Setting value number of _ifc__34 to _1 (changed)
Replaced replacements[i_16].subreg_loc with _1 in all uses of _ifc__34 = replacements[i_16].subreg_loc;
Value numbering stmt = _ifc__35 = y_9(D);
Setting value number of _ifc__35 to y_9(D) (changed)
Value numbering stmt = _ifc__36 = _1 == x_8(D) ? _ifc__35 : _ifc__34;
Setting value number of _ifc__36 to _ifc__36 (changed)
Making available beyond BB3 _ifc__36 for value _ifc__36
Value numbering stmt = replacements[i_16].subreg_loc = _ifc__36;
No store match
Value numbering store replacements[i_16].subreg_loc to _ifc__36
Setting value number of .MEM_12 to .MEM_12 (changed)
Value numbering stmt = _2 = replacements[i_16].where;
Setting value number of _2 to _2 (changed)
Making available beyond BB3 _2 for value _2
Value numbering stmt = _18 = _1 == x_8(D);
Setting value number of _18 to _18 (changed)
Making available beyond BB3 _18 for value _18
Value numbering stmt = _5 = ~_18;
Setting value number of _5 to _5 (changed)
Making available beyond BB3 _5 for value _5
Value numbering stmt = _14 = _2 == x_8(D);
Setting value number of _14 to _14 (changed)
Making available beyond BB3 _14 for value _14
Value numbering stmt = _15 = _5 & _14;
Setting value number of _15 to _15 (changed)
Making available beyond BB3 _15 for value _15
Value numbering stmt = _ifc__37 = replacements[i_16].where;
Setting value number of _ifc__37 to _2 (changed)
Replaced replacements[i_16].where with _2 in all uses of _ifc__37 = replacements[i_16].where;
Value numbering stmt = _ifc__38 = y_9(D);
Setting value number of _ifc__38 to y_9(D) (changed)
Value numbering stmt = _ifc__39 = _15 ? _ifc__38 : _ifc__37;
Setting value number of _ifc__39 to _ifc__39 (changed)
Making available beyond BB3 _ifc__39 for value _ifc__39
Value numbering stmt = replacements[i_16].where = _ifc__39;
No store match
Value numbering store replacements[i_16].where to _ifc__39
Setting value number of .MEM_10 to .MEM_10 (changed)
Value numbering stmt = _ifc__40 = replacements[i_16].subreg_loc;
Setting value number of _ifc__40 to _ifc__36 (changed)
Replaced replacements[i_16].subreg_loc with _ifc__36 in all uses of _ifc__40 = replacements[i_16].subreg_loc;
Value numbering stmt = _ifc__41 = 0B;
RHS 0B simplified to 0B
Setting value number of _ifc__41 to 0B (changed)
Value numbering stmt = _ifc__42 = _15 ? _ifc__41 : _ifc__40;
Setting value number of _ifc__42 to _ifc__42 (changed)
Making available beyond BB3 _ifc__42 for value _ifc__42
Value numbering stmt = replacements[i_16].subreg_loc = _ifc__42;
No store match
Value numbering store replacements[i_16].subreg_loc to _ifc__42
Setting value number of .MEM_11 to .MEM_11 (changed)
Value numbering stmt = _19 = _1 == x_8(D);
Setting value number of _19 to _18 (changed)
Replaced _1 == x_8(D) with _18 in all uses of _19 = _1 == x_8(D);
Value numbering stmt = _20 = ~_19;
Setting value number of _20 to _5 (changed)
Replaced ~_19 with _5 in all uses of _20 = ~_19;
Value numbering stmt = _21 = _2 == x_8(D);
Setting value number of _21 to _14 (changed)
Replaced _2 == x_8(D) with _14 in all uses of _21 = _2 == x_8(D);
Value numbering stmt = _22 = ~_21;
Setting value number of _22 to _22 (changed)
Making available beyond BB3 _22 for value _22
Value numbering stmt = _23 = _20 & _22;
Setting value number of _23 to _23 (changed)
Making available beyond BB3 _23 for value _23
Value numbering stmt = i_13 = i_16 + 1;
Setting value number of i_13 to i_13 (changed)
Making available beyond BB3 i_13 for value i_13
Value numbering stmt = if (n_replacements_7(D) > i_13)
RPO iteration over 1 blocks visited 1 blocks in total discovering 1 executable blocks iterating 1.0 times, a block was visited max. 1 times
RPO tracked 28 values available at 12 locations and 28 lattice elements
Removing dead stmt _21 = _2 == x_8(D);
Removing dead stmt _20 = ~_19;
Removing dead stmt _19 = _1 == x_8(D);
Removing dead stmt _ifc__41 = 0B;
Removing dead stmt _ifc__40 = replacements[i_16].subreg_loc;
Removing dead stmt _ifc__38 = y_9(D);
Removing dead stmt _ifc__37 = replacements[i_16].where;
Removing dead stmt _ifc__35 = y_9(D);
Removing dead stmt _ifc__34 = replacements[i_16].subreg_loc;
Delete dead stmt in bb#3
_23 = _5 & _22;
Delete dead stmt in bb#3
_22 = ~_14;
  Deleted dead store: replacements[i_16].subreg_loc = _ifc__36;

Removing basic block 9
;; basic block 9, loop depth 0
;;  pred:      
;;  succ:       20


Removing basic block 15
;; basic block 15, loop depth 1
;;  pred:      
;;  succ:       18


fix_loop_structure: fixing up loops for function
move_replacements (struct rtx_def * * x, struct rtx_def * * y, int n_replacements)
{
  int i;
  struct rtx_def * * _1;
  struct rtx_def * * _2;
  _Bool _5;
  _Bool _14;
  _Bool _15;
  _Bool _18;
  _Bool _24;
  struct rtx_def * * _27;
  struct rtx_def * * _28;
  struct rtx_def * * _ifc__36;
  struct rtx_def * * _ifc__39;
  struct rtx_def * * _ifc__42;

;;   basic block 2, loop depth 0, count 118111598 (estimated locally), maybe hot
;;    prev block 0, next block 20, flags: (NEW, REACHABLE, VISITED)
;;    pred:       ENTRY [always]  count:118111598 (estimated locally) (FALLTHRU,EXECUTABLE)
  if (n_replacements_7(D) > 0)
    goto <bb 20>; [89.00%]
  else
    goto <bb 8>; [11.00%]
;;    succ:       20 [89.0% (guessed)]  count:105119322 (estimated locally) (TRUE_VALUE,EXECUTABLE)
;;                8 [11.0% (guessed)]  count:12992276 (estimated locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 20, loop depth 0, count 105119322 (estimated locally), maybe hot
;;   Invalid sum of outgoing probabilities 200.0%
;;    prev block 2, next block 21, flags: (NEW)
;;    pred:       2 [89.0% (guessed)]  count:105119322 (estimated locally) (TRUE_VALUE,EXECUTABLE)
  _24 = .LOOP_VECTORIZED (1, 2);
  if (_24 != 0)
    goto <bb 21>; [100.00%]
  else
    goto <bb 22>; [100.00%]
;;    succ:       22 [always]  count:105119322 (estimated locally) (FALSE_VALUE)
;;                21 [always]  count:105119322 (estimated locally) (TRUE_VALUE)

;;   basic block 21, loop depth 0, count 105119322 (estimated locally), maybe hot
;;    prev block 20, next block 3, flags: (NEW)
;;    pred:       20 [always]  count:105119322 (estimated locally) (TRUE_VALUE)
;;    succ:       3 [always]  count:105119322 (estimated locally) (FALLTHRU,EXECUTABLE)

;;   basic block 3, loop depth 1, count 955630226 (estimated locally), maybe hot
;;    prev block 21, next block 10, flags: (NEW, REACHABLE, VISITED)
;;    pred:       10 [always]  count:850510902 (estimated locally) (FALLTHRU,EXECUTABLE)
;;                21 [always]  count:105119322 (estimated locally) (FALLTHRU,EXECUTABLE)
  # i_16 = PHI <i_13(10), 0(21)>
  _1 = replacements[i_16].subreg_loc;
  _ifc__36 = _1 != x_8(D) ? _1 : y_9(D);
  _2 = replacements[i_16].where;
  _18 = _1 == x_8(D);
  _5 = ~_18;
  _14 = _2 == x_8(D);
  _15 = _5 & _14;
  _ifc__39 = _15 ? y_9(D) : _2;
  replacements[i_16].where = _ifc__39;
  _ifc__42 = _15 ? 0B : _ifc__36;
  replacements[i_16].subreg_loc = _ifc__42;
  i_13 = i_16 + 1;
  if (n_replacements_7(D) > i_13)
    goto <bb 10>; [89.00%]
  else
    goto <bb 8>; [11.00%]
;;    succ:       10 [89.0% (guessed)]  count:850510902 (estimated locally) (TRUE_VALUE,EXECUTABLE)
;;                8 [11.0% (guessed)]  count:105119324 (estimated locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 10, loop depth 1, count 850510902 (estimated locally), maybe hot
;;    prev block 3, next block 22, flags: (NEW, VISITED)
;;    pred:       3 [89.0% (guessed)]  count:850510902 (estimated locally) (TRUE_VALUE,EXECUTABLE)
  goto <bb 3>; [100.00%]
;;    succ:       3 [always]  count:850510902 (estimated locally) (FALLTHRU,EXECUTABLE)

;;   basic block 22, loop depth 0, count 105119322 (estimated locally), maybe hot
;;    prev block 10, next block 13, flags: (NEW)
;;    pred:       20 [always]  count:105119322 (estimated locally) (FALSE_VALUE)
;;    succ:       13 [always]  count:105119322 (estimated locally) (FALLTHRU)

;;   basic block 13, loop depth 1, count 955630225 (estimated locally), maybe hot
;;    prev block 22, next block 17, flags: (NEW, REACHABLE, VISITED)
;;    pred:       19 [always]  count:850510902 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE)
;;                22 [always]  count:105119322 (estimated locally) (FALLTHRU)
  # i_25 = PHI <i_33(19), 0(22)>
  _27 = replacements[i_25].subreg_loc;
  if (_27 == x_8(D))
    goto <bb 17>; [30.00%]
  else
    goto <bb 14>; [70.00%]
;;    succ:       17 [30.0% (guessed)]  count:286689065 (estimated locally) (TRUE_VALUE,EXECUTABLE)
;;                14 [70.0% (guessed)]  count:668941160 (estimated locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 17, loop depth 1, count 286689065 (estimated locally), maybe hot
;;    prev block 13, next block 14, flags: (NEW, REACHABLE, VISITED)
;;    pred:       13 [30.0% (guessed)]  count:286689065 (estimated locally) (TRUE_VALUE,EXECUTABLE)
  replacements[i_25].subreg_loc = y_9(D);
  goto <bb 18>; [100.00%]
;;    succ:       18 [always]  count:286689065 (estimated locally) (FALLTHRU,EXECUTABLE)

;;   basic block 14, loop depth 1, count 668941161 (estimated locally), maybe hot
;;    prev block 17, next block 16, flags: (NEW, REACHABLE, VISITED)
;;    pred:       13 [70.0% (guessed)]  count:668941160 (estimated locally) (FALSE_VALUE,EXECUTABLE)
  _28 = replacements[i_25].where;
  if (_28 == x_8(D))
    goto <bb 16>; [30.00%]
  else
    goto <bb 18>; [70.00%]
;;    succ:       16 [30.0% (guessed)]  count:200682346 (estimated locally) (TRUE_VALUE,EXECUTABLE)
;;                18 [70.0% (guessed)]  count:468258815 (estimated locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 16, loop depth 1, count 200682346 (estimated locally), maybe hot
;;    prev block 14, next block 18, flags: (NEW, REACHABLE, VISITED)
;;    pred:       14 [30.0% (guessed)]  count:200682346 (estimated locally) (TRUE_VALUE,EXECUTABLE)
  replacements[i_25].where = y_9(D);
  replacements[i_25].subreg_loc = 0B;
;;    succ:       18 [always]  count:200682346 (estimated locally) (FALLTHRU,EXECUTABLE)

;;   basic block 18, loop depth 1, count 955630226 (estimated locally), maybe hot
;;    prev block 16, next block 19, flags: (NEW, REACHABLE, VISITED)
;;    pred:       14 [70.0% (guessed)]  count:468258815 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;                16 [always]  count:200682346 (estimated locally) (FALLTHRU,EXECUTABLE)
;;                17 [always]  count:286689065 (estimated locally) (FALLTHRU,EXECUTABLE)
  i_33 = i_25 + 1;
  if (n_replacements_7(D) > i_33)
    goto <bb 19>; [89.00%]
  else
    goto <bb 8>; [11.00%]
;;    succ:       19 [89.0% (guessed)]  count:850510902 (estimated locally) (TRUE_VALUE,EXECUTABLE)
;;                8 [11.0% (guessed)]  count:105119324 (estimated locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 19, loop depth 1, count 850510902 (estimated locally), maybe hot
;;    prev block 18, next block 8, flags: (NEW, VISITED)
;;    pred:       18 [89.0% (guessed)]  count:850510902 (estimated locally) (TRUE_VALUE,EXECUTABLE)
  goto <bb 13>; [100.00%]
;;    succ:       13 [always]  count:850510902 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE)

;;   basic block 8, loop depth 0, count 118111600 (estimated locally), maybe hot
;;   Invalid sum of incoming counts 223230924 (estimated locally), should be 118111600 (estimated locally)
;;    prev block 19, next block 1, flags: (NEW, REACHABLE, VISITED)
;;    pred:       2 [11.0% (guessed)]  count:12992276 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;                3 [11.0% (guessed)]  count:105119324 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;                18 [11.0% (guessed)]  count:105119324 (estimated locally) (FALSE_VALUE,EXECUTABLE)
  return;
;;    succ:       EXIT [always]  count:118111600 (estimated locally)

}



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-06 11:25         ` Christophe Lyon
@ 2020-08-06 11:42           ` Marc Glisse
  2020-08-06 12:00             ` Christophe Lyon
  0 siblings, 1 reply; 29+ messages in thread
From: Marc Glisse @ 2020-08-06 11:42 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Biener, GCC Patches

On Thu, 6 Aug 2020, Christophe Lyon wrote:

> On Thu, 6 Aug 2020 at 11:06, Marc Glisse <marc.glisse@inria.fr> wrote:
>>
>> On Thu, 6 Aug 2020, Christophe Lyon wrote:
>>
>>>>> 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
>>>>>
>>>>>         PR tree-optimization/95906
>>>>>         PR target/70314
>>>>>         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
>>>>>         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
>>>>>         (op (c ? a : b)): Update to match the new transformations.
>>>>>
>>>>>         * gcc.dg/tree-ssa/andnot-2.c: New file.
>>>>>         * gcc.dg/tree-ssa/pr95906.c: Likewise.
>>>>>         * gcc.target/i386/pr70314.c: Likewise.
>>>>>
>>>
>>> I think this patch is causing several ICEs on arm-none-linux-gnueabihf
>>> --with-cpu cortex-a9 --with-fpu neon-fp16:
>>>  Executed from: gcc.c-torture/compile/compile.exp
>>>    gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
>>> -funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
>>> compiler error)
>>>    gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
>>>  Executed from: gcc.dg/dg.exp
>>>    gcc.dg/pr87746.c (internal compiler error)
>>>  Executed from: gcc.dg/tree-ssa/tree-ssa.exp
>>>    gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)
>>
>> I tried a cross from x86_64-linux with current master
>>
>> .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16
>> make
>>
>> it stops at some point with an error, but I have xgcc and cc1 in
>> build/gcc.
>>
>> I copied 2 of the testcases and compiled
>>
>> ./xgcc pr87746.c -Ofast -S -B.
>> ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B.
>>
>> without getting any ICE.
>
> Sorry for the delay, I had to reproduce the problem manually.
>>
>> Is there a machine on the compile farm where this is easy to reproduce?
> I don't think there is any arm machine in the compile farm.
>
>> Or could you attach the .optimized dump that corresponds to the
>> backtrace below? It looks like we end up with a comparison with an
>> unexpected return type.
>>
>
> I've compiled pr87746.c with -fdump-tree-ifcvt-details-blocks-details,
> here is the log.
> Is that what you need?

Thanks.
The one from -fdump-tree-optimized would be closer to the ICE.
Though it would also be convenient to know which stmt is being expanded 
when we ICE, etc.

Was I on the right track configuring with 
--target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 
--with-fpu=neon-fp16
then compiling without any special option?

> Thanks,
>
> Christophe
>
>>>  Executed from: gcc.dg/vect/vect.exp
>>>    gcc.dg/vect/pr59591-1.c (internal compiler error)
>>>    gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
>>>    gcc.dg/vect/pr86927.c (internal compiler error)
>>>    gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
>>>    gcc.dg/vect/slp-cond-5.c (internal compiler error)
>>>    gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
>>>    gcc.dg/vect/vect-23.c (internal compiler error)
>>>    gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
>>>    gcc.dg/vect/vect-24.c (internal compiler error)
>>>    gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
>>>    gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
>>>    gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
>>> compiler error)
>>>
>>> Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
>>> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
>>> -finline-functions
>>> during RTL pass: expand
>>> /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
>>> compiler error: in do_store_flag, at expr.c:12259
>>> 0x8feb26 do_store_flag
>>>        /gcc/expr.c:12259
>>> 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>>> expand_modifier)
>>>        /gcc/expr.c:9617
>>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>>> expand_modifier, rtx_def**, bool)
>>>        /gcc/expr.c:10159
>>> 0x91174e expand_expr
>>>        /gcc/expr.h:282
>>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
>>> rtx_def**, expand_modifier)
>>>        /gcc/expr.c:8065
>>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>>> expand_modifier)
>>>        /gcc/expr.c:9950
>>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>>> expand_modifier, rtx_def**, bool)
>>>        /gcc/expr.c:10159
>>> 0x91174e expand_expr
>>>        /gcc/expr.h:282
>>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
>>> rtx_def**, expand_modifier)
>>>        /gcc/expr.c:8065
>>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>>> expand_modifier)
>>>        /gcc/expr.c:9950
>>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>>> expand_modifier, rtx_def**, bool)
>>>        /gcc/expr.c:10159
>>> 0x91174e expand_expr
>>>        /gcc/expr.h:282
>>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
>>> rtx_def**, expand_modifier)
>>>        /gcc/expr.c:8065
>>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>>> expand_modifier)
>>>        /gcc/expr.c:9950
>>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>>> expand_modifier, rtx_def**, bool)
>>>        /gcc/expr.c:10159
>>> 0x91174e expand_expr
>>>        /gcc/expr.h:282
>>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
>>> rtx_def**, expand_modifier)
>>>        /gcc/expr.c:8065
>>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
>>> expand_modifier)
>>>        /gcc/expr.c:9950
>>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>>> expand_modifier, rtx_def**, bool)
>>>        /gcc/expr.c:10159
>>> 0x91174e expand_expr
>>>        /gcc/expr.h:282
>>>
>>> Christophe
>>
>> --
>> Marc Glisse
>

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-06 11:42           ` Marc Glisse
@ 2020-08-06 12:00             ` Christophe Lyon
  2020-08-06 18:07               ` Marc Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Christophe Lyon @ 2020-08-06 12:00 UTC (permalink / raw)
  To: Marc Glisse; +Cc: Richard Biener, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 6686 bytes --]

On Thu, 6 Aug 2020 at 13:42, Marc Glisse <marc.glisse@inria.fr> wrote:
>
> On Thu, 6 Aug 2020, Christophe Lyon wrote:
>
> > On Thu, 6 Aug 2020 at 11:06, Marc Glisse <marc.glisse@inria.fr> wrote:
> >>
> >> On Thu, 6 Aug 2020, Christophe Lyon wrote:
> >>
> >>>>> 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
> >>>>>
> >>>>>         PR tree-optimization/95906
> >>>>>         PR target/70314
> >>>>>         * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
> >>>>>         (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
> >>>>>         (op (c ? a : b)): Update to match the new transformations.
> >>>>>
> >>>>>         * gcc.dg/tree-ssa/andnot-2.c: New file.
> >>>>>         * gcc.dg/tree-ssa/pr95906.c: Likewise.
> >>>>>         * gcc.target/i386/pr70314.c: Likewise.
> >>>>>
> >>>
> >>> I think this patch is causing several ICEs on arm-none-linux-gnueabihf
> >>> --with-cpu cortex-a9 --with-fpu neon-fp16:
> >>>  Executed from: gcc.c-torture/compile/compile.exp
> >>>    gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
> >>> -funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
> >>> compiler error)
> >>>    gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
> >>>  Executed from: gcc.dg/dg.exp
> >>>    gcc.dg/pr87746.c (internal compiler error)
> >>>  Executed from: gcc.dg/tree-ssa/tree-ssa.exp
> >>>    gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)
> >>
> >> I tried a cross from x86_64-linux with current master
> >>
> >> .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16
> >> make
> >>
> >> it stops at some point with an error, but I have xgcc and cc1 in
> >> build/gcc.
> >>
> >> I copied 2 of the testcases and compiled
> >>
> >> ./xgcc pr87746.c -Ofast -S -B.
> >> ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B.
> >>
> >> without getting any ICE.
> >
> > Sorry for the delay, I had to reproduce the problem manually.
> >>
> >> Is there a machine on the compile farm where this is easy to reproduce?
> > I don't think there is any arm machine in the compile farm.
> >
> >> Or could you attach the .optimized dump that corresponds to the
> >> backtrace below? It looks like we end up with a comparison with an
> >> unexpected return type.
> >>
> >
> > I've compiled pr87746.c with -fdump-tree-ifcvt-details-blocks-details,
> > here is the log.
> > Is that what you need?
>
> Thanks.
> The one from -fdump-tree-optimized would be closer to the ICE.
Here it is.

> Though it would also be convenient to know which stmt is being expanded
> when we ICE, etc.
I think it's when expanding
_96 = _86 | _95;
(that the value of "stmt" in expand_gimple_stmt_1 when we enter do_store_flag

> Was I on the right track configuring with
> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
> --with-fpu=neon-fp16
> then compiling without any special option?
>
Maybe you also need --with-float=hard, I don't remember if it's
implied by the 'hf' target suffix
(I saw similar problems with arm-none-linux-gnueabi anyway)

> > Thanks,
> >
> > Christophe
> >
> >>>  Executed from: gcc.dg/vect/vect.exp
> >>>    gcc.dg/vect/pr59591-1.c (internal compiler error)
> >>>    gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
> >>>    gcc.dg/vect/pr86927.c (internal compiler error)
> >>>    gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
> >>>    gcc.dg/vect/slp-cond-5.c (internal compiler error)
> >>>    gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
> >>>    gcc.dg/vect/vect-23.c (internal compiler error)
> >>>    gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
> >>>    gcc.dg/vect/vect-24.c (internal compiler error)
> >>>    gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
> >>>    gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
> >>>    gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
> >>> compiler error)
> >>>
> >>> Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
> >>> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
> >>> -finline-functions
> >>> during RTL pass: expand
> >>> /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
> >>> compiler error: in do_store_flag, at expr.c:12259
> >>> 0x8feb26 do_store_flag
> >>>        /gcc/expr.c:12259
> >>> 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> >>> expand_modifier)
> >>>        /gcc/expr.c:9617
> >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> >>> expand_modifier, rtx_def**, bool)
> >>>        /gcc/expr.c:10159
> >>> 0x91174e expand_expr
> >>>        /gcc/expr.h:282
> >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> >>> rtx_def**, expand_modifier)
> >>>        /gcc/expr.c:8065
> >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> >>> expand_modifier)
> >>>        /gcc/expr.c:9950
> >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> >>> expand_modifier, rtx_def**, bool)
> >>>        /gcc/expr.c:10159
> >>> 0x91174e expand_expr
> >>>        /gcc/expr.h:282
> >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> >>> rtx_def**, expand_modifier)
> >>>        /gcc/expr.c:8065
> >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> >>> expand_modifier)
> >>>        /gcc/expr.c:9950
> >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> >>> expand_modifier, rtx_def**, bool)
> >>>        /gcc/expr.c:10159
> >>> 0x91174e expand_expr
> >>>        /gcc/expr.h:282
> >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> >>> rtx_def**, expand_modifier)
> >>>        /gcc/expr.c:8065
> >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> >>> expand_modifier)
> >>>        /gcc/expr.c:9950
> >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> >>> expand_modifier, rtx_def**, bool)
> >>>        /gcc/expr.c:10159
> >>> 0x91174e expand_expr
> >>>        /gcc/expr.h:282
> >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
> >>> rtx_def**, expand_modifier)
> >>>        /gcc/expr.c:8065
> >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> >>> expand_modifier)
> >>>        /gcc/expr.c:9950
> >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> >>> expand_modifier, rtx_def**, bool)
> >>>        /gcc/expr.c:10159
> >>> 0x91174e expand_expr
> >>>        /gcc/expr.h:282
> >>>
> >>> Christophe
> >>
> >> --
> >> Marc Glisse
> >
>
> --
> Marc Glisse

[-- Attachment #2: pr87746.c.237t.optimized --]
[-- Type: application/octet-stream, Size: 5177 bytes --]


;; Function move_replacements (move_replacements, funcdef_no=0, decl_uid=5678, cgraph_uid=1, symbol_order=1)

Removing basic block 15
Removing basic block 16
Removing basic block 17
Removing basic block 18
Removing basic block 19
Removing basic block 20
Removing basic block 21
Removing basic block 22
move_replacements (struct rtx_def * * x, struct rtx_def * * y, int n_replacements)
{
  unsigned int ivtmp.34;
  unsigned int ivtmp.27;
  vector(4) unsigned int vect_patt_19.23;
  vector(4) unsigned int vect_patt_35.19;
  vector(4) unsigned int vect__ifc__36.15;
  vector(4) unsigned int vect__1.13;
  vector(4) unsigned int vect__1.12;
  vector(4) unsigned int vect_array.11[3];
  unsigned int[12] * {ref-all} vectp_replacements.9;
  int tmp.8;
  unsigned int niters_vector_mult_vf.7;
  unsigned int bnd.6;
  unsigned int niters.5;
  int i;
  sizetype _1;
  struct replacement[150] * _2;
  void * _14;
  unsigned int _26;
  unsigned int _30;
  sizetype _36;
  struct rtx_def * * _55;
  struct rtx_def * * _58;
  unsigned int _64;
  unsigned int _76;
  vector(4) unsigned int vect_cst__77;
  unsigned int _78;
  vector(4) unsigned int vect_cst__79;
  vector(4) <signed-boolean:32> _86;
  vector(4) <signed-boolean:32> _95;
  vector(4) <signed-boolean:32> _96;
  unsigned int _119;
  unsigned int _122;
  unsigned int _125;
  unsigned int _128;
  unsigned int _132;
  unsigned int _135;
  unsigned int _138;
  unsigned int _141;

  <bb 2> [local count: 118111598]:
  if (n_replacements_7(D) > 0)
    goto <bb 3>; [89.00%]
  else
    goto <bb 14>; [11.00%]

  <bb 3> [local count: 105119322]:
  niters.5_52 = (unsigned int) n_replacements_7(D);
  _64 = niters.5_52 + 4294967295;
  if (_64 <= 61)
    goto <bb 7>; [10.00%]
  else
    goto <bb 4>; [90.00%]

  <bb 4> [local count: 94607390]:
  bnd.6_67 = niters.5_52 >> 2;
  _76 = (unsigned int) x_8(D);
  vect_cst__77 = {_76, _76, _76, _76};
  _78 = (unsigned int) y_9(D);
  vect_cst__79 = {_78, _78, _78, _78};
  ivtmp.34_143 = (unsigned int) &replacements;
  _30 = bnd.6_67 * 48;
  _26 = _30 + ivtmp.34_143;

  <bb 5> [local count: 567644337]:
  # ivtmp.34_42 = PHI <ivtmp.34_143(4), ivtmp.34_130(5)>
  vectp_replacements.9_71 = (unsigned int[12] * {ref-all}) ivtmp.34_42;
  vect_array.11 = .LOAD_LANES (MEM <unsigned int[12]> [(struct rtx_def * * *)vectp_replacements.9_71]);
  vect__1.12_73 = vect_array.11[0];
  vect__1.13_74 = vect_array.11[1];
  vect_array.11 ={v} {CLOBBER};
  vect__ifc__36.15_81 = .VCONDU (vect__1.13_74, vect_cst__77, vect__1.13_74, vect_cst__79, 113);
  _86 = vect__1.13_74 == vect_cst__77;
  _95 = vect__1.12_73 != vect_cst__77;
  _96 = _86 | _95;
  vect_patt_35.19_99 = .VCOND (_96, { 0, 0, 0, 0 }, vect__1.12_73, vect_cst__79, 107);
  vect_patt_19.23_116 = .VCOND (_96, { 0, 0, 0, 0 }, vect__ifc__36.15_81, { 0, 0, 0, 0 }, 107);
  _119 = BIT_FIELD_REF <vect_patt_35.19_99, 32, 0>;
  MEM[base: vectp_replacements.9_71, offset: 0B] = _119;
  _122 = BIT_FIELD_REF <vect_patt_35.19_99, 32, 32>;
  MEM[base: vectp_replacements.9_71, offset: 12B] = _122;
  _125 = BIT_FIELD_REF <vect_patt_35.19_99, 32, 64>;
  MEM[base: vectp_replacements.9_71, offset: 24B] = _125;
  _128 = BIT_FIELD_REF <vect_patt_35.19_99, 32, 96>;
  MEM[base: vectp_replacements.9_71, offset: 36B] = _128;
  _132 = BIT_FIELD_REF <vect_patt_19.23_116, 32, 0>;
  MEM[base: vectp_replacements.9_71, offset: 4B] = _132;
  _135 = BIT_FIELD_REF <vect_patt_19.23_116, 32, 32>;
  MEM[base: vectp_replacements.9_71, offset: 16B] = _135;
  _138 = BIT_FIELD_REF <vect_patt_19.23_116, 32, 64>;
  MEM[base: vectp_replacements.9_71, offset: 28B] = _138;
  _141 = BIT_FIELD_REF <vect_patt_19.23_116, 32, 96>;
  MEM[base: vectp_replacements.9_71, offset: 40B] = _141;
  ivtmp.34_130 = ivtmp.34_42 + 48;
  if (_26 != ivtmp.34_130)
    goto <bb 5>; [83.33%]
  else
    goto <bb 6>; [16.67%]

  <bb 6> [local count: 105119322]:
  niters_vector_mult_vf.7_68 = niters.5_52 & 4294967292;
  tmp.8_69 = (int) niters_vector_mult_vf.7_68;
  if (niters.5_52 == niters_vector_mult_vf.7_68)
    goto <bb 13>; [25.00%]
  else
    goto <bb 7>; [75.00%]

  <bb 7> [local count: 81467474]:
  # i_65 = PHI <0(3), tmp.8_69(6)>
  _1 = (sizetype) i_65;
  _36 = _1 * 12;
  _2 = &replacements + _36;
  ivtmp.27_75 = (unsigned int) _2;

  <bb 8> [local count: 822903805]:
  # i_53 = PHI <i_65(7), i_57(9)>
  # ivtmp.27_24 = PHI <ivtmp.27_75(7), ivtmp.27_16(9)>
  _14 = (void *) ivtmp.27_24;
  _55 = MEM[base: _14, offset: 4B];
  if (x_8(D) == _55)
    goto <bb 12>; [30.00%]
  else
    goto <bb 10>; [70.00%]

  <bb 9> [local count: 822903806]:
  i_57 = i_53 + 1;
  ivtmp.27_16 = ivtmp.27_24 + 12;
  if (n_replacements_7(D) > i_57)
    goto <bb 8>; [89.00%]
  else
    goto <bb 13>; [11.00%]

  <bb 10> [local count: 576032666]:
  _58 = MEM[base: _14, offset: 0B];
  if (x_8(D) == _58)
    goto <bb 11>; [30.00%]
  else
    goto <bb 9>; [70.00%]

  <bb 11> [local count: 172809798]:
  MEM[base: _14, offset: 0B] = y_9(D);
  MEM[base: _14, offset: 4B] = 0B;
  goto <bb 9>; [100.00%]

  <bb 12> [local count: 246871139]:
  MEM[base: _14, offset: 4B] = y_9(D);
  goto <bb 9>; [100.00%]

  <bb 13> [local count: 116799249]:

  <bb 14> [local count: 118111600]:
  return;

}



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-06 12:00             ` Christophe Lyon
@ 2020-08-06 18:07               ` Marc Glisse
  2020-08-07  6:38                 ` Richard Biener
  0 siblings, 1 reply; 29+ messages in thread
From: Marc Glisse @ 2020-08-06 18:07 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: Richard Biener, GCC Patches

On Thu, 6 Aug 2020, Christophe Lyon wrote:

>> Was I on the right track configuring with
>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
>> --with-fpu=neon-fp16
>> then compiling without any special option?
>
> Maybe you also need --with-float=hard, I don't remember if it's
> implied by the 'hf' target suffix

Thanks! That's what I was missing to reproduce the issue. Now I can
reproduce it with just

typedef unsigned int vec __attribute__((vector_size(16)));
typedef int vi __attribute__((vector_size(16)));
vi f(vec a,vec b){
     return a==5 | b==7;
}

with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1

   _1 = a_5(D) == { 5, 5, 5, 5 };
   _3 = b_6(D) == { 7, 7, 7, 7 };
   _9 = _1 | _3;
   _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);

we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
false), while with -fdisable-tree-forwprop4 we do manage to expand

   _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112);

It doesn't make much sense to me that we can expand the more complicated
form and not the simpler form of the same operation (both compare a to 5
and produce a vector of -1 or 0 of the same size), especially when the
target has an instruction (vceq) that does just what we want.

Introducing boolean vectors was fine, but I think they should be real 
types, that we can operate on, not be forced to appear only as the first 
argument of a vcond.

I can think of 2 natural ways to improve things: either implement vector 
comparisons in the ARM backend (possibly by forwarding to their existing 
code for vcond), or in the generic expansion code try using vcond if the 
direct comparison opcode is not provided.

We can temporarily revert my patch, but I would like it to be temporary. 
Since aarch64 seems to handle the same code just fine, maybe someone who 
knows arm could copy the relevant code over?

Does my message make sense, do people have comments?

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-06 18:07               ` Marc Glisse
@ 2020-08-07  6:38                 ` Richard Biener
  2020-08-07  8:33                   ` Marc Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Richard Biener @ 2020-08-07  6:38 UTC (permalink / raw)
  To: Marc Glisse; +Cc: Christophe Lyon, GCC Patches

On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote:
>
> On Thu, 6 Aug 2020, Christophe Lyon wrote:
>
> >> Was I on the right track configuring with
> >> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
> >> --with-fpu=neon-fp16
> >> then compiling without any special option?
> >
> > Maybe you also need --with-float=hard, I don't remember if it's
> > implied by the 'hf' target suffix
>
> Thanks! That's what I was missing to reproduce the issue. Now I can
> reproduce it with just
>
> typedef unsigned int vec __attribute__((vector_size(16)));
> typedef int vi __attribute__((vector_size(16)));
> vi f(vec a,vec b){
>      return a==5 | b==7;
> }
>
> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1
>
>    _1 = a_5(D) == { 5, 5, 5, 5 };
>    _3 = b_6(D) == { 7, 7, 7, 7 };
>    _9 = _1 | _3;
>    _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);
>
> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
> false), while with -fdisable-tree-forwprop4 we do manage to expand
>
>    _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112);
>
> It doesn't make much sense to me that we can expand the more complicated
> form and not the simpler form of the same operation (both compare a to 5
> and produce a vector of -1 or 0 of the same size), especially when the
> target has an instruction (vceq) that does just what we want.
>
> Introducing boolean vectors was fine, but I think they should be real
> types, that we can operate on, not be forced to appear only as the first
> argument of a vcond.
>
> I can think of 2 natural ways to improve things: either implement vector
> comparisons in the ARM backend (possibly by forwarding to their existing
> code for vcond), or in the generic expansion code try using vcond if the
> direct comparison opcode is not provided.
>
> We can temporarily revert my patch, but I would like it to be temporary.
> Since aarch64 seems to handle the same code just fine, maybe someone who
> knows arm could copy the relevant code over?
>
> Does my message make sense, do people have comments?

So what complicates things now (and to some extent pre-existed when you
used AVX512 which _could_ operate on boolean vectors) is that we
have split out the condition from VEC_COND_EXPR to separate stmts
but we do not expect backends to be able to code-generate the separate
form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs
to .VCOND[U] "merging" the compares again.  Now that process breaks
down once we have things like _9 = _1 | _3;  -  at some point I argued
that we should handle vector compares [and operations on boolean vectors]
as well in ISEL but then when it came up again for some reason I
disregarded that again.

Thus - we don't want to go back to fixing up the generic expansion code
(which looks at one instruction at a time and is restricted by TER single-use
restrictions).  Instead we want to deal with this in ISEL which should
behave more intelligently.  In the above case it might involve turning
the _1 and _3 defs into .VCOND [with different result type], doing
_9 in that type and then somehow dealing with _7 ... but this eventually
means undoing the match simplification that introduced the code?

Not sure if that helps though.

Richard.

> --
> Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-07  6:38                 ` Richard Biener
@ 2020-08-07  8:33                   ` Marc Glisse
  2020-08-07  8:47                     ` Richard Biener
  0 siblings, 1 reply; 29+ messages in thread
From: Marc Glisse @ 2020-08-07  8:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: Christophe Lyon, GCC Patches

On Fri, 7 Aug 2020, Richard Biener wrote:

> On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote:
>>
>> On Thu, 6 Aug 2020, Christophe Lyon wrote:
>>
>>>> Was I on the right track configuring with
>>>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
>>>> --with-fpu=neon-fp16
>>>> then compiling without any special option?
>>>
>>> Maybe you also need --with-float=hard, I don't remember if it's
>>> implied by the 'hf' target suffix
>>
>> Thanks! That's what I was missing to reproduce the issue. Now I can
>> reproduce it with just
>>
>> typedef unsigned int vec __attribute__((vector_size(16)));
>> typedef int vi __attribute__((vector_size(16)));
>> vi f(vec a,vec b){
>>      return a==5 | b==7;
>> }
>>
>> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1
>>
>>    _1 = a_5(D) == { 5, 5, 5, 5 };
>>    _3 = b_6(D) == { 7, 7, 7, 7 };
>>    _9 = _1 | _3;
>>    _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);
>>
>> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
>> false), while with -fdisable-tree-forwprop4 we do manage to expand
>>
>>    _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112);
>>
>> It doesn't make much sense to me that we can expand the more complicated
>> form and not the simpler form of the same operation (both compare a to 5
>> and produce a vector of -1 or 0 of the same size), especially when the
>> target has an instruction (vceq) that does just what we want.
>>
>> Introducing boolean vectors was fine, but I think they should be real
>> types, that we can operate on, not be forced to appear only as the first
>> argument of a vcond.
>>
>> I can think of 2 natural ways to improve things: either implement vector
>> comparisons in the ARM backend (possibly by forwarding to their existing
>> code for vcond), or in the generic expansion code try using vcond if the
>> direct comparison opcode is not provided.
>>
>> We can temporarily revert my patch, but I would like it to be temporary.
>> Since aarch64 seems to handle the same code just fine, maybe someone who
>> knows arm could copy the relevant code over?
>>
>> Does my message make sense, do people have comments?
>
> So what complicates things now (and to some extent pre-existed when you
> used AVX512 which _could_ operate on boolean vectors) is that we
> have split out the condition from VEC_COND_EXPR to separate stmts
> but we do not expect backends to be able to code-generate the separate
> form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs
> to .VCOND[U] "merging" the compares again.  Now that process breaks
> down once we have things like _9 = _1 | _3;  -  at some point I argued
> that we should handle vector compares [and operations on boolean vectors]
> as well in ISEL but then when it came up again for some reason I
> disregarded that again.
>
> Thus - we don't want to go back to fixing up the generic expansion code
> (which looks at one instruction at a time and is restricted by TER single-use
> restrictions).

Here, it would only be handling comparisons left over by ISEL because they 
do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would 
be more consistent, but then we may have to teach the vector lowering pass 
about this.

> Instead we want to deal with this in ISEL which should
> behave more intelligently.  In the above case it might involve turning
> the _1 and _3 defs into .VCOND [with different result type], doing
> _9 in that type and then somehow dealing with _7 ... but this eventually
> means undoing the match simplification that introduced the code?

For targets that do not have compact boolean vectors, 
VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality. 
Removing it to allow more simplifications makes sense, reintroducing it 
for expansion can make sense as well, I think it is ok if the second one 
reverses the first, but very locally, without having to propagate a change 
of type to the uses. So on ARM we could turn _1 and _3 into .VCOND 
producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or 
does using bool vectors like that seem bad? (maybe they aren't guaranteed 
to be equivalent to signed integer vectors with values 0 and -1 and we 
need to query the target to know if that is the case, or introduce an 
extra .VCOND)

Also, we only want to replace comparisons with .VCOND if the target 
doesn't handle the comparison directly. For AVX512, we do want to produce 
SImode bool vectors (for k* registers) and operate on them directly, 
that's the whole point of introducing the bool vectors (if bool vectors 
were only used to feed VEC_COND_EXPR and were all turned into .VCOND 
before expansion, I don't see the point of specifying different types for 
bool vectors for AVX512 vs non-AVX512, as it would make no difference on 
what is passed to the backend).

I think targets should provide some number of operations, including for 
instance bit_and and bit_ior on bool vectors, and be a bit consistent 
about what they provide, it becomes unmanageable in the middle-end 
otherwise...

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-07  8:33                   ` Marc Glisse
@ 2020-08-07  8:47                     ` Richard Biener
  2020-08-07 12:15                       ` Marc Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Richard Biener @ 2020-08-07  8:47 UTC (permalink / raw)
  To: Marc Glisse; +Cc: Christophe Lyon, GCC Patches

On Fri, Aug 7, 2020 at 10:33 AM Marc Glisse <marc.glisse@inria.fr> wrote:
>
> On Fri, 7 Aug 2020, Richard Biener wrote:
>
> > On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote:
> >>
> >> On Thu, 6 Aug 2020, Christophe Lyon wrote:
> >>
> >>>> Was I on the right track configuring with
> >>>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
> >>>> --with-fpu=neon-fp16
> >>>> then compiling without any special option?
> >>>
> >>> Maybe you also need --with-float=hard, I don't remember if it's
> >>> implied by the 'hf' target suffix
> >>
> >> Thanks! That's what I was missing to reproduce the issue. Now I can
> >> reproduce it with just
> >>
> >> typedef unsigned int vec __attribute__((vector_size(16)));
> >> typedef int vi __attribute__((vector_size(16)));
> >> vi f(vec a,vec b){
> >>      return a==5 | b==7;
> >> }
> >>
> >> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1
> >>
> >>    _1 = a_5(D) == { 5, 5, 5, 5 };
> >>    _3 = b_6(D) == { 7, 7, 7, 7 };
> >>    _9 = _1 | _3;
> >>    _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);
> >>
> >> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
> >> false), while with -fdisable-tree-forwprop4 we do manage to expand
> >>
> >>    _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112);
> >>
> >> It doesn't make much sense to me that we can expand the more complicated
> >> form and not the simpler form of the same operation (both compare a to 5
> >> and produce a vector of -1 or 0 of the same size), especially when the
> >> target has an instruction (vceq) that does just what we want.
> >>
> >> Introducing boolean vectors was fine, but I think they should be real
> >> types, that we can operate on, not be forced to appear only as the first
> >> argument of a vcond.
> >>
> >> I can think of 2 natural ways to improve things: either implement vector
> >> comparisons in the ARM backend (possibly by forwarding to their existing
> >> code for vcond), or in the generic expansion code try using vcond if the
> >> direct comparison opcode is not provided.
> >>
> >> We can temporarily revert my patch, but I would like it to be temporary.
> >> Since aarch64 seems to handle the same code just fine, maybe someone who
> >> knows arm could copy the relevant code over?
> >>
> >> Does my message make sense, do people have comments?
> >
> > So what complicates things now (and to some extent pre-existed when you
> > used AVX512 which _could_ operate on boolean vectors) is that we
> > have split out the condition from VEC_COND_EXPR to separate stmts
> > but we do not expect backends to be able to code-generate the separate
> > form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs
> > to .VCOND[U] "merging" the compares again.  Now that process breaks
> > down once we have things like _9 = _1 | _3;  -  at some point I argued
> > that we should handle vector compares [and operations on boolean vectors]
> > as well in ISEL but then when it came up again for some reason I
> > disregarded that again.
> >
> > Thus - we don't want to go back to fixing up the generic expansion code
> > (which looks at one instruction at a time and is restricted by TER single-use
> > restrictions).
>
> Here, it would only be handling comparisons left over by ISEL because they
> do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would
> be more consistent, but then we may have to teach the vector lowering pass
> about this.
>
> > Instead we want to deal with this in ISEL which should
> > behave more intelligently.  In the above case it might involve turning
> > the _1 and _3 defs into .VCOND [with different result type], doing
> > _9 in that type and then somehow dealing with _7 ... but this eventually
> > means undoing the match simplification that introduced the code?
>
> For targets that do not have compact boolean vectors,
> VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality.
> Removing it to allow more simplifications makes sense, reintroducing it
> for expansion can make sense as well, I think it is ok if the second one
> reverses the first, but very locally, without having to propagate a change
> of type to the uses. So on ARM we could turn _1 and _3 into .VCOND
> producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or
> does using bool vectors like that seem bad? (maybe they aren't guaranteed
> to be equivalent to signed integer vectors with values 0 and -1 and we
> need to query the target to know if that is the case, or introduce an
> extra .VCOND)
>
> Also, we only want to replace comparisons with .VCOND if the target
> doesn't handle the comparison directly. For AVX512, we do want to produce
> SImode bool vectors (for k* registers) and operate on them directly,
> that's the whole point of introducing the bool vectors (if bool vectors
> were only used to feed VEC_COND_EXPR and were all turned into .VCOND
> before expansion, I don't see the point of specifying different types for
> bool vectors for AVX512 vs non-AVX512, as it would make no difference on
> what is passed to the backend).
>
> I think targets should provide some number of operations, including for
> instance bit_and and bit_ior on bool vectors, and be a bit consistent
> about what they provide, it becomes unmanageable in the middle-end
> otherwise...

It's currently somewhat of a mess I guess.  It might help to
restrict the simplification patterns to passes before vector lowering
so maybe add something like (or re-use)
canonicalize_math[_after_vectorization]_p ()?

Richard.

>
> --
> Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-07  8:47                     ` Richard Biener
@ 2020-08-07 12:15                       ` Marc Glisse
  2020-08-07 13:04                         ` Richard Biener
  0 siblings, 1 reply; 29+ messages in thread
From: Marc Glisse @ 2020-08-07 12:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: Christophe Lyon, GCC Patches

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6247 bytes --]

On Fri, 7 Aug 2020, Richard Biener wrote:

> On Fri, Aug 7, 2020 at 10:33 AM Marc Glisse <marc.glisse@inria.fr> wrote:
>>
>> On Fri, 7 Aug 2020, Richard Biener wrote:
>>
>>> On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote:
>>>>
>>>> On Thu, 6 Aug 2020, Christophe Lyon wrote:
>>>>
>>>>>> Was I on the right track configuring with
>>>>>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
>>>>>> --with-fpu=neon-fp16
>>>>>> then compiling without any special option?
>>>>>
>>>>> Maybe you also need --with-float=hard, I don't remember if it's
>>>>> implied by the 'hf' target suffix
>>>>
>>>> Thanks! That's what I was missing to reproduce the issue. Now I can
>>>> reproduce it with just
>>>>
>>>> typedef unsigned int vec __attribute__((vector_size(16)));
>>>> typedef int vi __attribute__((vector_size(16)));
>>>> vi f(vec a,vec b){
>>>>      return a==5 | b==7;
>>>> }
>>>>
>>>> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1
>>>>
>>>>    _1 = a_5(D) == { 5, 5, 5, 5 };
>>>>    _3 = b_6(D) == { 7, 7, 7, 7 };
>>>>    _9 = _1 | _3;
>>>>    _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);
>>>>
>>>> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
>>>> false), while with -fdisable-tree-forwprop4 we do manage to expand
>>>>
>>>>    _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112);
>>>>
>>>> It doesn't make much sense to me that we can expand the more complicated
>>>> form and not the simpler form of the same operation (both compare a to 5
>>>> and produce a vector of -1 or 0 of the same size), especially when the
>>>> target has an instruction (vceq) that does just what we want.
>>>>
>>>> Introducing boolean vectors was fine, but I think they should be real
>>>> types, that we can operate on, not be forced to appear only as the first
>>>> argument of a vcond.
>>>>
>>>> I can think of 2 natural ways to improve things: either implement vector
>>>> comparisons in the ARM backend (possibly by forwarding to their existing
>>>> code for vcond), or in the generic expansion code try using vcond if the
>>>> direct comparison opcode is not provided.
>>>>
>>>> We can temporarily revert my patch, but I would like it to be temporary.
>>>> Since aarch64 seems to handle the same code just fine, maybe someone who
>>>> knows arm could copy the relevant code over?
>>>>
>>>> Does my message make sense, do people have comments?
>>>
>>> So what complicates things now (and to some extent pre-existed when you
>>> used AVX512 which _could_ operate on boolean vectors) is that we
>>> have split out the condition from VEC_COND_EXPR to separate stmts
>>> but we do not expect backends to be able to code-generate the separate
>>> form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs
>>> to .VCOND[U] "merging" the compares again.  Now that process breaks
>>> down once we have things like _9 = _1 | _3;  -  at some point I argued
>>> that we should handle vector compares [and operations on boolean vectors]
>>> as well in ISEL but then when it came up again for some reason I
>>> disregarded that again.
>>>
>>> Thus - we don't want to go back to fixing up the generic expansion code
>>> (which looks at one instruction at a time and is restricted by TER single-use
>>> restrictions).
>>
>> Here, it would only be handling comparisons left over by ISEL because they
>> do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would
>> be more consistent, but then we may have to teach the vector lowering pass
>> about this.
>>
>>> Instead we want to deal with this in ISEL which should
>>> behave more intelligently.  In the above case it might involve turning
>>> the _1 and _3 defs into .VCOND [with different result type], doing
>>> _9 in that type and then somehow dealing with _7 ... but this eventually
>>> means undoing the match simplification that introduced the code?
>>
>> For targets that do not have compact boolean vectors,
>> VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality.
>> Removing it to allow more simplifications makes sense, reintroducing it
>> for expansion can make sense as well, I think it is ok if the second one
>> reverses the first, but very locally, without having to propagate a change
>> of type to the uses. So on ARM we could turn _1 and _3 into .VCOND
>> producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or
>> does using bool vectors like that seem bad? (maybe they aren't guaranteed
>> to be equivalent to signed integer vectors with values 0 and -1 and we
>> need to query the target to know if that is the case, or introduce an
>> extra .VCOND)
>>
>> Also, we only want to replace comparisons with .VCOND if the target
>> doesn't handle the comparison directly. For AVX512, we do want to produce
>> SImode bool vectors (for k* registers) and operate on them directly,
>> that's the whole point of introducing the bool vectors (if bool vectors
>> were only used to feed VEC_COND_EXPR and were all turned into .VCOND
>> before expansion, I don't see the point of specifying different types for
>> bool vectors for AVX512 vs non-AVX512, as it would make no difference on
>> what is passed to the backend).
>>
>> I think targets should provide some number of operations, including for
>> instance bit_and and bit_ior on bool vectors, and be a bit consistent
>> about what they provide, it becomes unmanageable in the middle-end
>> otherwise...
>
> It's currently somewhat of a mess I guess.  It might help to
> restrict the simplification patterns to passes before vector lowering
> so maybe add something like (or re-use)
> canonicalize_math[_after_vectorization]_p ()?

That's indeed a nice way to gain time to sort out the mess :-)

This patch solved the issue on ARM for at least 1 testcase. I have a 
bootstrap+regtest running on x86_64-linux-gnu, at least the tests added in 
the previous patch still work.

2020-08-05  Marc Glisse  <marc.glisse@inria.fr>

 	* generic-match-head.c (optimize_vectors_before_lowering_p): New
 	function.
 	* gimple-match-head.c (optimize_vectors_before_lowering_p):
 	Likewise.
 	* match.pd ((v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): Use it.

-- 
Marc Glisse

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-diff; name=lvec.patch, Size: 3832 bytes --]

diff --git a/gcc/generic-match-head.c b/gcc/generic-match-head.c
index 2454baac9d4..fdb528d9686 100644
--- a/gcc/generic-match-head.c
+++ b/gcc/generic-match-head.c
@@ -80,6 +80,16 @@ canonicalize_math_after_vectorization_p ()
   return false;
 }
 
+/* Return true if we can still perform transformations that may introduce
+   vector operations that are not supported by the target. Vector lowering
+   normally handles those, but after that pass, it becomes unsafe.  */
+
+static inline bool
+optimize_vectors_before_lowering_p ()
+{
+  return true;
+}
+
 /* Return true if successive divisions can be optimized.
    Defer to GIMPLE opts.  */
 
diff --git a/gcc/gimple-match-head.c b/gcc/gimple-match-head.c
index 9b3e7298d87..4a65be703b9 100644
--- a/gcc/gimple-match-head.c
+++ b/gcc/gimple-match-head.c
@@ -1158,6 +1158,16 @@ canonicalize_math_after_vectorization_p ()
   return !cfun || (cfun->curr_properties & PROP_gimple_lvec) != 0;
 }
 
+/* Return true if we can still perform transformations that may introduce
+   vector operations that are not supported by the target. Vector lowering
+   normally handles those, but after that pass, it becomes unsafe.  */
+
+static inline bool
+optimize_vectors_before_lowering_p ()
+{
+  return !cfun || (cfun->curr_properties & PROP_gimple_lvec) == 0;
+}
+
 /* Return true if pow(cst, x) should be optimized into exp(log(cst) * x).
    As a workaround for SPEC CPU2017 628.pop2_s, don't do it if arg0
    is an exact integer, arg1 = phi_res +/- cst1 and phi_res = PHI <cst2, ...>
diff --git a/gcc/match.pd b/gcc/match.pd
index d8e3927d3c7..7e5c5a6eae6 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3461,40 +3461,42 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (vec_cond @0 (op! @3 @1) (op! @3 @2))))
 #endif
 
-/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
+/* (v ? w : 0) ? a : b is just (v & w) ? a : b
+   Currently disabled after pass lvec because ARM understands
+   VEC_COND_EXPR<v==w,-1,0> but not a plain v==w fed to BIT_IOR_EXPR.  */
 (simplify
  (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
- (if (types_match (@0, @3))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @3))
   (vec_cond (bit_and @0 @3) @1 @2)))
 (simplify
  (vec_cond (vec_cond:s @0 integer_all_onesp @3) @1 @2)
- (if (types_match (@0, @3))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @3))
   (vec_cond (bit_ior @0 @3) @1 @2)))
 (simplify
  (vec_cond (vec_cond:s @0 integer_zerop @3) @1 @2)
- (if (types_match (@0, @3))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @3))
   (vec_cond (bit_ior @0 (bit_not @3)) @2 @1)))
 (simplify
  (vec_cond (vec_cond:s @0 @3 integer_all_onesp) @1 @2)
- (if (types_match (@0, @3))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @3))
   (vec_cond (bit_and @0 (bit_not @3)) @2 @1)))
 
 /* c1 ? c2 ? a : b : b  -->  (c1 & c2) ? a : b  */
 (simplify
  (vec_cond @0 (vec_cond:s @1 @2 @3) @3)
- (if (types_match (@0, @1))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @1))
   (vec_cond (bit_and @0 @1) @2 @3)))
 (simplify
  (vec_cond @0 @2 (vec_cond:s @1 @2 @3))
- (if (types_match (@0, @1))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @1))
   (vec_cond (bit_ior @0 @1) @2 @3)))
 (simplify
  (vec_cond @0 (vec_cond:s @1 @2 @3) @2)
- (if (types_match (@0, @1))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @1))
   (vec_cond (bit_ior (bit_not @0) @1) @2 @3)))
 (simplify
  (vec_cond @0 @3 (vec_cond:s @1 @2 @3))
- (if (types_match (@0, @1))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @1))
   (vec_cond (bit_and (bit_not @0) @1) @2 @3)))
 
 /* Simplification moved from fold_cond_expr_with_comparison.  It may also

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: VEC_COND_EXPR optimizations v2
  2020-08-07 12:15                       ` Marc Glisse
@ 2020-08-07 13:04                         ` Richard Biener
  0 siblings, 0 replies; 29+ messages in thread
From: Richard Biener @ 2020-08-07 13:04 UTC (permalink / raw)
  To: Marc Glisse; +Cc: Christophe Lyon, GCC Patches

On Fri, Aug 7, 2020 at 2:15 PM Marc Glisse <marc.glisse@inria.fr> wrote:
>
> On Fri, 7 Aug 2020, Richard Biener wrote:
>
> > On Fri, Aug 7, 2020 at 10:33 AM Marc Glisse <marc.glisse@inria.fr> wrote:
> >>
> >> On Fri, 7 Aug 2020, Richard Biener wrote:
> >>
> >>> On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote:
> >>>>
> >>>> On Thu, 6 Aug 2020, Christophe Lyon wrote:
> >>>>
> >>>>>> Was I on the right track configuring with
> >>>>>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
> >>>>>> --with-fpu=neon-fp16
> >>>>>> then compiling without any special option?
> >>>>>
> >>>>> Maybe you also need --with-float=hard, I don't remember if it's
> >>>>> implied by the 'hf' target suffix
> >>>>
> >>>> Thanks! That's what I was missing to reproduce the issue. Now I can
> >>>> reproduce it with just
> >>>>
> >>>> typedef unsigned int vec __attribute__((vector_size(16)));
> >>>> typedef int vi __attribute__((vector_size(16)));
> >>>> vi f(vec a,vec b){
> >>>>      return a==5 | b==7;
> >>>> }
> >>>>
> >>>> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1
> >>>>
> >>>>    _1 = a_5(D) == { 5, 5, 5, 5 };
> >>>>    _3 = b_6(D) == { 7, 7, 7, 7 };
> >>>>    _9 = _1 | _3;
> >>>>    _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);
> >>>>
> >>>> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
> >>>> false), while with -fdisable-tree-forwprop4 we do manage to expand
> >>>>
> >>>>    _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112);
> >>>>
> >>>> It doesn't make much sense to me that we can expand the more complicated
> >>>> form and not the simpler form of the same operation (both compare a to 5
> >>>> and produce a vector of -1 or 0 of the same size), especially when the
> >>>> target has an instruction (vceq) that does just what we want.
> >>>>
> >>>> Introducing boolean vectors was fine, but I think they should be real
> >>>> types, that we can operate on, not be forced to appear only as the first
> >>>> argument of a vcond.
> >>>>
> >>>> I can think of 2 natural ways to improve things: either implement vector
> >>>> comparisons in the ARM backend (possibly by forwarding to their existing
> >>>> code for vcond), or in the generic expansion code try using vcond if the
> >>>> direct comparison opcode is not provided.
> >>>>
> >>>> We can temporarily revert my patch, but I would like it to be temporary.
> >>>> Since aarch64 seems to handle the same code just fine, maybe someone who
> >>>> knows arm could copy the relevant code over?
> >>>>
> >>>> Does my message make sense, do people have comments?
> >>>
> >>> So what complicates things now (and to some extent pre-existed when you
> >>> used AVX512 which _could_ operate on boolean vectors) is that we
> >>> have split out the condition from VEC_COND_EXPR to separate stmts
> >>> but we do not expect backends to be able to code-generate the separate
> >>> form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs
> >>> to .VCOND[U] "merging" the compares again.  Now that process breaks
> >>> down once we have things like _9 = _1 | _3;  -  at some point I argued
> >>> that we should handle vector compares [and operations on boolean vectors]
> >>> as well in ISEL but then when it came up again for some reason I
> >>> disregarded that again.
> >>>
> >>> Thus - we don't want to go back to fixing up the generic expansion code
> >>> (which looks at one instruction at a time and is restricted by TER single-use
> >>> restrictions).
> >>
> >> Here, it would only be handling comparisons left over by ISEL because they
> >> do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would
> >> be more consistent, but then we may have to teach the vector lowering pass
> >> about this.
> >>
> >>> Instead we want to deal with this in ISEL which should
> >>> behave more intelligently.  In the above case it might involve turning
> >>> the _1 and _3 defs into .VCOND [with different result type], doing
> >>> _9 in that type and then somehow dealing with _7 ... but this eventually
> >>> means undoing the match simplification that introduced the code?
> >>
> >> For targets that do not have compact boolean vectors,
> >> VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality.
> >> Removing it to allow more simplifications makes sense, reintroducing it
> >> for expansion can make sense as well, I think it is ok if the second one
> >> reverses the first, but very locally, without having to propagate a change
> >> of type to the uses. So on ARM we could turn _1 and _3 into .VCOND
> >> producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or
> >> does using bool vectors like that seem bad? (maybe they aren't guaranteed
> >> to be equivalent to signed integer vectors with values 0 and -1 and we
> >> need to query the target to know if that is the case, or introduce an
> >> extra .VCOND)
> >>
> >> Also, we only want to replace comparisons with .VCOND if the target
> >> doesn't handle the comparison directly. For AVX512, we do want to produce
> >> SImode bool vectors (for k* registers) and operate on them directly,
> >> that's the whole point of introducing the bool vectors (if bool vectors
> >> were only used to feed VEC_COND_EXPR and were all turned into .VCOND
> >> before expansion, I don't see the point of specifying different types for
> >> bool vectors for AVX512 vs non-AVX512, as it would make no difference on
> >> what is passed to the backend).
> >>
> >> I think targets should provide some number of operations, including for
> >> instance bit_and and bit_ior on bool vectors, and be a bit consistent
> >> about what they provide, it becomes unmanageable in the middle-end
> >> otherwise...
> >
> > It's currently somewhat of a mess I guess.  It might help to
> > restrict the simplification patterns to passes before vector lowering
> > so maybe add something like (or re-use)
> > canonicalize_math[_after_vectorization]_p ()?
>
> That's indeed a nice way to gain time to sort out the mess :-)
>
> This patch solved the issue on ARM for at least 1 testcase. I have a
> bootstrap+regtest running on x86_64-linux-gnu, at least the tests added in
> the previous patch still work.

So let's go with this for now (assuming bootstrap / testing works).  I guess
we should have a look at whether the foldings make anything worse.
If you hit a case where vector lowering is required with but not without
folding then that's likely the case.  Now, that would also mainly ask
for vector lowering to be improved of course.

Richard.

> 2020-08-05  Marc Glisse  <marc.glisse@inria.fr>
>
>         * generic-match-head.c (optimize_vectors_before_lowering_p): New
>         function.
>         * gimple-match-head.c (optimize_vectors_before_lowering_p):
>         Likewise.
>         * match.pd ((v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): Use it.
>
> --
> Marc Glisse

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-08-07 13:04 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-30  7:49 VEC_COND_EXPR optimizations Marc Glisse
2020-07-31 11:18 ` Richard Sandiford
2020-07-31 11:38   ` Marc Glisse
2020-07-31 11:43     ` Richard Biener
2020-07-31 11:57       ` Marc Glisse
2020-07-31 12:50     ` Richard Sandiford
2020-07-31 12:59       ` Richard Biener
2020-07-31 13:01       ` Marc Glisse
2020-07-31 13:13         ` Marc Glisse
2020-07-31 11:35 ` Richard Biener
2020-07-31 11:39   ` Richard Biener
2020-07-31 11:47     ` Richard Biener
2020-07-31 12:08       ` Richard Biener
2020-07-31 12:12       ` Marc Glisse
2020-08-05 13:32 ` VEC_COND_EXPR optimizations v2 Marc Glisse
2020-08-05 14:24   ` Richard Biener
2020-08-06  8:17     ` Christophe Lyon
2020-08-06  9:05       ` Marc Glisse
2020-08-06 11:25         ` Christophe Lyon
2020-08-06 11:42           ` Marc Glisse
2020-08-06 12:00             ` Christophe Lyon
2020-08-06 18:07               ` Marc Glisse
2020-08-07  6:38                 ` Richard Biener
2020-08-07  8:33                   ` Marc Glisse
2020-08-07  8:47                     ` Richard Biener
2020-08-07 12:15                       ` Marc Glisse
2020-08-07 13:04                         ` Richard Biener
2020-08-06 10:29       ` Richard Biener
2020-08-06 11:11         ` Marc Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).