[PATCH] Fold truncations of left shifts in match.pd

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Fold truncations of left shifts in match.pd
@ 2022-05-30 12:23 Roger Sayle
  2022-06-01 15:00 ` Jeff Law
  2022-06-02  8:48 ` Richard Biener
  0 siblings, 2 replies; 5+ messages in thread
From: Roger Sayle @ 2022-05-30 12:23 UTC (permalink / raw)
  To: 'GCC Patches'

[-- Attachment #1: Type: text/plain, Size: 3181 bytes --]


Whilst investigating PR 55278, I noticed that the tree-ssa optimizers
aren't eliminating the promotions of shifts to "int" as inserted by the
c-family front-ends, instead leaving this simplification to be left to
the RTL optimizers.  This patch allows match.pd to do this itself earlier,
narrowing (T)(X << C) to (T)X << C when the constant C is known to be
valid for the (narrower) type T.

Hence for this simple test case:
short foo(short x) { return x << 5; }

the .optimized dump currently looks like:

short int foo (short int x)
{
  int _1;
  int _2;
  short int _4;

  <bb 2> [local count: 1073741824]:
  _1 = (int) x_3(D);
  _2 = _1 << 5;
  _4 = (short int) _2;
  return _4;
}

but with this patch, now becomes:

short int foo (short int x)
{
  short int _2;

  <bb 2> [local count: 1073741824]:
  _2 = x_1(D) << 5;
  return _2;
}

This is always reasonable as RTL expansion knows how to use
widening optabs if it makes sense at the RTL level to perform
this shift in a wider mode.

Of course, there's often a catch.  The above simplification not only
reduces the number of statements in gimple, but also allows further
optimizations, for example including the perception of rotate idioms
and bswap16.  Alas, optimizing things earlier than anticipated
requires several testsuite changes [though all these tests have
been confirmed to generate identical assembly code on x86_64].
The only significant change is that the vectorization pass previously
wouldn't vectorize rotations if the backend doesn't explicitly provide
an optab for them.  This is curious as if the rotate is expressed as
ior(lshift,rshift) it will vectorize, and likewise RTL expansion will
generate the iorv(lshiftv,rshiftv) sequence if required for a vector
mode rotation.  Hence this patch includes a tweak to the optabs
test in tree-vect-stmts.cc's vectorizable_shifts to better reflect
the functionality supported by RTL expansion.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-05-30  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * match.pd (convert (lshift @1 INTEGER_CST@2)): Narrow integer
        left shifts by a constant when the result is truncated, and the
        shift constant is well-defined for the narrower mode.
        * tree-vect-stmts.cc (vectorizable_shift): Rotations by
        constants are vectorizable, if the backend supports logical
        shifts and IOR logical operations in the required vector mode.

gcc/testsuite/ChangeLog
        * gcc.dg/fold-convlshift-4.c: New test case.
        * gcc.dg/optimize-bswaphi-1.c: Update found bswap count.
        * gcc.dg/tree-ssa/pr61839_3.c: Shift is now optimized before VRP.
        * gcc.dg/vect/vect-over-widen-1-big-array.c: Remove obsolete tests.
        * gcc.dg/vect/vect-over-widen-1.c: Likewise.
        * gcc.dg/vect/vect-over-widen-3-big-array.c: Likewise.
        * gcc.dg/vect/vect-over-widen-3.c: Likewise.
        * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
        * gcc.dg/vect/vect-over-widen-4.c: Likewise.


Thanks in advance,
Roger
--


[-- Attachment #2: patchls3b.txt --]
[-- Type: text/plain, Size: 8714 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index c2fed9b..05b34d6 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3637,6 +3637,16 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
       && !integer_zerop (@3))
   (lshift (convert @2) @3)))
 
+/* Narrow a lshift by constant.  */
+(simplify
+ (convert (lshift:s@0 @1 INTEGER_CST@2))
+ (if (INTEGRAL_TYPE_P (type)
+      && INTEGRAL_TYPE_P (TREE_TYPE (@0))
+      && TYPE_PRECISION (type) <= TYPE_PRECISION (TREE_TYPE (@0))
+      && !integer_zerop (@2)
+      && wi::ltu_p (wi::to_wide (@2), TYPE_PRECISION (type)))
+  (lshift (convert @1) (convert @2))))
+
 /* Simplifications of conversions.  */
 
 /* Basic strip-useless-type-conversions / strip_nops.  */
diff --git a/gcc/testsuite/gcc.dg/fold-convlshift-4.c b/gcc/testsuite/gcc.dg/fold-convlshift-4.c
new file mode 100644
index 0000000..001627f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-convlshift-4.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+short foo(short x)
+{
+  return x << 5;
+}
+
+/* { dg-final { scan-tree-dump-not "\\(int\\)" "optimized" } } */
+/* { dg-final { scan-tree-dump-not "\\(short int\\)" "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c b/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
index d045da9..a5d8bfd 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
@@ -68,4 +68,4 @@ get_unaligned_16_be (unsigned char *p)
 
 
 /* { dg-final { scan-tree-dump-times "16 bit load in target endianness found at" 4 "bswap" } } */
-/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 5 "bswap" } } */
+/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 4 "bswap" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c
index bc2126f..38cf792 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c
@@ -1,6 +1,6 @@
 /* PR tree-optimization/61839.  */
 /* { dg-do run } */
-/* { dg-options "-O2 -fdump-tree-vrp -fdump-tree-optimized -fdisable-tree-ethread -fdisable-tree-threadfull1" } */
+/* { dg-options "-O2 -fdump-tree-optimized -fdisable-tree-ethread -fdisable-tree-threadfull1" } */
 
 __attribute__ ((noinline))
 int foo (int a, unsigned b)
@@ -21,6 +21,4 @@ int main ()
   foo (-1, b);
 }
 
-/* Scan for c [12, 13] << 8 in function foo.  */
-/* { dg-final { scan-tree-dump-times "3072 : 3328" 1  "vrp1" } } */
 /* { dg-final { scan-tree-dump-times "3072" 0  "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c
index 9e5f464..9a5141ee 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c
@@ -58,9 +58,7 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 3} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 3} "vect" } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 8} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 5} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c
index c2d0797..f2d284c 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c
@@ -62,9 +62,7 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 3} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 3} "vect" } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 8} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 5} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c
index 37da7c9..6f89aac 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c
@@ -59,9 +59,7 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 3} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 3} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 8} "vect" } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 9} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c
index 4138480..a1e1182 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c
@@ -57,9 +57,7 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 3} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 3} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 8} "vect" } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 9} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c
index 514337c..03a6e67 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c
@@ -62,9 +62,7 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 3} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 3} "vect" } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 8} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 5} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c
index 3d536d5..0ef377f 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c
@@ -66,9 +66,7 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 3} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 3} "vect" } } */
-/* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* << 8} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 5} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 8327e9d..3c54f04 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -5841,6 +5841,14 @@ vectorizable_shift (vec_info *vinfo,
     }
   vec_mode = TYPE_MODE (vectype);
   icode = (int) optab_handler (optab, vec_mode);
+
+  /* RTL expansion knows how to expand rotates using shift/or.  */
+  if (icode == CODE_FOR_nothing
+      && (code == LROTATE_EXPR || code == RROTATE_EXPR)
+      && optab_handler (ior_optab, vec_mode) != CODE_FOR_nothing
+      && optab_handler (ashl_optab, vec_mode) != CODE_FOR_nothing)
+    icode = (int) optab_handler (lshr_optab, vec_mode);
+
   if (icode == CODE_FOR_nothing)
     {
       if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fold truncations of left shifts in match.pd
  2022-05-30 12:23 [PATCH] Fold truncations of left shifts in match.pd Roger Sayle
@ 2022-06-01 15:00 ` Jeff Law
  2022-06-02  8:48 ` Richard Biener
  1 sibling, 0 replies; 5+ messages in thread
From: Jeff Law @ 2022-06-01 15:00 UTC (permalink / raw)
  To: gcc-patches



On 5/30/2022 6:23 AM, Roger Sayle wrote:
> Whilst investigating PR 55278, I noticed that the tree-ssa optimizers
> aren't eliminating the promotions of shifts to "int" as inserted by the
> c-family front-ends, instead leaving this simplification to be left to
> the RTL optimizers.  This patch allows match.pd to do this itself earlier,
> narrowing (T)(X << C) to (T)X << C when the constant C is known to be
> valid for the (narrower) type T.
>
> Hence for this simple test case:
> short foo(short x) { return x << 5; }
>
> the .optimized dump currently looks like:
>
> short int foo (short int x)
> {
>    int _1;
>    int _2;
>    short int _4;
>
>    <bb 2> [local count: 1073741824]:
>    _1 = (int) x_3(D);
>    _2 = _1 << 5;
>    _4 = (short int) _2;
>    return _4;
> }
>
> but with this patch, now becomes:
>
> short int foo (short int x)
> {
>    short int _2;
>
>    <bb 2> [local count: 1073741824]:
>    _2 = x_1(D) << 5;
>    return _2;
> }
>
> This is always reasonable as RTL expansion knows how to use
> widening optabs if it makes sense at the RTL level to perform
> this shift in a wider mode.
>
> Of course, there's often a catch.  The above simplification not only
> reduces the number of statements in gimple, but also allows further
> optimizations, for example including the perception of rotate idioms
> and bswap16.  Alas, optimizing things earlier than anticipated
> requires several testsuite changes [though all these tests have
> been confirmed to generate identical assembly code on x86_64].
> The only significant change is that the vectorization pass previously
> wouldn't vectorize rotations if the backend doesn't explicitly provide
> an optab for them.  This is curious as if the rotate is expressed as
> ior(lshift,rshift) it will vectorize, and likewise RTL expansion will
> generate the iorv(lshiftv,rshiftv) sequence if required for a vector
> mode rotation.  Hence this patch includes a tweak to the optabs
> test in tree-vect-stmts.cc's vectorizable_shifts to better reflect
> the functionality supported by RTL expansion.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok for mainline?
>
>
> 2022-05-30  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>          * match.pd (convert (lshift @1 INTEGER_CST@2)): Narrow integer
>          left shifts by a constant when the result is truncated, and the
>          shift constant is well-defined for the narrower mode.
>          * tree-vect-stmts.cc (vectorizable_shift): Rotations by
>          constants are vectorizable, if the backend supports logical
>          shifts and IOR logical operations in the required vector mode.
>
> gcc/testsuite/ChangeLog
>          * gcc.dg/fold-convlshift-4.c: New test case.
>          * gcc.dg/optimize-bswaphi-1.c: Update found bswap count.
>          * gcc.dg/tree-ssa/pr61839_3.c: Shift is now optimized before VRP.
>          * gcc.dg/vect/vect-over-widen-1-big-array.c: Remove obsolete tests.
>          * gcc.dg/vect/vect-over-widen-1.c: Likewise.
>          * gcc.dg/vect/vect-over-widen-3-big-array.c: Likewise.
>          * gcc.dg/vect/vect-over-widen-3.c: Likewise.
>          * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
>          * gcc.dg/vect/vect-over-widen-4.c: Likewise.
So the worry here would be stuff like narrowing the source operand 
leading to partial stalls.  But as you indicated, if the target really 
wants to do the shift in a wider mode, it can.  Furthermore, the place 
to make that decision is at the gimple->rtl border, IMHO.

OK.

jeff

ps.  There may still be another old BZ for the lack of narrowing 
inhibiting vectorization IIRC.  I don't recall the specifics enough to 
hazard a guess if this patch will help or not.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fold truncations of left shifts in match.pd
  2022-05-30 12:23 [PATCH] Fold truncations of left shifts in match.pd Roger Sayle
  2022-06-01 15:00 ` Jeff Law
@ 2022-06-02  8:48 ` Richard Biener
  2022-06-02 10:55   ` Roger Sayle
  1 sibling, 1 reply; 5+ messages in thread
From: Richard Biener @ 2022-06-02  8:48 UTC (permalink / raw)
  To: Roger Sayle; +Cc: GCC Patches

On Mon, May 30, 2022 at 2:24 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Whilst investigating PR 55278, I noticed that the tree-ssa optimizers
> aren't eliminating the promotions of shifts to "int" as inserted by the
> c-family front-ends, instead leaving this simplification to be left to
> the RTL optimizers.  This patch allows match.pd to do this itself earlier,
> narrowing (T)(X << C) to (T)X << C when the constant C is known to be
> valid for the (narrower) type T.
>
> Hence for this simple test case:
> short foo(short x) { return x << 5; }
>
> the .optimized dump currently looks like:
>
> short int foo (short int x)
> {
>   int _1;
>   int _2;
>   short int _4;
>
>   <bb 2> [local count: 1073741824]:
>   _1 = (int) x_3(D);
>   _2 = _1 << 5;
>   _4 = (short int) _2;
>   return _4;
> }
>
> but with this patch, now becomes:
>
> short int foo (short int x)
> {
>   short int _2;
>
>   <bb 2> [local count: 1073741824]:
>   _2 = x_1(D) << 5;
>   return _2;
> }
>
> This is always reasonable as RTL expansion knows how to use
> widening optabs if it makes sense at the RTL level to perform
> this shift in a wider mode.
>
> Of course, there's often a catch.  The above simplification not only
> reduces the number of statements in gimple, but also allows further
> optimizations, for example including the perception of rotate idioms
> and bswap16.  Alas, optimizing things earlier than anticipated
> requires several testsuite changes [though all these tests have
> been confirmed to generate identical assembly code on x86_64].
> The only significant change is that the vectorization pass previously
> wouldn't vectorize rotations if the backend doesn't explicitly provide
> an optab for them.  This is curious as if the rotate is expressed as
> ior(lshift,rshift) it will vectorize, and likewise RTL expansion will
> generate the iorv(lshiftv,rshiftv) sequence if required for a vector
> mode rotation.  Hence this patch includes a tweak to the optabs
> test in tree-vect-stmts.cc's vectorizable_shifts to better reflect
> the functionality supported by RTL expansion.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok for mainline?

can the lshift pattern be merged with the preceeding one?  It looks
awfully similar.  Possibly even do

  (if (wi::ltu_p (....))
   (lshift (convert @1) (convert @2))
   { build_zero_cst (type); })

for when the truncation leaves us with zero?

+  /* RTL expansion knows how to expand rotates using shift/or.  */
+  if (icode == CODE_FOR_nothing
+      && (code == LROTATE_EXPR || code == RROTATE_EXPR)
+      && optab_handler (ior_optab, vec_mode) != CODE_FOR_nothing
+      && optab_handler (ashl_optab, vec_mode) != CODE_FOR_nothing)
+    icode = (int) optab_handler (lshr_optab, vec_mode);

but we then get the vector costing wrong.  Also note that vector lowering
will figure the rotate is not supported and do its own "lowering" using
IOR.  Also it seems that only handles the case of vector by scalar (aka
uniform vector) rotates, otherwise will expand to scalar operations.

That said, the appropriate way to deal with this is in tree-vect-patterns.cc
where there already is vect_recog_rotate_pattern that should be detected
so the above hunk shouldn't be necessary - instead eventually the
pattern recognition routine needs improving?

Thanks,
Richard.


>
> 2022-05-30  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * match.pd (convert (lshift @1 INTEGER_CST@2)): Narrow integer
>         left shifts by a constant when the result is truncated, and the
>         shift constant is well-defined for the narrower mode.
>         * tree-vect-stmts.cc (vectorizable_shift): Rotations by
>         constants are vectorizable, if the backend supports logical
>         shifts and IOR logical operations in the required vector mode.
>
> gcc/testsuite/ChangeLog
>         * gcc.dg/fold-convlshift-4.c: New test case.
>         * gcc.dg/optimize-bswaphi-1.c: Update found bswap count.
>         * gcc.dg/tree-ssa/pr61839_3.c: Shift is now optimized before VRP.
>         * gcc.dg/vect/vect-over-widen-1-big-array.c: Remove obsolete tests.
>         * gcc.dg/vect/vect-over-widen-1.c: Likewise.
>         * gcc.dg/vect/vect-over-widen-3-big-array.c: Likewise.
>         * gcc.dg/vect/vect-over-widen-3.c: Likewise.
>         * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
>         * gcc.dg/vect/vect-over-widen-4.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] Fold truncations of left shifts in match.pd
  2022-06-02  8:48 ` Richard Biener
@ 2022-06-02 10:55   ` Roger Sayle
  2022-06-02 11:02     ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Sayle @ 2022-06-02 10:55 UTC (permalink / raw)
  To: 'Richard Biener'; +Cc: 'GCC Patches'

Hi Richard,
> +  /* RTL expansion knows how to expand rotates using shift/or.  */  if
> + (icode == CODE_FOR_nothing
> +      && (code == LROTATE_EXPR || code == RROTATE_EXPR)
> +      && optab_handler (ior_optab, vec_mode) != CODE_FOR_nothing
> +      && optab_handler (ashl_optab, vec_mode) != CODE_FOR_nothing)
> +    icode = (int) optab_handler (lshr_optab, vec_mode);
> 
> but we then get the vector costing wrong. 

The issue is that we currently get the (relative) vector costing wrong.
Currently for gcc.dg/vect/pr98674.c, the vectorizer thinks the scalar
code requires two shifts and an ior, so believes its profitable to vectorize
this loop using two vector shifts and an vector ior.  But once match.pd
simplifies the truncate and recognizes the HImode rotate we end up with:

pr98674.c:6:16: note:   ==> examining statement: _6 = _1 r>> 8;
pr98674.c:6:16: note:   vect_is_simple_use: vectype vector(8) short int
pr98674.c:6:16: note:   vect_is_simple_use: operand 8, type of def: constant
pr98674.c:6:16: missed:   op not supported by target.
pr98674.c:8:33: missed:   not vectorized: relevant stmt not supported: _6 = _1 r>> 8;
pr98674.c:6:16: missed:  bad operation or unsupported loop bound.

Clearly, it's a win to vectorize HImode rotates, when the backend can perform
8 (or 16) rotations at a time, but using 3 vector instructions, even when a scalar
rotate can performed in a single instruction.  Fundamentally, vectorization may
still be desirable/profitable even when the backend doesn't provide an optab.

The current situation where the i386's backend provides expanders to lower
rotations (or vcond) into individual instruction sequences, also interferes with
vector costing.   It's the vector cost function that needs to be fixed, not the
generated code made worse (or the backend bloated performing its own
RTL expansion workarounds).

Is it instead ok to mark pr98674.c as XFAIL (a regression)?
The tweak to tree-vect-stmts.cc was based on the assumption that we wished
to continue vectorizing this loop.  Improving scalar code generation really
shouldn't disable vectorization like this.

Cheers,
Roger
--

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fold truncations of left shifts in match.pd
  2022-06-02 10:55   ` Roger Sayle
@ 2022-06-02 11:02     ` Richard Biener
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2022-06-02 11:02 UTC (permalink / raw)
  To: Roger Sayle; +Cc: GCC Patches

On Thu, Jun 2, 2022 at 12:55 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard,
> > +  /* RTL expansion knows how to expand rotates using shift/or.  */  if
> > + (icode == CODE_FOR_nothing
> > +      && (code == LROTATE_EXPR || code == RROTATE_EXPR)
> > +      && optab_handler (ior_optab, vec_mode) != CODE_FOR_nothing
> > +      && optab_handler (ashl_optab, vec_mode) != CODE_FOR_nothing)
> > +    icode = (int) optab_handler (lshr_optab, vec_mode);
> >
> > but we then get the vector costing wrong.
>
> The issue is that we currently get the (relative) vector costing wrong.
> Currently for gcc.dg/vect/pr98674.c, the vectorizer thinks the scalar
> code requires two shifts and an ior, so believes its profitable to vectorize
> this loop using two vector shifts and an vector ior.  But once match.pd
> simplifies the truncate and recognizes the HImode rotate we end up with:
>
> pr98674.c:6:16: note:   ==> examining statement: _6 = _1 r>> 8;
> pr98674.c:6:16: note:   vect_is_simple_use: vectype vector(8) short int
> pr98674.c:6:16: note:   vect_is_simple_use: operand 8, type of def: constant
> pr98674.c:6:16: missed:   op not supported by target.
> pr98674.c:8:33: missed:   not vectorized: relevant stmt not supported: _6 = _1 r>> 8;
> pr98674.c:6:16: missed:  bad operation or unsupported loop bound.
>
>
> Clearly, it's a win to vectorize HImode rotates, when the backend can perform
> 8 (or 16) rotations at a time, but using 3 vector instructions, even when a scalar
> rotate can performed in a single instruction.  Fundamentally, vectorization may
> still be desirable/profitable even when the backend doesn't provide an optab.

Yes, as said it's tree-vect-patterns.cc job to handle this not
natively supported
rotate by re-writing it.  Can you check why vect_recog_rotate_pattern does
not do this?  Ah, the code only handles !TYPE_UNSIGNED (type) - not sure
why though (for rotates it should not matter and for the lowered sequence
we can convert to desired signedness to get arithmetic/logical shifts)?

> The current situation where the i386's backend provides expanders to lower
> rotations (or vcond) into individual instruction sequences, also interferes with
> vector costing.   It's the vector cost function that needs to be fixed, not the
> generated code made worse (or the backend bloated performing its own
> RTL expansion workarounds).
>
> Is it instead ok to mark pr98674.c as XFAIL (a regression)?
> The tweak to tree-vect-stmts.cc was based on the assumption that we wished
> to continue vectorizing this loop.  Improving scalar code generation really
> shouldn't disable vectorization like this.

Yes, see above where the fix needs to be.  The pattern will then expose
the shift and ior to the vectorizer which then are properly costed.

Richard.

>
>
> Cheers,
> Roger
> --
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-02 11:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-30 12:23 [PATCH] Fold truncations of left shifts in match.pd Roger Sayle
2022-06-01 15:00 ` Jeff Law
2022-06-02  8:48 ` Richard Biener
2022-06-02 10:55   ` Roger Sayle
2022-06-02 11:02     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).