[Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand
@ 2024-01-11  7:13 fxue at os dot amperecomputing.com
  2024-01-11  7:30 ` [Bug target/113326] " pinskia at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2024-01-11  7:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

            Bug ID: 113326
           Summary: Optimize vector shift with constant delta on
                    shifting-count operand
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

For shift by induction variable, loop vectorization could generate a series of
vector shifts, whose shifting-count operands are advanced by constant delta. We
could use result of the first vector shift as base, and transform others to be
vector shift on the result by scalar delta, which would eliminate non-uniform
shift vector count constant, and thereby save a load on most architectures.

 int test(int array[16]);

 int foo(int value)
 {
   int array[16];

   for (int i = 0; i < 16; i++)
      array[i] = value >> i;

   return test(array);
 }

 // Current
 vect_value = { value, value, value, value };
 vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
 vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
 vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
 vect_shift_3 = vect_value >> { 12, 13, 14, 15 };

 // To be optimized to
 vect_value = { value, value, value, value };
 vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
 vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
 vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
 vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
  2024-01-11  7:13 [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand fxue at os dot amperecomputing.com
@ 2024-01-11  7:30 ` pinskia at gcc dot gnu.org
  2024-01-11  7:44 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-11  7:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
          Component|tree-optimization           |target
           Severity|normal                      |enhancement
   Last reconfirmed|                            |2024-01-11
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note on aarch64 with SVE, you should be able to generate those constants
without a load, using the index instruction.  Though aarch64 does not generate
them currently.  So this is definitely target specific and all.

Plus this is simplification happens after the vectorizer which produces:
```
  <bb 3> [local count: 252544065]:
  # i_13 = PHI <i_10(5), 0(2)>
  # ivtmp_3 = PHI <ivtmp_2(5), 16(2)>
  # vect_vec_iv_.4_11 = PHI <_16(5), { 0, 1, 2, 3 }(2)>
  # vectp_array.6_19 = PHI <vectp_array.6_20(5), &array(2)>
  # ivtmp_22 = PHI <ivtmp_23(5), 0(2)>
  _16 = vect_vec_iv_.4_11 + { 4, 4, 4, 4 };
  vect__1.5_18 = vect_cst__17 >> vect_vec_iv_.4_11;
  _1 = value_8(D) >> i_13;
  MEM <vector(4) int> [(int *)vectp_array.6_19] = vect__1.5_18;
  i_10 = i_13 + 1;
  ivtmp_2 = ivtmp_3 - 1;
  vectp_array.6_20 = vectp_array.6_19 + 16;
  ivtmp_23 = ivtmp_22 + 1;
  if (ivtmp_23 < 4)
    goto <bb 5>; [75.00%]
  else
    goto <bb 4>; [25.00%]
```

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
  2024-01-11  7:13 [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand fxue at os dot amperecomputing.com
  2024-01-11  7:30 ` [Bug target/113326] " pinskia at gcc dot gnu.org
@ 2024-01-11  7:44 ` pinskia at gcc dot gnu.org
  2024-01-11  7:52 ` fxue at os dot amperecomputing.com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-11  7:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Basically this requires an "un-shift" pass and most likely should be done at
the RTL level though that might be too late.
Maybe isel?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
  2024-01-11  7:13 [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand fxue at os dot amperecomputing.com
  2024-01-11  7:30 ` [Bug target/113326] " pinskia at gcc dot gnu.org
  2024-01-11  7:44 ` pinskia at gcc dot gnu.org
@ 2024-01-11  7:52 ` fxue at os dot amperecomputing.com
  2024-01-11  7:56 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2024-01-11  7:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #3 from Feng Xue <fxue at os dot amperecomputing.com> ---

(In reply to Andrew Pinski from comment #1)
> Note on aarch64 with SVE, you should be able to generate those constants
> without a load, using the index instruction.
Ok. Thanks for the note. This still requires an extra instruction, while the
constant delta could be nested in shift instruction as IMM operand.

> Basically this requires an "un-shift" pass and most likely should be done at
> the RTL level though that might be too late.
> Maybe isel?
I'm thinking of adding the processing in pass_lower_vector_ssa, which also
contains other peephole vector ssa optimizations, not just lowering.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
  2024-01-11  7:13 [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand fxue at os dot amperecomputing.com
                   ` (2 preceding siblings ...)
  2024-01-11  7:52 ` fxue at os dot amperecomputing.com
@ 2024-01-11  7:56 ` pinskia at gcc dot gnu.org
  2024-01-11  8:10 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-11  7:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Feng Xue from comment #3)
> (In reply to Andrew Pinski from comment #1)
> > Note on aarch64 with SVE, you should be able to generate those constants
> > without a load, using the index instruction.
> Ok. Thanks for the note. This still requires an extra instruction, while the
> constant delta could be nested in shift instruction as IMM operand.
> 
> > Basically this requires an "un-shift" pass and most likely should be done at
> > the RTL level though that might be too late.
> > Maybe isel?
> I'm thinking of adding the processing in pass_lower_vector_ssa, which also
> contains other peephole vector ssa optimizations, not just lowering.

It should be in isel like other vector instruction selection that goes on.

pass_lower_vector_ssa is only for lowing generic vectors.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
  2024-01-11  7:13 [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand fxue at os dot amperecomputing.com
                   ` (3 preceding siblings ...)
  2024-01-11  7:56 ` pinskia at gcc dot gnu.org
@ 2024-01-11  8:10 ` pinskia at gcc dot gnu.org
  2024-01-11  9:13 ` rguenth at gcc dot gnu.org
  2024-01-11  9:58 ` fxue at os dot amperecomputing.com
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-11  8:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
One more thing:
```
 vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
 vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
 vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
 vect_shift_3 = vect_value >> { 12, 13, 14, 15 };
```
vs
```
 vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
 vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
 vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
 vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };
```

the first has fully independent operations while in the second case, there is
one dependent and the are independent operations.

On cores which has many vector units the first one might be faster than the
second one.  So this needs a cost model too.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
  2024-01-11  7:13 [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand fxue at os dot amperecomputing.com
                   ` (4 preceding siblings ...)
  2024-01-11  8:10 ` pinskia at gcc dot gnu.org
@ 2024-01-11  9:13 ` rguenth at gcc dot gnu.org
  2024-01-11  9:58 ` fxue at os dot amperecomputing.com
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-11  9:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #5)
> One more thing:
> ```
>  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
>  vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
>  vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
>  vect_shift_3 = vect_value >> { 12, 13, 14, 15 };
> ```
> vs
> ```
>  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
>  vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
>  vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
>  vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };
> ```
> 
> the first has fully independent operations while in the second case, there
> is one dependent and the are independent operations.
> 
> On cores which has many vector units the first one might be faster than the
> second one.  So this needs a cost model too.

Note the vectorizer has the shift values dependent as well (across iterations),
we just constant propagate after unrolling here.

Note this is basically asking for "strength-reduction" of expensive
constants which could be more generally useful and not only for this
specific shift case.  Consider the same example but with an add instead
of a shift for example, the same exact set of constants will appear.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
  2024-01-11  7:13 [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand fxue at os dot amperecomputing.com
                   ` (5 preceding siblings ...)
  2024-01-11  9:13 ` rguenth at gcc dot gnu.org
@ 2024-01-11  9:58 ` fxue at os dot amperecomputing.com
  6 siblings, 0 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2024-01-11  9:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #7 from Feng Xue <fxue at os dot amperecomputing.com> ---
(In reply to Richard Biener from comment #6)
> (In reply to Andrew Pinski from comment #5)
> > One more thing:
> > ```
> >  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
> >  vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
> >  vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
> >  vect_shift_3 = vect_value >> { 12, 13, 14, 15 };
> > ```
> > vs
> > ```
> >  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
> >  vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
> >  vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
> >  vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };
> > ```
> > 
> > the first has fully independent operations while in the second case, there
> > is one dependent and the are independent operations.
> > 
> > On cores which has many vector units the first one might be faster than the
> > second one.  So this needs a cost model too.
> 
> Note the vectorizer has the shift values dependent as well (across
> iterations),
> we just constant propagate after unrolling here.
> 
> Note this is basically asking for "strength-reduction" of expensive
> constants which could be more generally useful and not only for this
> specific shift case.  Consider the same example but with an add instead
> of a shift for example, the same exact set of constants will appear.

It is. But only find that vector shift has special treatment to constant
operands based on its numerical pattern. No sure any other operator would be.

BTW, here is a scalar-version strength-reduction for shift, like:

  int a = value >> n;
  int b = value >> (n + 6);

  ==>

  int a = value >> n;
  int b = a >> 6;      // (n + 6) is not needed

But this is not covered by current scalar strength-reduction pass.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-01-11  9:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-11  7:13 [Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand fxue at os dot amperecomputing.com
2024-01-11  7:30 ` [Bug target/113326] " pinskia at gcc dot gnu.org
2024-01-11  7:44 ` pinskia at gcc dot gnu.org
2024-01-11  7:52 ` fxue at os dot amperecomputing.com
2024-01-11  7:56 ` pinskia at gcc dot gnu.org
2024-01-11  8:10 ` pinskia at gcc dot gnu.org
2024-01-11  9:13 ` rguenth at gcc dot gnu.org
2024-01-11  9:58 ` fxue at os dot amperecomputing.com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).