public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i
@ 2021-11-09  2:07 crazylht at gmail dot com
  2021-11-09 11:48 ` [Bug tree-optimization/103144] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2021-11-09  2:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

            Bug ID: 103144
           Summary: vectorizer failed to recognize shift>>=1 in loop as
                    shift>>i
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: crazylht at gmail dot com
  Target Milestone: ---
              Host: x86_64-pc-linux-gnu

#include<stdint.h>

void
foo1 (uint64_t* __restrict pdst, uint64_t* psrc, uint64_t shift)
{
    for (int64_t i = 0; i != 64; i++)
    {
      uint64_t shift_i = shift >> i;
      pdst[i] = psrc[i] + shift_i;
    }
}

void
foo (uint64_t* __restrict pdst, uint64_t* psrc, uint64_t shift)
{
    for (int64_t i = 0; i != 64; i++)
    {
      pdst[i] = psrc[i] + shift;
      shift >>= 1;
    }
}

gcc can vectorize foo1 but not foo since there's cross-iteration dep for shift
>>= 1, but shift >>= 1 is just shift >> i in loop.

Where should it be handled, can vect_recog_pattern handle this?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103144] vectorizer failed to recognize shift>>=1 in loop as shift>>i
  2021-11-09  2:07 [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i crazylht at gmail dot com
@ 2021-11-09 11:48 ` rguenth at gcc dot gnu.org
  2021-12-01  1:30 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-11-09 11:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-11-09
             Blocks|                            |53947
             Status|UNCONFIRMED                 |NEW
                 CC|                            |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
There are two choices.

1) implement more general induction support and vectorize it as

    shiftv = { shift, shift >>1, shift>>2, shift>>3 };
    for (int64_t i = 0; i != 64; i++)
    {
      pdst[i] = psrc[i] + shiftv;
      shiftv >>= {4, 4, 4, 4};
    }

(I think I've seen related bugreports doing bit-flip "induction" with XOR)

2) add the ability to have patterns for PHIs and do what you suggest,
pattern-replacing the shift PHI def with a non-PHI pattern shift >> i
after doing cycle analysis - then the PHI and the shift >>= 1 should become
!STMT_VINFO_RELEVANT.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103144] vectorizer failed to recognize shift>>=1 in loop as shift>>i
  2021-11-09  2:07 [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i crazylht at gmail dot com
  2021-11-09 11:48 ` [Bug tree-optimization/103144] " rguenth at gcc dot gnu.org
@ 2021-12-01  1:30 ` crazylht at gmail dot com
  2021-12-01  8:14 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2021-12-01  1:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
Another issue is for SLP, when trip count is small and loop is completely
unrolled. SLP failed to generate vlshr_optab.

#include<stdint.h>
void
foo (uint64_t* __restrict pdst, uint64_t* psrc, uint64_t shift)
{
    for (int64_t i = 0; i != 4; i++)
    {
      pdst[i] = psrc[i] + shift;
      shift >>= 1;
    }
}

.175t.slp1 dump:

  _8 = *psrc_12(D);
  _28 = _8 + shift_10(D);
  shift_30 = shift_10(D) >> 1;
  _39 = psrc_12(D) + 8;
  _40 = *_39;
  _41 = pdst_13(D) + 8;
  _42 = shift_30 + _40;
  shift_44 = shift_30 >> 1;
  _53 = psrc_12(D) + 16;
  _54 = *_53;
  _55 = pdst_13(D) + 16;
  _56 = shift_44 + _54;
  shift_58 = shift_44 >> 1;
  _3 = psrc_12(D) + 24;
  _4 = *_3;
  _5 = pdst_13(D) + 24;
  _6 = _4 + shift_58;
  _15 = {_28, _42, _56, _6};
  vectp.8_18 = pdst_13(D);
  MEM <vector(4) long unsigned int> [(uint64_t *)vectp.8_18] = _15;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103144] vectorizer failed to recognize shift>>=1 in loop as shift>>i
  2021-11-09  2:07 [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i crazylht at gmail dot com
  2021-11-09 11:48 ` [Bug tree-optimization/103144] " rguenth at gcc dot gnu.org
  2021-12-01  1:30 ` crazylht at gmail dot com
@ 2021-12-01  8:14 ` rguenther at suse dot de
  2021-12-01  8:49 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenther at suse dot de @ 2021-12-01  8:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 1 Dec 2021, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144
> 
> --- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
> Another issue is for SLP, when trip count is small and loop is completely
> unrolled. SLP failed to generate vlshr_optab.
> 
> #include<stdint.h>
> void
> foo (uint64_t* __restrict pdst, uint64_t* psrc, uint64_t shift)
> {
>     for (int64_t i = 0; i != 4; i++)
>     {
>       pdst[i] = psrc[i] + shift;
>       shift >>= 1;
>     }
> }
> 
> .175t.slp1 dump:
> 
>   _8 = *psrc_12(D);
>   _28 = _8 + shift_10(D);
>   shift_30 = shift_10(D) >> 1;
>   _39 = psrc_12(D) + 8;
>   _40 = *_39;
>   _41 = pdst_13(D) + 8;
>   _42 = shift_30 + _40;
>   shift_44 = shift_30 >> 1;
>   _53 = psrc_12(D) + 16;
>   _54 = *_53;
>   _55 = pdst_13(D) + 16;
>   _56 = shift_44 + _54;
>   shift_58 = shift_44 >> 1;
>   _3 = psrc_12(D) + 24;
>   _4 = *_3;
>   _5 = pdst_13(D) + 24;
>   _6 = _4 + shift_58;
>   _15 = {_28, _42, _56, _6};
>   vectp.8_18 = pdst_13(D);
>   MEM <vector(4) long unsigned int> [(uint64_t *)vectp.8_18] = _15;

The issue is that the shifts are dependent on each other:

  shift_30 = shift_10(D) >> 1;
  shift_44 = shift_30 >> 1;
  shift_58 = shift_44 >> 1;

and one addend is shift_10(D) which isn't shifted (SLP build does
not support inserting no-op operations into some lanes like +- 0
or */ 1 or in this case >> 0).  So first of all we either need
to simplify the above to

  shift_30 = shift_10(D) >> 1;
  shift_44 = shift_10(D) >> 2;
  shift_58 = shift_10(D) >> 3;

then the shift_10(D) >> 0 issue remains.

With

void
foo (uint64_t* __restrict pdst, uint64_t* psrc, uint64_t shift)
{
    for (int64_t i = 0; i != 4; i++)
    {
      shift >>= 1;
      pdst[i] = psrc[i] + shift;
    }
}

you see SLP discovery works but we vectorize it as

  shift_23 = shift_10(D) >> 1;
  shift_37 = shift_23 >> 1;
  shift_51 = shift_37 >> 1;
  _17 = {shift_10(D), shift_23, shift_37, shift_51};
  vect_shift_23.10_8 = _17 >> 1;

which eats from the profitability.  We also fail to improve
such code (if it were written by a user) later realizing
we could merge the scalar shifts with the vector shift.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103144] vectorizer failed to recognize shift>>=1 in loop as shift>>i
  2021-11-09  2:07 [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i crazylht at gmail dot com
                   ` (2 preceding siblings ...)
  2021-12-01  8:14 ` rguenther at suse dot de
@ 2021-12-01  8:49 ` rguenth at gcc dot gnu.org
  2022-09-07  0:52 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-12-01  8:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So neither cunroll itself nor the triggered FRE afterwards do what forwprop
would do - fold each stmt.  There's no propagation into them so FRE doesn't
fold,
but we don't fold at stmt creation time (that's just copy_bb, so
understandable).
Supposedly cunroll could invoke a per-BB forwprop or the region FRE could be
teached to fold all stmts instead of just the modified ones (but it's the
outermost loop so only the scalar-cleanup whole function FRE pass triggers
here).

Ideally cunroll IL copying would do folding & FRE on-the-fly.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103144] vectorizer failed to recognize shift>>=1 in loop as shift>>i
  2021-11-09  2:07 [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i crazylht at gmail dot com
                   ` (3 preceding siblings ...)
  2021-12-01  8:49 ` rguenth at gcc dot gnu.org
@ 2022-09-07  0:52 ` cvs-commit at gcc dot gnu.org
  2022-09-07  0:57 ` crazylht at gmail dot com
  2022-09-07  0:57 ` crazylht at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-09-07  0:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:c13223b790bbc5e4a3f5605e057eac59b61b2c85

commit r13-2503-gc13223b790bbc5e4a3f5605e057eac59b61b2c85
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Aug 4 09:04:22 2022 +0800

    Extend vectorizer to handle nonlinear induction for neg, mul/lshift/rshift
with a constant.

    For neg, the patch create a vec_init as [ a, -a, a, -a, ...  ] and no
    vec_step is needed to update vectorized iv since vf is always multiple
    of 2(negative * negative is positive).

    For shift, the patch create a vec_init as [ a, a >> c, a >> 2*c, ..]
    as vec_step as [ c * nunits, c * nunits, c * nunits, ... ], vectorized iv
is
    updated as vec_def = vec_init >>/<< vec_step.

    For mul, the patch create a vec_init as [ a, a * c, a * pow(c, 2), ..]
    as vec_step as [ pow(c,nunits), pow(c,nunits),...] iv is updated as vec_def
=
    vec_init * vec_step.

    The patch handles nonlinear iv for
    1. Integer type only, floating point is not handled.
    2. No slp_node.
    3. iv_loop should be same as vector loop, not nested loop.
    4. No UD is created, for mul, use unsigned mult to avoid UD, for
       shift, shift count should be less than type precision.

    gcc/ChangeLog:

            PR tree-optimization/103144
            * tree-vect-loop.cc (vect_is_nonlinear_iv_evolution): New function.
            (vect_analyze_scalar_cycles_1): Detect nonlinear iv by upper
function.
            (vect_create_nonlinear_iv_init): New function.
            (vect_peel_nonlinear_iv_init): Ditto.
            (vect_create_nonlinear_iv_step): Ditto
            (vect_create_nonlinear_iv_vec_step): Ditto
            (vect_update_nonlinear_iv): Ditto
            (vectorizable_nonlinear_induction): Ditto.
            (vectorizable_induction): Call
            vectorizable_nonlinear_induction when induction_type is not
            vect_step_op_add.
            * tree-vect-loop-manip.cc (vect_update_ivs_after_vectorizer):
            Update nonlinear iv for epilogue loop.
            * tree-vectorizer.h (enum vect_induction_op_type): New enum.
            (STMT_VINFO_LOOP_PHI_EVOLUTION_TYPE): New Macro.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr103144-mul-1.c: New test.
            * gcc.target/i386/pr103144-mul-2.c: New test.
            * gcc.target/i386/pr103144-neg-1.c: New test.
            * gcc.target/i386/pr103144-neg-2.c: New test.
            * gcc.target/i386/pr103144-shift-1.c: New test.
            * gcc.target/i386/pr103144-shift-2.c: New test.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103144] vectorizer failed to recognize shift>>=1 in loop as shift>>i
  2021-11-09  2:07 [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i crazylht at gmail dot com
                   ` (4 preceding siblings ...)
  2022-09-07  0:52 ` cvs-commit at gcc dot gnu.org
@ 2022-09-07  0:57 ` crazylht at gmail dot com
  2022-09-07  0:57 ` crazylht at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2022-09-07  0:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
Fixed in GCC13.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/103144] vectorizer failed to recognize shift>>=1 in loop as shift>>i
  2021-11-09  2:07 [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i crazylht at gmail dot com
                   ` (5 preceding siblings ...)
  2022-09-07  0:57 ` crazylht at gmail dot com
@ 2022-09-07  0:57 ` crazylht at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2022-09-07  0:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103144

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-09-07  0:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-09  2:07 [Bug tree-optimization/103144] New: vectorizer failed to recognize shift>>=1 in loop as shift>>i crazylht at gmail dot com
2021-11-09 11:48 ` [Bug tree-optimization/103144] " rguenth at gcc dot gnu.org
2021-12-01  1:30 ` crazylht at gmail dot com
2021-12-01  8:14 ` rguenther at suse dot de
2021-12-01  8:49 ` rguenth at gcc dot gnu.org
2022-09-07  0:52 ` cvs-commit at gcc dot gnu.org
2022-09-07  0:57 ` crazylht at gmail dot com
2022-09-07  0:57 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).