[Bug tree-optimization/98339] New: GCC could not vectorize loop with conditional reduced add and store

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/98339] New: GCC could not vectorize loop with conditional reduced add and store
@ 2020-12-17  6:03 wwwhhhyyy333 at gmail dot com
  2021-01-04 15:57 ` [Bug tree-optimization/98339] " rguenth at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: wwwhhhyyy333 at gmail dot com @ 2020-12-17  6:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98339

            Bug ID: 98339
           Summary: GCC could not vectorize loop with conditional reduced
                    add and store
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

For testcase

void foo(                
                int* restrict x,                      
                int n,
                int start,   
                int m,
                int* restrict ret
)   
{
    for (int i = 0; i < n; i++)
    {
        int pos = start + i;
        if ( pos <= m)
            ret[0] += x[i];    
    }
}

with -O3 -mavx2 it could not be vectorized because ret[0] += x[i] is zero step
MASK_STORE inside loop, and dr analysis failed for zero step store.

But with manually loop store motion

void foo2(                
                int* restrict x,                      
                int n,
                int start,   
                int m,
                int* restrict ret
)   
{
    int tmp = 0;

    for (int i = 0; i < n; i++)
    {
        int pos = start + i;
        if (pos <= m)
            tmp += x[i];    
    }

    ret[0] += tmp;
}

could be vectorized. 

godbolt: https://godbolt.org/z/Kcv8hP

There is no LIM between ifcvt and vect, and current LIM could not handle
MASK_STORE. Is there any possibility to vectorize foo, like by doing loop store
motion in ifcvt instead of creating MASK_STORE?

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug tree-optimization/98339] GCC could not vectorize loop with conditional reduced add and store
  2020-12-17  6:03 [Bug tree-optimization/98339] New: GCC could not vectorize loop with conditional reduced add and store wwwhhhyyy333 at gmail dot com
@ 2021-01-04 15:57 ` rguenth at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-04 15:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98339

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
             Target|                            |x86_64-*-*
     Ever confirmed|0                           |1
             Blocks|                            |53947
   Last reconfirmed|                            |2021-01-04

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is that we need to vectorize this as reduction and since there's no
"masked scalar store" on GIMPLE LIM itself doesn't help.  The issue why
LIM doesn't apply store-motion here is the _load_ which can trap.  LIM would
like to do

  ret0 = ret[0];
  bool stored = false;
    for (int i = 0; i < n; i++)
    {
        int pos = start + i;
        if ( pos <= m)
          {
            ret0 += x[i];    
            stored = true;
          }
    }
  if (stored)
    ret[0] = ret0;

but as you can see the unconditional load breaks this.  LIM would need to
be changed to handle the whole load-update-store sequence delaying the
load as well (thereby re-associating the reduction).

An alternative would be to split the loop and apply store-motion to the tail.

    for (int i = 0; i < n; i++)
    {
        int pos = start + i;
        if ( pos <= m)
          break;
    }
    if (i < n)
      {
        ret0 = ret[0];
      for (int i = 0; i < n; i++)
       {
         int pos = start + i;
         if ( pos <= m)
            ret0 += x[i]; 
       }
        ret[0] = ret0;
      }

we can then vectorize the second loop.

At the source level the fix is to make sure the load from ret[0] doesn't trap.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-01-04 15:57 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17  6:03 [Bug tree-optimization/98339] New: GCC could not vectorize loop with conditional reduced add and store wwwhhhyyy333 at gmail dot com
2021-01-04 15:57 ` [Bug tree-optimization/98339] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).