[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
@ 2020-05-19  9:39 ` guojiufu at gcc dot gnu.org
  2020-05-21  5:29 ` guojiufu at gcc dot gnu.org
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-05-19  9:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Jiu Fu Guo <guojiufu at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |guojiufu at gcc dot gnu.org

--- Comment #25 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #10)
> If the compiler knew say from PGO that pos is usually a multiple of certain
> power of two and that the loop usually iterates many times (I guess the
> latter can be determined from comparing the bb count of the loop itself and
> its header), it could emit something like:
> static int func2(int max, int pos, unsigned char *cur)
> {
>   unsigned char *p = cur + pos;
>   int len = 0;
>   if (max > 32 && (pos & 7) == 0)
>     {
>       int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
>       while (++len != l)
>         if (p[len] != cur[len])
>           goto end;
>       unsigned long long __attribute__((may_alias)) *p2 = (unsigned long
> long *) &p[len];
>       unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long
> long *) &cur[len];
>       while (len + 8 < max)
>         {
>           if (*p2++ != *cur2++)
>             break;
>           len += 8;
>         }
>       --len;
>     }
>   while (++len != max)
>     if (p[len] != cur[len])
>       break;
> end:
>   return cur[len];
> }
> 
> or so (untested).  Of course, it could be done using SIMD too if there is a
> way to terminate the loop if any of the elts is different and could be done
> in that case at 16 or 32 or 64 characters at a time etc.
> But, without knowing that pos is typically some power of two this would just
> waste code size, dealing with the unaligned cases would be more complicated
> (one can't read the next elt until proving that the current one is all
> equal), so it would need to involve some rotations (or permutes for SIMD).

With this kind of widening,  we can see ~5% performance improvement for xz on
some platforms.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
  2020-05-19  9:39 ` [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison guojiufu at gcc dot gnu.org
@ 2020-05-21  5:29 ` guojiufu at gcc dot gnu.org
  2020-05-27  9:32 ` guojiufu at gcc dot gnu.org
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-05-21  5:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #26 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
Had a test on spec2017 xz_r by changing the specified loop manually, on
ppc64le.

original loop (this loops occur three times in code):
                        while (++len != len_limit)
                                if (pb[len] != cur[len])
                                        break;
changed to loop:
typedef long long __attribute__((may_alias)) TYPEE;

  for(++len; len + sizeof(TYPEE) <= len_limit; len += sizeof(TYPEE)) {
    long long a = *((TYPEE*)(cur+len));
    long long b = *((TYPEE*)(pb+len));
    if (a != b) {
      break; //to optimize len can be move forward here.
      }
    }
  for (;len != len_limit; ++len)
    if (pb[len] != cur[len])
      break;

We can see xz_r runtime improved from 433s to 382s(>12%).
It would be very valuable to do this kind of widening reading/checking.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
  2020-05-19  9:39 ` [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison guojiufu at gcc dot gnu.org
  2020-05-21  5:29 ` guojiufu at gcc dot gnu.org
@ 2020-05-27  9:32 ` guojiufu at gcc dot gnu.org
  2020-05-27  9:45 ` guojiufu at gcc dot gnu.org
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-05-27  9:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #27 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Wilco from comment #13)
> So to add some real numbers to the discussion, the average number of
> iterations is 4.31. Frequency stats (16 includes all iterations > 16 too):
> 
> 1: 29.0
> 2: 4.2
> 3: 1.0
> 4: 36.7
> 5: 8.7
> 6: 3.4
> 7: 3.0
> 8: 2.6
> 9: 2.1
> 10: 1.9
> 11: 1.6
> 12: 1.2
> 13: 0.9
> 14: 0.8
> 15: 0.7
> 16: 2.1
> 

Find one interesting thing:
If using widen reading for the run which > 16 iterations, we can see the
performance is significantly improved(>18%) for xz_r in spec.
This means that the frequency is small for >16, while it still costs a big part
of the runtime.

if (len_limit - len > 16)                                         
    {                                                                   
      for(++len; len + sizeof(TYPEE) <= len_limit; len += sizeof(TYPEE)) 
        {                                                               
          long long a = *((TYPEE*)(cur+len));                           
          long long b = *((TYPEE*)(pb+len));                            
          if (a != b) {                                                 
            int lz = __builtin_ctzll(a ^ b);                            
            len += lz / 8;                                              
            goto found;                                                 
            break;                                                      
          }                                                             
        }                                                               
      for (;len != len_limit; ++len)                                    
        if (pb[len] != cur[len])                                        
          break;                                                        
    found:;                                                             
    }    
else
 { xxxx original loop}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2020-05-27  9:32 ` guojiufu at gcc dot gnu.org
@ 2020-05-27  9:45 ` guojiufu at gcc dot gnu.org
  2020-05-27 11:21 ` wilco at gcc dot gnu.org
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-05-27  9:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #28 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Jiu Fu Guo from comment #27)

> > 12: 1.2
> > 13: 0.9
> > 14: 0.8
> > 15: 0.7
> > 16: 2.1
> > 
> 
> Find one interesting thing:
> If using widen reading for the run which > 16 iterations, we can see the
> performance is significantly improved(>18%) for xz_r in spec.
> This means that the frequency is small for >16, while it still costs a big
> part of the runtime.
> 

Oh, Recheck frequency in my test, the frequency is big (99.8%) for >16
iterations.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2020-05-27  9:45 ` guojiufu at gcc dot gnu.org
@ 2020-05-27 11:21 ` wilco at gcc dot gnu.org
  2020-05-27 13:06 ` guojiufu at gcc dot gnu.org
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: wilco at gcc dot gnu.org @ 2020-05-27 11:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #29 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jiu Fu Guo from comment #28)
> (In reply to Jiu Fu Guo from comment #27)
> 
> > > 12: 1.2
> > > 13: 0.9
> > > 14: 0.8
> > > 15: 0.7
> > > 16: 2.1
> > > 
> > 
> > Find one interesting thing:
> > If using widen reading for the run which > 16 iterations, we can see the
> > performance is significantly improved(>18%) for xz_r in spec.
> > This means that the frequency is small for >16, while it still costs a big
> > part of the runtime.
> > 
> 
> Oh, Recheck frequency in my test, the frequency is big (99.8%) for >16
> iterations.

The frequency for >16 iterations is small, 2.1%. The limit is generally large,
but the actual number of iterations is what matters because of the early exit.

The key question remains whether it is legal to assume the limit implies the
memory is valid and use wider accesses.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2020-05-27 11:21 ` wilco at gcc dot gnu.org
@ 2020-05-27 13:06 ` guojiufu at gcc dot gnu.org
  2020-05-27 13:25 ` wilco at gcc dot gnu.org
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-05-27 13:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #30 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Wilco from comment #29)
> (In reply to Jiu Fu Guo from comment #28)
> > > 
> > > Find one interesting thing:
> > > If using widen reading for the run which > 16 iterations, we can see the
> > > performance is significantly improved(>18%) for xz_r in spec.
> > > This means that the frequency is small for >16, while it still costs a big
> > > part of the runtime.
> > > 
> > 
> > Oh, Recheck frequency in my test, the frequency is big (99.8%) for >16
> > iterations.
> 
> The frequency for >16 iterations is small, 2.1%. The limit is generally
> large, but the actual number of iterations is what matters because of the
> early exit.
Right, the actual number of executed iterations relates to runtime. 
> 
> The key question remains whether it is legal to assume the limit implies the
> memory is valid and use wider accesses.
If unaligned access is support, it would be valid to access the memory.
Otherwise, checking like ((pb&7) == (cur & 7)) would cost an additional test,
and it may not sure likely to be true.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2020-05-27 13:06 ` guojiufu at gcc dot gnu.org
@ 2020-05-27 13:25 ` wilco at gcc dot gnu.org
  2020-05-28  2:18 ` guojiufu at gcc dot gnu.org
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: wilco at gcc dot gnu.org @ 2020-05-27 13:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #31 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jiu Fu Guo from comment #30)
> (In reply to Wilco from comment #29)

> > The key question remains whether it is legal to assume the limit implies the
> > memory is valid and use wider accesses.

> If unaligned access is support, it would be valid to access the memory.
> Otherwise, checking like ((pb&7) == (cur & 7)) would cost an additional
> test, and it may not sure likely to be true.

If both pointers are aligned it would be safe indeed, but if unaligned they
will access up to 7 bytes after the first match. And that's not safe without
knowing the underlying buffer bounds. See eg. comment #3.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2020-05-27 13:25 ` wilco at gcc dot gnu.org
@ 2020-05-28  2:18 ` guojiufu at gcc dot gnu.org
  2020-05-28  2:30 ` guojiufu at gcc dot gnu.org
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-05-28  2:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #32 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Wilco from comment #31)
> (In reply to Jiu Fu Guo from comment #30)
> > (In reply to Wilco from comment #29)
> 
> > > The key question remains whether it is legal to assume the limit implies the
> > > memory is valid and use wider accesses.
> 
> > If unaligned access is support, it would be valid to access the memory.
> > Otherwise, checking like ((pb&7) == (cur & 7)) would cost an additional
> > test, and it may not sure likely to be true.
> 
> If both pointers are aligned it would be safe indeed, but if unaligned they
> will access up to 7 bytes after the first match. And that's not safe without
> knowing the underlying buffer bounds. See eg. comment #3.

Extra information is needed, before using widen reading: if the pointers are
aligned, and make sure the underlying buffer is large enough. Additional
checking is needed and the original loop may need to keep.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2020-05-28  2:18 ` guojiufu at gcc dot gnu.org
@ 2020-05-28  2:30 ` guojiufu at gcc dot gnu.org
  2020-05-28 12:44 ` wilco at gcc dot gnu.org
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-05-28  2:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #33 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
It would be relatively easy if the target supports unaligned access.  like
read64ne in
https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
Then the alignment issue is relaxed.   It may be safer if we can prove/assume
the underlying buffer is enough, like array accessing or pointer+index
accessing in a loop.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2020-05-28  2:30 ` guojiufu at gcc dot gnu.org
@ 2020-05-28 12:44 ` wilco at gcc dot gnu.org
  2020-05-31 20:31 ` wschmidt at linux dot ibm.com
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: wilco at gcc dot gnu.org @ 2020-05-28 12:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #34 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jiu Fu Guo from comment #33)
> It would be relatively easy if the target supports unaligned access.  like
> read64ne in
> https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
> Then the alignment issue is relaxed.   It may be safer if we can
> prove/assume the underlying buffer is enough, like array accessing or
> pointer+index accessing in a loop.

Yes, without unaligned support you can't use a wider access.

If we can't prove the buffer bounds then we'd have to use a page cross check
before every unaligned access.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2020-05-28 12:44 ` wilco at gcc dot gnu.org
@ 2020-05-31 20:31 ` wschmidt at linux dot ibm.com
  2020-06-02  2:49 ` guojiufu at gcc dot gnu.org
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: wschmidt at linux dot ibm.com @ 2020-05-31 20:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #35 from wschmidt at linux dot ibm.com ---
Hi Jeff,

Just a quick comment.  We should never discuss raw runtimes of SPEC 
benchmarks on Power hardware in public.  It's okay to talk about 
improvements (>12% in this case), but not wall clock time.  Not a big 
deal, but there are some legal reasons regarding SPEC that cause us to 
be a little careful.

Thanks!
Bill

On 5/21/20 12:29 AM, guojiufu at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
>
> --- Comment #26 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
> Had a test on spec2017 xz_r by changing the specified loop manually, on
> ppc64le.
>
> original loop (this loops occur three times in code):
>                          while (++len != len_limit)
>                                  if (pb[len] != cur[len])
>                                          break;
> changed to loop:
> typedef long long __attribute__((may_alias)) TYPEE;
>
>    for(++len; len + sizeof(TYPEE) <= len_limit; len += sizeof(TYPEE)) {
>      long long a = *((TYPEE*)(cur+len));
>      long long b = *((TYPEE*)(pb+len));
>      if (a != b) {
>        break; //to optimize len can be move forward here.
>        }
>      }
>    for (;len != len_limit; ++len)
>      if (pb[len] != cur[len])
>        break;
>
> We can see xz_r runtime improved from 433s to 382s(>12%).
> It would be very valuable to do this kind of widening reading/checking.
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2020-05-31 20:31 ` wschmidt at linux dot ibm.com
@ 2020-06-02  2:49 ` guojiufu at gcc dot gnu.org
  2020-06-02  6:35 ` rguenther at suse dot de
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-02  2:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #36 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #10)
> If the compiler knew say from PGO that pos is usually a multiple of certain
> power of two and that the loop usually iterates many times (I guess the
> latter can be determined from comparing the bb count of the loop itself and
> its header), it could emit something like:
> static int func2(int max, int pos, unsigned char *cur)
> {
>   unsigned char *p = cur + pos;
>   int len = 0;
>   if (max > 32 && (pos & 7) == 0)
>     {
>       int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
>       while (++len != l)
>         if (p[len] != cur[len])
>           goto end;
>       unsigned long long __attribute__((may_alias)) *p2 = (unsigned long
> long *) &p[len];
>       unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long
> long *) &cur[len];
>       while (len + 8 < max)
>         {
>           if (*p2++ != *cur2++)
>             break;
>           len += 8;
>         }
>       --len;
>     }
>   while (++len != max)
>     if (p[len] != cur[len])
>       break;
> end:
>   return cur[len];
> }
> 
> or so (untested).  Of course, it could be done using SIMD too if there is a
> way to terminate the loop if any of the elts is different and could be done
> in that case at 16 or 32 or 64 characters at a time etc.
> But, without knowing that pos is typically some power of two this would just
> waste code size, dealing with the unaligned cases would be more complicated
> (one can't read the next elt until proving that the current one is all
> equal), so it would need to involve some rotations (or permutes for SIMD).

Unaligned reading is supported on some platforms already, and reading
multi-bytes(64/128bits) takes far less cost than reading 8bits multi-times,
extremely, dword reading may cost the same cycles as byte reading.
As the above discussions, there are still a few kinds of stuff need to take
care of.  I’m wondering if we could introduce this as a compiler optimization
in some circumstances.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2020-06-02  2:49 ` guojiufu at gcc dot gnu.org
@ 2020-06-02  6:35 ` rguenther at suse dot de
  2020-06-04 14:15 ` guojiufu at gcc dot gnu.org
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2020-06-02  6:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #37 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 2 Jun 2020, guojiufu at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> 
> --- Comment #36 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
> (In reply to Jakub Jelinek from comment #10)
> > If the compiler knew say from PGO that pos is usually a multiple of certain
> > power of two and that the loop usually iterates many times (I guess the
> > latter can be determined from comparing the bb count of the loop itself and
> > its header), it could emit something like:
> > static int func2(int max, int pos, unsigned char *cur)
> > {
> >   unsigned char *p = cur + pos;
> >   int len = 0;
> >   if (max > 32 && (pos & 7) == 0)
> >     {
> >       int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
> >       while (++len != l)
> >         if (p[len] != cur[len])
> >           goto end;
> >       unsigned long long __attribute__((may_alias)) *p2 = (unsigned long
> > long *) &p[len];
> >       unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long
> > long *) &cur[len];
> >       while (len + 8 < max)
> >         {
> >           if (*p2++ != *cur2++)
> >             break;
> >           len += 8;
> >         }
> >       --len;
> >     }
> >   while (++len != max)
> >     if (p[len] != cur[len])
> >       break;
> > end:
> >   return cur[len];
> > }
> > 
> > or so (untested).  Of course, it could be done using SIMD too if there is a
> > way to terminate the loop if any of the elts is different and could be done
> > in that case at 16 or 32 or 64 characters at a time etc.
> > But, without knowing that pos is typically some power of two this would just
> > waste code size, dealing with the unaligned cases would be more complicated
> > (one can't read the next elt until proving that the current one is all
> > equal), so it would need to involve some rotations (or permutes for SIMD).
> 
> Unaligned reading is supported on some platforms already, and reading
> multi-bytes(64/128bits) takes far less cost than reading 8bits multi-times,
> extremely, dword reading may cost the same cycles as byte reading.
> As the above discussions, there are still a few kinds of stuff need to take
> care of.  I’m wondering if we could introduce this as a compiler optimization
> in some circumstances.

The challenge is as always to not regress cases too much where it ends
up not profitable rather than just making it work for SPEC ;)  Luckily
chasing SPEC numbers isn't our primary objective.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2020-06-02  6:35 ` rguenther at suse dot de
@ 2020-06-04 14:15 ` guojiufu at gcc dot gnu.org
  2020-06-08  2:45 ` guojiufu at gcc dot gnu.org
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-04 14:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #38 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #37)
> On Tue, 2 Jun 2020, guojiufu at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> > 
...
> > Unaligned reading is supported on some platforms already, and reading
> > multi-bytes(64/128bits) takes far less cost than reading 8bits multi-times,
> > extremely, dword reading may cost the same cycles as byte reading.
> > As the above discussions, there are still a few kinds of stuff need to take
> > care of.  I’m wondering if we could introduce this as a compiler optimization
> > in some circumstances.
> 
> The challenge is as always to not regress cases too much where it ends
> up not profitable rather than just making it work for SPEC ;)  Luckily
> chasing SPEC numbers isn't our primary objective.

Yes.  An optimization would benefit general workloads. 
I had also tangled about which would be the best way to do, as Jakub and all
guys discussed above.  Through wider reading (vs. unroll/peel), it may be
helpful for a serial byte reading.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2020-06-04 14:15 ` guojiufu at gcc dot gnu.org
@ 2020-06-08  2:45 ` guojiufu at gcc dot gnu.org
  2020-06-08  9:22 ` wilco at gcc dot gnu.org
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-08  2:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #39 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
I’m thinking to draft a patch for this optimization.  If any suggestions,
please point out, thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2020-06-08  2:45 ` guojiufu at gcc dot gnu.org
@ 2020-06-08  9:22 ` wilco at gcc dot gnu.org
  2020-06-09  3:46 ` guojiufu at gcc dot gnu.org
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: wilco at gcc dot gnu.org @ 2020-06-08  9:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #40 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jiu Fu Guo from comment #39)
> I’m thinking to draft a patch for this optimization.  If any suggestions,
> please point out, thanks.

Which optimization to be precise? Besides unrolling I haven't seen a proposal
for an optimization which is both safe and generally applicable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2020-06-08  9:22 ` wilco at gcc dot gnu.org
@ 2020-06-09  3:46 ` guojiufu at gcc dot gnu.org
  2020-06-09  3:49 ` guojiufu at gcc dot gnu.org
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-09  3:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #41 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Wilco from comment #40)
> (In reply to Jiu Fu Guo from comment #39)
> > I’m thinking to draft a patch for this optimization.  If any suggestions,
> > please point out, thanks.
> 
> Which optimization to be precise? Besides unrolling I haven't seen a
> proposal for an optimization which is both safe and generally applicable.

1. For unroll, there are still branches in the loop. And then need careful
merge on those reading and comparison.  Another thing about unroll would be
that, if we prefer to optimize this early in GIMPLE, we still not GIMPLE unroll
on it.
while (len != max)
{
    if (p[len] != cur[len])
      break; ++len;
    if (p[len] != cur[len])
      break; ++len;
    if (p[len] != cur[len])
      break; ++len;
....
}

2. Also thinking about if it makes sense to enhance GIMPLE vectorization pass. 
In an aspect that using a vector to read and compare, also need to handle/merge
compares into vector compare and handle early exit carefully.
if (len + 8 < max && buffers not cross page) ///(p&4K) == (p+8)&4k? 4k:pagesize
 while (len != max)
{
 vec a = xx p;
 vec b = xx cur;
 if (a != b) /// may not only for comparison 
  {....;break;}
 len += 8;
}

3. Introduce a new stand-alone pass to optimize reading/computing shorter types
into large(dword/vector) reading/computing.

Thanks a lot for your comments/suggestions!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (16 preceding siblings ...)
  2020-06-09  3:46 ` guojiufu at gcc dot gnu.org
@ 2020-06-09  3:49 ` guojiufu at gcc dot gnu.org
  2020-06-10  9:56 ` guojiufu at gcc dot gnu.org
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-09  3:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #42 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Jiu Fu Guo from comment #41)
> (In reply to Wilco from comment #40)
> > (In reply to Jiu Fu Guo from comment #39)
> > > I’m thinking to draft a patch for this optimization.  If any suggestions,
> > > please point out, thanks.
> > 
> > Which optimization to be precise? Besides unrolling I haven't seen a
> > proposal for an optimization which is both safe and generally applicable.
> 
> 1. For unroll, there are still branches in the loop. And then need careful
> merge on those reading and comparison.  Another thing about unroll would be
> that, if we prefer to optimize this early in GIMPLE, we still not GIMPLE
> unroll on it.
> while (len != max)
> {
>     if (p[len] != cur[len])
>       break; ++len;
>     if (p[len] != cur[len])
>       break; ++len;
>     if (p[len] != cur[len])
>       break; ++len;
> ....
> }

Besides we do not have GIMPLE unroll for this kind of code, this may also
likely be transformed to vectorization.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (17 preceding siblings ...)
  2020-06-09  3:49 ` guojiufu at gcc dot gnu.org
@ 2020-06-10  9:56 ` guojiufu at gcc dot gnu.org
  2020-06-10 10:03 ` wilco at gcc dot gnu.org
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-10  9:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #43 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
To handle vectorization for this kind of code, it needs to overcome the hard
issue mentioned in comment #5: the loop has 2 exits.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (18 preceding siblings ...)
  2020-06-10  9:56 ` guojiufu at gcc dot gnu.org
@ 2020-06-10 10:03 ` wilco at gcc dot gnu.org
  2020-06-12  4:20 ` guojiufu at gcc dot gnu.org
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: wilco at gcc dot gnu.org @ 2020-06-10 10:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #44 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jiu Fu Guo from comment #43)
> To handle vectorization for this kind of code, it needs to overcome the hard
> issue mentioned in comment #5: the loop has 2 exits.

Yes and that also implies vector loads are unsafe unless they are non-faulting.
Few ISAs have such support.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (19 preceding siblings ...)
  2020-06-10 10:03 ` wilco at gcc dot gnu.org
@ 2020-06-12  4:20 ` guojiufu at gcc dot gnu.org
  2020-06-12  4:47 ` guojiufu at gcc dot gnu.org
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-12  4:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #45 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Wilco from comment #44)
> (In reply to Jiu Fu Guo from comment #43)
> > To handle vectorization for this kind of code, it needs to overcome the hard
> > issue mentioned in comment #5: the loop has 2 exits.
> 
> Yes and that also implies vector loads are unsafe unless they are
> non-faulting. Few ISAs have such support.

Thanks a lot for your kind comments!

For this case, the first match for (p[len] != cur[len]) very very possible
happen at middle of vector. Vector loading reads more data than required.  Also
as comment #3 and comment #7 said, those data are in question if they are
really accessible.

Since this is a loop, can we assume it is safe if reading data less than the
upbound of the loop and if two buffers are not cross pages?
something like: "while (++len != max - sizeof(T)) { T a = *(T*)(p+len)..}" and
"p & page_size_mask == (p+len) & page_size_mask".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (20 preceding siblings ...)
  2020-06-12  4:20 ` guojiufu at gcc dot gnu.org
@ 2020-06-12  4:47 ` guojiufu at gcc dot gnu.org
  2020-06-17 15:27 ` guojiufu at gcc dot gnu.org
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-12  4:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #46 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Jiu Fu Guo from comment #45)
> (In reply to Wilco from comment #44)
> > (In reply to Jiu Fu Guo from comment #43)
> > > To handle vectorization for this kind of code, it needs to overcome the hard
> > > issue mentioned in comment #5: the loop has 2 exits.
> > 
> > Yes and that also implies vector loads are unsafe unless they are
> > non-faulting. Few ISAs have such support.
> 
> Since this is a loop, can we assume it is safe if reading data less than the
> upbound of the loop and if two buffers are not cross pages?
> something like: "while (++len != max - sizeof(T)) { T a = *(T*)(p+len)..}"
> and "p & page_size_mask == (p+len) & page_size_mask".

In other words, if the underlying buffers cover the low-bound and upbound of
the loop.  We may assume it, but hard to prove it.  In which circumstances, we
can prove it?  Or does it make sense to provide a flag to let user confirm it
is safe?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (21 preceding siblings ...)
  2020-06-12  4:47 ` guojiufu at gcc dot gnu.org
@ 2020-06-17 15:27 ` guojiufu at gcc dot gnu.org
  2021-05-04 12:32 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 26+ messages in thread
From: guojiufu at gcc dot gnu.org @ 2020-06-17 15:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #47 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
memcmp is using wider reading in glibc; strncmp does not use wider reading.
memcmp is using "void *" as arguments, while strncmp is "char *".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (22 preceding siblings ...)
  2020-06-17 15:27 ` guojiufu at gcc dot gnu.org
@ 2021-05-04 12:32 ` rguenth at gcc dot gnu.org
  2022-03-14 12:23 ` d_vampile at 163 dot com
  2022-03-15 17:22 ` wilco at gcc dot gnu.org
  25 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-05-04 12:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (23 preceding siblings ...)
  2021-05-04 12:32 ` rguenth at gcc dot gnu.org
@ 2022-03-14 12:23 ` d_vampile at 163 dot com
  2022-03-15 17:22 ` wilco at gcc dot gnu.org
  25 siblings, 0 replies; 26+ messages in thread
From: d_vampile at 163 dot com @ 2022-03-14 12:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

d_vampile <d_vampile at 163 dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |d_vampile at 163 dot com

--- Comment #48 from d_vampile <d_vampile at 163 dot com> ---
(In reply to Jiu Fu Guo from comment #41)
> (In reply to Wilco from comment #40)
> > (In reply to Jiu Fu Guo from comment #39)
> > > I’m thinking to draft a patch for this optimization.  If any suggestions,
> > > please point out, thanks.
> > 
> > Which optimization to be precise? Besides unrolling I haven't seen a
> > proposal for an optimization which is both safe and generally applicable.
> 
> 1. For unroll, there are still branches in the loop. And then need careful
> merge on those reading and comparison.  Another thing about unroll would be
> that, if we prefer to optimize this early in GIMPLE, we still not GIMPLE
> unroll on it.
> while (len != max)
> {
>     if (p[len] != cur[len])
>       break; ++len;
>     if (p[len] != cur[len])
>       break; ++len;
>     if (p[len] != cur[len])
>       break; ++len;
> ....
> }
> 
> 2. Also thinking about if it makes sense to enhance GIMPLE vectorization
> pass.  In an aspect that using a vector to read and compare, also need to
> handle/merge compares into vector compare and handle early exit carefully.
> if (len + 8 < max && buffers not cross page) ///(p&4K) == (p+8)&4k?
> 4k:pagesize
>  while (len != max)
> {
>  vec a = xx p;
>  vec b = xx cur;
>  if (a != b) /// may not only for comparison 
>   {....;break;}
>  len += 8;
> }
> 
> 3. Introduce a new stand-alone pass to optimize reading/computing shorter
> types into large(dword/vector) reading/computing.
> 
> Thanks a lot for your comments/suggestions!

Any progress or patches for the new pass mentioned in point 3? Or new ideas?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
       [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
                   ` (24 preceding siblings ...)
  2022-03-14 12:23 ` d_vampile at 163 dot com
@ 2022-03-15 17:22 ` wilco at gcc dot gnu.org
  25 siblings, 0 replies; 26+ messages in thread
From: wilco at gcc dot gnu.org @ 2022-03-15 17:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #49 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to d_vampile from comment #48)
> (In reply to Jiu Fu Guo from comment #41)
> > (In reply to Wilco from comment #40)
> > > (In reply to Jiu Fu Guo from comment #39)
> > > > I’m thinking to draft a patch for this optimization.  If any suggestions,
> > > > please point out, thanks.
> > > 
> > > Which optimization to be precise? Besides unrolling I haven't seen a
> > > proposal for an optimization which is both safe and generally applicable.
> > 
> > 1. For unroll, there are still branches in the loop. And then need careful
> > merge on those reading and comparison.  Another thing about unroll would be
> > that, if we prefer to optimize this early in GIMPLE, we still not GIMPLE
> > unroll on it.
> > while (len != max)
> > {
> >     if (p[len] != cur[len])
> >       break; ++len;
> >     if (p[len] != cur[len])
> >       break; ++len;
> >     if (p[len] != cur[len])
> >       break; ++len;
> > ....
> > }
> > 
> > 2. Also thinking about if it makes sense to enhance GIMPLE vectorization
> > pass.  In an aspect that using a vector to read and compare, also need to
> > handle/merge compares into vector compare and handle early exit carefully.
> > if (len + 8 < max && buffers not cross page) ///(p&4K) == (p+8)&4k?
> > 4k:pagesize
> >  while (len != max)
> > {
> >  vec a = xx p;
> >  vec b = xx cur;
> >  if (a != b) /// may not only for comparison 
> >   {....;break;}
> >  len += 8;
> > }
> > 
> > 3. Introduce a new stand-alone pass to optimize reading/computing shorter
> > types into large(dword/vector) reading/computing.
> > 
> > Thanks a lot for your comments/suggestions!
> 
> Any progress or patches for the new pass mentioned in point 3? Or new ideas?

A new standalone pass would have to redo a lot of the work of the existing loop
optimizers and vectorizer. And you wouldn't be able to skip the extra checks to
guarantee safety of the optimization. This is why these optimizations are
typically done in the source code.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-03-15 17:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-88398-4@http.gcc.gnu.org/bugzilla/>
2020-05-19  9:39 ` [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison guojiufu at gcc dot gnu.org
2020-05-21  5:29 ` guojiufu at gcc dot gnu.org
2020-05-27  9:32 ` guojiufu at gcc dot gnu.org
2020-05-27  9:45 ` guojiufu at gcc dot gnu.org
2020-05-27 11:21 ` wilco at gcc dot gnu.org
2020-05-27 13:06 ` guojiufu at gcc dot gnu.org
2020-05-27 13:25 ` wilco at gcc dot gnu.org
2020-05-28  2:18 ` guojiufu at gcc dot gnu.org
2020-05-28  2:30 ` guojiufu at gcc dot gnu.org
2020-05-28 12:44 ` wilco at gcc dot gnu.org
2020-05-31 20:31 ` wschmidt at linux dot ibm.com
2020-06-02  2:49 ` guojiufu at gcc dot gnu.org
2020-06-02  6:35 ` rguenther at suse dot de
2020-06-04 14:15 ` guojiufu at gcc dot gnu.org
2020-06-08  2:45 ` guojiufu at gcc dot gnu.org
2020-06-08  9:22 ` wilco at gcc dot gnu.org
2020-06-09  3:46 ` guojiufu at gcc dot gnu.org
2020-06-09  3:49 ` guojiufu at gcc dot gnu.org
2020-06-10  9:56 ` guojiufu at gcc dot gnu.org
2020-06-10 10:03 ` wilco at gcc dot gnu.org
2020-06-12  4:20 ` guojiufu at gcc dot gnu.org
2020-06-12  4:47 ` guojiufu at gcc dot gnu.org
2020-06-17 15:27 ` guojiufu at gcc dot gnu.org
2021-05-04 12:32 ` rguenth at gcc dot gnu.org
2022-03-14 12:23 ` d_vampile at 163 dot com
2022-03-15 17:22 ` wilco at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).