public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz
@ 2023-03-03 14:56 vincenzo.innocente at cern dot ch
  2023-03-03 19:45 ` [Bug tree-optimization/109011] " pinskia at gcc dot gnu.org
                   ` (22 more replies)
  0 siblings, 23 replies; 24+ messages in thread
From: vincenzo.innocente at cern dot ch @ 2023-03-03 14:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

            Bug ID: 109011
           Summary: missed optimization in presence of __builtin_ctz
           Product: gcc
           Version: 12.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in the following code foo does not vectorize, bar does.
clang vectorize foo using a pattern that invokes vplzcntd

(code made a bit complex to make vectorization "relevant") 

see https://godbolt.org/z/5fa1zbPeG

#include <cstdint>
uint32_t x[256];
uint32_t y[256];
uint32_t w[256];
uint32_t z[256];



void foo() {
  for (int i=0; i<256;i++) {
    auto p = x[i] ?  __builtin_ctz(x[i]) : y[i];
   z[i] = w[i]*p;
 }  
}


void bar() {
  for (int j=0; j<256;j+=8)
  for (int i=j; i<j+8;i++) {
   // auto p = x[i] ?  x[i] : y[i];
   auto p = x[i] ?  __builtin_ctz(x[i]) : y[i];
   z[i] = w[i]*p;
 }  
}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
@ 2023-03-03 19:45 ` pinskia at gcc dot gnu.org
  2023-03-03 19:48 ` pinskia at gcc dot gnu.org
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-03 19:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
On aarch64, both get vectorized.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
  2023-03-03 19:45 ` [Bug tree-optimization/109011] " pinskia at gcc dot gnu.org
@ 2023-03-03 19:48 ` pinskia at gcc dot gnu.org
  2023-03-03 20:21 ` jakub at gcc dot gnu.org
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-03 19:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
             Target|                            |x86_64-*-*

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>clang vectorize foo using a pattern that invokes vplzcntd


But clang does not vectorize bar :).

Note the x86_64 options were: "-O3 -march=skylake-avx512"


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
  2023-03-03 19:45 ` [Bug tree-optimization/109011] " pinskia at gcc dot gnu.org
  2023-03-03 19:48 ` pinskia at gcc dot gnu.org
@ 2023-03-03 20:21 ` jakub at gcc dot gnu.org
  2023-03-03 20:39 ` jakub at gcc dot gnu.org
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-03 20:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Seems they are vectorizing __builtin_ctz (x) as bitsize - .CLZ ((x - 1) & ~x)
for CLZ_DEFINED_VALUE_AT_ZERO 2 with value bitsize.
Perhaps we should pattern match it in tree-vect-patterns.cc that way if clz is
vectorizable (it is with -mavx512cd) and ctz is not.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (2 preceding siblings ...)
  2023-03-03 20:21 ` jakub at gcc dot gnu.org
@ 2023-03-03 20:39 ` jakub at gcc dot gnu.org
  2023-03-03 20:46 ` jakub at gcc dot gnu.org
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-03 20:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Hacker's Delight has also a variant for popcount, either .POPCOUNT ((x - 1) &
~x)
or bitsize - .POPCOUNT (x | -x), though a question is if there are any targets
which have vector popcount and don't have vector clz nor ctz for some mode.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (3 preceding siblings ...)
  2023-03-03 20:39 ` jakub at gcc dot gnu.org
@ 2023-03-03 20:46 ` jakub at gcc dot gnu.org
  2023-03-04  0:18 ` jakub at gcc dot gnu.org
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-03 20:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
And to answer myself, as x86 has vplzcnt* just for 32-bit and 64-bit elts with
-mavx512cd (perhaps -mavx512vl also depending on vecsize), there is also 8-bit
and 16-bit element vector popcount (guarded by different options).
And with popcount it would be 3 instructions instead of 4, though dunno about
their latencies etc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (4 preceding siblings ...)
  2023-03-03 20:46 ` jakub at gcc dot gnu.org
@ 2023-03-04  0:18 ` jakub at gcc dot gnu.org
  2023-03-04 11:14 ` jakub at gcc dot gnu.org
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-04  0:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Oh, and optabs.cc expands ctz using clz as (bitsize-1) - .CLZ(x & -x) which is
one fewer operations if andn isn't supported, on the other side is undefined at
zero (so could be used for __builtin_ctz but not for .CTZ if
CTZ_UNDEFINED_AT_ZERO is 2.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (5 preceding siblings ...)
  2023-03-04  0:18 ` jakub at gcc dot gnu.org
@ 2023-03-04 11:14 ` jakub at gcc dot gnu.org
  2023-03-04 12:22 ` jakub at gcc dot gnu.org
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-04 11:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Also, I wonder why vect_recog_popcount_pattern handles only popcount, can't it
handle clz/ctz as well?
I mean for
void
foo (long long *p, long long *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = __builtin_popcountll (q[i]);
}

void
bar (long long *p, long long *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = __builtin_clzll (q[i]);
}
with -O3 -mavx512{bw,cd,vl,dq,bitalg,vpopcntdq} we have in *.optimized in the
inner loop nice:
  vect__4.7_40 = MEM <vector(8) long long int> [(long long int *)q_12(D) +
ivtmp.25_1 * 1];
  vect_patt_24.8_41 = .POPCOUNT (vect__4.7_40);
  MEM <vector(8) long long int> [(long long int *)p_13(D) + ivtmp.25_1 * 1] =
vect_patt_24.8_41;
but in the other loop
  vect__4.36_39 = MEM <vector(8) long long int> [(long long int *)q_12(D) +
ivtmp.56_1 * 1];
  vect__4.37_41 = MEM <vector(8) long long int> [(long long int *)q_12(D) + 64B
+ ivtmp.56_1 * 1];
  vect__5.38_42 = VIEW_CONVERT_EXPR<vector(8) long long unsigned
int>(vect__4.36_39);
  vect__5.38_43 = VIEW_CONVERT_EXPR<vector(8) long long unsigned
int>(vect__4.37_41);
  _44 = .CLZ (vect__5.38_42);
  _45 = .CLZ (vect__5.38_43);
  vect__6.39_46 = VEC_PACK_TRUNC_EXPR <_44, _45>;
  vect__8.40_47 = [vec_unpack_lo_expr] vect__6.39_46;
  vect__8.40_48 = [vec_unpack_hi_expr] vect__6.39_46;
  MEM <vector(8) long long int> [(long long int *)p_13(D) + ivtmp.56_1 * 1] =
vect__8.40_47;
  MEM <vector(8) long long int> [(long long int *)p_13(D) + 64B + ivtmp.56_1 *
1] = vect__8.40_48;
So, we need to handle twice as many vectors regardless of unrolling, perform
twice vector V8DI->V8DI clz, then pack it and immediately unpack it again.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (6 preceding siblings ...)
  2023-03-04 11:14 ` jakub at gcc dot gnu.org
@ 2023-03-04 12:22 ` jakub at gcc dot gnu.org
  2023-03-04 14:01 ` jakub at gcc dot gnu.org
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-04 12:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |jakub at gcc dot gnu.org

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
And, similarly to the fallback for ctz there should be fallback for ffs.
I'll handle this for GCC 14.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (7 preceding siblings ...)
  2023-03-04 12:22 ` jakub at gcc dot gnu.org
@ 2023-03-04 14:01 ` jakub at gcc dot gnu.org
  2023-03-04 15:08 ` jakub at gcc dot gnu.org
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-04 14:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 54584
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54584&action=edit
gcc13-pr109011.patch

Untested patch to just extend the popcount handling to clz, ctz and ffs, though
for now only if they have corresponding optabs implemented.  This improves the
__builtin_clzll testcase above, but doesn't help with ctz or ffs.
Those will need to be added incrementally.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (8 preceding siblings ...)
  2023-03-04 14:01 ` jakub at gcc dot gnu.org
@ 2023-03-04 15:08 ` jakub at gcc dot gnu.org
  2023-03-06  5:26 ` crazylht at gmail dot com
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-04 15:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #54584|0                           |1
        is obsolete|                            |

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 54585
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54585&action=edit
gcc13-pr109011.patch

Small fix for a theoretical problem in the patch.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (9 preceding siblings ...)
  2023-03-04 15:08 ` jakub at gcc dot gnu.org
@ 2023-03-06  5:26 ` crazylht at gmail dot com
  2023-03-06  7:01 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: crazylht at gmail dot com @ 2023-03-06  5:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Jakub Jelinek from comment #3)
> Seems they are vectorizing __builtin_ctz (x) as bitsize - .CLZ ((x - 1) &
> ~x) for CLZ_DEFINED_VALUE_AT_ZERO 2 with value bitsize.
> Perhaps we should pattern match it in tree-vect-patterns.cc that way if clz
> is vectorizable (it is with -mavx512cd) and ctz is not.

Yes, I'm going to add a pattern match named vect_recog_ctz_as_clz_or_popcnt.

BTW, It looks like only x86 have vector clz but not ctz, for other targets,
they either have both or neither.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (10 preceding siblings ...)
  2023-03-06  5:26 ` crazylht at gmail dot com
@ 2023-03-06  7:01 ` jakub at gcc dot gnu.org
  2023-03-06  7:59 ` crazylht at gmail dot com
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-06  7:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #11)
> (In reply to Jakub Jelinek from comment #3)
> > Seems they are vectorizing __builtin_ctz (x) as bitsize - .CLZ ((x - 1) &
> > ~x) for CLZ_DEFINED_VALUE_AT_ZERO 2 with value bitsize.
> > Perhaps we should pattern match it in tree-vect-patterns.cc that way if clz
> > is vectorizable (it is with -mavx512cd) and ctz is not.
> 
> Yes, I'm going to add a pattern match named vect_recog_ctz_as_clz_or_popcnt.
> 
> BTW, It looks like only x86 have vector clz but not ctz, for other targets,
> they either have both or neither.

I have already the patch half written.  Yes, for clz, but various other targets
don't have ffs and do have clz etc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (11 preceding siblings ...)
  2023-03-06  7:01 ` jakub at gcc dot gnu.org
@ 2023-03-06  7:59 ` crazylht at gmail dot com
  2023-03-06  8:11 ` jakub at gcc dot gnu.org
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: crazylht at gmail dot com @ 2023-03-06  7:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #13 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Jakub Jelinek from comment #12)
> (In reply to Hongtao.liu from comment #11)
> > (In reply to Jakub Jelinek from comment #3)
> > > Seems they are vectorizing __builtin_ctz (x) as bitsize - .CLZ ((x - 1) &
> > > ~x) for CLZ_DEFINED_VALUE_AT_ZERO 2 with value bitsize.
> > > Perhaps we should pattern match it in tree-vect-patterns.cc that way if clz
> > > is vectorizable (it is with -mavx512cd) and ctz is not.
> > 
> > Yes, I'm going to add a pattern match named vect_recog_ctz_as_clz_or_popcnt.
> > 
> > BTW, It looks like only x86 have vector clz but not ctz, for other targets,
> > they either have both or neither.
> 
> I have already the patch half written.  Yes, for clz, but various other
> targets don't have ffs and do have clz etc.

It looks like ffs is *just* ctz with defined behavior for zero, so we can
handle it exactly the same as ctz in the same pattern match((bitsize - .CLZ ((x
- 1) & ~x)) or .POPCOUNT ((x - 1) & ~x)) when CLZ_DEFINED_VALUE_AT_ZERO 2.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (12 preceding siblings ...)
  2023-03-06  7:59 ` crazylht at gmail dot com
@ 2023-03-06  8:11 ` jakub at gcc dot gnu.org
  2023-03-06  8:23 ` crazylht at gmail dot com
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-06  8:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #13)
> It looks like ffs is *just* ctz with defined behavior for zero, so we can
> handle it exactly the same as ctz in the same pattern match((bitsize - .CLZ
> ((x - 1) & ~x)) or .POPCOUNT ((x - 1) & ~x)) when CLZ_DEFINED_VALUE_AT_ZERO
> 2.

No, ffs(x) is ctz(x) + 1 for all x != 0, and 0 for x == 0.  But yes, we can
generally handle it similarly.  Let me attach a patch.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (13 preceding siblings ...)
  2023-03-06  8:11 ` jakub at gcc dot gnu.org
@ 2023-03-06  8:23 ` crazylht at gmail dot com
  2023-03-06  8:32 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: crazylht at gmail dot com @ 2023-03-06  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #15 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Jakub Jelinek from comment #14)
> (In reply to Hongtao.liu from comment #13)
> > It looks like ffs is *just* ctz with defined behavior for zero, so we can
> > handle it exactly the same as ctz in the same pattern match((bitsize - .CLZ
> > ((x - 1) & ~x)) or .POPCOUNT ((x - 1) & ~x)) when CLZ_DEFINED_VALUE_AT_ZERO
> > 2.
> 
> No, ffs(x) is ctz(x) + 1 for all x != 0, and 0 for x == 0.  But yes, we can
> generally handle it similarly.  Let me attach a patch.

Oh, I see.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (14 preceding siblings ...)
  2023-03-06  8:23 ` crazylht at gmail dot com
@ 2023-03-06  8:32 ` jakub at gcc dot gnu.org
  2023-03-06  9:22 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-06  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 54590
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54590&action=edit
gcc13-pr109011-2.patch

Here is what I have right now, totally untested and will need further work
so that the two pattern recognizers work together nicely (I think the one
modified by the earlier patch will for ctz/ffs need to trigger first and allow
ctz/ffs and be done even if for these they don't have direct optab but have
ctz (for ffs), clz or popcount.  And then let the new one rewrite it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (15 preceding siblings ...)
  2023-03-06  8:32 ` jakub at gcc dot gnu.org
@ 2023-03-06  9:22 ` jakub at gcc dot gnu.org
  2023-03-14  9:24 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-06  9:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #17 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Testcase for the normal SI -> SI stuff might be something like
with e.g.
-O3 -mavx512{bw,cd,vl,dq,bitalg,vpopcntdq} -mbmi -mlzcnt
options or so (the intent of the last 2 is to make clz/ctz defined at zero in
GIMPLE).
Plus similar testcase with long long * instead of int * and ll suffixed
builtins.
And also with unsigned char * and unsigned short * too eventually.

void
foo (int *p, int *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = __builtin_popcount (q[i]);
}

void
bar (int *p, int *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = __builtin_clz (q[i]);
}

void
baz (int *p, int *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = __builtin_ffs (q[i]);
}

void
qux (int *p, int *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = __builtin_ctz (q[i]);
}

void
corge (int *p, int *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = q[i] ? __builtin_clz (q[i]) : 32;
}

void
grault (int *p, int *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = q[i] ? __builtin_ctz (q[i]) : 32;
}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (16 preceding siblings ...)
  2023-03-06  9:22 ` jakub at gcc dot gnu.org
@ 2023-03-14  9:24 ` crazylht at gmail dot com
  2023-03-16  6:32 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: crazylht at gmail dot com @ 2023-03-14  9:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #18 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Jakub Jelinek from comment #16)
> Created attachment 54590 [details]
> gcc13-pr109011-2.patch
> 
> Here is what I have right now, totally untested and will need further work
> so that the two pattern recognizers work together nicely (I think the one
> modified by the earlier patch will for ctz/ffs need to trigger first and
> allow
> ctz/ffs and be done even if for these they don't have direct optab but have
> ctz (for ffs), clz or popcount.  And then let the new one rewrite it.

For ffs/ctz generated by vect_recog_popcount_clz_ctz_ffs_pattern, we can call
vect_recog_ctz_ffs_pattern again with a little adjustment so that call_stmt
doesn't come from stmt_vinfo->stmt but from parameter.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (17 preceding siblings ...)
  2023-03-14  9:24 ` crazylht at gmail dot com
@ 2023-03-16  6:32 ` crazylht at gmail dot com
  2023-04-19  9:15 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: crazylht at gmail dot com @ 2023-03-16  6:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #19 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 54678
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54678&action=edit
gcc13-pr109011-3.patch

Fix an ICE when gimple_call_lhs (call_stmt) is NULL in
vect_recog_ctz_ffs_pattern, recognize ctz/ffs generated by
vect_recog_popcount_clz_ctz_ffs_pattern.

Add testcases for them, bootstrapped and regtested on icelaker.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (18 preceding siblings ...)
  2023-03-16  6:32 ` crazylht at gmail dot com
@ 2023-04-19  9:15 ` cvs-commit at gcc dot gnu.org
  2023-04-20  9:56 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-04-19  9:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #20 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:ade0a1ee5c6707b950ba284adcfed0514866c12d

commit r14-65-gade0a1ee5c6707b950ba284adcfed0514866c12d
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Wed Apr 19 11:14:23 2023 +0200

    tree-vect-patterns: Improve __builtin_{clz,ctz,ffs}ll vectorization
[PR109011]

    For __builtin_popcountll tree-vect-patterns.cc has
    vect_recog_popcount_pattern, which improves the vectorized code.
    Without that the vectorization is always multi-type vectorization
    in the loop (at least int and long long types) where we emit two
    .POPCOUNT calls with long long arguments and int return value and then
    widen to long long, so effectively after vectorization do the
    V?DImode -> V?DImode popcount twice, then pack the result into V?SImode
    and immediately unpack.

    The following patch extends that handling to __builtin_{clz,ctz,ffs}ll
    builtins as well (as long as there is an optab for them; more to come
    laster).

    x86 can do __builtin_popcountll with -mavx512vpopcntdq, __builtin_clzll
    with -mavx512cd, ppc can do __builtin_popcountll and __builtin_clzll
    with -mpower8-vector and __builtin_ctzll with -mpower9-vector, s390
    can do __builtin_{popcount,clz,ctz}ll with -march=z13 -mzarch (i.e. VX).

    2023-04-19  Jakub Jelinek  <jakub@redhat.com>

            PR tree-optimization/109011
            * tree-vect-patterns.cc (vect_recog_popcount_pattern): Rename to
...
            (vect_recog_popcount_clz_ctz_ffs_pattern): ... this.  Handle also
            CLZ, CTZ and FFS.  Remove vargs variable, use
            gimple_build_call_internal rather than
gimple_build_call_internal_vec.
            (vect_vect_recog_func_ptrs): Adjust popcount entry.

            * gcc.dg/vect/pr109011-1.c: New test.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (19 preceding siblings ...)
  2023-04-19  9:15 ` cvs-commit at gcc dot gnu.org
@ 2023-04-20  9:56 ` cvs-commit at gcc dot gnu.org
  2023-04-20 17:45 ` cvs-commit at gcc dot gnu.org
  2023-04-24  1:35 ` cvs-commit at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-04-20  9:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #21 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:705b0d2b62318b3935214f08a1cf023b1117acb8

commit r14-108-g705b0d2b62318b3935214f08a1cf023b1117acb8
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Apr 20 11:55:16 2023 +0200

    tree-vect-patterns: Pattern recognize ctz or ffs using clz, popcount or ctz
[PR109011]

    The following patch allows to vectorize __builtin_ffs*/.FFS even if
    we just have vector .CTZ support, or
__builtin_ffs*/.FFS/__builtin_ctz*/.CTZ
    if we just have vector .CLZ or .POPCOUNT support.
    It uses various expansions from Hacker's Delight book as well as GCC's
    expansion, in particular:
    .CTZ (X) = PREC - .CLZ ((X - 1) & ~X)
    .CTZ (X) = .POPCOUNT ((X - 1) & ~X)
    .CTZ (X) = (PREC - 1) - .CLZ (X & -X)
    .FFS (X) = PREC - .CLZ (X & -X)
    .CTZ (X) = PREC - .POPCOUNT (X | -X)
    .FFS (X) = (PREC + 1) - .POPCOUNT (X | -X)
    .FFS (X) = .CTZ (X) + 1
    where the first one can be only used if both CTZ and CLZ have value
    defined at zero (kind 2) and both have value of PREC there.
    If the original has value defined at zero and the latter doesn't
    for other forms or if it doesn't have matching value for that case,
    a COND_EXPR is added for that afterwards.

    The patch also modifies vect_recog_popcount_clz_ctz_ffs_pattern
    such that the two can work together.

    2023-04-20  Jakub Jelinek  <jakub@redhat.com>

            PR tree-optimization/109011
            * tree-vect-patterns.cc (vect_recog_ctz_ffs_pattern): New function.
            (vect_recog_popcount_clz_ctz_ffs_pattern): Move
vect_pattern_detected
            call later.  Don't punt for IFN_CTZ or IFN_FFS if it doesn't have
            direct optab support, but has instead IFN_CLZ, IFN_POPCOUNT or
            for IFN_FFS IFN_CTZ support, use vect_recog_ctz_ffs_pattern for
that
            case.
            (vect_vect_recog_func_ptrs): Add ctz_ffs entry.

            * gcc.dg/vect/pr109011-1.c: Remove -mpower9-vector from
            dg-additional-options.
            (baz, qux): Remove functions and corresponding dg-final.
            * gcc.dg/vect/pr109011-2.c: New test.
            * gcc.dg/vect/pr109011-3.c: New test.
            * gcc.dg/vect/pr109011-4.c: New test.
            * gcc.dg/vect/pr109011-5.c: New test.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (20 preceding siblings ...)
  2023-04-20  9:56 ` cvs-commit at gcc dot gnu.org
@ 2023-04-20 17:45 ` cvs-commit at gcc dot gnu.org
  2023-04-24  1:35 ` cvs-commit at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-04-20 17:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #22 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:87c9bae4e32b54829dce0a93ff735412d5f684f8

commit r14-121-g87c9bae4e32b54829dce0a93ff735412d5f684f8
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Apr 20 19:44:27 2023 +0200

    tree-vect-patterns: One small vect_recog_ctz_ffs_pattern tweak [PR109011]

    I've noticed I've made a typo, ifn in this function this late
    is always only IFN_CTZ or IFN_FFS, never IFN_CLZ.

    Due to this typo, we weren't using the originally intended
    .CTZ (X) = .POPCOUNT ((X - 1) & ~X)
    but
    .CTZ (X) = PREC - .POPCOUNT (X | -X)
    instead when we want to emit __builtin_ctz*/.CTZ using .POPCOUNT.
    Both compute the same value, both are defined at 0 with the
    same value (PREC), both have same number of GIMPLE statements,
    but I think the former ought to be preferred, because lots of targets
    have andn as a single operation rather than two, and also putting
    a -1 constant into a vector register is often cheaper than vector
    with broadcast PREC power of two value.

    2023-04-20  Jakub Jelinek  <jakub@redhat.com>

            PR tree-optimization/109011
            * tree-vect-patterns.cc (vect_recog_ctz_ffs_pattern): Use
            .CTZ (X) = .POPCOUNT ((X - 1) & ~X) in preference to
            .CTZ (X) = PREC - .POPCOUNT (X | -X).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug tree-optimization/109011] missed optimization in presence of __builtin_ctz
  2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
                   ` (21 preceding siblings ...)
  2023-04-20 17:45 ` cvs-commit at gcc dot gnu.org
@ 2023-04-24  1:35 ` cvs-commit at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-04-24  1:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #23 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:8311c26757657fe8ffa28ca1539d02d141bb8292

commit r14-182-g8311c26757657fe8ffa28ca1539d02d141bb8292
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Mar 15 13:41:06 2023 +0800

    Add testcases for ffs/ctz vectorization.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/109011
            * gcc.target/i386/pr109011-b1.c: New test.
            * gcc.target/i386/pr109011-b2.c: New test.
            * gcc.target/i386/pr109011-d1.c: New test.
            * gcc.target/i386/pr109011-d2.c: New test.
            * gcc.target/i386/pr109011-q1.c: New test.
            * gcc.target/i386/pr109011-q2.c: New test.
            * gcc.target/i386/pr109011-w1.c: New test.
            * gcc.target/i386/pr109011-w2.c: New test.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-04-24  1:35 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-03 14:56 [Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz vincenzo.innocente at cern dot ch
2023-03-03 19:45 ` [Bug tree-optimization/109011] " pinskia at gcc dot gnu.org
2023-03-03 19:48 ` pinskia at gcc dot gnu.org
2023-03-03 20:21 ` jakub at gcc dot gnu.org
2023-03-03 20:39 ` jakub at gcc dot gnu.org
2023-03-03 20:46 ` jakub at gcc dot gnu.org
2023-03-04  0:18 ` jakub at gcc dot gnu.org
2023-03-04 11:14 ` jakub at gcc dot gnu.org
2023-03-04 12:22 ` jakub at gcc dot gnu.org
2023-03-04 14:01 ` jakub at gcc dot gnu.org
2023-03-04 15:08 ` jakub at gcc dot gnu.org
2023-03-06  5:26 ` crazylht at gmail dot com
2023-03-06  7:01 ` jakub at gcc dot gnu.org
2023-03-06  7:59 ` crazylht at gmail dot com
2023-03-06  8:11 ` jakub at gcc dot gnu.org
2023-03-06  8:23 ` crazylht at gmail dot com
2023-03-06  8:32 ` jakub at gcc dot gnu.org
2023-03-06  9:22 ` jakub at gcc dot gnu.org
2023-03-14  9:24 ` crazylht at gmail dot com
2023-03-16  6:32 ` crazylht at gmail dot com
2023-04-19  9:15 ` cvs-commit at gcc dot gnu.org
2023-04-20  9:56 ` cvs-commit at gcc dot gnu.org
2023-04-20 17:45 ` cvs-commit at gcc dot gnu.org
2023-04-24  1:35 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).