[Bug c/110062] New: missed vectorization in graphicsmagick

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/110062] New: missed vectorization in graphicsmagick
@ 2023-05-31 13:20 hubicka at gcc dot gnu.org
  2023-06-01  9:22 ` [Bug tree-optimization/110062] " crazylht at gmail dot com
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-31 13:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

            Bug ID: 110062
           Summary: missed vectorization in graphicsmagick
           Product: gcc
           Version: 13.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Phoronix claims 31% performance difference between gcc13 and clang on sharpen
benchmark of graphicsmagick.  On zen3 I reproduce only 4%, but the benchmark
has only single short internal loop:

214
  97.56%  gm               gm                          [.] ConvolveImage.◆
   0.88%  gm               libgomp.so.1.0.0            [.] 0x000000000002▒
   0.67%  gm               libc.so.6                   [.] __memmove_avx_▒

GCC version:
  2.38 │500:┌─→vmovss      (%r8,%rax,4),%xmm2                            ▒
  0.04 │    │  movzbl      0x2(%rdx,%rax,4),%ebp                         ▒
  0.09 │    │  vcvtsi2ss   %ebp,%xmm0,%xmm1                              ▒
  7.44 │    │  movzbl      0x1(%rdx,%rax,4),%ebp                         ▒
  0.16 │    │  vfmadd231ss %xmm1,%xmm2,%xmm7                             ▒
 30.23 │    │  vcvtsi2ss   %ebp,%xmm0,%xmm1                              ▒
  2.38 │    │  movzbl      (%rdx,%rax,4),%ebp                            ▒
  0.03 │    │  inc         %rax                                          ▒
  0.00 │    │  vfmadd231ss %xmm1,%xmm2,%xmm9                             ▒
 22.80 │    │  vcvtsi2ss   %ebp,%xmm0,%xmm1                              ▒
  1.03 │    │  vfmadd231ss %xmm1,%xmm2,%xmm10                            ▒
 30.49 │    ├──cmp         %rax,%rbx                                     ▒
  0.18 │    └──jne         500                                           ▒

Clangs:
  0.00 │1e70:┌─→movzbl       0x2(%rdx,%rsi,4),%r9d                       ▒
  0.05 │     │  vbroadcastss (%rcx,%rsi,4),%xmm3                         ▒
  0.56 │     │  movzwl       (%rdx,%rsi,4),%r11d                         ▒
  0.05 │     │  inc          %rsi                                        ▒
  0.00 │     │  vcvtsi2ss    %r9d,%xmm10,%xmm2                           ▒
  0.71 │     │  vfmadd231ss  %xmm2,%xmm3,%xmm0                           ▒
  1.17 │     │  vmovd        %r11d,%xmm2                                 ▒
  0.00 │     │  vpmovzxbd    %xmm2,%xmm2                                 ▒
  0.06 │     │  vcvtdq2ps    %xmm2,%xmm2                                 ▒
  0.89 │     │  vfmadd231ps  %xmm2,%xmm3,%xmm1                           ▒
  1.98 │     ├──cmp          %rsi,%r10                                   ▒
  0.00 │     └──jne          1e70                                        ▒
  0.00 │      ↑ jmp          1630                                        ▒

Probably same issue as in PR109812 but reproduces on zens and loop is even
shorter.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
@ 2023-06-01  9:22 ` crazylht at gmail dot com
  2023-06-02  7:33 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2023-06-01  9:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
One of the vectorizer issues is related to PR110018.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
  2023-06-01  9:22 ` [Bug tree-optimization/110062] " crazylht at gmail dot com
@ 2023-06-02  7:33 ` rguenth at gcc dot gnu.org
  2023-06-06 20:22 ` hubicka at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-02  7:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
             Status|UNCONFIRMED                 |WAITING
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2023-06-02

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Can you produce a testcase for the loop?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
  2023-06-01  9:22 ` [Bug tree-optimization/110062] " crazylht at gmail dot com
  2023-06-02  7:33 ` rguenth at gcc dot gnu.org
@ 2023-06-06 20:22 ` hubicka at gcc dot gnu.org
  2023-06-07  6:43 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-06-06 20:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
#include <stddef.h>
struct pixel {float red, green, blue, opacity;};
struct ipixel {unsigned char red, green, blue, opacity;};
test(float *k, struct ipixel *r, int width, int columns, struct ipixel *q)
{
        struct pixel pixel;
        for (int v=0; v < width; v++)
                      {
                        for (int u=0; u < width; u++)
                         {
                            pixel.red+=k[u]*r[u].red;
                            pixel.green+=k[u]*r[u].green;
                            pixel.blue+=k[u]*r[u].blue;
                          }                   
                        k+=width;
                        r+=(size_t) columns+width;
                      }
                    q->red=pixel.red;
                    q->green=pixel.green;
                    q->blue=pixel.blue;
                    q->opacity=255;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-06-06 20:22 ` hubicka at gcc dot gnu.org
@ 2023-06-07  6:43 ` rguenth at gcc dot gnu.org
  2023-06-07 14:43 ` hubicka at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-07  6:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we fail to vectorize the outer loop (with double reduction) because of

t.c:7:25: note:   === vect_analyze_data_ref_accesses ===
t.c:7:25: note:   Detected interleaving load _7->red and _7->green
t.c:7:25: note:   Detected interleaving load _7->red and _7->blue
t.c:7:25: note:   grouped access in outer loop.
t.c:7:25: missed:   not vectorized: complicated access pattern.

for vectorizing the inner loop SLP discovery fails because of a not grouped
load - r[u].{red,green,blue} is handled but k[u] not - I think this is a
well-known
limitation (that ought to be fixed).  We then vectorize the loop with
interleaving and peeling for gaps, but profitability says 'width' needs to
be 16.  We also vectorize the epilog.

I suppose the vectorized body isn't entered?

Note outer loop vectorization likely isn't profitable even if implemented,
so the SLP failure is the thing to fix (which should be easy).  Need to
find the duplicate bug for this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-06-07  6:43 ` rguenth at gcc dot gnu.org
@ 2023-06-07 14:43 ` hubicka at gcc dot gnu.org
  2023-06-16 12:23 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-06-07 14:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
In sharpening the number of iterations depends on sharpen radius. Not sure what
it is for the benchmark, but in normal situations the number of iterations is
indeed not very large.

However clang simply slp vectorizes the red&green channels into vector of size
2.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-06-07 14:43 ` hubicka at gcc dot gnu.org
@ 2023-06-16 12:23 ` rguenth at gcc dot gnu.org
  2023-06-19  2:15 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-16 12:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, we would also be able to vectorize just the red and green channel:

t.c:18:27: note: ***** Analysis succeeded with vector mode V4SF
t.c:18:27: note: SLPing BB part
t.c:18:27: note: Costing subgraph:
t.c:18:27: note: node 0x420b6c8 (max_nunits=2, refcnt=1) vector(2) unsigned
char
t.c:18:27: note: op template: q_45(D)->red = _29;
t.c:18:27: note:        stmt 0 q_45(D)->red = _29;
t.c:18:27: note:        stmt 1 q_45(D)->green = _31;
t.c:18:27: note:        children 0x420b750
t.c:18:27: note: node (external) 0x420b750 (max_nunits=2, refcnt=1) vector(2)
unsigned char
t.c:18:27: note:        stmt 0 _29 = (unsigned char) pixel$red_78;
t.c:18:27: note:        stmt 1 _31 = (unsigned char) pixel$green_84;
t.c:18:27: note:        children 0x420b7d8
t.c:18:27: note: node 0x420b7d8 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>
t.c:18:27: note:        stmt 0 pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>
t.c:18:27: note:        stmt 1 pixel$green_84 = PHI <_144(11),
pixel$green_61(D)(10)>
t.c:18:27: note:        children 0x420b860 0x420be38
t.c:18:27: note: node 0x420b860 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _142 = PHI <_143(4)>
t.c:18:27: note:        stmt 0 _142 = PHI <_143(4)>
t.c:18:27: note:        stmt 1 _144 = PHI <_145(4)>
t.c:18:27: note:        children 0x420b8e8
t.c:18:27: note: node 0x420b8e8 (max_nunits=2, refcnt=2) vector(2) float
t.c:18:27: note: op template: _143 = PHI <_12(3)>
t.c:18:27: note:        stmt 0 _143 = PHI <_12(3)>
t.c:18:27: note:        stmt 1 _145 = PHI <_17(3)>
t.c:18:27: note:        children 0x420b970
t.c:18:27: note: node 0x420b970 (max_nunits=2, refcnt=2) vector(2) float
t.c:18:27: note: op template: _12 = _11 + pixel$red_80;
t.c:18:27: note:        stmt 0 _12 = _11 + pixel$red_80;
t.c:18:27: note:        stmt 1 _17 = _16 + pixel$green_82;
t.c:18:27: note:        children 0x420b9f8 0x420bca0
t.c:18:27: note: node 0x420b9f8 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _11 = _4 * _10;
t.c:18:27: note:        stmt 0 _11 = _4 * _10;
t.c:18:27: note:        stmt 1 _16 = _4 * _15;
t.c:18:27: note:        children 0x420ba80 0x420bb08
t.c:18:27: note: node (external) 0x420ba80 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:        { _4, _4 }
t.c:18:27: note: node 0x420bb08 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _10 = (float) _9;
t.c:18:27: note:        stmt 0 _10 = (float) _9;
t.c:18:27: note:        stmt 1 _15 = (float) _14;
t.c:18:27: note:        children 0x420bb90
t.c:18:27: note: node (external) 0x420bb90 (max_nunits=2, refcnt=1) vector(2)
int
t.c:18:27: note:        stmt 0 _9 = (int) _8;
t.c:18:27: note:        stmt 1 _14 = (int) _13;
t.c:18:27: note:        children 0x420bc18
t.c:18:27: note: node 0x420bc18 (max_nunits=2, refcnt=1) vector(2) unsigned
char
t.c:18:27: note: op template: _8 = _7->red;
t.c:18:27: note:        stmt 0 _8 = _7->red;
t.c:18:27: note:        stmt 1 _13 = _7->green;
t.c:18:27: note: node 0x420bca0 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_80 = PHI <_12(9), pixel$red_79(5)>
t.c:18:27: note:        stmt 0 pixel$red_80 = PHI <_12(9), pixel$red_79(5)>
t.c:18:27: note:        stmt 1 pixel$green_82 = PHI <_17(9), pixel$green_85(5)>
t.c:18:27: note:        children 0x420b970 0x420bd28
t.c:18:27: note: node 0x420bd28 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)>
t.c:18:27: note:        stmt 0 pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)>
t.c:18:27: note:        stmt 1 pixel$green_85 = PHI <_145(8),
pixel$green_61(D)(7)>
t.c:18:27: note:        children 0x420b8e8 0x420bdb0
t.c:18:27: note: node (external) 0x420bdb0 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:        { pixel$red_60(D), pixel$green_61(D) }
t.c:18:27: note: node (external) 0x420be38 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note:        { pixel$red_60(D), pixel$green_61(D) }

But the '(external)' show that we're missing support for some operations:

t.c:18:27: note:   ==> examining statement: _29 = (unsigned char) pixel$red_78;
t.c:18:27: note:   vect_is_simple_use: operand pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>, type of def: internal
t.c:18:27: missed:   conversion not supported by target.
t.c:18:27: note:   vect_is_simple_use: operand pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>, type of def: internal
t.c:18:27: missed:   no optab.
t.c:18:27: missed:   not vectorized: relevant stmt not supported: _29 =
(unsigned char) pixel$red_78;
t.c:18:27: note:   Building vector operands of 0x4215e90 from scalars instead

that's float -> unsigned char

for the stores:

                    q->red=pixel.red;
                    q->green=pixel.green;

we then cut the SLP off from that node, we're not considering keeping
the remains and materialize the sources of the conversions from vector
components.  That is, we're not trying to split the SLP graph at
such edges but simply throw away unreachable bits.

So there's this BB SLP issue, the issue we're not vectorizing the loop
and possibly the issue that we're not able to vectorize this conversion.

You btw didn't show me whether clang vectorizes the store (and this
conversion).  clang 13 does

        vcvttps2dq      %xmm1, %xmm1
        vpackusdw       %xmm1, %xmm1, %xmm1
        vpackuswb       %xmm1, %xmm1, %xmm1
        vcvttss2si      %xmm0, %eax
        jmp     .LBB0_9
.LBB0_1:
                                        # implicit-def: $al
                                        # implicit-def: $xmm1
.LBB0_9:
        vpextrb $0, %xmm1, (%r8)
        vpextrb $1, %xmm1, 1(%r8)
        movb    %al, 2(%r8)
        movb    $-1, 3(%r8)

so it doesn't vectorize the stores and it vectorizes the conversions
by converting to int and then packing two times to short and then char.
I suppose since it extracts the bytes the clang way would have been
faster extracting the two floats and doing scalar conversions like it
does for blue.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-06-16 12:23 ` rguenth at gcc dot gnu.org
@ 2023-06-19  2:15 ` crazylht at gmail dot com
  2023-06-21 12:01 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2023-06-19  2:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---

> pixel$red_60(D)(10)>, type of def: internal
> t.c:18:27: missed:   no optab.
> t.c:18:27: missed:   not vectorized: relevant stmt not supported: _29 =
> (unsigned char) pixel$red_78;
> t.c:18:27: note:   Building vector operands of 0x4215e90 from scalars instead
> 
> that's float -> unsigned char
> 
A patch is posted to support vectorization between float and unsigned char
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620402.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2023-06-19  2:15 ` crazylht at gmail dot com
@ 2023-06-21 12:01 ` rguenth at gcc dot gnu.org
  2023-06-21 12:53 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-21 12:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Since r14-2007-g6f19cf7526168f we now vectorize the loop but without SLP
which means we get interleaving and a vectorization factor of 64.  Turning
off loop vectorization yields the following which is now comparable to
what clang does.  Of course the loop vectorized interleaving is inefficient
in the end ...

        .p2align 4
        .p2align 3
.L3:
        movq    %rax, %rdx
        movq    %rdi, %rax
        .p2align 4
        .p2align 3
.L4:
        vpinsrw $0, (%rax), %xmm0, %xmm0
        vmovss  (%rdx), %xmm1
        movzbl  2(%rax), %ecx
        addq    $4, %rdx
        addq    $4, %rax
        vpmovzxbd       %xmm0, %xmm0
        vmovsldup       %xmm1, %xmm4
        vcvtdq2ps       %xmm0, %xmm0
        vfmadd231ps     %xmm4, %xmm0, %xmm2
        vcvtsi2ssl      %ecx, %xmm5, %xmm0
        vfmadd231ss     %xmm0, %xmm1, %xmm3
        cmpq    %rsi, %rdx
        jne     .L4
        incl    %r9d
        movq    %r11, %rax
        addq    %rbx, %rdi
        addq    %rbp, %rsi
        cmpl    %r9d, %r10d
        je      .L2
        addq    %rbp, %r11
        jmp     .L3
        .p2align 4
        .p2align 3
.L2:
        vcvttps2dq      %xmm2, %xmm2
        vpmovdb %xmm2, %xmm2
        popq    %rbx
        .cfi_def_cfa_offset 16
        vcvttss2sil     %xmm3, %eax
        popq    %rbp
        .cfi_def_cfa_offset 8
        vpextrw $0, %xmm2, (%r8)
        movb    %al, 2(%r8)
        movb    $-1, 3(%r8)
        ret

The loop cost modeling looks like

t.c:9:23: note:  Cost model analysis:
  Vector inside of loop cost: 1156
  Vector prologue cost: 24
  Vector epilogue cost: 5488
  Scalar iteration cost: 168
  Scalar outside cost: 32
  Vector outside cost: 5512
  prologue iterations: 0
  epilogue iterations: 32
  Calculated minimum iters for profitability: 33
t.c:9:23: note:    Runtime profitability threshold = 64
t.c:9:23: note:    Static estimate profitability threshold = 64

and we get a VF == 32 vectorized epilog as well:

t.c:9:23: note:  Cost model analysis: 
  Vector inside of loop cost: 620
  Vector prologue cost: 12
  Vector epilogue cost: 2752
  Scalar iteration cost: 168
  Scalar outside cost: 32 
  Vector outside cost: 2764
  prologue iterations: 0
  epilogue iterations: 16
  Calculated minimum iters for profitability: 17
t.c:9:23: note:    Runtime profitability threshold = 32
t.c:9:23: note:    Static estimate profitability threshold = 32

so at least we'll enter the BB SLP optimized scalar epilog in the likely case.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2023-06-21 12:01 ` rguenth at gcc dot gnu.org
@ 2023-06-21 12:53 ` rguenth at gcc dot gnu.org
  2023-07-31 11:28 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-21 12:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note SLPing k[u] won't help to reduce the VF, only selecting a smaller vector
size would.  The alternative is to have a power-of-two group size by using
masking for the 'opacity' field.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2023-06-21 12:53 ` rguenth at gcc dot gnu.org
@ 2023-07-31 11:28 ` rguenth at gcc dot gnu.org
  2023-11-25 13:33 ` hubicka at gcc dot gnu.org
  2023-11-27  7:29 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-31 11:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
We now also apply SLP vectorizing the loop, but as said the high VF is probably
prohibitive and causes quite some spilling:

.L7:
        vmovdqu (%r14), %ymm2
        vmovdqu 32(%r14), %ymm1
        subq    $-128, %r14
        subq    $-128, %rdx
        vmovups -128(%rdx), %ymm10
        vmovdqu -64(%r14), %ymm0
        vpshufb .LC7(%rip), %ymm2, %ymm4
        vmovups -96(%rdx), %ymm9
        vmovups -64(%rdx), %ymm8
        vpshufb .LC8(%rip), %ymm1, %ymm3
        vpermq  $78, %ymm4, %ymm4
        vpermq  $78, %ymm3, %ymm3
...
        vmulps  %ymm7, %ymm0, %ymm0
        vaddps  136(%rsp), %ymm0, %ymm7
        vaddps  %ymm3, %ymm15, %ymm15
        vmovaps %ymm4, 168(%rsp)
        vmovaps %ymm7, 136(%rsp)
        cmpq    %r13, %r14
        jne     .L7

Maybe we should more aggressively reject vectorization when the VF is
equal to the smallest element number of vector lanes.  When we then
also detect SLP this usually means BB-level SLP can do something.
Note we fail to support V2SF -> V2QI now, not sure what changed here.
vectorizable_conversion doesn't support float->int->short->char but
only either float->char, float->int->char or float->short->char, but
at least for 2-element vectors we don't support these (the vectorizer
could support extra intermediate steps as well).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2023-07-31 11:28 ` rguenth at gcc dot gnu.org
@ 2023-11-25 13:33 ` hubicka at gcc dot gnu.org
  2023-11-27  7:29 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-11-25 13:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #11 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
trunk -O3 -flto -march=native -fopenmp
    Operation: Sharpen:
        257
        256
        256

    Average: 256 Iterations Per Minute
GCC13 -O3 -flto -march=native -fopenmp
        257
        256
        256

    Average: 256 Iterations Per Minute
clang17 O3 -flto -march=native -fopenmp
   Operation: Sharpen:
        257
        256
        256
    Average: 256 Iterations Per Minute

So I guess I will need to try on zen3 to see if there is any difference.

the internal loop is:
  0.00 │460:┌─→movzbl      0x2(%rdx,%rax,4),%esi ▒
  0.02 │    │  vmovss      (%r8,%rax,4),%xmm2    ▒
  0.95 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
 20.22 │    │  movzbl      0x1(%rdx,%rax,4),%esi ▒
  0.01 │    │  vfmadd231ss %xmm1,%xmm2,%xmm3     ▒
 11.97 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
 18.76 │    │  movzbl      (%rdx,%rax,4),%esi    ▒
  0.00 │    │  inc         %rax                  ▒
  0.72 │    │  vfmadd231ss %xmm1,%xmm2,%xmm4     ▒
 12.55 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
 14.95 │    │  vfmadd231ss %xmm1,%xmm2,%xmm5     ▒
 15.93 │    ├──cmp         %rax,%r13             ▒
  0.35 │    └──jne         460                                                  

so it still does not get....

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
  2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2023-11-25 13:33 ` hubicka at gcc dot gnu.org
@ 2023-11-27  7:29 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-11-27  7:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #11)
> trunk -O3 -flto -march=native -fopenmp
>     Operation: Sharpen:
>         257
>         256
>         256
> 
>     Average: 256 Iterations Per Minute
> GCC13 -O3 -flto -march=native -fopenmp
>         257
>         256
>         256
> 
>     Average: 256 Iterations Per Minute
> clang17 O3 -flto -march=native -fopenmp
>    Operation: Sharpen:
>         257
>         256
>         256
>     Average: 256 Iterations Per Minute
> 
> So I guess I will need to try on zen3 to see if there is any difference.
> 
> the internal loop is:
>   0.00 │460:┌─→movzbl      0x2(%rdx,%rax,4),%esi ▒
>   0.02 │    │  vmovss      (%r8,%rax,4),%xmm2    ▒
>   0.95 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
>  20.22 │    │  movzbl      0x1(%rdx,%rax,4),%esi ▒
>   0.01 │    │  vfmadd231ss %xmm1,%xmm2,%xmm3     ▒
>  11.97 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
>  18.76 │    │  movzbl      (%rdx,%rax,4),%esi    ▒
>   0.00 │    │  inc         %rax                  ▒
>   0.72 │    │  vfmadd231ss %xmm1,%xmm2,%xmm4     ▒
>  12.55 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
>  14.95 │    │  vfmadd231ss %xmm1,%xmm2,%xmm5     ▒
>  15.93 │    ├──cmp         %rax,%r13             ▒
>   0.35 │    └──jne         460                                              
> 
> 
> so it still does not get....

As said the VF is going to be prohibitively large, likely the vector code
is never entered and the above is the scalar "epilogue".

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-11-27  7:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
2023-06-01  9:22 ` [Bug tree-optimization/110062] " crazylht at gmail dot com
2023-06-02  7:33 ` rguenth at gcc dot gnu.org
2023-06-06 20:22 ` hubicka at gcc dot gnu.org
2023-06-07  6:43 ` rguenth at gcc dot gnu.org
2023-06-07 14:43 ` hubicka at gcc dot gnu.org
2023-06-16 12:23 ` rguenth at gcc dot gnu.org
2023-06-19  2:15 ` crazylht at gmail dot com
2023-06-21 12:01 ` rguenth at gcc dot gnu.org
2023-06-21 12:53 ` rguenth at gcc dot gnu.org
2023-07-31 11:28 ` rguenth at gcc dot gnu.org
2023-11-25 13:33 ` hubicka at gcc dot gnu.org
2023-11-27  7:29 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).