public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/110062] New: missed vectorization in graphicsmagick
@ 2023-05-31 13:20 hubicka at gcc dot gnu.org
2023-06-01 9:22 ` [Bug tree-optimization/110062] " crazylht at gmail dot com
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-31 13:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
Bug ID: 110062
Summary: missed vectorization in graphicsmagick
Product: gcc
Version: 13.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
Phoronix claims 31% performance difference between gcc13 and clang on sharpen
benchmark of graphicsmagick. On zen3 I reproduce only 4%, but the benchmark
has only single short internal loop:
214
97.56% gm gm [.] ConvolveImage.◆
0.88% gm libgomp.so.1.0.0 [.] 0x000000000002▒
0.67% gm libc.so.6 [.] __memmove_avx_▒
GCC version:
2.38 │500:┌─→vmovss (%r8,%rax,4),%xmm2 ▒
0.04 │ │ movzbl 0x2(%rdx,%rax,4),%ebp ▒
0.09 │ │ vcvtsi2ss %ebp,%xmm0,%xmm1 ▒
7.44 │ │ movzbl 0x1(%rdx,%rax,4),%ebp ▒
0.16 │ │ vfmadd231ss %xmm1,%xmm2,%xmm7 ▒
30.23 │ │ vcvtsi2ss %ebp,%xmm0,%xmm1 ▒
2.38 │ │ movzbl (%rdx,%rax,4),%ebp ▒
0.03 │ │ inc %rax ▒
0.00 │ │ vfmadd231ss %xmm1,%xmm2,%xmm9 ▒
22.80 │ │ vcvtsi2ss %ebp,%xmm0,%xmm1 ▒
1.03 │ │ vfmadd231ss %xmm1,%xmm2,%xmm10 ▒
30.49 │ ├──cmp %rax,%rbx ▒
0.18 │ └──jne 500 ▒
Clangs:
0.00 │1e70:┌─→movzbl 0x2(%rdx,%rsi,4),%r9d ▒
0.05 │ │ vbroadcastss (%rcx,%rsi,4),%xmm3 ▒
0.56 │ │ movzwl (%rdx,%rsi,4),%r11d ▒
0.05 │ │ inc %rsi ▒
0.00 │ │ vcvtsi2ss %r9d,%xmm10,%xmm2 ▒
0.71 │ │ vfmadd231ss %xmm2,%xmm3,%xmm0 ▒
1.17 │ │ vmovd %r11d,%xmm2 ▒
0.00 │ │ vpmovzxbd %xmm2,%xmm2 ▒
0.06 │ │ vcvtdq2ps %xmm2,%xmm2 ▒
0.89 │ │ vfmadd231ps %xmm2,%xmm3,%xmm1 ▒
1.98 │ ├──cmp %rsi,%r10 ▒
0.00 │ └──jne 1e70 ▒
0.00 │ ↑ jmp 1630 ▒
Probably same issue as in PR109812 but reproduces on zens and loop is even
shorter.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
@ 2023-06-01 9:22 ` crazylht at gmail dot com
2023-06-02 7:33 ` rguenth at gcc dot gnu.org
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2023-06-01 9:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
One of the vectorizer issues is related to PR110018.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
2023-06-01 9:22 ` [Bug tree-optimization/110062] " crazylht at gmail dot com
@ 2023-06-02 7:33 ` rguenth at gcc dot gnu.org
2023-06-06 20:22 ` hubicka at gcc dot gnu.org
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-02 7:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org
Status|UNCONFIRMED |WAITING
Ever confirmed|0 |1
Last reconfirmed| |2023-06-02
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Can you produce a testcase for the loop?
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
2023-06-01 9:22 ` [Bug tree-optimization/110062] " crazylht at gmail dot com
2023-06-02 7:33 ` rguenth at gcc dot gnu.org
@ 2023-06-06 20:22 ` hubicka at gcc dot gnu.org
2023-06-07 6:43 ` rguenth at gcc dot gnu.org
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-06-06 20:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |NEW
--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
#include <stddef.h>
struct pixel {float red, green, blue, opacity;};
struct ipixel {unsigned char red, green, blue, opacity;};
test(float *k, struct ipixel *r, int width, int columns, struct ipixel *q)
{
struct pixel pixel;
for (int v=0; v < width; v++)
{
for (int u=0; u < width; u++)
{
pixel.red+=k[u]*r[u].red;
pixel.green+=k[u]*r[u].green;
pixel.blue+=k[u]*r[u].blue;
}
k+=width;
r+=(size_t) columns+width;
}
q->red=pixel.red;
q->green=pixel.green;
q->blue=pixel.blue;
q->opacity=255;
}
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (2 preceding siblings ...)
2023-06-06 20:22 ` hubicka at gcc dot gnu.org
@ 2023-06-07 6:43 ` rguenth at gcc dot gnu.org
2023-06-07 14:43 ` hubicka at gcc dot gnu.org
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-07 6:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Status|NEW |ASSIGNED
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we fail to vectorize the outer loop (with double reduction) because of
t.c:7:25: note: === vect_analyze_data_ref_accesses ===
t.c:7:25: note: Detected interleaving load _7->red and _7->green
t.c:7:25: note: Detected interleaving load _7->red and _7->blue
t.c:7:25: note: grouped access in outer loop.
t.c:7:25: missed: not vectorized: complicated access pattern.
for vectorizing the inner loop SLP discovery fails because of a not grouped
load - r[u].{red,green,blue} is handled but k[u] not - I think this is a
well-known
limitation (that ought to be fixed). We then vectorize the loop with
interleaving and peeling for gaps, but profitability says 'width' needs to
be 16. We also vectorize the epilog.
I suppose the vectorized body isn't entered?
Note outer loop vectorization likely isn't profitable even if implemented,
so the SLP failure is the thing to fix (which should be easy). Need to
find the duplicate bug for this.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (3 preceding siblings ...)
2023-06-07 6:43 ` rguenth at gcc dot gnu.org
@ 2023-06-07 14:43 ` hubicka at gcc dot gnu.org
2023-06-16 12:23 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-06-07 14:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
In sharpening the number of iterations depends on sharpen radius. Not sure what
it is for the benchmark, but in normal situations the number of iterations is
indeed not very large.
However clang simply slp vectorizes the red&green channels into vector of size
2.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (4 preceding siblings ...)
2023-06-07 14:43 ` hubicka at gcc dot gnu.org
@ 2023-06-16 12:23 ` rguenth at gcc dot gnu.org
2023-06-19 2:15 ` crazylht at gmail dot com
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-16 12:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, we would also be able to vectorize just the red and green channel:
t.c:18:27: note: ***** Analysis succeeded with vector mode V4SF
t.c:18:27: note: SLPing BB part
t.c:18:27: note: Costing subgraph:
t.c:18:27: note: node 0x420b6c8 (max_nunits=2, refcnt=1) vector(2) unsigned
char
t.c:18:27: note: op template: q_45(D)->red = _29;
t.c:18:27: note: stmt 0 q_45(D)->red = _29;
t.c:18:27: note: stmt 1 q_45(D)->green = _31;
t.c:18:27: note: children 0x420b750
t.c:18:27: note: node (external) 0x420b750 (max_nunits=2, refcnt=1) vector(2)
unsigned char
t.c:18:27: note: stmt 0 _29 = (unsigned char) pixel$red_78;
t.c:18:27: note: stmt 1 _31 = (unsigned char) pixel$green_84;
t.c:18:27: note: children 0x420b7d8
t.c:18:27: note: node 0x420b7d8 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>
t.c:18:27: note: stmt 0 pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>
t.c:18:27: note: stmt 1 pixel$green_84 = PHI <_144(11),
pixel$green_61(D)(10)>
t.c:18:27: note: children 0x420b860 0x420be38
t.c:18:27: note: node 0x420b860 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _142 = PHI <_143(4)>
t.c:18:27: note: stmt 0 _142 = PHI <_143(4)>
t.c:18:27: note: stmt 1 _144 = PHI <_145(4)>
t.c:18:27: note: children 0x420b8e8
t.c:18:27: note: node 0x420b8e8 (max_nunits=2, refcnt=2) vector(2) float
t.c:18:27: note: op template: _143 = PHI <_12(3)>
t.c:18:27: note: stmt 0 _143 = PHI <_12(3)>
t.c:18:27: note: stmt 1 _145 = PHI <_17(3)>
t.c:18:27: note: children 0x420b970
t.c:18:27: note: node 0x420b970 (max_nunits=2, refcnt=2) vector(2) float
t.c:18:27: note: op template: _12 = _11 + pixel$red_80;
t.c:18:27: note: stmt 0 _12 = _11 + pixel$red_80;
t.c:18:27: note: stmt 1 _17 = _16 + pixel$green_82;
t.c:18:27: note: children 0x420b9f8 0x420bca0
t.c:18:27: note: node 0x420b9f8 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _11 = _4 * _10;
t.c:18:27: note: stmt 0 _11 = _4 * _10;
t.c:18:27: note: stmt 1 _16 = _4 * _15;
t.c:18:27: note: children 0x420ba80 0x420bb08
t.c:18:27: note: node (external) 0x420ba80 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note: { _4, _4 }
t.c:18:27: note: node 0x420bb08 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: _10 = (float) _9;
t.c:18:27: note: stmt 0 _10 = (float) _9;
t.c:18:27: note: stmt 1 _15 = (float) _14;
t.c:18:27: note: children 0x420bb90
t.c:18:27: note: node (external) 0x420bb90 (max_nunits=2, refcnt=1) vector(2)
int
t.c:18:27: note: stmt 0 _9 = (int) _8;
t.c:18:27: note: stmt 1 _14 = (int) _13;
t.c:18:27: note: children 0x420bc18
t.c:18:27: note: node 0x420bc18 (max_nunits=2, refcnt=1) vector(2) unsigned
char
t.c:18:27: note: op template: _8 = _7->red;
t.c:18:27: note: stmt 0 _8 = _7->red;
t.c:18:27: note: stmt 1 _13 = _7->green;
t.c:18:27: note: node 0x420bca0 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_80 = PHI <_12(9), pixel$red_79(5)>
t.c:18:27: note: stmt 0 pixel$red_80 = PHI <_12(9), pixel$red_79(5)>
t.c:18:27: note: stmt 1 pixel$green_82 = PHI <_17(9), pixel$green_85(5)>
t.c:18:27: note: children 0x420b970 0x420bd28
t.c:18:27: note: node 0x420bd28 (max_nunits=2, refcnt=1) vector(2) float
t.c:18:27: note: op template: pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)>
t.c:18:27: note: stmt 0 pixel$red_79 = PHI <_143(8), pixel$red_60(D)(7)>
t.c:18:27: note: stmt 1 pixel$green_85 = PHI <_145(8),
pixel$green_61(D)(7)>
t.c:18:27: note: children 0x420b8e8 0x420bdb0
t.c:18:27: note: node (external) 0x420bdb0 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note: { pixel$red_60(D), pixel$green_61(D) }
t.c:18:27: note: node (external) 0x420be38 (max_nunits=1, refcnt=1) vector(2)
float
t.c:18:27: note: { pixel$red_60(D), pixel$green_61(D) }
But the '(external)' show that we're missing support for some operations:
t.c:18:27: note: ==> examining statement: _29 = (unsigned char) pixel$red_78;
t.c:18:27: note: vect_is_simple_use: operand pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>, type of def: internal
t.c:18:27: missed: conversion not supported by target.
t.c:18:27: note: vect_is_simple_use: operand pixel$red_78 = PHI <_142(11),
pixel$red_60(D)(10)>, type of def: internal
t.c:18:27: missed: no optab.
t.c:18:27: missed: not vectorized: relevant stmt not supported: _29 =
(unsigned char) pixel$red_78;
t.c:18:27: note: Building vector operands of 0x4215e90 from scalars instead
that's float -> unsigned char
for the stores:
q->red=pixel.red;
q->green=pixel.green;
we then cut the SLP off from that node, we're not considering keeping
the remains and materialize the sources of the conversions from vector
components. That is, we're not trying to split the SLP graph at
such edges but simply throw away unreachable bits.
So there's this BB SLP issue, the issue we're not vectorizing the loop
and possibly the issue that we're not able to vectorize this conversion.
You btw didn't show me whether clang vectorizes the store (and this
conversion). clang 13 does
vcvttps2dq %xmm1, %xmm1
vpackusdw %xmm1, %xmm1, %xmm1
vpackuswb %xmm1, %xmm1, %xmm1
vcvttss2si %xmm0, %eax
jmp .LBB0_9
.LBB0_1:
# implicit-def: $al
# implicit-def: $xmm1
.LBB0_9:
vpextrb $0, %xmm1, (%r8)
vpextrb $1, %xmm1, 1(%r8)
movb %al, 2(%r8)
movb $-1, 3(%r8)
so it doesn't vectorize the stores and it vectorizes the conversions
by converting to int and then packing two times to short and then char.
I suppose since it extracts the bytes the clang way would have been
faster extracting the two floats and doing scalar conversions like it
does for blue.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (5 preceding siblings ...)
2023-06-16 12:23 ` rguenth at gcc dot gnu.org
@ 2023-06-19 2:15 ` crazylht at gmail dot com
2023-06-21 12:01 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2023-06-19 2:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
> pixel$red_60(D)(10)>, type of def: internal
> t.c:18:27: missed: no optab.
> t.c:18:27: missed: not vectorized: relevant stmt not supported: _29 =
> (unsigned char) pixel$red_78;
> t.c:18:27: note: Building vector operands of 0x4215e90 from scalars instead
>
> that's float -> unsigned char
>
A patch is posted to support vectorization between float and unsigned char
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620402.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (6 preceding siblings ...)
2023-06-19 2:15 ` crazylht at gmail dot com
@ 2023-06-21 12:01 ` rguenth at gcc dot gnu.org
2023-06-21 12:53 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-21 12:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Since r14-2007-g6f19cf7526168f we now vectorize the loop but without SLP
which means we get interleaving and a vectorization factor of 64. Turning
off loop vectorization yields the following which is now comparable to
what clang does. Of course the loop vectorized interleaving is inefficient
in the end ...
.p2align 4
.p2align 3
.L3:
movq %rax, %rdx
movq %rdi, %rax
.p2align 4
.p2align 3
.L4:
vpinsrw $0, (%rax), %xmm0, %xmm0
vmovss (%rdx), %xmm1
movzbl 2(%rax), %ecx
addq $4, %rdx
addq $4, %rax
vpmovzxbd %xmm0, %xmm0
vmovsldup %xmm1, %xmm4
vcvtdq2ps %xmm0, %xmm0
vfmadd231ps %xmm4, %xmm0, %xmm2
vcvtsi2ssl %ecx, %xmm5, %xmm0
vfmadd231ss %xmm0, %xmm1, %xmm3
cmpq %rsi, %rdx
jne .L4
incl %r9d
movq %r11, %rax
addq %rbx, %rdi
addq %rbp, %rsi
cmpl %r9d, %r10d
je .L2
addq %rbp, %r11
jmp .L3
.p2align 4
.p2align 3
.L2:
vcvttps2dq %xmm2, %xmm2
vpmovdb %xmm2, %xmm2
popq %rbx
.cfi_def_cfa_offset 16
vcvttss2sil %xmm3, %eax
popq %rbp
.cfi_def_cfa_offset 8
vpextrw $0, %xmm2, (%r8)
movb %al, 2(%r8)
movb $-1, 3(%r8)
ret
The loop cost modeling looks like
t.c:9:23: note: Cost model analysis:
Vector inside of loop cost: 1156
Vector prologue cost: 24
Vector epilogue cost: 5488
Scalar iteration cost: 168
Scalar outside cost: 32
Vector outside cost: 5512
prologue iterations: 0
epilogue iterations: 32
Calculated minimum iters for profitability: 33
t.c:9:23: note: Runtime profitability threshold = 64
t.c:9:23: note: Static estimate profitability threshold = 64
and we get a VF == 32 vectorized epilog as well:
t.c:9:23: note: Cost model analysis:
Vector inside of loop cost: 620
Vector prologue cost: 12
Vector epilogue cost: 2752
Scalar iteration cost: 168
Scalar outside cost: 32
Vector outside cost: 2764
prologue iterations: 0
epilogue iterations: 16
Calculated minimum iters for profitability: 17
t.c:9:23: note: Runtime profitability threshold = 32
t.c:9:23: note: Static estimate profitability threshold = 32
so at least we'll enter the BB SLP optimized scalar epilog in the likely case.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (7 preceding siblings ...)
2023-06-21 12:01 ` rguenth at gcc dot gnu.org
@ 2023-06-21 12:53 ` rguenth at gcc dot gnu.org
2023-07-31 11:28 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-21 12:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note SLPing k[u] won't help to reduce the VF, only selecting a smaller vector
size would. The alternative is to have a power-of-two group size by using
masking for the 'opacity' field.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (8 preceding siblings ...)
2023-06-21 12:53 ` rguenth at gcc dot gnu.org
@ 2023-07-31 11:28 ` rguenth at gcc dot gnu.org
2023-11-25 13:33 ` hubicka at gcc dot gnu.org
2023-11-27 7:29 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-31 11:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
We now also apply SLP vectorizing the loop, but as said the high VF is probably
prohibitive and causes quite some spilling:
.L7:
vmovdqu (%r14), %ymm2
vmovdqu 32(%r14), %ymm1
subq $-128, %r14
subq $-128, %rdx
vmovups -128(%rdx), %ymm10
vmovdqu -64(%r14), %ymm0
vpshufb .LC7(%rip), %ymm2, %ymm4
vmovups -96(%rdx), %ymm9
vmovups -64(%rdx), %ymm8
vpshufb .LC8(%rip), %ymm1, %ymm3
vpermq $78, %ymm4, %ymm4
vpermq $78, %ymm3, %ymm3
...
vmulps %ymm7, %ymm0, %ymm0
vaddps 136(%rsp), %ymm0, %ymm7
vaddps %ymm3, %ymm15, %ymm15
vmovaps %ymm4, 168(%rsp)
vmovaps %ymm7, 136(%rsp)
cmpq %r13, %r14
jne .L7
Maybe we should more aggressively reject vectorization when the VF is
equal to the smallest element number of vector lanes. When we then
also detect SLP this usually means BB-level SLP can do something.
Note we fail to support V2SF -> V2QI now, not sure what changed here.
vectorizable_conversion doesn't support float->int->short->char but
only either float->char, float->int->char or float->short->char, but
at least for 2-element vectors we don't support these (the vectorizer
could support extra intermediate steps as well).
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (9 preceding siblings ...)
2023-07-31 11:28 ` rguenth at gcc dot gnu.org
@ 2023-11-25 13:33 ` hubicka at gcc dot gnu.org
2023-11-27 7:29 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-11-25 13:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #11 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
trunk -O3 -flto -march=native -fopenmp
Operation: Sharpen:
257
256
256
Average: 256 Iterations Per Minute
GCC13 -O3 -flto -march=native -fopenmp
257
256
256
Average: 256 Iterations Per Minute
clang17 O3 -flto -march=native -fopenmp
Operation: Sharpen:
257
256
256
Average: 256 Iterations Per Minute
So I guess I will need to try on zen3 to see if there is any difference.
the internal loop is:
0.00 │460:┌─→movzbl 0x2(%rdx,%rax,4),%esi ▒
0.02 │ │ vmovss (%r8,%rax,4),%xmm2 ▒
0.95 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒
20.22 │ │ movzbl 0x1(%rdx,%rax,4),%esi ▒
0.01 │ │ vfmadd231ss %xmm1,%xmm2,%xmm3 ▒
11.97 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒
18.76 │ │ movzbl (%rdx,%rax,4),%esi ▒
0.00 │ │ inc %rax ▒
0.72 │ │ vfmadd231ss %xmm1,%xmm2,%xmm4 ▒
12.55 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒
14.95 │ │ vfmadd231ss %xmm1,%xmm2,%xmm5 ▒
15.93 │ ├──cmp %rax,%r13 ▒
0.35 │ └──jne 460
so it still does not get....
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/110062] missed vectorization in graphicsmagick
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
` (10 preceding siblings ...)
2023-11-25 13:33 ` hubicka at gcc dot gnu.org
@ 2023-11-27 7:29 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-11-27 7:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #11)
> trunk -O3 -flto -march=native -fopenmp
> Operation: Sharpen:
> 257
> 256
> 256
>
> Average: 256 Iterations Per Minute
> GCC13 -O3 -flto -march=native -fopenmp
> 257
> 256
> 256
>
> Average: 256 Iterations Per Minute
> clang17 O3 -flto -march=native -fopenmp
> Operation: Sharpen:
> 257
> 256
> 256
> Average: 256 Iterations Per Minute
>
> So I guess I will need to try on zen3 to see if there is any difference.
>
> the internal loop is:
> 0.00 │460:┌─→movzbl 0x2(%rdx,%rax,4),%esi ▒
> 0.02 │ │ vmovss (%r8,%rax,4),%xmm2 ▒
> 0.95 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒
> 20.22 │ │ movzbl 0x1(%rdx,%rax,4),%esi ▒
> 0.01 │ │ vfmadd231ss %xmm1,%xmm2,%xmm3 ▒
> 11.97 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒
> 18.76 │ │ movzbl (%rdx,%rax,4),%esi ▒
> 0.00 │ │ inc %rax ▒
> 0.72 │ │ vfmadd231ss %xmm1,%xmm2,%xmm4 ▒
> 12.55 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒
> 14.95 │ │ vfmadd231ss %xmm1,%xmm2,%xmm5 ▒
> 15.93 │ ├──cmp %rax,%r13 ▒
> 0.35 │ └──jne 460
>
>
> so it still does not get....
As said the VF is going to be prohibitively large, likely the vector code
is never entered and the above is the scalar "epilogue".
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-11-27 7:29 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-31 13:20 [Bug c/110062] New: missed vectorization in graphicsmagick hubicka at gcc dot gnu.org
2023-06-01 9:22 ` [Bug tree-optimization/110062] " crazylht at gmail dot com
2023-06-02 7:33 ` rguenth at gcc dot gnu.org
2023-06-06 20:22 ` hubicka at gcc dot gnu.org
2023-06-07 6:43 ` rguenth at gcc dot gnu.org
2023-06-07 14:43 ` hubicka at gcc dot gnu.org
2023-06-16 12:23 ` rguenth at gcc dot gnu.org
2023-06-19 2:15 ` crazylht at gmail dot com
2023-06-21 12:01 ` rguenth at gcc dot gnu.org
2023-06-21 12:53 ` rguenth at gcc dot gnu.org
2023-07-31 11:28 ` rguenth at gcc dot gnu.org
2023-11-25 13:33 ` hubicka at gcc dot gnu.org
2023-11-27 7:29 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).