public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/107916] New: PPC VSX code generation for OpenZFS
@ 2022-11-29 14:38 dje at gcc dot gnu.org
2022-11-29 14:39 ` [Bug target/107916] " dje at gcc dot gnu.org
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: dje at gcc dot gnu.org @ 2022-11-29 14:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916
Bug ID: 107916
Summary: PPC VSX code generation for OpenZFS
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: dje at gcc dot gnu.org
CC: bergner at gcc dot gnu.org, segher at gcc dot gnu.org
Target Milestone: ---
Target: powerpc64le-*-linux
https://github.com/openzfs/zfs/pull/14234
GCC codegen https://gcc.godbolt.org/z/bhPo9sWsx
Clang codegen https://gcc.godbolt.org/z/4rTEe3WMG
Clang is relatively compact and efficient
.LBB0_2: # =>This Inner Loop Header: Depth=1
lxvd2x 1, 0, 4
addi 4, 4, 16
xxswapd 1, 1
xxmrghw 40, 0, 1
xxmrglw 41, 0, 1
vaddudm 7, 7, 8
vaddudm 6, 6, 9
vaddudm 1, 7, 1
vaddudm 5, 6, 5
vaddudm 0, 1, 0
vaddudm 4, 5, 4
vaddudm 3, 0, 3
vaddudm 2, 4, 2
bdnz .LBB0_2
GCC is rather less efficient.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/107916] PPC VSX code generation for OpenZFS
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
@ 2022-11-29 14:39 ` dje at gcc dot gnu.org
2022-11-29 14:54 ` [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64 pinskia at gcc dot gnu.org
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: dje at gcc dot gnu.org @ 2022-11-29 14:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916
David Edelsohn <dje at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2022-11-29
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #1 from David Edelsohn <dje at gcc dot gnu.org> ---
Confirmed.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
2022-11-29 14:39 ` [Bug target/107916] " dje at gcc dot gnu.org
@ 2022-11-29 14:54 ` pinskia at gcc dot gnu.org
2022-11-29 14:55 ` pinskia at gcc dot gnu.org
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-29 14:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Summary|PPC VSX code generation for |vector_size(32) is
|OpenZFS |inefficient for VSX on
| |powerpc64
Component|target |middle-end
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Reduced testcase:
```
#include <stdint.h>
typedef uint32_t u32x4 __attribute__ ((vector_size (16)));
typedef uint32_t u32x8 __attribute__ ((vector_size (32)));
typedef uint64_t u64x4 __attribute__ ((vector_size (32)));
#pragma GCC push_options
#if defined(__x86_64__)
#ifdef __clang_major__
#pragma clang attribute push(__attribute__((target("avx2"))), \
apply_to = function)
#else
#pragma GCC target ("avx2")
#endif
#elif defined(__powerpc64__)
#ifdef __clang_major__
#pragma clang attribute
push(__attribute__((target("vsx,block-ops-unaligned-vsx,power8-vector"))), \
apply_to = function)
#else
#pragma GCC target ("vsx,block-ops-unaligned-vsx,power8-vector,power9-vector")
#endif
#endif
void f(int n, u32x8 *a, u32x8 *b)
{
u32x8 c = {0};
for(int i = 0; i < n; i++)
c+=*a;
*b += c;
}
#ifdef __clang_major__
#if defined(__x86_64__) || defined(__powerpc64__)
#pragma clang attribute pop
#endif
#else
#pragma GCC pop_options
#endif
```
Basically what is going wrong is that c is being pushed to the stack. But
really I had expected c's phi node to be split during vector lowering.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
2022-11-29 14:39 ` [Bug target/107916] " dje at gcc dot gnu.org
2022-11-29 14:54 ` [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64 pinskia at gcc dot gnu.org
@ 2022-11-29 14:55 ` pinskia at gcc dot gnu.org
2022-11-29 15:07 ` pinskia at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-29 14:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
aarch64 has a similar issue too:
.L3:
add w1, w1, 1
add v0.4s, v5.4s, v2.4s
add v1.4s, v4.4s, v3.4s
mov v2.16b, v0.16b
mov v3.16b, v1.16b
cmp w0, w1
bne .L3
Though not as bad as it is just extra moves inside the loop as there is OI mode
there ... .
This is a generic vect lowering issue I think.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
` (2 preceding siblings ...)
2022-11-29 14:55 ` pinskia at gcc dot gnu.org
@ 2022-11-29 15:07 ` pinskia at gcc dot gnu.org
2022-11-30 8:48 ` [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops rguenth at gcc dot gnu.org
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-29 15:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Reduced even further just compile with `-O2 -mvsx` is enough to show the issue
really:
```
typedef unsigned u32x8 __attribute__ ((vector_size (32)));
void f(int n, u32x8 *a, u32x8 *b)
{
u32x8 c = {0};
for(int i = 0; i < n; i++)
c+=*a;
*b += c;
}
```
With the above you can see the issue on x86_64 with just -O2 (not turning on
AVX 512 or anything):
.L3:
movdqa xmm4, XMMWORD PTR [rsp-32]
movdqa xmm5, XMMWORD PTR [rsp-16]
add eax, 1
paddd xmm4, xmm2
paddd xmm5, xmm3
movaps XMMWORD PTR [rsp-32], xmm4
movaps XMMWORD PTR [rsp-16], xmm5
cmp edi, eax
jne .L3
See the extra load/stores.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
` (3 preceding siblings ...)
2022-11-29 15:07 ` pinskia at gcc dot gnu.org
@ 2022-11-30 8:48 ` rguenth at gcc dot gnu.org
2024-04-03 16:52 ` pinskia at gcc dot gnu.org
2024-04-03 16:54 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-11-30 8:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Version|unknown |13.0
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Vector lowering does only lower "operations", it doesn't touch data transfer
which means {reg,mem} <-> {mem,reg} copies, even if performed as part of
PHI node copies. In the end this means unsupported vector modes will be
expanded to the stack variables. Note there's later forwprop which will
deal with the loads/stores in most cases (but that's really an
afterthought), nothing handles the (loop) PHI node case so we end up with
<bb 4> [local count: 955630225]:
# c_15 = PHI <c_12(4), { 0, 0, 0, 0, 0, 0, 0, 0 }(3)>
# i_17 = PHI <i_13(4), 0(3)>
_4 = BIT_FIELD_REF <c_15, 128, 0>;
_6 = _4 + _5;
_18 = BIT_FIELD_REF <c_15, 128, 128>;
_19 = _14 + _18;
c_12 = {_6, _19};
i_13 = i_17 + 1;
if (n_7(D) != i_13)
goto <bb 4>; [89.00%]
else
goto <bb 5>; [11.00%]
<bb 5> [local count: 118111600]:
# c_16 = PHI <c_12(4), { 0, 0, 0, 0, 0, 0, 0, 0 }(2)>
Vector lowering would need to work more like Complex lowering to improve
things here. I'm not sure if stmt-by-stmt lowering of PHIs and
other reg-reg copies will give the desired results (esp. when backedges
are involved).
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
` (4 preceding siblings ...)
2022-11-30 8:48 ` [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops rguenth at gcc dot gnu.org
@ 2024-04-03 16:52 ` pinskia at gcc dot gnu.org
2024-04-03 16:54 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-03 16:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |liuhongt at gcc dot gnu.org
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 114570 has been marked as a duplicate of this bug. ***
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
` (5 preceding siblings ...)
2024-04-03 16:52 ` pinskia at gcc dot gnu.org
@ 2024-04-03 16:54 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-03 16:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |ajidala at gmail dot com
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 100745 has been marked as a duplicate of this bug. ***
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-04-03 16:54 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
2022-11-29 14:39 ` [Bug target/107916] " dje at gcc dot gnu.org
2022-11-29 14:54 ` [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64 pinskia at gcc dot gnu.org
2022-11-29 14:55 ` pinskia at gcc dot gnu.org
2022-11-29 15:07 ` pinskia at gcc dot gnu.org
2022-11-30 8:48 ` [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops rguenth at gcc dot gnu.org
2024-04-03 16:52 ` pinskia at gcc dot gnu.org
2024-04-03 16:54 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).