public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/88492] SLP optimization generates ugly code
[not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
@ 2021-04-14 16:36 ` ptomsich at gcc dot gnu.org
2021-04-14 17:12 ` tnfchris at gcc dot gnu.org
` (2 subsequent siblings)
3 siblings, 0 replies; 4+ messages in thread
From: ptomsich at gcc dot gnu.org @ 2021-04-14 16:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
ptomsich at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |ptomsich at gcc dot gnu.org
--- Comment #6 from ptomsich at gcc dot gnu.org ---
With the current master, the test case generates (with -mcpu=neoverse-n1):
.arch armv8.2-a+crc+fp16+rcpc+dotprod+profile
.file "pr88492.c"
.text
.align 2
.p2align 5,,15
.global test_slp
.type test_slp, %function
test_slp:
.LFB0:
.cfi_startproc
ldr q2, [x0]
adrp x1, .LC0
ldr q16, [x1, #:lo12:.LC0]
uxtl v4.8h, v2.8b
uxtl2 v2.8h, v2.16b
uxtl v0.4s, v4.4h
uxtl v6.4s, v2.4h
uxtl2 v4.4s, v4.8h
uxtl2 v2.4s, v2.8h
mov v1.16b, v0.16b
mov v7.16b, v6.16b
mov v5.16b, v4.16b
mov v3.16b, v2.16b
tbl v0.16b, {v0.16b - v1.16b}, v16.16b
tbl v6.16b, {v6.16b - v7.16b}, v16.16b
tbl v4.16b, {v4.16b - v5.16b}, v16.16b
tbl v2.16b, {v2.16b - v3.16b}, v16.16b
add v0.4s, v0.4s, v4.4s
add v6.4s, v6.4s, v2.4s
add v0.4s, v0.4s, v6.4s
addv s0, v0.4s
fmov w0, s0
ret
.cfi_endproc
.LFE0:
.size test_slp, .-test_slp
which contrasts with LLVM13 (with -mcpu=neoverse-n1):
test_slp: // @test_slp
.cfi_startproc
// %bb.0: // %entry
ldr q0, [x0]
movi v1.16b, #1
movi v2.2d, #0000000000000000
udot v2.4s, v0.16b, v1.16b
addv s0, v2.4s
fmov w0, s0
ret
.Lfunc_end0:
.size test_slp, .Lfunc_end0-test_slp
or (LLVM13 w/o the mcpu-option):
.type test_slp,@function
test_slp: // @test_slp
.cfi_startproc
// %bb.0: // %entry
ldr q0, [x0]
ushll2 v1.8h, v0.16b, #0
ushll v0.8h, v0.8b, #0
uaddl2 v2.4s, v0.8h, v1.8h
uaddl v0.4s, v0.4h, v1.4h
add v0.4s, v0.4s, v2.4s
addv s0, v0.4s
fmov w0, s0
ret
.Lfunc_end0:
.size test_slp, .Lfunc_end0-test_slp
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/88492] SLP optimization generates ugly code
[not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
2021-04-14 16:36 ` [Bug tree-optimization/88492] SLP optimization generates ugly code ptomsich at gcc dot gnu.org
@ 2021-04-14 17:12 ` tnfchris at gcc dot gnu.org
2022-01-04 19:23 ` pinskia at gcc dot gnu.org
2024-02-27 7:52 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 4+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-04-14 17:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to ptomsich from comment #6)
> With the current master, the test case generates (with -mcpu=neoverse-n1):
> which contrasts with LLVM13 (with -mcpu=neoverse-n1):
>
> test_slp: // @test_slp
> .cfi_startproc
> // %bb.0: // %entry
> ldr q0, [x0]
> movi v1.16b, #1
> movi v2.2d, #0000000000000000
> udot v2.4s, v0.16b, v1.16b
> addv s0, v2.4s
> fmov w0, s0
> ret
> .Lfunc_end0:
> .size test_slp, .Lfunc_end0-test_slp
>
> or (LLVM13 w/o the mcpu-option):
>
> .type test_slp,@function
> test_slp: // @test_slp
> .cfi_startproc
> // %bb.0: // %entry
> ldr q0, [x0]
> ushll2 v1.8h, v0.16b, #0
> ushll v0.8h, v0.8b, #0
> uaddl2 v2.4s, v0.8h, v1.8h
> uaddl v0.4s, v0.4h, v1.4h
> add v0.4s, v0.4s, v2.4s
> addv s0, v0.4s
> fmov w0, s0
> ret
> .Lfunc_end0:
> .size test_slp, .Lfunc_end0-test_slp
It's definitely a neat trick, but correct me if I'm wrong: it's only possible
because addition is commutative.
Clang has just simply reordered the loads because the loop is very simple to
just
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0];
tmp[i][1] = b[1];
tmp[i][2] = b[2];
tmp[i][3] = b[3];
}
Which GCC also handles fine.
As Richi mentioned before
>I know the "real" code this testcase is from has actual operations
> in place of the b[N] reads, for the above vectorization looks somewhat
> pointless given we end up decomposing the result again.
It seems a bit of a too narrow focus to optimize for this particular example as
the real code does "other" things.
i.e.
Both GCC and Clang fall apart with
int test_slp( unsigned char *b )
{
unsigned int tmp[4][4];
int sum = 0;
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0] - b[4];
tmp[i][2] = b[1] + b[5];
tmp[i][1] = b[2] - b[6];
tmp[i][3] = b[3] + b[7];
}
for( int i = 0; i < 4; i++ )
{
sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
}
return sum;
}
which has about the same access pattern as the real code.
If you change the operations you'll notice that for others examples like
int test_slp( unsigned char *b )
{
unsigned int tmp[4][4];
int sum = 0;
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0] - b[4];
tmp[i][2] = b[1] - b[5];
tmp[i][1] = b[2] - b[6];
tmp[i][3] = b[3] - b[7];
}
for( int i = 0; i < 4; i++ )
{
sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
}
return sum;
}
GCC handles this better (but we are let down by register allocation).
To me it seems quite unlikely that actual code would be written like that, but
I guess there could be a case to be made to try to reassoc loads as well.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/88492] SLP optimization generates ugly code
[not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
2021-04-14 16:36 ` [Bug tree-optimization/88492] SLP optimization generates ugly code ptomsich at gcc dot gnu.org
2021-04-14 17:12 ` tnfchris at gcc dot gnu.org
@ 2022-01-04 19:23 ` pinskia at gcc dot gnu.org
2024-02-27 7:52 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-04 19:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://gcc.gnu.org/bugzill
| |a/show_bug.cgi?id=99412
--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Similar to PR 99412 .
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/88492] SLP optimization generates ugly code
[not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2022-01-04 19:23 ` pinskia at gcc dot gnu.org
@ 2024-02-27 7:52 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-27 7:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I noticed once I add V4QI and V2HI support to the aarch64 backend, this code
gets even worse.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-02-27 7:52 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
2021-04-14 16:36 ` [Bug tree-optimization/88492] SLP optimization generates ugly code ptomsich at gcc dot gnu.org
2021-04-14 17:12 ` tnfchris at gcc dot gnu.org
2022-01-04 19:23 ` pinskia at gcc dot gnu.org
2024-02-27 7:52 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).