public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/99161] New: Suboptimal SVE code for ld4/st4 MLA code
@ 2021-02-19 10:27 ktkachov at gcc dot gnu.org
2024-02-27 8:37 ` [Bug target/99161] " pinskia at gcc dot gnu.org
0 siblings, 1 reply; 2+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2021-02-19 10:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99161
Bug ID: 99161
Summary: Suboptimal SVE code for ld4/st4 MLA code
Product: gcc
Version: unknown
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: ktkachov at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
void ld_st_4 (char *x)
{
for (int i = 0; i < 4096; i += 4)
{
char r = x[i];
char g = x[i + 1];
char b = x[i + 2];
char a = x[i + 3];
char smoosh = (r + g + b) * a;
x[i] = r - smoosh;
x[i+1] = g + smoosh;
x[i+2] = b - smoosh;
x[i+3] = a + smoosh;
}
}
with -O3 (no SVE) gives a nice loop on aarch64:
ld_st_4(char*):
add x1, x0, 4096
.L2:
ld4 {v0.16b - v3.16b}, [x0]
add v4.16b, v0.16b, v1.16b
add v4.16b, v4.16b, v2.16b
mul v4.16b, v4.16b, v3.16b
sub v16.16b, v0.16b, v4.16b
add v17.16b, v4.16b, v1.16b
sub v18.16b, v2.16b, v4.16b
add v19.16b, v4.16b, v3.16b
st4 {v16.16b - v19.16b}, [x0], 64
cmp x1, x0
bne .L2
ret
with -O3 -march=armv8.2-a+sve we get:
ld_st_4(char*):
mov x1, 0
mov w2, 1024
ptrue p0.b, all
whilelo p1.b, wzr, w2
.L2:
ld4b {z0.b - z3.b}, p1/z, [x0]
add z4.b, z1.b, z0.b
add z4.b, z4.b, z2.b
movprfx z16, z0
mls z16.b, p0/m, z4.b, z3.b
movprfx z17, z1
mla z17.b, p0/m, z4.b, z3.b
movprfx z18, z2
mls z18.b, p0/m, z4.b, z3.b
movprfx z19, z3
mla z19.b, p0/m, z4.b, z3.b
st4b {z16.b - z19.b}, p1, [x0]
incb x1
incb x0, all, mul #4
whilelo p1.b, w1, w2
b.any .L2
ret
There's a few things that could be improved here:
* Use x0 for limit
* Use a single predicate (avoid multiple incb instructions)
* factor in the cost of movprfx somehow (i.e. the destructive semantics of the
MLA/MLS), and prefer to use mul and add/sub
A better SVE loop would look a lot like Neon:
ld_st_4(char*):
add x1, x0, 4096
ptrue p0.b, all
.L2:
ld4b {z0.b - z3.b}, p0/z, [x0]
add z4.b, z1.b, z0.b
add z4.b, z4.b, z2.b
mul z4.b, p0/m, z4.b, z3.b
sub z16.b, z0.b, z4.b
add z17.b, z4.b, z1.b
sub z18.b, z2.b, z4.b
add z19.b, z4.b, z3.b
st4b {z16.b - z19.b}, p0, [x0]
incb x0, all, mul #4
cmp x1, x0
bne .L2
ret
^ permalink raw reply [flat|nested] 2+ messages in thread
* [Bug target/99161] Suboptimal SVE code for ld4/st4 MLA code
2021-02-19 10:27 [Bug target/99161] New: Suboptimal SVE code for ld4/st4 MLA code ktkachov at gcc dot gnu.org
@ 2024-02-27 8:37 ` pinskia at gcc dot gnu.org
0 siblings, 0 replies; 2+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-27 8:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99161
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Target Milestone|--- |13.0
Status|UNCONFIRMED |RESOLVED
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Fixed in GCC 13.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2024-02-27 8:37 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-19 10:27 [Bug target/99161] New: Suboptimal SVE code for ld4/st4 MLA code ktkachov at gcc dot gnu.org
2024-02-27 8:37 ` [Bug target/99161] " pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).