[Bug target/99161] New: Suboptimal SVE code for ld4/st4 MLA code

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/99161] New: Suboptimal SVE code for ld4/st4 MLA code
@ 2021-02-19 10:27 ktkachov at gcc dot gnu.org
  2024-02-27  8:37 ` [Bug target/99161] " pinskia at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2021-02-19 10:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99161

            Bug ID: 99161
           Summary: Suboptimal SVE code for ld4/st4 MLA code
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

void ld_st_4 (char *x)
{
    for (int i = 0; i < 4096; i += 4)
    {
        char r = x[i];
        char g = x[i + 1];
        char b = x[i + 2];
        char a = x[i + 3];
        char smoosh = (r + g + b) * a;
        x[i] = r - smoosh;
        x[i+1] = g + smoosh;
        x[i+2] = b - smoosh;
        x[i+3] = a + smoosh;
    }
}

with -O3 (no SVE) gives a nice loop on aarch64:

ld_st_4(char*):
        add     x1, x0, 4096
.L2:
        ld4     {v0.16b - v3.16b}, [x0]
        add     v4.16b, v0.16b, v1.16b
        add     v4.16b, v4.16b, v2.16b
        mul     v4.16b, v4.16b, v3.16b
        sub     v16.16b, v0.16b, v4.16b
        add     v17.16b, v4.16b, v1.16b
        sub     v18.16b, v2.16b, v4.16b
        add     v19.16b, v4.16b, v3.16b
        st4     {v16.16b - v19.16b}, [x0], 64
        cmp     x1, x0
        bne     .L2
        ret

with -O3 -march=armv8.2-a+sve we get:
ld_st_4(char*):
        mov     x1, 0
        mov     w2, 1024
        ptrue   p0.b, all
        whilelo p1.b, wzr, w2
.L2:
        ld4b    {z0.b - z3.b}, p1/z, [x0]
        add     z4.b, z1.b, z0.b
        add     z4.b, z4.b, z2.b
        movprfx z16, z0
        mls     z16.b, p0/m, z4.b, z3.b
        movprfx z17, z1
        mla     z17.b, p0/m, z4.b, z3.b
        movprfx z18, z2
        mls     z18.b, p0/m, z4.b, z3.b
        movprfx z19, z3
        mla     z19.b, p0/m, z4.b, z3.b
        st4b    {z16.b - z19.b}, p1, [x0]
        incb    x1
        incb    x0, all, mul #4
        whilelo p1.b, w1, w2
        b.any   .L2
        ret

There's a few things that could be improved here:
* Use x0 for limit
* Use a single predicate (avoid multiple incb instructions)
* factor in the cost of movprfx somehow (i.e. the destructive semantics of the
MLA/MLS), and prefer to use mul and add/sub

A better SVE loop would look a lot like Neon:
ld_st_4(char*):
        add     x1, x0, 4096
        ptrue   p0.b, all
.L2:
        ld4b    {z0.b - z3.b}, p0/z, [x0]
        add     z4.b, z1.b, z0.b
        add     z4.b, z4.b, z2.b
        mul     z4.b, p0/m, z4.b, z3.b
        sub     z16.b, z0.b, z4.b
        add     z17.b, z4.b, z1.b
        sub     z18.b, z2.b, z4.b
        add     z19.b, z4.b, z3.b
        st4b    {z16.b - z19.b}, p0, [x0]
        incb    x0, all, mul #4
        cmp     x1, x0
        bne     .L2
        ret

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug target/99161] Suboptimal SVE code for ld4/st4 MLA code
  2021-02-19 10:27 [Bug target/99161] New: Suboptimal SVE code for ld4/st4 MLA code ktkachov at gcc dot gnu.org
@ 2024-02-27  8:37 ` pinskia at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-27  8:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99161

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
   Target Milestone|---                         |13.0
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Fixed in GCC 13.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-02-27  8:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-19 10:27 [Bug target/99161] New: Suboptimal SVE code for ld4/st4 MLA code ktkachov at gcc dot gnu.org
2024-02-27  8:37 ` [Bug target/99161] " pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).