[Bug target/100745] New: GCC generates suboptimal assembly from vector extensions on AArch64

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/100745] New: GCC generates suboptimal assembly from vector extensions on AArch64
@ 2021-05-24 14:11 ajidala at gmail dot com
  2021-05-24 20:27 ` [Bug target/100745] " ajidala at gmail dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: ajidala at gmail dot com @ 2021-05-24 14:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745

            Bug ID: 100745
           Summary: GCC generates suboptimal assembly from vector
                    extensions on AArch64
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ajidala at gmail dot com
  Target Milestone: ---

Created attachment 50861
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50861&action=edit
The profile.c file minimal benchmark/test case

As part of an attempt to make mpv's scaletempo2 audio filter faster, two
vectorised implementations were written:

The first one, mine, uses aarch64 intrinsics. It shows a 3.14x speedup on my
test system, and is referred to as "new" or "nicolas" in the code.

The second one, by haasn, also referred to as "niklas" in the code, uses GCC's
vector extensions to automatically generate vectorised code for a wide variety
of architectures. It shows a slower speedup on my system and another aarch64
test system (1.45x) but shows a much better speedup on x86_64 (>2x for generic,
>10x for -march=native on this zen+ laptop thanks to avx).

Clang, on the other hand compiles the vector extension code down to something
more efficient than gcc, beating my intrinsics SIMD (even in absolute terms
compared to gcc). I believe this is due to a bug in gcc making it produce
subpar vector assembly on aarch64 in this case.

Since we'd rather not keep platform specific vector code around in mpv, and
clang's codegen is overall worse in non-vector code, we'd much appreciate it if
someone could look into what gcc is tripping over here.

Attached is the minimal microbenchmark profile.c, which needs no special
options or includes aside from stdio so no .i file if that's alright. My test
system is a cortex-a53 in-order core, though -mtune -march for that does not
fix it, and the problem also exhibits itself on a cortex-a55 in-order core.

The test was compiled with gcc -O3 -o profile profile.c, though it is worth
noting that the pure C implementation performs much better under -O2 (possibly
a separate bug) while both SIMD versions are largely unaffected by this.

GCC Version: 10.2.0
Distribution: Arch Linux ARM
Platform: ROCK64 with a RK3328 (4x Cortex A-53, 2GB RAM)

The options used for building gcc can be found here, in build():
https://archlinuxarm.org/packages/aarch64/gcc/files/PKGBUILD

I've looked at the disassembly of gcc trunk on godbolt, but it did not look
significantly different enough to me to think this has already been fixed in
trunk. If required, I can try building gcc trunk from source.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/100745] GCC generates suboptimal assembly from vector extensions on AArch64
  2021-05-24 14:11 [Bug target/100745] New: GCC generates suboptimal assembly from vector extensions on AArch64 ajidala at gmail dot com
@ 2021-05-24 20:27 ` ajidala at gmail dot com
  2021-05-24 20:28 ` ajidala at gmail dot com
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: ajidala at gmail dot com @ 2021-05-24 20:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745

--- Comment #1 from Nicolas F. <ajidala at gmail dot com> ---
I'll attach a second version of profile.c, with the vector extension code
that's actually going to be used in mpv (some cleanup has been done).
Performance is unchanged. Some absolute numbers from gcc 11.1.0:

$ ./profile 
old: 811703
nicolas: 262007 (3.10x as fast)
niklas: 679524 (1.19x as fast)

Some absolute numbers from Clang -O3:

$ ./profile 
old: 1547552
nicolas: 269081 (5.75x as fast)
niklas: 246508 (6.28x as fast)

As you can see, Clang does significantly worse on the C version (yay GCC!), but
significantly, and most importantly, in absolute terms, better on the vector
version. Like more than twice as fast than GCC's code.

Looking at GCC's assembly output, I can see some odd choices, such as shuffling
vectors around on the stack instead of using the other scratch registers
(v21-v30), whereas clang does use those scratch registers.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/100745] GCC generates suboptimal assembly from vector extensions on AArch64
  2021-05-24 14:11 [Bug target/100745] New: GCC generates suboptimal assembly from vector extensions on AArch64 ajidala at gmail dot com
  2021-05-24 20:27 ` [Bug target/100745] " ajidala at gmail dot com
@ 2021-05-24 20:28 ` ajidala at gmail dot com
  2024-02-27  8:53 ` [Bug tree-optimization/100745] " pinskia at gcc dot gnu.org
  2024-04-03 16:54 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: ajidala at gmail dot com @ 2021-05-24 20:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745

Nicolas F. <ajidala at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #50861|0                           |1
        is obsolete|                            |

--- Comment #2 from Nicolas F. <ajidala at gmail dot com> ---
Created attachment 50863
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50863&action=edit
Updated profile.c with the haasn code that will end up in mpv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/100745] GCC generates suboptimal assembly from vector extensions on AArch64
  2021-05-24 14:11 [Bug target/100745] New: GCC generates suboptimal assembly from vector extensions on AArch64 ajidala at gmail dot com
  2021-05-24 20:27 ` [Bug target/100745] " ajidala at gmail dot com
  2021-05-24 20:28 ` ajidala at gmail dot com
@ 2024-02-27  8:53 ` pinskia at gcc dot gnu.org
  2024-04-03 16:54 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-27  8:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-02-27
          Component|target                      |tree-optimization
             Status|UNCONFIRMED                 |NEW

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
```
  # vsum$0_107 = PHI <_47(11), _29(10)>
  _232 = BIT_FIELD_REF <vsum$0_107, 128, 0>;
  _231 = .FMA (_100, _101, _232);
  _230 = BIT_FIELD_REF <vsum$0_107, 128, 128>;
  _229 = .FMA (_234, _235, _230);
...
  _47 = {_231, _229};

...
```

Confirmed, I thought I saw this before, basically inside the loop we keep
together the generic vector still and this causes stores IIRC.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/100745] GCC generates suboptimal assembly from vector extensions on AArch64
  2021-05-24 14:11 [Bug target/100745] New: GCC generates suboptimal assembly from vector extensions on AArch64 ajidala at gmail dot com
                   ` (2 preceding siblings ...)
  2024-02-27  8:53 ` [Bug tree-optimization/100745] " pinskia at gcc dot gnu.org
@ 2024-04-03 16:54 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-03 16:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |DUPLICATE
             Status|NEW                         |RESOLVED

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Dup, yes this one is older than the dup but I feel PR 107916 has the best info
in it.

*** This bug has been marked as a duplicate of bug 107916 ***

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-03 16:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-24 14:11 [Bug target/100745] New: GCC generates suboptimal assembly from vector extensions on AArch64 ajidala at gmail dot com
2021-05-24 20:27 ` [Bug target/100745] " ajidala at gmail dot com
2021-05-24 20:28 ` ajidala at gmail dot com
2024-02-27  8:53 ` [Bug tree-optimization/100745] " pinskia at gcc dot gnu.org
2024-04-03 16:54 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).