[Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics
@ 2021-01-29  6:51 spop at gcc dot gnu.org
  2021-01-29  7:39 ` [Bug target/98877] " pinskia at gcc dot gnu.org
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: spop at gcc dot gnu.org @ 2021-01-29  6:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

            Bug ID: 98877
           Summary: [AArch64] Inefficient code generated for tbl NEON
                    intrinsics
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

The use of NEON intrinsics is inefficient and leads developers to prefer inline
assembly instead of intrinsics.

A similar performance bug for vmlal intrinsics was reported in
https://gcc.gnu.org/PR92665
The code generated by GCC for table lookups is also inefficient:

$ cat red.c
#include "arm_neon.h"

uint8x16_t fun(uint8x16_t lo, uint8x16_t hi, uint8x16_t idx) {
  uint8x16x2_t tab = { .val = {lo, hi} };
  uint8x16_t res = vqtbl2q_u8(tab, idx);
  return res;
}

$ gcc -O3 -S -o- red.c
fun:
        mov     v4.16b, v0.16b
        mov     v5.16b, v1.16b
        tbl     v0.16b, {v4.16b - v5.16b}, v2.16b
        ret

$ clang -O3 -S -o- red.c
fun:
        tbl     v0.16b, { v0.16b, v1.16b }, v2.16b
        ret

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
@ 2021-01-29  7:39 ` pinskia at gcc dot gnu.org
  2021-01-29  7:40 ` pinskia at gcc dot gnu.org
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-01-29  7:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am 90% sure this is just a register allocation issue.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
  2021-01-29  7:39 ` [Bug target/98877] " pinskia at gcc dot gnu.org
@ 2021-01-29  7:40 ` pinskia at gcc dot gnu.org
  2021-01-29  9:10 ` ktkachov at gcc dot gnu.org
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-01-29  7:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |91753

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
See PR 91753 for another example.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91753
[Bug 91753] Bad register allocation of multi-register types

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
  2021-01-29  7:39 ` [Bug target/98877] " pinskia at gcc dot gnu.org
  2021-01-29  7:40 ` pinskia at gcc dot gnu.org
@ 2021-01-29  9:10 ` ktkachov at gcc dot gnu.org
  2021-08-12  8:09 ` tnfchris at gcc dot gnu.org
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2021-01-29  9:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
                 CC|                            |ktkachov at gcc dot gnu.org,
                   |                            |tnfchris at gcc dot gnu.org
   Last reconfirmed|                            |2021-01-29
             Status|UNCONFIRMED                 |NEW

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Confirmed. I think the whole moving in and out the structure modes (OImode,
XImode and friends) really hurts codegen at the RTL level.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-01-29  9:10 ` ktkachov at gcc dot gnu.org
@ 2021-08-12  8:09 ` tnfchris at gcc dot gnu.org
  2021-08-22  9:30 ` pinskia at gcc dot gnu.org
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-12  8:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877
Bug 98877 depends on bug 91753, which changed state.

Bug 91753 Summary: Bad register allocation of multi-register types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91753

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-08-12  8:09 ` tnfchris at gcc dot gnu.org
@ 2021-08-22  9:30 ` pinskia at gcc dot gnu.org
  2021-08-22  9:30 ` pinskia at gcc dot gnu.org
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-22  9:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Here is another example where GCC messes up:

#include "arm_neon.h"
uint8x16_t g(void);
uint8x16_t fun(uint8x16_t lo, uint8x16_t hi, uint8x16_t idx) {
  uint8x16x2_t tab = { .val = {g(), g()} };
  uint8x16_t res = vqtbl2q_u8(tab, idx);
  return res;
}

Note clang/LLVM messes the above one up even worse.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-08-22  9:30 ` pinskia at gcc dot gnu.org
@ 2021-08-22  9:30 ` pinskia at gcc dot gnu.org
  2021-08-22 10:14 ` tnfchris at gcc dot gnu.org
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-22  9:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-08-22  9:30 ` pinskia at gcc dot gnu.org
@ 2021-08-22 10:14 ` tnfchris at gcc dot gnu.org
  2024-01-26  0:24 ` pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-22 10:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|11.0                        |12.0

--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
We're in the process of rewriting these intrinsics, should be fixed in GCC 12.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-08-22 10:14 ` tnfchris at gcc dot gnu.org
@ 2024-01-26  0:24 ` pinskia at gcc dot gnu.org
  2024-02-27  8:34 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-26  0:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
In the original testcase, there are still extra movs.

For the testcase in comment #4, it is fixed on the trunk and we now get:
```
fun:
        stp     x29, x30, [sp, -48]!
        mov     x29, sp
        str     q2, [sp, 32]
        bl      g
        str     q0, [sp, 16]
        bl      g
        ldp     q30, q2, [sp, 16]
        mov     v31.16b, v0.16b
        ldp     x29, x30, [sp], 48
        tbl     v0.16b, {v30.16b - v31.16b}, v2.16b
        ret
```

Maybe the issue is only with arguments now.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-01-26  0:24 ` pinskia at gcc dot gnu.org
@ 2024-02-27  8:34 ` pinskia at gcc dot gnu.org
  2024-02-27 19:28 ` rsandifo at gcc dot gnu.org
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-27  8:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>Maybe the issue is only with arguments now.


Actually I think this is still a subreg vs ra issue.


(insn 8 5 9 2 (set (subreg:V16QI (reg/v:V2x16QI 100 [ __tab ]) 0)
        (reg/v:V16QI 102 [ lo ])) -1
     (nil))
(insn 9 8 10 2 (set (subreg:V16QI (reg/v:V2x16QI 100 [ __tab ]) 16)
        (reg/v:V16QI 103 [ hi ])) -1
     (nil))
(insn 10 9 11 2 (set (reg:V16QI 101 [ <retval> ])
        (unspec:V16QI [
                (reg/v:V2x16QI 100 [ __tab ])
                (reg/v:V16QI 104 [ idx ])
            ] UNSPEC_TBL))
"/opt/compiler-explorer/arm64/gcc-trunk-20240227/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/14.0.1/include/arm_neon.h":19566:43
-1
     (nil))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-02-27  8:34 ` pinskia at gcc dot gnu.org
@ 2024-02-27 19:28 ` rsandifo at gcc dot gnu.org
  2024-02-28  9:12 ` tnfchris at gcc dot gnu.org
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-02-27 19:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

Richard Sandiford <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #8 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
The reason early_ra doesn't help with the original testcase is that early_ra
punts on any non-move instruction that has a hard register destination.  And it
does that because it can't cope well with cases where hard-coded destinations
force the wrong choice (unlike the proper allocators, which can change the
destination where necessary).  The restriction is needed to avoid regressing
SVE ACLE tests.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-02-27 19:28 ` rsandifo at gcc dot gnu.org
@ 2024-02-28  9:12 ` tnfchris at gcc dot gnu.org
  2024-02-29  5:45 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-28  9:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

--- Comment #9 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
While RA should be able to deal with this,
shouldn't we also just lower TBLs in gimple?

This no reason why this can't be a VEC_PERM_EXPR which would also get the
copies
removed at the gimple level and allows us to optimize this to something else
depending on the index.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2024-02-28  9:12 ` tnfchris at gcc dot gnu.org
@ 2024-02-29  5:45 ` pinskia at gcc dot gnu.org
  2024-02-29  7:26 ` tnfchris at gcc dot gnu.org
  2024-02-29  7:27 ` tnfchris at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-29  5:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

--- Comment #10 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #9)
> While RA should be able to deal with this,
> shouldn't we also just lower TBLs in gimple?
> 
> This no reason why this can't be a VEC_PERM_EXPR which would also get the
> copies
> removed at the gimple level and allows us to optimize this to something else
> depending on the index.

Yes there is a reason, `out of range` values for VEC_PERM is undefined while
tbl is well defined  ( If an index is out of range for the table, the result
for that lookup is 0.).

For tbx, it is well defined also (If an index is out of range for the table,
the existing value in the vector element of the destination register is left
unchanged. ).

I think for VECTOR_CST we can fold it down to VEC_PERM_EXPR if there is no out
of bounds value though.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2024-02-29  5:45 ` pinskia at gcc dot gnu.org
@ 2024-02-29  7:26 ` tnfchris at gcc dot gnu.org
  2024-02-29  7:27 ` tnfchris at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-29  7:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

--- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #10)
> (In reply to Tamar Christina from comment #9)
> > While RA should be able to deal with this,
> > shouldn't we also just lower TBLs in gimple?
> > 
> > This no reason why this can't be a VEC_PERM_EXPR which would also get the
> > copies
> > removed at the gimple level and allows us to optimize this to something else
> > depending on the index.
> 
> Yes there is a reason, `out of range` values for VEC_PERM is undefined while
> tbl is well defined  ( If an index is out of range for the table, the result
> for that lookup is 0.).
> 

I don't think that's not a good reason. The out of range values can be made
implementation defined. i.e. mid-end shouldn't care about them.

not lowering this in gimple means we lose a heck of a lot of optimizations that
are impossible to cover in RTL.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
  2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2024-02-29  7:26 ` tnfchris at gcc dot gnu.org
@ 2024-02-29  7:27 ` tnfchris at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-29  7:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877

--- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
and it's not the first time we have conditional lowering. We already do so for
e.g. shifts, where shifting by an amount => bitsize of a vector element is
defined behavior or AArch64.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-02-29  7:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-29  6:51 [Bug target/98877] New: [AArch64] Inefficient code generated for tbl NEON intrinsics spop at gcc dot gnu.org
2021-01-29  7:39 ` [Bug target/98877] " pinskia at gcc dot gnu.org
2021-01-29  7:40 ` pinskia at gcc dot gnu.org
2021-01-29  9:10 ` ktkachov at gcc dot gnu.org
2021-08-12  8:09 ` tnfchris at gcc dot gnu.org
2021-08-22  9:30 ` pinskia at gcc dot gnu.org
2021-08-22  9:30 ` pinskia at gcc dot gnu.org
2021-08-22 10:14 ` tnfchris at gcc dot gnu.org
2024-01-26  0:24 ` pinskia at gcc dot gnu.org
2024-02-27  8:34 ` pinskia at gcc dot gnu.org
2024-02-27 19:28 ` rsandifo at gcc dot gnu.org
2024-02-28  9:12 ` tnfchris at gcc dot gnu.org
2024-02-29  5:45 ` pinskia at gcc dot gnu.org
2024-02-29  7:26 ` tnfchris at gcc dot gnu.org
2024-02-29  7:27 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).