public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation
@ 2020-06-29 12:49 rsandifo at gcc dot gnu.org
2021-08-12 8:01 ` [Bug target/95962] " tnfchris at gcc dot gnu.org
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2020-06-29 12:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962
Bug ID: 95962
Summary: Inefficient code for simple arm_neon.h iota operation
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
Blocks: 95958
Target Milestone: ---
Target: aarch64*-*-*
For:
#include <arm_neon.h>
int32x4_t
foo (void)
{
int32_t array[] = { 0, 1, 2, 3 };
return vld1q_s32 (array);
}
we produce:
foo:
.LFB4217:
.cfi_startproc
sub sp, sp, #16
.cfi_def_cfa_offset 16
mov x0, 2
mov x1, 4294967296
movk x0, 0x3, lsl 32
stp x1, x0, [sp]
ldr q0, [sp]
add sp, sp, 16
.cfi_def_cfa_offset 0
ret
In contrast, clang produces essentially perfect code:
adrp x8, .LCPI0_0
ldr q0, [x8, :lo12:.LCPI0_0]
ret
I think the problem is a combination of two things:
- __builtin_aarch64_ld1v4si & co. are treated as general
functions rather than pure functions, so in principle
it could write to the given address. This stops us
promoting the array to a constant.
- The loads could be reduced to native gimple-level
operations, at least on little-endian targets.
IMO this a bug rather than an enhancement. Intrinsics only
exist to optimise code, and what GCC is doing falls short
of what users should reasonably expect.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/95962] Inefficient code for simple arm_neon.h iota operation
2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
@ 2021-08-12 8:01 ` tnfchris at gcc dot gnu.org
2021-08-20 11:52 ` rsandifo at gcc dot gnu.org
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-12 8:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962
Tamar Christina <tnfchris at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2021-08-12
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
CC| |tnfchris at gcc dot gnu.org
--- Comment #1 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
We generate the correct code at -O3 but not -O2.
At -O3 we generate
foo:
adrp x0, .LC0
sub sp, sp, #16
ldr q0, [x0, #:lo12:.LC0]
add sp, sp, 16
ret
where the problem seems to be at at -O2 store merging has broken up the
construction of `array` into two separate memory accesses:
MEM <unsigned long> [(int *)&array] = 4294967296;
MEM <unsigned long> [(int *)&array + 8B] = 12884901890;
whereas at -O3 we still have a single assignment:
MEM <vector(4) int> [(int *)&array] = { 0, 1, 2, 3 };
I'm not sure even if we made these loads gimple level if that would help. we'd
still have the explicit MEMs created by store merging.
Perhaps we should just make store-merging allow TImode merges and split them in
the backend if needed.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/95962] Inefficient code for simple arm_neon.h iota operation
2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
2021-08-12 8:01 ` [Bug target/95962] " tnfchris at gcc dot gnu.org
@ 2021-08-20 11:52 ` rsandifo at gcc dot gnu.org
2021-11-15 15:09 ` tnfchris at gcc dot gnu.org
2021-12-03 17:05 ` rsandifo at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-08-20 11:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962
--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #1)
> We generate the correct code at -O3 but not -O2.
>
> At -O3 we generate
>
> foo:
> adrp x0, .LC0
> sub sp, sp, #16
> ldr q0, [x0, #:lo12:.LC0]
> add sp, sp, 16
> ret
>
> where the problem seems to be at at -O2 store merging has broken up the
> construction of `array` into two separate memory accesses:
>
> MEM <unsigned long> [(int *)&array] = 4294967296;
> MEM <unsigned long> [(int *)&array + 8B] = 12884901890;
>
> whereas at -O3 we still have a single assignment:
>
> MEM <vector(4) int> [(int *)&array] = { 0, 1, 2, 3 };
>
> I'm not sure even if we made these loads gimple level if that would help.
> we'd still have the explicit MEMs created by store merging.
If we folded them to gimple loads, the gimple optimisers should replace
the MEM with an assignment of the VECTOR_CST { 0, 1, 2, 3 } to an SSA name,
with the function returning the SSA name.
expand will convert this back into a memory access, in the form of
an RTL constant pool load. But that will avoid the stack temporary
and thus the pointless stack adjustments.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/95962] Inefficient code for simple arm_neon.h iota operation
2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
2021-08-12 8:01 ` [Bug target/95962] " tnfchris at gcc dot gnu.org
2021-08-20 11:52 ` rsandifo at gcc dot gnu.org
@ 2021-11-15 15:09 ` tnfchris at gcc dot gnu.org
2021-12-03 17:05 ` rsandifo at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-11-15 15:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962
--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
This is now fixed on trunk, at least for ld1/st1.
Was this ticket about the general problem for loads or just the ld1/st1
examples?
I'll leave it open since we still need to do the rest of the loads/stores.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/95962] Inefficient code for simple arm_neon.h iota operation
2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
` (2 preceding siblings ...)
2021-11-15 15:09 ` tnfchris at gcc dot gnu.org
@ 2021-12-03 17:05 ` rsandifo at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-12-03 17:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962
--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #3)
> This is now fixed on trunk, at least for ld1/st1.
Nice!
> Was this ticket about the general problem for loads or just the ld1/st1
> examples?
>
> I'll leave it open since we still need to do the rest of the loads/stores.
Yeah, agree we should keep it open for {ld,st}[234].
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-12-03 17:05 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
2021-08-12 8:01 ` [Bug target/95962] " tnfchris at gcc dot gnu.org
2021-08-20 11:52 ` rsandifo at gcc dot gnu.org
2021-11-15 15:09 ` tnfchris at gcc dot gnu.org
2021-12-03 17:05 ` rsandifo at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).