public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/109078] New: Missing optimization on aarch64 for types like `float32x4x2_t`
@ 2023-03-09 10:21 dorazzsoft at gmail dot com
2023-04-02 17:04 ` [Bug target/109078] " pinskia at gcc dot gnu.org
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: dorazzsoft at gmail dot com @ 2023-03-09 10:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109078
Bug ID: 109078
Summary: Missing optimization on aarch64 for types like
`float32x4x2_t`
Product: gcc
Version: 12.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: dorazzsoft at gmail dot com
Target Milestone: ---
Here is a simple code: https://godbolt.org/z/3qMTTfcfx
#include <arm_neon.h>
#include <stddef.h>
#include <stdbool.h>
void simple_gemm(
float* restrict out,
float const* restrict a,
float const* restrict b,
size_t k, bool zero_out
) {
register float32x4x2_t o0;
o0.val[0] = vdupq_n_f32(0.0f);
o0.val[1] = vdupq_n_f32(0.0f);
// begin dot
{
register float32x4_t a0;
register float32x4x2_t b0;
while (k >= 1) {
b0 = vld1q_f32_x2(b);
a0 = vdupq_n_f32(a[0]);
o0.val[0] = vfmaq_f32(o0.val[0], a0, b0.val[0]);
o0.val[1] = vfmaq_f32(o0.val[1], a0, b0.val[1]);
b += 8;
a += 1;
k -= 1;
}
} // end dot
// begin writeback
{
if (!zero_out) {
register float32x4x2_t t0;
t0 = vld1q_f32_x2(out);
o0.val[0] = vaddq_f32(o0.val[0], t0.val[0]);
o0.val[1] = vaddq_f32(o0.val[1], t0.val[1]);
}
// TODO: both clang and gcc generates redundant mov because of bad register
allocation.
vst1q_f32_x2(out, o0);
} // end writeback
}
The assembly generated:
simple_gemm:
movi v3.4s, 0
and w4, w4, 255
mov v4.16b, v3.16b
cbz x3, .L2
.L3:
ld1 {v0.4s - v1.4s}, [x2], 32
subs x3, x3, #1
ld1r {v2.4s}, [x1], 4
fmla v3.4s, v2.4s, v0.4s
fmla v4.4s, v2.4s, v1.4s
bne .L3
.L2:
cbnz w4, .L4
ld1 {v0.4s - v1.4s}, [x0]
fadd v3.4s, v3.4s, v0.4s
fadd v4.4s, v4.4s, v1.4s
.L4:
mov v0.16b, v3.16b
mov v1.16b, v4.16b
st1 {v0.4s - v1.4s}, [x0]
ret
The two values of float32x4x2_t o0 are assigned to v3 and v4. They should be
able to be used directly as operands of st1, so the mov at L4 is redundant.
I also found that in some code, the register pair may not be neighboring, which
results in some redundant mov instructions.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/109078] Missing optimization on aarch64 for types like `float32x4x2_t`
2023-03-09 10:21 [Bug target/109078] New: Missing optimization on aarch64 for types like `float32x4x2_t` dorazzsoft at gmail dot com
@ 2023-04-02 17:04 ` pinskia at gcc dot gnu.org
2023-11-07 21:49 ` rsandifo at gcc dot gnu.org
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-04-02 17:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109078
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/109078] Missing optimization on aarch64 for types like `float32x4x2_t`
2023-03-09 10:21 [Bug target/109078] New: Missing optimization on aarch64 for types like `float32x4x2_t` dorazzsoft at gmail dot com
2023-04-02 17:04 ` [Bug target/109078] " pinskia at gcc dot gnu.org
@ 2023-11-07 21:49 ` rsandifo at gcc dot gnu.org
2023-12-07 19:41 ` cvs-commit at gcc dot gnu.org
2023-12-07 19:52 ` rsandifo at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2023-11-07 21:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109078
Richard Sandiford <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2023-11-07
Ever confirmed|0 |1
Status|UNCONFIRMED |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
--- Comment #1 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
Some of the SME changes I'm working on fix this, but I'm not sure how widely
we'll be able to use them on non-SME code. Assigning myself just in case.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/109078] Missing optimization on aarch64 for types like `float32x4x2_t`
2023-03-09 10:21 [Bug target/109078] New: Missing optimization on aarch64 for types like `float32x4x2_t` dorazzsoft at gmail dot com
2023-04-02 17:04 ` [Bug target/109078] " pinskia at gcc dot gnu.org
2023-11-07 21:49 ` rsandifo at gcc dot gnu.org
@ 2023-12-07 19:41 ` cvs-commit at gcc dot gnu.org
2023-12-07 19:52 ` rsandifo at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-12-07 19:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109078
--- Comment #2 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:
https://gcc.gnu.org/g:9f0f7d802482a8958d6cdc72f1fe0c8549db2182
commit r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182
Author: Richard Sandiford <richard.sandiford@arm.com>
Date: Thu Dec 7 19:41:19 2023 +0000
aarch64: Add an early RA for strided registers
This pass adds a simple register allocator for FP & SIMD registers.
Its main purpose is to make use of SME2's strided LD1, ST1 and LUTI2/4
instructions, which require a very specific grouping structure,
and so would be difficult to exploit with general allocation.
The allocator is very simple. It gives up on anything that would
require spilling, or that it might not handle well for other reasons.
The allocator needs to track liveness at the level of individual FPRs.
Doing that fixes a lot of the PRs relating to redundant moves caused by
structure loads and stores. That particular problem is going to be
fixed more generally for GCC 15 by Lehua's RA patches.
However, the early-RA pass runs before scheduling, so it has a chance
to bag a spill-free allocation of vector code before the scheduler moves
things around. It could therefore still be useful for non-SME code
(e.g. for hand-scheduled ACLE code) even after Lehua's patches are in.
The pass is controlled by a tristate switch:
- -mearly-ra=all: run on all functions
- -mearly-ra=strided: run on functions that have access to strided
registers
- -mearly-ra=none: don't run on any function
The patch makes -mearly-ra=all the default at -O2 and above for now.
We can revisit this for GCC 15 once Lehua's patches are in;
-mearly-ra=strided might then be more appropriate.
As said previously, the pass is very naive. There's much more that we
could do, such as handling invariants better. The main focus is on not
committing to a bad allocation, rather than on handling as much as
possible.
gcc/
PR rtl-optimization/106694
PR rtl-optimization/109078
PR rtl-optimization/109391
* config.gcc: Add aarch64-early-ra.o for AArch64 targets.
* config/aarch64/t-aarch64 (aarch64-early-ra.o): New rule.
* config/aarch64/aarch64-opts.h (aarch64_early_ra_scope): New enum.
* config/aarch64/aarch64.opt (mearly_ra): New option.
* doc/invoke.texi: Document it.
* common/config/aarch64/aarch64-common.cc
(aarch_option_optimization_table): Use -mearly-ra=strided by
default for -O2 and above.
* config/aarch64/aarch64-passes.def (pass_aarch64_early_ra): New
pass.
* config/aarch64/aarch64-protos.h (aarch64_strided_registers_p)
(make_pass_aarch64_early_ra): Declare.
* config/aarch64/aarch64-sme.md
(@aarch64_sme_lut<LUTI_BITS><mode>):
Add a stride_type attribute.
(@aarch64_sme_lut<LUTI_BITS><mode>_strided2): New pattern.
(@aarch64_sme_lut<LUTI_BITS><mode>_strided4): Likewise.
* config/aarch64/aarch64-sve-builtins-base.cc (svld1_impl::expand)
(svldnt1_impl::expand, svst1_impl::expand, svstn1_impl::expand):
Handle
new way of defining multi-register loads and stores.
* config/aarch64/aarch64-sve.md (@aarch64_ld1<SVE_FULLx24:mode>)
(@aarch64_ldnt1<SVE_FULLx24:mode>, @aarch64_st1<SVE_FULLx24:mode>)
(@aarch64_stnt1<SVE_FULLx24:mode>): Delete.
* config/aarch64/aarch64-sve2.md (@aarch64_<LD1_COUNT:optab><mode>)
(@aarch64_<LD1_COUNT:optab><mode>_strided2): New patterns.
(@aarch64_<LD1_COUNT:optab><mode>_strided4): Likewise.
(@aarch64_<ST1_COUNT:optab><mode>): Likewise.
(@aarch64_<ST1_COUNT:optab><mode>_strided2): Likewise.
(@aarch64_<ST1_COUNT:optab><mode>_strided4): Likewise.
* config/aarch64/aarch64.cc (aarch64_strided_registers_p): New
function.
* config/aarch64/aarch64.md (UNSPEC_LD1_SVE_COUNT): Delete.
(UNSPEC_ST1_SVE_COUNT, UNSPEC_LDNT1_SVE_COUNT): Likewise.
(UNSPEC_STNT1_SVE_COUNT): Likewise.
(stride_type): New attribute.
* config/aarch64/constraints.md (Uwd, Uwt): New constraints.
* config/aarch64/iterators.md (UNSPEC_LD1_COUNT,
UNSPEC_LDNT1_COUNT)
(UNSPEC_ST1_COUNT, UNSPEC_STNT1_COUNT): New unspecs.
(optab): Handle them.
(LD1_COUNT, ST1_COUNT): New iterators.
* config/aarch64/aarch64-early-ra.cc: New file.
gcc/testsuite/
PR rtl-optimization/106694
PR rtl-optimization/109078
PR rtl-optimization/109391
* gcc.target/aarch64/ldp_stp_16.c (cons4_4_float): Tighten expected
output test.
* gcc.target/aarch64/sve/shift_1.c: Allow reversed shifts for .s
as well as .d.
* gcc.target/aarch64/sme/strided_1.c: New test.
* gcc.target/aarch64/pr109078.c: Likewise.
* gcc.target/aarch64/pr109391.c: Likewise.
* gcc.target/aarch64/sve/pr106694.c: Likewise.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/109078] Missing optimization on aarch64 for types like `float32x4x2_t`
2023-03-09 10:21 [Bug target/109078] New: Missing optimization on aarch64 for types like `float32x4x2_t` dorazzsoft at gmail dot com
` (2 preceding siblings ...)
2023-12-07 19:41 ` cvs-commit at gcc dot gnu.org
@ 2023-12-07 19:52 ` rsandifo at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2023-12-07 19:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109078
Richard Sandiford <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|ASSIGNED |RESOLVED
--- Comment #3 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
Fix for this case. The patch only deals with cases that can be allocated
without spilling, but Lehua has a more general fix that should go into GCC 15.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-12-07 19:52 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-09 10:21 [Bug target/109078] New: Missing optimization on aarch64 for types like `float32x4x2_t` dorazzsoft at gmail dot com
2023-04-02 17:04 ` [Bug target/109078] " pinskia at gcc dot gnu.org
2023-11-07 21:49 ` rsandifo at gcc dot gnu.org
2023-12-07 19:41 ` cvs-commit at gcc dot gnu.org
2023-12-07 19:52 ` rsandifo at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).