public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182
@ 2024-05-28 10:59 manolis.tsamis at vrull dot eu
2024-05-28 12:03 ` [Bug rtl-optimization/115258] " rguenth at gcc dot gnu.org
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2024-05-28 10:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
Bug ID: 115258
Summary: [14 Regression][aarch64] Additional XORs generated
after
r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: manolis.tsamis at vrull dot eu
Target Milestone: ---
For the following testcase:
typedef int veci __attribute__ ((vector_size (4 * sizeof (int))));
void fun (veci *a, veci *b, veci *c) {
*c = __builtin_shufflevector (*a, *b, 0, 5, 2, 7);
}
After this commit we generate (at -O2)
adrp x3, .LC0
ldr q30, [x1]
ldr q31, [x0]
ldr q29, [x3, #:lo12:.LC0]
eor v31.16b, v31.16b, v30.16b
eor v30.16b, v31.16b, v30.16b
eor v31.16b, v31.16b, v30.16b
tbl v30.16b, {v30.16b - v31.16b}, v29.16b
str q30, [x2]
ret
Instead of
adrp x3, .LC0
ldr q30, [x0]
ldr q31, [x1]
ldr q29, [x3, #:lo12:.LC0]
tbl v30.16b, {v30.16b - v31.16b}, v29.16b
str q30, [x2]
ret
The 3 newly introduced eor instructions just swap the values of v30 and v31,
which are loaded in the reverse order compared to the old code.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/115258] [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
@ 2024-05-28 12:03 ` rguenth at gcc dot gnu.org
2024-05-28 14:50 ` [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290 pinskia at gcc dot gnu.org
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-28 12:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Target Milestone|--- |14.2
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
2024-05-28 12:03 ` [Bug rtl-optimization/115258] " rguenth at gcc dot gnu.org
@ 2024-05-28 14:50 ` pinskia at gcc dot gnu.org
2024-05-29 10:02 ` rsandifo at gcc dot gnu.org
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-28 14:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2024-05-28
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.
Though the register allocation before was also just "lucky".
(define_insn_and_split "aarch64_combinev16qi"
[(set (match_operand:V2x16QI 0 "register_operand" "=w")
(unspec:V2x16QI [(match_operand:V16QI 1 "register_operand" "w")
(match_operand:V16QI 2 "register_operand" "w")]
UNSPEC_CONCAT))]
"TARGET_SIMD"
"#"
"&& reload_completed"
[(const_int 0)]
{
aarch64_split_combinev16qi (operands);
DONE;
}
[(set_attr "type" "multiple")]
)
It seems like for little-endian the constraint on 1 should include 0. But I
could be wrong.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
2024-05-28 12:03 ` [Bug rtl-optimization/115258] " rguenth at gcc dot gnu.org
2024-05-28 14:50 ` [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290 pinskia at gcc dot gnu.org
@ 2024-05-29 10:02 ` rsandifo at gcc dot gnu.org
2024-05-29 15:43 ` cvs-commit at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-05-29 10:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
Richard Sandiford <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org
--- Comment #2 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
I agree with Andrew that this seems mostly to be luck. On that basis, I'm not
sure (either way) whether we should backport the fix.
Now that we're hopefully getting better at tracking and allocating subregs, it
probably makes sense to allow the split before reload. Doing that (and
adjusting the split accordingly) seems to fix the testcase.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
` (2 preceding siblings ...)
2024-05-29 10:02 ` rsandifo at gcc dot gnu.org
@ 2024-05-29 15:43 ` cvs-commit at gcc dot gnu.org
2024-05-29 15:53 ` [Bug target/115258] [14 " rsandifo at gcc dot gnu.org
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-05-29 15:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
--- Comment #3 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:
https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
Author: Richard Sandiford <richard.sandiford@arm.com>
Date: Wed May 29 16:43:33 2024 +0100
aarch64: Split aarch64_combinev16qi before RA [PR115258]
Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
purpose is to put the two input data vectors into consecutive registers.
This aarch64_combinev16qi was then split after reload into individual
moves (from the first input to the first half of the output, and from
the second input to the second half of the output).
In the worst case, the RA might allocate things so that the destination
of the aarch64_combinev16qi is the second input followed by the first
input. In that case, the split form of aarch64_combinev16qi uses three
eors to swap the registers around.
This PR is about a test where this worst case occurred. And given the
insn description, that allocation doesn't semm unreasonable.
early-ra should (hopefully) mean that we're now better at allocating
subregs of vector registers. The upcoming RA subreg patches should
improve things further. The best fix for the PR therefore seems
to be to split the combination before RA, so that the RA can see
the underlying moves.
Perhaps it even makes sense to do this at expand time, avoiding the need
for aarch64_combinev16qi entirely. That deserves more experimentation
though.
gcc/
PR target/115258
* config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
the split before reload.
* config/aarch64/aarch64.cc (aarch64_split_combinev16qi):
Generalize
into a form that handles pseudo registers.
gcc/testsuite/
PR target/115258
* gcc.target/aarch64/pr115258.c: New test.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
` (3 preceding siblings ...)
2024-05-29 15:43 ` cvs-commit at gcc dot gnu.org
@ 2024-05-29 15:53 ` rsandifo at gcc dot gnu.org
2024-08-01 9:41 ` jakub at gcc dot gnu.org
2024-09-18 3:12 ` kugan at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-05-29 15:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
Richard Sandiford <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|[14/15 Regression] register |[14 Regression] register
|swaps for vector perm in |swaps for vector perm in
|some cases after r14-6290 |some cases after r14-6290
--- Comment #4 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
Leaving open in case we do decide to backport.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
` (4 preceding siblings ...)
2024-05-29 15:53 ` [Bug target/115258] [14 " rsandifo at gcc dot gnu.org
@ 2024-08-01 9:41 ` jakub at gcc dot gnu.org
2024-09-18 3:12 ` kugan at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-08-01 9:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|14.2 |14.3
--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 14.2 is being released, retargeting bugs to GCC 14.3.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
` (5 preceding siblings ...)
2024-08-01 9:41 ` jakub at gcc dot gnu.org
@ 2024-09-18 3:12 ` kugan at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: kugan at gcc dot gnu.org @ 2024-09-18 3:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
kugan at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |kugan at gcc dot gnu.org
--- Comment #6 from kugan at gcc dot gnu.org ---
This (In reply to GCC Commits from comment #3)
> The trunk branch has been updated by Richard Sandiford
> <rsandifo@gcc.gnu.org>:
>
> https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
>
> commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
> Author: Richard Sandiford <richard.sandiford@arm.com>
> Date: Wed May 29 16:43:33 2024 +0100
>
> aarch64: Split aarch64_combinev16qi before RA [PR115258]
>
> Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
> purpose is to put the two input data vectors into consecutive registers.
> This aarch64_combinev16qi was then split after reload into individual
> moves (from the first input to the first half of the output, and from
> the second input to the second half of the output).
>
> In the worst case, the RA might allocate things so that the destination
> of the aarch64_combinev16qi is the second input followed by the first
> input. In that case, the split form of aarch64_combinev16qi uses three
> eors to swap the registers around.
>
> This PR is about a test where this worst case occurred. And given the
> insn description, that allocation doesn't semm unreasonable.
>
> early-ra should (hopefully) mean that we're now better at allocating
> subregs of vector registers. The upcoming RA subreg patches should
> improve things further. The best fix for the PR therefore seems
> to be to split the combination before RA, so that the RA can see
> the underlying moves.
>
> Perhaps it even makes sense to do this at expand time, avoiding the need
> for aarch64_combinev16qi entirely. That deserves more experimentation
> though.
>
> gcc/
> PR target/115258
> * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
> the split before reload.
> * config/aarch64/aarch64.cc (aarch64_split_combinev16qi):
> Generalize
> into a form that handles pseudo registers.
>
> gcc/testsuite/
> PR target/115258
> * gcc.target/aarch64/pr115258.c: New test.
This is causing performance regression in some TSVC kernels and others. Here is
an example:
https://godbolt.org/z/r91nYEEsP
We now get:
.L3:
add x3, x26, x0
add x2, x25, x0
add x3, x3, 65536
add x2, x2, 65536
sub x0, x0, #16
ldr q31, [x3, 62448]
mov v28.16b, v31.16b
mov v29.16b, v31.16b
tbl v31.16b, {v28.16b - v29.16b}, v30.16b
fadd v31.4s, v31.4s, v25.4s
mov v26.16b, v31.16b
mov v27.16b, v31.16b
tbl v31.16b, {v26.16b - v27.16b}, v30.16b
str q31, [x2, 62448]
cmp x0, x27
bne .L3
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-09-18 3:12 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
2024-05-28 12:03 ` [Bug rtl-optimization/115258] " rguenth at gcc dot gnu.org
2024-05-28 14:50 ` [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290 pinskia at gcc dot gnu.org
2024-05-29 10:02 ` rsandifo at gcc dot gnu.org
2024-05-29 15:43 ` cvs-commit at gcc dot gnu.org
2024-05-29 15:53 ` [Bug target/115258] [14 " rsandifo at gcc dot gnu.org
2024-08-01 9:41 ` jakub at gcc dot gnu.org
2024-09-18 3:12 ` kugan at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).