public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists.
@ 2024-02-20 10:22 manolis.tsamis at vrull dot eu
2024-02-21 0:06 ` [Bug tree-optimization/114010] " pinskia at gcc dot gnu.org
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2024-02-20 10:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
Bug ID: 114010
Summary: Unwanted effects of using SSA free lists.
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: manolis.tsamis at vrull dot eu
Target Milestone: ---
Created attachment 57469
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57469&action=edit
Testcase and example compilations on AArch64
After some experiments, I have observed a number of unwanted effects due to
re-using SSA names through free lists. Specifically, the issue is that the
recycled SSA names have unpredictable ordering and the compiler is affected by
this ordering. The effects include:
1) Worse code generation from some passes that explicitly or implicitly
depend on SSA name ordering. The most important case I have observed is that
the vectorizer can fail or create inferior code with more shuffles/moves when
the SSA names aren't monotonically increasing.
2) Missed canonicalization opportunities. Functions that do the same thing
but have different SSA ordering will usually end up diverging after various
optimization passes. The generated assembly code is also much different when it
shouldn't. Since the SSA freelists are affected by things like moving around
dead code this also makes comparing GIMPLE or assembly code much harder than
necessary.
3) Because freelists persist through optimization passes, their hidden state
can affect optimizations/codegen in unexpected ways. I have seen two similar
source files generating the exact same GIMPLE code up to some optimization pass
but then completely diverging due to different freelists. Getting something
completely different by starting from the same GIMPLE was mildly surprising at
first.
4) Although I haven't observed something worthwhile on that front, in theory
if a large chunk of names is free'd (e.g. dead code elimination) then the
pending free'd names will negatively affect performance and memory because
they're not compacted. This mostly affects all the bitmaps that use
SSA_NAME_VERSION as a key.
The easier way to experiment with this behaviour is code like below:
...some code...
if (condition that will be eliminated, but later) {
...some function call that will be inlined now...
...or some code...
}
...some other code...
I have crafted and attached a testcase (canon.cpp) that can showcase some of
these behaviors at different optimization levels.
As part of the experiments I have a modified GCC compiler that calls
`release_free_names_and_compact_live_names` after every optimization pass,
which is enough to guarantee that the generated names are in monotonically
increasing order.
Together with the testcase I have attached the AArch64 assembly code generated
at 8 different configurations: with or without the dead code in the source
file, with or without free name canonicalization, with O2 or with O3.
The observations at O2:
As per the previous notes, it can be seen that code in with-dead-code.O2.txt is
quite different from without-dead-code.O2.txt (different instruction mix,
different instruction count, different register allocation etc.). This is due
to the free'd up names causing different SSA name allocation. Compared to that
with-dead-code.canonicalized.O2.txt and without-dead-code.canonicalized.O2.txt
are identical except for the stack offsets chosen for the variables. It can be
easily seen as it diffs cleanly, both at the gimple and assembly level.
The observations at O3:
At O3 the assembly no longer cleanly diffs in either case, but it is
interesting to observe the better vectorization done due to the better ordering
of names (in almost all places that there is vectorized code):
E.g. non-canonicalized:
.L37:
ld4 {v24.16b - v27.16b}, [x1], 64
shl v23.16b, v24.16b, 5
add v28.16b, v23.16b, v24.16b
mov v23.16b, v18.16b
mla v23.16b, v17.16b, v25.16b
mov v29.16b, v23.16b
mov v23.16b, v20.16b
mla v23.16b, v19.16b, v26.16b
mov v30.16b, v23.16b
mov v23.16b, v22.16b
mla v23.16b, v21.16b, v27.16b
mov v31.16b, v23.16b
st4 {v28.16b - v31.16b}, [x0], 64
cmp x3, x1
bne .L37
Assembly for the same code, 3 instructions less:
.L37:
ld4 {v28.16b - v31.16b}, [x1], 64
mov v26.16b, v23.16b
mov v27.16b, v21.16b
shl v25.16b, v28.16b, 5
mla v26.16b, v24.16b, v29.16b
mla v27.16b, v22.16b, v30.16b
add v25.16b, v25.16b, v28.16b
mov v28.16b, v19.16b
mla v28.16b, v20.16b, v31.16b
st4 {v25.16b - v28.16b}, [x0], 64
cmp x1, x3
bne .L37
E.g. another loop, non canonicalized names:
.L120:
ldr q30, [x0], 16
movi v29.2s, 0
ld2 {v26.16b - v27.16b}, [x4], 32
movi v25.4s, 0
zip1 v29.16b, v30.16b, v29.16b
zip2 v30.16b, v30.16b, v25.16b
umlal v29.8h, v26.8b, v28.8b
umlal2 v30.8h, v26.16b, v28.16b
uaddw v31.4s, v31.4s, v29.4h
uaddw2 v31.4s, v31.4s, v29.8h
uaddw v31.4s, v31.4s, v30.4h
uaddw2 v31.4s, v31.4s, v30.8h
cmp x5, x0
bne .L120
Assembly for the same loop, 2 instructions less (no use of zip1/2):
.L120:
ld2 {v30.16b - v31.16b}, [x4], 32
ldr q28, [x0], 16
umull v31.8h, v30.8b, v27.8b
umull2 v30.8h, v30.16b, v27.16b
uaddw v31.8h, v31.8h, v28.8b
uaddw2 v30.8h, v30.8h, v28.16b
uaddw v29.4s, v29.4s, v31.4h
uaddw2 v29.4s, v29.4s, v31.8h
uaddw v29.4s, v29.4s, v30.4h
uaddw2 v29.4s, v29.4s, v30.8h
cmp x0, x5
bne .L120
I haven't found any compile time regressions from compacting SSA names after
each optimization pass.
Although we can search for every place that is affected by the ordering of
names and try to fix it up, it looks like compacting SSA names and/or sorting
the freelists is a good start.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
@ 2024-02-21 0:06 ` pinskia at gcc dot gnu.org
2024-02-21 0:11 ` pinskia at gcc dot gnu.org
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-21 0:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Blocks| |53947
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>The most important case I have observed is that the vectorizer can fail or create inferior code with more shuffles/moves when the SSA names aren't monotonically increasing.
That should not be true.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
2024-02-21 0:06 ` [Bug tree-optimization/114010] " pinskia at gcc dot gnu.org
@ 2024-02-21 0:11 ` pinskia at gcc dot gnu.org
2024-02-21 0:22 ` pinskia at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-21 0:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note what I had found in the past it is not exactly SSA_NAMEs that cause the
difference but rather the RTL register pesdu # causes differences in register
allocation which was exposed from the different in operands canonicalization.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
2024-02-21 0:06 ` [Bug tree-optimization/114010] " pinskia at gcc dot gnu.org
2024-02-21 0:11 ` pinskia at gcc dot gnu.org
@ 2024-02-21 0:22 ` pinskia at gcc dot gnu.org
2024-02-21 12:03 ` manolis.tsamis at vrull dot eu
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-21 0:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The first example (of assembly here) in comment #0 is extra moves due to the RA
not handling subreg that decent for the load/store lane. There are other bug
reports dealing with that. Why the SSA_NAMES being monotonically help is just
by an accident really.
Also:
> This mostly affects all the bitmaps that use SSA_NAME_VERSION as a key.
Most use sparse bitmaps there so it is not a big deal.
I think this should be split up in a few different bug reports really.
One for each case where better optimizations happen.
Also:
>I have seen two similar source files generating the exact same GIMPLE code up to some optimization pass but then completely diverging due to different freelists.
The only case where I have seen this happen is expand will have different pesdu
# really. Yes I noticed this effect while I did r14-569-g21e2ef2dc25de3 really.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
` (2 preceding siblings ...)
2024-02-21 0:22 ` pinskia at gcc dot gnu.org
@ 2024-02-21 12:03 ` manolis.tsamis at vrull dot eu
2024-02-21 12:09 ` manolis.tsamis at vrull dot eu
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2024-02-21 12:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #4 from Manolis Tsamis <manolis.tsamis at vrull dot eu> ---
Hi Andrew,
Thank for your insights on this. Let me reply to some of your points:
(In reply to Andrew Pinski from comment #1)
> >The most important case I have observed is that the vectorizer can fail or create inferior code with more shuffles/moves when the SSA names aren't monotonically increasing.
>
> That should not be true.
Indeed, after further cleaning-up the dumps, some differences that I was
considering were just due to the diff algorithm not doing a good job and that
confused me (sigh).
So, for this example while we're in tree form I observe only naming changes,
but no different code or order of statements.
(In reply to Andrew Pinski from comment #2)
> Note what I had found in the past it is not exactly SSA_NAMEs that cause the
> difference but rather the RTL register pesdu # causes differences in
> register allocation which was exposed from the different in operands
> canonicalization.
Yes, I have also observed this and it looks to be the main issue.
(In reply to Andrew Pinski from comment #3)
> The first example (of assembly here) in comment #0 is extra moves due to the
> RA not handling subreg that decent for the load/store lane. There are other
> bug reports dealing with that. Why the SSA_NAMES being monotonically help is
> just by an accident really.
>
>
Do you happen to recall the relevant ticket(s)? I would like to have a look but
couldn't find them so far.
Also, while I agree than in some cases changes like this 'just happen' to
improve codegen in some particular case, it was in multiple experiments that
vectorized code was superior with sorted names and it never was worse with
sorted names. In most cases that I recall the version that used unsorted names
had additional shuffles of different sorts or moves. So, which anecdotal, the
effects doesn't look accidental to me in this case. I feel like there may be
some subtle difference due to the names that helps in this case?
>
> Also:
> > This mostly affects all the bitmaps that use SSA_NAME_VERSION as a key.
>
> Most use sparse bitmaps there so it is not a big deal.
>
Agreed and that's probably why I couldn't measure any non-trivial difference in
compilation times.
I should just note that there are also places that create vectors or other data
structures sized to the number of ssa_names, so in theory this could still help
in extreme cases.
> I think this should be split up in a few different bug reports really.
> One for each case where better optimizations happen.
>
Ok, the only cases that I found to be clearly better are the ones related to
vectorization. Would it help to create a ticket just for that now, or should I
wait for the discussion in this one to conclude first?
> Also:
> >I have seen two similar source files generating the exact same GIMPLE code up to some optimization pass but then completely diverging due to different freelists.
>
> The only case where I have seen this happen is expand will have different
> pesdu # really. Yes I noticed this effect while I did
> r14-569-g21e2ef2dc25de3 really.
Afaik, the codegen differences that I observed was due to the same reason, but
it nonetheless felt weird that the same GIMPLE could produce two different
w.r.t. name ordering files later on just because the freelists were different
(but invisible in the dumps). So I naturally questioned 'why don't we just
flush the freelists after every pass if it's not a performance issue'?
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
` (3 preceding siblings ...)
2024-02-21 12:03 ` manolis.tsamis at vrull dot eu
@ 2024-02-21 12:09 ` manolis.tsamis at vrull dot eu
2024-02-22 1:02 ` ptomsich at gcc dot gnu.org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2024-02-21 12:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #5 from Manolis Tsamis <manolis.tsamis at vrull dot eu> ---
Also, I further investigated the codegen difference in the second example (zip
+ umlal vs umull) and it looks to be some sort of RTL ordering + combine issue.
Specifically, when the we expand the RTL for the example there are some very
slight ordering differences where some non-dependent insns have swapped order.
On of these happens to precede a relevant vector statement and then in one case
combine does the umlal transformation but in the other not.
Afaik combine has some limits about the instruction window that it looks, so it
looks feasible that ordering differences in RTL can later transform into major
codegen differences in a number of ways. Other differences seem to come from
register allocation, as you mentioned.
This doesn't yet provide any useful insights whether the vectorization
improvements are accidental or not.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
` (4 preceding siblings ...)
2024-02-21 12:09 ` manolis.tsamis at vrull dot eu
@ 2024-02-22 1:02 ` ptomsich at gcc dot gnu.org
2024-02-22 1:02 ` ptomsich at gcc dot gnu.org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: ptomsich at gcc dot gnu.org @ 2024-02-22 1:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #6 from ptomsich at gcc dot gnu.org ---
(In reply to Manolis Tsamis from comment #5)
> On of these happens to precede a relevant vector statement and then
> in one case combine does the umlal transformation but in the other not.
Please attach the A/B dump-files for the combine pass.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
` (5 preceding siblings ...)
2024-02-22 1:02 ` ptomsich at gcc dot gnu.org
@ 2024-02-22 1:02 ` ptomsich at gcc dot gnu.org
2024-02-22 10:28 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: ptomsich at gcc dot gnu.org @ 2024-02-22 1:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #7 from ptomsich at gcc dot gnu.org ---
(In reply to Manolis Tsamis from comment #5)
> On of these happens to precede a relevant vector statement and then
> in one case combine does the umlal transformation but in the other not.
Please attach the A/B dump-files for the combine pass.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
` (6 preceding siblings ...)
2024-02-22 1:02 ` ptomsich at gcc dot gnu.org
@ 2024-02-22 10:28 ` rguenth at gcc dot gnu.org
2024-02-23 14:31 ` ptomsich at gcc dot gnu.org
2024-02-23 15:02 ` manolis.tsamis at vrull dot eu
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-22 10:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's expected that SSA naming can affect code generation, there's no
"declaration order" as for VAR_DECLs we could ultimatively fall back to.
Ideally algorithms should not depend on particular ordering so it might be
good to investigate particular bad cases.
For that I'd even suggest to add a debug mechanism to "shuffle" SSA names
somewhat randomly (but note statements should be re-canonicalized
afterwards).
There are some cases with SLP vectorization of COND_EXPRs that are
problematic where both the false/true arms and the condition can
be "canonicalized" (yes, the inside the vectorizer COND_EXPRs with
embedded conditions are still a thing - thanks to pattern recog).
I don't know of any others.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
` (7 preceding siblings ...)
2024-02-22 10:28 ` rguenth at gcc dot gnu.org
@ 2024-02-23 14:31 ` ptomsich at gcc dot gnu.org
2024-02-23 15:02 ` manolis.tsamis at vrull dot eu
9 siblings, 0 replies; 11+ messages in thread
From: ptomsich at gcc dot gnu.org @ 2024-02-23 14:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #9 from ptomsich at gcc dot gnu.org ---
(In reply to Manolis Tsamis from comment #0)
> E.g. another loop, non canonicalized names:
>
> .L120:
> ldr q30, [x0], 16
> movi v29.2s, 0
> ld2 {v26.16b - v27.16b}, [x4], 32
> movi v25.4s, 0
> zip1 v29.16b, v30.16b, v29.16b
> zip2 v30.16b, v30.16b, v25.16b
> umlal v29.8h, v26.8b, v28.8b
> umlal2 v30.8h, v26.16b, v28.16b
> uaddw v31.4s, v31.4s, v29.4h
> uaddw2 v31.4s, v31.4s, v29.8h
> uaddw v31.4s, v31.4s, v30.4h
> uaddw2 v31.4s, v31.4s, v30.8h
> cmp x5, x0
> bne .L120
Is it just me, or are the zip1 and zip2 instructions dead?
Philipp.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
` (8 preceding siblings ...)
2024-02-23 14:31 ` ptomsich at gcc dot gnu.org
@ 2024-02-23 15:02 ` manolis.tsamis at vrull dot eu
9 siblings, 0 replies; 11+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2024-02-23 15:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #10 from Manolis Tsamis <manolis.tsamis at vrull dot eu> ---
(In reply to ptomsich from comment #9)
> (In reply to Manolis Tsamis from comment #0)
> > E.g. another loop, non canonicalized names:
> >
> > .L120:
> > ldr q30, [x0], 16
> > movi v29.2s, 0
> > ld2 {v26.16b - v27.16b}, [x4], 32
> > movi v25.4s, 0
> > zip1 v29.16b, v30.16b, v29.16b
> > zip2 v30.16b, v30.16b, v25.16b
> > umlal v29.8h, v26.8b, v28.8b
> > umlal2 v30.8h, v26.16b, v28.16b
> > uaddw v31.4s, v31.4s, v29.4h
> > uaddw2 v31.4s, v31.4s, v29.8h
> > uaddw v31.4s, v31.4s, v30.4h
> > uaddw2 v31.4s, v31.4s, v30.8h
> > cmp x5, x0
> > bne .L120
>
> Is it just me, or are the zip1 and zip2 instructions dead?
>
> Philipp.
They certainly look dead, but they're not because the umlal/umlal2 (and other
accumulate instructions) also read from the destination register.
There looks to be a missed optimization opportunity to use just a single `movi
v25.4s, 0` here though.
Also, looking again at the generated code in the first example:
mov v23.16b, v18.16b
mla v23.16b, v17.16b, v25.16b
If I'm correct this could be folded into just
mla v18.16b, v17.16b, v25.16b
In which case most of the movs in the first and second example could be
eliminated. To me it looks like the accumulate instructions are missing some
optimizations.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-02-23 15:02 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
2024-02-21 0:06 ` [Bug tree-optimization/114010] " pinskia at gcc dot gnu.org
2024-02-21 0:11 ` pinskia at gcc dot gnu.org
2024-02-21 0:22 ` pinskia at gcc dot gnu.org
2024-02-21 12:03 ` manolis.tsamis at vrull dot eu
2024-02-21 12:09 ` manolis.tsamis at vrull dot eu
2024-02-22 1:02 ` ptomsich at gcc dot gnu.org
2024-02-22 1:02 ` ptomsich at gcc dot gnu.org
2024-02-22 10:28 ` rguenth at gcc dot gnu.org
2024-02-23 14:31 ` ptomsich at gcc dot gnu.org
2024-02-23 15:02 ` manolis.tsamis at vrull dot eu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).