public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists.
@ 2024-02-20 10:22 manolis.tsamis at vrull dot eu
  2024-02-21  0:06 ` [Bug tree-optimization/114010] " pinskia at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2024-02-20 10:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010

            Bug ID: 114010
           Summary: Unwanted effects of using SSA free lists.
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---

Created attachment 57469
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57469&action=edit
Testcase and example compilations on AArch64

After some experiments, I have observed a number of unwanted effects due to
re-using SSA names through free lists. Specifically, the issue is that the
recycled SSA names have unpredictable ordering and the compiler is affected by
this ordering. The effects include:

  1) Worse code generation from some passes that explicitly or implicitly
depend on SSA name ordering. The most important case I have observed is that
the vectorizer can fail or create inferior code with more shuffles/moves when
the SSA names aren't monotonically increasing.
  2) Missed canonicalization opportunities. Functions that do the same thing
but have different SSA ordering will usually end up diverging after various
optimization passes. The generated assembly code is also much different when it
shouldn't. Since the SSA freelists are affected by things like moving around
dead code this also makes comparing GIMPLE or assembly code much harder than
necessary.
  3) Because freelists persist through optimization passes, their hidden state
can affect optimizations/codegen in unexpected ways. I have seen two similar
source files generating the exact same GIMPLE code up to some optimization pass
but then completely diverging due to different freelists. Getting something
completely different by starting from the same GIMPLE was mildly surprising at
first.
  4) Although I haven't observed something worthwhile on that front, in theory
if a large chunk of names is free'd (e.g. dead code elimination) then the
pending free'd names will negatively affect performance and memory because
they're not compacted. This mostly affects all the bitmaps that use
SSA_NAME_VERSION as a key.

The easier way to experiment with this behaviour is code like below:

...some code...
if (condition that will be eliminated, but later) {
  ...some function call that will be inlined now...
  ...or some code...
}
...some other code...

I have crafted and attached a testcase (canon.cpp) that can showcase some of
these behaviors at different optimization levels.
As part of the experiments I have a modified GCC compiler that calls
`release_free_names_and_compact_live_names` after every optimization pass,
which is enough to guarantee that the generated names are in monotonically
increasing order.

Together with the testcase I have attached the AArch64 assembly code generated
at 8 different configurations: with or without the dead code in the source
file, with or without free name canonicalization, with O2 or with O3.

The observations at O2:

As per the previous notes, it can be seen that code in with-dead-code.O2.txt is
quite different from without-dead-code.O2.txt (different instruction mix,
different instruction count, different register allocation etc.). This is due
to the free'd up names causing different SSA name allocation. Compared to that
with-dead-code.canonicalized.O2.txt and without-dead-code.canonicalized.O2.txt
are identical except for the stack offsets chosen for the variables. It can be
easily seen as it diffs cleanly, both at the gimple and assembly level. 

The observations at O3:

At O3 the assembly no longer cleanly diffs in either case, but it is
interesting to observe the better vectorization done due to the better ordering
of names (in almost all places that there is vectorized code):

E.g. non-canonicalized:

.L37:
        ld4     {v24.16b - v27.16b}, [x1], 64
        shl     v23.16b, v24.16b, 5
        add     v28.16b, v23.16b, v24.16b
        mov     v23.16b, v18.16b
        mla     v23.16b, v17.16b, v25.16b
        mov     v29.16b, v23.16b
        mov     v23.16b, v20.16b
        mla     v23.16b, v19.16b, v26.16b
        mov     v30.16b, v23.16b
        mov     v23.16b, v22.16b
        mla     v23.16b, v21.16b, v27.16b
        mov     v31.16b, v23.16b
        st4     {v28.16b - v31.16b}, [x0], 64
        cmp     x3, x1
        bne     .L37

Assembly for the same code, 3 instructions less:

.L37:
        ld4     {v28.16b - v31.16b}, [x1], 64
        mov     v26.16b, v23.16b
        mov     v27.16b, v21.16b
        shl     v25.16b, v28.16b, 5
        mla     v26.16b, v24.16b, v29.16b
        mla     v27.16b, v22.16b, v30.16b
        add     v25.16b, v25.16b, v28.16b
        mov     v28.16b, v19.16b
        mla     v28.16b, v20.16b, v31.16b
        st4     {v25.16b - v28.16b}, [x0], 64
        cmp     x1, x3
        bne     .L37

E.g. another loop, non canonicalized names:

.L120:
        ldr     q30, [x0], 16
        movi    v29.2s, 0
        ld2     {v26.16b - v27.16b}, [x4], 32
        movi    v25.4s, 0
        zip1    v29.16b, v30.16b, v29.16b
        zip2    v30.16b, v30.16b, v25.16b
        umlal   v29.8h, v26.8b, v28.8b
        umlal2  v30.8h, v26.16b, v28.16b
        uaddw   v31.4s, v31.4s, v29.4h
        uaddw2  v31.4s, v31.4s, v29.8h
        uaddw   v31.4s, v31.4s, v30.4h
        uaddw2  v31.4s, v31.4s, v30.8h
        cmp     x5, x0
        bne     .L120

Assembly for the same loop, 2 instructions less (no use of zip1/2):

.L120:
        ld2     {v30.16b - v31.16b}, [x4], 32
        ldr     q28, [x0], 16
        umull   v31.8h, v30.8b, v27.8b
        umull2  v30.8h, v30.16b, v27.16b
        uaddw   v31.8h, v31.8h, v28.8b
        uaddw2  v30.8h, v30.8h, v28.16b
        uaddw   v29.4s, v29.4s, v31.4h
        uaddw2  v29.4s, v29.4s, v31.8h
        uaddw   v29.4s, v29.4s, v30.4h
        uaddw2  v29.4s, v29.4s, v30.8h
        cmp     x0, x5
        bne     .L120

I haven't found any compile time regressions from compacting SSA names after
each optimization pass.

Although we can search for every place that is affected by the ordering of
names and try to fix it up, it looks like compacting SSA names and/or sorting
the freelists is a good start.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-02-23 15:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-20 10:22 [Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists manolis.tsamis at vrull dot eu
2024-02-21  0:06 ` [Bug tree-optimization/114010] " pinskia at gcc dot gnu.org
2024-02-21  0:11 ` pinskia at gcc dot gnu.org
2024-02-21  0:22 ` pinskia at gcc dot gnu.org
2024-02-21 12:03 ` manolis.tsamis at vrull dot eu
2024-02-21 12:09 ` manolis.tsamis at vrull dot eu
2024-02-22  1:02 ` ptomsich at gcc dot gnu.org
2024-02-22  1:02 ` ptomsich at gcc dot gnu.org
2024-02-22 10:28 ` rguenth at gcc dot gnu.org
2024-02-23 14:31 ` ptomsich at gcc dot gnu.org
2024-02-23 15:02 ` manolis.tsamis at vrull dot eu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).