public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes
@ 2024-01-26  6:04 pinskia at gcc dot gnu.org
  2024-01-26  6:05 ` [Bug target/113613] " pinskia at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-26  6:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

            Bug ID: 113613
           Summary: [14 Regression] Missing ldp/stp optimization sometimes
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
                CC: acoplan at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64-*-*

Take:
```
typedef float __attribute__((vector_size(8))) v2sf;
v2sf a[4];
v2sf b[4];
void f()
{
  b[0] += a[0];
  b[1] += a[1];
}

```

With -O3 on the trunk we get:
```
f:
        adrp    x1, .LANCHOR0
        add     x0, x1, :lo12:.LANCHOR0
        ldr     d31, [x1, #:lo12:.LANCHOR0]
        ldr     d30, [x0, 32]
        fadd    v30.2s, v31.2s, v30.2s
        ldr     d31, [x0, 8]
        str     d30, [x1, #:lo12:.LANCHOR0]
        ldr     d30, [x0, 40]
        fadd    v30.2s, v31.2s, v30.2s
        str     d30, [x0, 8]
        ret
```

But in GCC 13 we got:
```
f:
        adrp    x1, .LANCHOR0
        add     x0, x1, :lo12:.LANCHOR0
        ldp     d1, d0, [x0]
        ldp     d3, d2, [x0, 32]
        fadd    v1.2s, v1.2s, v3.2s
        fadd    v0.2s, v0.2s, v2.2s
        stp     d1, d0, [x0]
        ret
```

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization sometimes
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
@ 2024-01-26  6:05 ` pinskia at gcc dot gnu.org
  2024-01-26  6:39 ` pinskia at gcc dot gnu.org
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-26  6:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |14.0

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note we should really get v4sf but that is PR 95960.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization sometimes
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
  2024-01-26  6:05 ` [Bug target/113613] " pinskia at gcc dot gnu.org
@ 2024-01-26  6:39 ` pinskia at gcc dot gnu.org
  2024-01-26  8:34 ` acoplan at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-26  6:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note I don't know if this shows up in real programs but it might point to
something missing that might happen in real programs.

Another testcase this time without vectors:
```
double a[4];
double b[4];
void f()
{
  b[0] += a[0];
  b[1] *= a[1];
}

```

For some reason it works with the GPRs though:
```
int a[4];
int b[4];
void f()
{
  b[0] += a[0];
  b[1] *= a[1];
}

```

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization sometimes
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
  2024-01-26  6:05 ` [Bug target/113613] " pinskia at gcc dot gnu.org
  2024-01-26  6:39 ` pinskia at gcc dot gnu.org
@ 2024-01-26  8:34 ` acoplan at gcc dot gnu.org
  2024-01-26  8:58 ` [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8 acoplan at gcc dot gnu.org
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-01-26  8:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan <acoplan at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
           Assignee|unassigned at gcc dot gnu.org      |acoplan at gcc dot gnu.org
   Last reconfirmed|                            |2024-01-26
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #3 from Alex Coplan <acoplan at gcc dot gnu.org> ---
Confirmed, I'll take a look.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-01-26  8:34 ` acoplan at gcc dot gnu.org
@ 2024-01-26  8:58 ` acoplan at gcc dot gnu.org
  2024-01-26  9:00 ` acoplan at gcc dot gnu.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-01-26  8:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan <acoplan at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org
            Summary|[14 Regression] Missing     |[14 Regression] Missing
                   |ldp/stp optimization        |ldp/stp optimization since
                   |sometimes                   |r14-6290-g9f0f7d802482a8

--- Comment #4 from Alex Coplan <acoplan at gcc dot gnu.org> ---
Interestingly we started to miss this with the introduction of aarch64
early RA i.e. r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182.

My ldp/stp pattern rewrite was:
r14-6604-gd7ee988c491cde43d04fe25f2b3dbad9d85ded45
so we started to miss this before any of my ldp/stp patches.

Looking at what happens with the ldp/stp pass, I can see that in sched1 we've
already allocated hard regs to the vector load destinations:

    3: NOTE_INSN_BASIC_BLOCK 2
    2: NOTE_INSN_FUNCTION_BEG
   13: NOTE_INSN_DELETED
    5: debug begin stmt marker
    6: r107:DI=high(`*.LANCHOR0')
    7: r106:DI=r107:DI+low(`*.LANCHOR0')
      REG_EQUAL `*.LANCHOR0'
   14: v31:V2SF=[r107:DI+low(`*.LANCHOR0')]
   15: v30:V2SF=[r106:DI+0x20]
   16: v30:V2SF=v31:V2SF+v30:V2SF
      REG_DEAD v31:V2SF
   27: v31:V2SF=[r106:DI+0x8]
   17: [r107:DI+low(`*.LANCHOR0')]=v30:V2SF
      REG_DEAD r107:DI
      REG_DEAD v30:V2SF
   18: debug begin stmt marker
   28: v30:V2SF=[r106:DI+0x28]
   29: v30:V2SF=v31:V2SF+v30:V2SF
      REG_DEAD v31:V2SF
   30: [r106:DI+0x8]=v30:V2SF
      REG_DEAD r106:DI
      REG_DEAD v30:V2SF
   33: NOTE_INSN_DELETED

and then there's nothing that the early ldp/stp pass can do because the
would-be load pair candidates already use the same (hard) transfer register due
to early RA:

merge_pairs [L=1], cand vecs (14) x (27)
analyzing pair (load=1): (14,27)
punting on ldp due to reg conflcits (14,27)
merge_pairs [L=1], cand vecs (15) x (28)
analyzing pair (load=1): (15,28)
punting on ldp due to reg conflcits (15,28)
merge_pairs [L=0], cand vecs (17) x (30)
analyzing pair (load=0): (17,30)
pair (17,30): rejecting base 106 due to dataflow hazards (28,29)
can't form pair (17,30) due to dataflow hazards
starting the processing of deferred insns
ending the processing of deferred insns

CCing Richard S for an opinion.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-01-26  8:58 ` [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8 acoplan at gcc dot gnu.org
@ 2024-01-26  9:00 ` acoplan at gcc dot gnu.org
  2024-01-26  9:05 ` acoplan at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-01-26  9:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Alex Coplan <acoplan at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEW
           Assignee|acoplan at gcc dot gnu.org         |unassigned at gcc dot gnu.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-01-26  9:00 ` acoplan at gcc dot gnu.org
@ 2024-01-26  9:05 ` acoplan at gcc dot gnu.org
  2024-01-26  9:33 ` acoplan at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-01-26  9:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

--- Comment #5 from Alex Coplan <acoplan at gcc dot gnu.org> ---
It looks like the current ordering of passes is:

early_ra
sched1
ldp_fusion1
early_remat

ISTM that ldp_fusion1 should probably be running before early_ra, but we found
that running ldp_fusion1 before sched1 could lead to increased register
pressure. Hmm.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-01-26  9:05 ` acoplan at gcc dot gnu.org
@ 2024-01-26  9:33 ` acoplan at gcc dot gnu.org
  2024-01-26 10:20 ` rsandifo at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-01-26  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

--- Comment #6 from Alex Coplan <acoplan at gcc dot gnu.org> ---
FWIW, if I move ldp_fusion1 before early_ra, with:

diff --git a/gcc/config/aarch64/aarch64-passes.def
b/gcc/config/aarch64/aarch64-passes.def
index 769d48f4faa..3853f6bf7a4 100644
--- a/gcc/config/aarch64/aarch64-passes.def
+++ b/gcc/config/aarch64/aarch64-passes.def
@@ -18,6 +18,7 @@
    along with GCC; see the file COPYING3.  If not see
    <http://www.gnu.org/licenses/>.  */

+INSERT_PASS_BEFORE (pass_sched, 1, pass_ldp_fusion);
 INSERT_PASS_BEFORE (pass_sched, 1, pass_aarch64_early_ra);
 INSERT_PASS_AFTER (pass_regrename, 1, pass_fma_steering);
 INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
@@ -25,5 +26,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue,
1, pass_switch_pstat
 INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
 INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
 INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
-INSERT_PASS_BEFORE (pass_early_remat, 1, pass_ldp_fusion);
 INSERT_PASS_BEFORE (pass_peephole2, 1, pass_ldp_fusion);

we get:

f:
.LFB0:
        .cfi_startproc
        adrp    x0, .LANCHOR0
        add     x0, x0, :lo12:.LANCHOR0
        ldp     d31, d30, [x0]
        ldp     d29, d28, [x0, 32]
        fadd    v29.2s, v31.2s, v29.2s
        fadd    v28.2s, v30.2s, v28.2s
        stp     d29, d28, [x0]
        ret

note that this does use more registers, though, so it's not necessarily a clear
win in the general case (particularly if register pressure is already high).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-01-26  9:33 ` acoplan at gcc dot gnu.org
@ 2024-01-26 10:20 ` rsandifo at gcc dot gnu.org
  2024-02-23 14:13 ` cvs-commit at gcc dot gnu.org
  2024-02-23 14:16 ` rsandifo at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-01-26 10:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Richard Sandiford <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rsandifo at gcc dot gnu.org

--- Comment #7 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
early-ra does try to avoid reusing registers too soon, to increase scheduling
freedom.  But in this case I imagine it handles the two statements as separate
regions.  Should be fixable by carrying across a round-robin counter.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-01-26 10:20 ` rsandifo at gcc dot gnu.org
@ 2024-02-23 14:13 ` cvs-commit at gcc dot gnu.org
  2024-02-23 14:16 ` rsandifo at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-02-23 14:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

--- Comment #8 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:ff442719cdb64c9df9d069af88e90d51bee6fb56

commit r14-9157-gff442719cdb64c9df9d069af88e90d51bee6fb56
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Fri Feb 23 14:12:55 2024 +0000

    aarch64: Spread out FPR usage between RA regions [PR113613]

    early-ra already had code to do regrename-style "broadening"
    of the allocation, to promote scheduling freedom.  However,
    the pass divides the function into allocation regions
    and this broadening only worked within a single region.
    This meant that if a basic block contained one subblock
    of FPR use, followed by a point at which no FPRs were live,
    followed by another subblock of FPR use, the two subblocks
    would tend to reuse the same registers.  This in turn meant
    that it wasn't possible to form LDP/STP pairs between them.

    The failure to form LDPs and STPs in the testcase was a
    regression from GCC 13.

    The patch adds a simple heuristic to prefer less recently
    used registers in the event of a tie.

    gcc/
            PR target/113613
            * config/aarch64/aarch64-early-ra.cc
            (early_ra::m_current_region): New member variable.
            (early_ra::m_fpr_recency): Likewise.
            (early_ra::start_new_region): Bump m_current_region.
            (early_ra::allocate_colors): Prefer less recently used registers
            in the event of a tie.  Add a comment to explain why we prefer(ed)
            higher-numbered registers.
            (early_ra::find_oldest_color): Prefer less recently used registers
            here too.
            (early_ra::finalize_allocation): Update recency information for
            allocated registers.
            (early_ra::process_blocks): Initialize m_current_region and
            m_fpr_recency.

    gcc/testsuite/
            PR target/113613
            * gcc.target/aarch64/pr113613.c: New test.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8
  2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-02-23 14:13 ` cvs-commit at gcc dot gnu.org
@ 2024-02-23 14:16 ` rsandifo at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-02-23 14:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113613

Richard Sandiford <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #9 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
Fixed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-02-23 14:16 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-26  6:04 [Bug target/113613] New: [14 Regression] Missing ldp/stp optimization sometimes pinskia at gcc dot gnu.org
2024-01-26  6:05 ` [Bug target/113613] " pinskia at gcc dot gnu.org
2024-01-26  6:39 ` pinskia at gcc dot gnu.org
2024-01-26  8:34 ` acoplan at gcc dot gnu.org
2024-01-26  8:58 ` [Bug target/113613] [14 Regression] Missing ldp/stp optimization since r14-6290-g9f0f7d802482a8 acoplan at gcc dot gnu.org
2024-01-26  9:00 ` acoplan at gcc dot gnu.org
2024-01-26  9:05 ` acoplan at gcc dot gnu.org
2024-01-26  9:33 ` acoplan at gcc dot gnu.org
2024-01-26 10:20 ` rsandifo at gcc dot gnu.org
2024-02-23 14:13 ` cvs-commit at gcc dot gnu.org
2024-02-23 14:16 ` rsandifo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).