public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies
@ 2021-01-21 14:31 tnfchris at gcc dot gnu.org
  2021-01-21 14:35 ` [Bug rtl-optimization/98782] " jgreenhalgh at gcc dot gnu.org
                   ` (44 more replies)
  0 siblings, 45 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-01-21 14:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

            Bug ID: 98782
           Summary: IRA artificially creating spills due to BB frequencies
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: ra
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
                CC: jgreenhalgh at gcc dot gnu.org, vmakarov at gcc dot gnu.org
  Target Milestone: ---

Created attachment 50020
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50020&action=edit
dumps and source

Hi,

James and I have been investigating the exchange2 regression that has been
haunting Trunk

since:

        commit 1118a3ff9d3ad6a64bba25dc01e7703325e23d92
        Author: Jan Hubicka <jh@suse.cz>
        Date:   Tue Aug 11 12:02:32 2020 +0200
    Do not combine PRED_LOOP_GUARD and PRED_LOOP_GUARD_WITH_RECURSION

That patch fixed the prediction frequency for the basic blocks in some
of the recursively specialized functions in exchange2. Unfortunately, it
also created a large (12+%) performance regression on multiple targets
(including x86).

After initially blaming the new prediction frequencies, and significant work
from Jan and Martin we have good confidence in the probabilities, however they
appear to be exposing issues with the probability-based cost models in IRA
causing additional spills after artificially limiting register pressure by
excluding caller-saved registers across a call site in the loop.

This testcase (tuned) for AArch64 shows the issue:

    void bar (int, int, int, int);
    int foo (int x, char* foo) {
      int tmp = x * 753;
      int t2 = tmp + 7;
      int t3 = tmp * 7;
      int c1 = 753;
      int c2 = c1 + 7;
      int c3 = c3 * 7;
      for (int i = 0; i < 1024; i++) {
        if (__builtin_expect_with_probability (foo[i] != 0, 1, SPILLER))
          bar(x, tmp, t2, t3);
        c1 += foo[i+1];
        c2 *= foo[i+1];
        c3 += c2;
      }
      return c1 + c2 + c3;
    }

You can see the difference in the basic block labeled with L2 (using ffixed
to provoke register pressure):

Good compile:
  gcc -DSPILLER=0.5 -fno-shrink-wrap -fno-schedule-insns -O3   -ffixed-x23
-ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks

    .L2:
        ldrb    w0, [x19, 1]!
        add     w22, w22, w0
        mul     w20, w20, w0
        add     w21, w21, w20
        cmp     x7, x19
        bne     .L5

Bad compile:
  gcc -DSPILLER=0.51 -fno-shrink-wrap -fno-schedule-insns -O3   -ffixed-x23
-ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks

    .L2:
        ldrb    w0, [x19, 1]!
        add     w21, w21, w0
        mul     w20, w20, w0
        ldr     x0, [sp, 64]         <<<<<< Reload of x0
        add     w22, w22, w20
        cmp     x0, x19
        bne     .L5


Neither of us are an expert in this area by any means, I think what we're
seeing can
be explained by this line in the IRA dump:

Good:
   Allocno a5r104 of GENERAL_REGS(24) has 17 avail. regs  4-15 18-22, node: 
4-15 18-22 (confl regs =  0-3 16-17 23-85)
Bad:
    Allocno a5r104 of GENERAL_REGS(24) has 4 avail. regs  19-22, node:  19-22
(confl regs =  0-3 16-17 23-85)

In the bad case it looks like all the available registers go down, but these
particular ones have so few left over that it causes the spill to occur.

The change in available registers comes from this code in
setup_profitable_hardregs:

              if (ALLOCNO_UPDATED_MEMORY_COST (a) < costs[j]
                  /* Do not remove HARD_REGNO for static chain pointer
                     pseudo when non-local goto is used.  */
                  && ! non_spilled_static_chain_regno_p (ALLOCNO_REGNO (a)))
                CLEAR_HARD_REG_BIT (data->profitable_hard_regs,
                                    hard_regno);

Both sides of the ALLOCNO_UPDATED_MEMORY_COST (a) < costs[j] calculation
make use of frequency, but there is some asymmetry.

In the case of the bigger exchange2 regression if just revert
1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which affects the BB frequencies we
get a slightly higher score than in GCC 10, indicating that the changes in IPA
are indeed sound.

It also gives us a comparison where the entire sequence up to reload is exactly
the same, aside from the counts in the BB and the frequencies.

Good BB:

;;  succ:       66 [always]  count:53687092 (estimated locally) (FALLTHRU)
;; lr  out       29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438
;; live  out     29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438

;; basic block 66, loop depth 0, count 107374184 (estimated locally), maybe hot
;;  prev block 65, next block 67, flags: (REACHABLE, RTL, MODIFIED)
;;  pred:       64 [50.0% (guessed)]  count:53687092 (estimated locally)
;;              65 [always]  count:53687092 (estimated locally) (FALLTHRU)

Bad BB:

;;  succ:       66 [always]  count:3487081 (estimated locally) (FALLTHRU)
;; lr  out       29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438
;; live  out     29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438

;; basic block 66, loop depth 0, count 6974163 (estimated locally), maybe hot
;;  prev block 65, next block 67, flags: (REACHABLE, RTL, MODIFIED)
;;  pred:       64 [50.0% (guessed)]  count:3487082 (estimated locally)
;;              65 [always]  count:3487081 (estimated locally) (FALLTHRU)

>From there everything seems to change including the costs for register classes

Good:

a8(r112,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:1445,14170
FP_LO_REGS:1445,14170 FP_REGS:1445,14170 POINTER_AND_FP_REGS:1445,14170
MEM:1156,11336
a45(r112,l1) costs: GENERAL_REGS:0,0 FP_LO8_REGS:12725,12725
FP_LO_REGS:12725,12725 FP_REGS:12725,12725 POINTER_AND_FP_REGS:12725,12725
MEM:10180,10180
Allocno a8r112 of GENERAL_REGS(30) has 26 avail. regs  1-15 18-28, node:  1-15
18-28 (confl regs =  0 16-17 29-85)

Bad:

a8(r112,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:85,610 FP_LO_REGS:85,610
FP_REGS:85,610 POINTER_AND_FP_REGS:85,610 MEM:68,488
a45(r112,l1) costs: GENERAL_REGS:0,0 FP_LO8_REGS:525,525 FP_LO_REGS:525,525
FP_REGS:525,525 POINTER_AND_FP_REGS:525,525 MEM:420,420
Allocno a8r112 of GENERAL_REGS(30) has 10 avail. regs  19-28, node:  19-28
(confl regs =  0 16-17 29-85)

and a new spill a89 is created for a value that is actually live through all
branches inside the loop:

Good IRA sets:

    Hard reg set forest:
      0:( 0-28 30 32-63)@0
        1:( 0-17 19-25 27-28 30)@112080
          2:( 1-15 19-25 27-28)@196432
            3:( 19-25 27-28)@280

Bad IRA sets:

    Hard reg set forest:
      0:( 0-28 30 32-63)@0
        1:( 0-22 24 28 30)@121976
          2:( 1-15 18-22 24 28)@116576
            3:( 19-22 24 28)@8864
      Spill a87(r99,l2)
      Spill a89(r112,l2)

In the exchange2 example the function call seems thousands of instructions
away, but is part of the main loop.
The function call is self recursive call (before specialization) and only needs
x0.  So it looks like caller saved registers are being severely
penalized here.

That is to say, the likelihood of RA needing a caller-saved register is so high
it should just use it, even in the
presence of the function call.  Since whether it spills the caller-saved or the
temp register as it's trying now would
make no difference if it has to spill at the call site.  But if it doesn't then
this would result in better code.

We would appreciate any tips/help/workarounds we could do to mitigate this as
the regression is quite significant.

I have attached for convenience the dump files and generated code for the good
and bad cases for the example above.

In exchange2 this code is generated in the
__brute_force_MOD_digits_2.constprop.4 specialization.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] IRA artificially creating spills due to BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
@ 2021-01-21 14:35 ` jgreenhalgh at gcc dot gnu.org
  2021-01-22 10:12 ` fxue at os dot amperecomputing.com
                   ` (43 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: jgreenhalgh at gcc dot gnu.org @ 2021-01-21 14:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

James Greenhalgh <jgreenhalgh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-01-21
     Ever confirmed|0                           |1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] IRA artificially creating spills due to BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
  2021-01-21 14:35 ` [Bug rtl-optimization/98782] " jgreenhalgh at gcc dot gnu.org
@ 2021-01-22 10:12 ` fxue at os dot amperecomputing.com
  2021-01-29 13:34 ` tnfchris at gcc dot gnu.org
                   ` (42 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: fxue at os dot amperecomputing.com @ 2021-01-22 10:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

Feng Xue <fxue at os dot amperecomputing.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fxue at os dot amperecomputing.com

--- Comment #1 from Feng Xue <fxue at os dot amperecomputing.com> ---
The value "foo + 1024" is spilled for both cases, but in different way. For bad
case, spill outside loop, and only reload inside. While for good case,
spill/reload pair occurs around the call to "bar", which might also consider
extra cost of using caller-saved registers. It seems that IRA has two different
logics to handle spilling.

[Bad case]
foo:
        stp     x29, x30, [sp, -80]!
        mov     w5, 753
        mov     x29, sp
        stp     x19, x20, [sp, 16]
        mov     x19, x1
        mul     w1, w0, w5
        stp     x21, x22, [sp, 32]
        mov     w22, 5271
        add     w2, w1, 7
        mov     w21, w5
        mul     w3, w0, w22
        mov     w20, 760
        mov     w22, 0
        str     w0, [sp, 76]
        add     x0, x19, 1024
        str     x0, [sp, 64]      // Spill (foo + 1024)
        .p2align 3,,7
.L5:
        ldrb    w0, [x19]
        cbz     w0, .L2
        ldr     w0, [sp, 76]
        stp     w1, w2, [sp, 56]
        str     w3, [sp, 72]
        bl      bar
        ldrb    w0, [x19, 1]!
        ldp     w1, w2, [sp, 56]
        add     w21, w21, w0
        ldr     w3, [sp, 72]
        mul     w20, w20, w0
        ldr     x0, [sp, 64]     // Reload (foo + 1024)
        add     w22, w22, w20
        cmp     x19, x0
        bne     .L5
        b       .L4
        .p2align 2,,3
.L2:
        ldrb    w0, [x19, 1]!
        add     w21, w21, w0
        mul     w20, w20, w0
        ldr     x0, [sp, 64]     // Reload (foo + 1024)
        add     w22, w22, w20
        cmp     x0, x19
        bne     .L5
.L4:
        add     w0, w20, w21
        add     w0, w0, w22
        ldp     x19, x20, [sp, 16]
        ldp     x21, x22, [sp, 32]
        ldp     x29, x30, [sp], 80
        ret

[Good case:]
foo:
        stp     x29, x30, [sp, -80]!
        mov     w5, 753
        add     x7, x1, 1024
        mul     w2, w0, w5
        mov     x29, sp
        stp     x21, x22, [sp, 32]
        mov     w21, 5271
        mov     w22, w5
        stp     x19, x20, [sp, 16]
        mov     x19, x1
        mul     w3, w0, w21
        stp     w2, w0, [sp, 72]  // Spill x(%w0)
        add     w2, w2, 7         // t2(%w2)
        mov     w21, 0
        mov     w20, 760
        .p2align 3,,7
.L5:
        ldrb    w0, [x19]
        cbz     w0, .L2
        ldp     w1, w0, [sp, 72]  // Reload x 
        stp     w2, w3, [sp, 56]  // Spill t2
        str     x7, [sp, 64]      // Spill (foo + 1024)
        bl      bar
        ldrb    w0, [x19, 1]!
        ldr     x7, [sp, 64]      // Reload (foo + 1024)
        add     w22, w22, w0
        ldp     w2, w3, [sp, 56]  // Reload t2
        mul     w20, w20, w0
        add     w21, w21, w20
        cmp     x19, x7
        bne     .L5
        b       .L4
        .p2align 2,,3
.L2:
        ldrb    w0, [x19, 1]!
        add     w22, w22, w0
        mul     w20, w20, w0
        add     w21, w21, w20
        cmp     x7, x19
        bne     .L5
.L4:
        add     w0, w20, w22
        add     w0, w0, w21
        ldp     x19, x20, [sp, 16]
        ldp     x21, x22, [sp, 32]
        ldp     x29, x30, [sp], 80
        ret

Even for good case, we could expect better spill/reload generation. Refer to
comments above, "x" and "t2" are similar, both loop invariant, but handled
differently. Spilling "t2" inside loop is worst than spilling it outside, as
what IRA does for "x". 

Both issues could be correlated to same thing.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] IRA artificially creating spills due to BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
  2021-01-21 14:35 ` [Bug rtl-optimization/98782] " jgreenhalgh at gcc dot gnu.org
  2021-01-22 10:12 ` fxue at os dot amperecomputing.com
@ 2021-01-29 13:34 ` tnfchris at gcc dot gnu.org
  2021-02-05 12:02 ` [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in " tnfchris at gcc dot gnu.org
                   ` (41 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-01-29 13:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Just an update on what I know so far.

There seems to be no symmetry between the growth of the memory costs vs that of
the caller saved registers.

In the case of the memory costs, the actual memory cost in the back end is
multiplied by the BB frequencies.
The total allocation frequency and memory costs for the memory is a sum of all
the usages of the register in
a live range/BB.

In the case of the example above the reg we're interested in is reg 104.

create_insn_allocnos you can see that the memory costs for the register and the
total frequencies grow as follows:

Bad:

   ALLOCNO_FREQ 10, REF_FREQ_BB 10
   ALLOCNO_FREQ 485, REF_FREQ_BB 485
   ALLOCNO_FREQ 989, REF_FREQ_BB 504

Good:

   ALLOCNO_FREQ 10, REF_FREQ_BB 10  
   ALLOCNO_FREQ 495, REF_FREQ_BB 495
   ALLOCNO_FREQ 990, REF_FREQ_BB 495

Notice that in the bad case your total final frequency is actually lower.

The costs are created by BB_FREQ * mem-cost in the backend. In the case of
AArch64 that's
BB_FREQ * 4.

So what we end up within cost calculations in scan_one_insn in ira-costs is:

Bad:

  add-cost 40, mem-cost 40
  add-cost 1940, mem-cost 1940
  add-cost 2016, mem-cost 3956

Good:

  add-cost 40, mem-cost 40    
  add-cost 1980, mem-cost 1980
  add-cost 1980, mem-cost 3960

Counter-intuitively this means having a higher probability for the BB gets you
a lower frequency which in turn seems to give you a lower cost for memory
operations.

Finally this gets topped off somewhere with another small amount (10 * memcost,
where 10 looks like to be the ratio between BB_FREQ_MAX / REG_FREQ_MAX)
which result in the costs for regiisters that can be seen during an IRA dump.

Bad:

a5(r104,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:50,4995 FP_LO_REGS:50,4995
FP_REGS:50,4995 POINTER_AND_FP_REGS:50,4995 MEM:40,3996

Good:

a5(r104,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:50,5000 FP_LO_REGS:50,5000
FP_REGS:50,5000 POINTER_AND_FP_REGS:50,5000 MEM:40,4000

So overall a change of 0.1% in probability caused a decrease of 0.1 %
(BB_FREQ_MAX / REG_FREQ_MAX) * memcost.

Now, based on the above the base costs of using GENERAL_REGS for both of the
above are 0.

But in ira_tune_allocno_costs IRA applies a penalty to the costs of the
register if it's live across a call in the same BB.
This penalty is applied using the CALL_FREQ.

Bad:

  CALL_FREQ 504, FREQ_FROM_BB 504

Good:

  CALL_FREQ 495, FREQ_FROM_BB 495

So the BB_FREQ went down, but the CALL_FREQ went up in the Bad/Good case.

The BB_FREQ went down 1, but the CALL_FREQ went up 9.

The penalty that is applied is CALL_FREQ * (<total cost for a spill AND reload
of the register>).
So in our case it's CALL_FREQ * (4 + 4).

So we end up with:

Bad:

   ira_memory_move_cost[0] 4, ira_memory_move_cost[1] 4
   cost 4032, reg-cost 4032
   CALL_FREQ 504, ALLOCANO_FREQ 999

Good:

   ira_memory_move_cost[0] 4, ira_memory_move_cost[1] 4  
   cost 3960, reg-cost 3960                              
   CALL_FREQ 495, ALLOCANO_FREQ 1000                     

Which gives us a final cost of:

Bad:

   Memory: 3956
   Register: 4032

Good:

   Memory: 3960
   Register: 3960

This is all to say, the penalty grows far quicker than BB frequency.
Tiny changes in BB frequencies have a large effect on it.

This is why RA ends up thinking it's cheaper to go through memory.  It is
trying to apply a penalty to the registers to prevent their use, which is
understandable but what isn't clear to me is that it doesn't take into account
whether the instruction is in a loop.  It can be far cheaper to spill at the
call site instead.

ira_tune_allocno_costs does have a callback hook

  IRA_HARD_REGNO_ADD_COST_MULTIPLIER

that targets can use to tweak the penalty being applied to the registers that
are live through a call. The annoying thing about the hook is that values you
give it are weighted by BB_FREQ and not CALL_FREQ.  So you're not given enough
information to be able to strike a balance between the growth of the CALL_FREQ
and the BB_FREQ.

It's done using

    cost += ((ira_memory_move_cost[mode][rclass][0]
              + ira_memory_move_cost[mode][rclass][1])
             * ALLOCNO_FREQ (a)
             * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);

I have attempted to provide a backend-hook in AArch64 that applies an inverse
penalty to caller-saved registers.  But because I am not given frequency
information I can only give a constant penalty back, which means it's clearly
incorrect as it's specifically tuning for a loop in Exchange2.

Using this hook I was able to only regain about 40% of the regression with no
losses in SPECINT CPU2017.

But I think the ratio between the growth of the costs and the penalty is off. 
This is evident by that the regression exists on multiple targets.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-01-29 13:34 ` tnfchris at gcc dot gnu.org
@ 2021-02-05 12:02 ` tnfchris at gcc dot gnu.org
  2021-02-23  2:11 ` jiangning.liu at amperecomputing dot com
                   ` (40 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-02-05 12:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org
            Summary|IRA artificially creating   |[11 Regression] Bad
                   |spills due to BB            |interaction between IPA
                   |frequencies                 |frequences and IRA
                   |                            |resulting in spills due to
                   |                            |changes in BB frequencies

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Hi,

Since we are in stage-4 I'd like to put all our ducks in a row and see what the
options are at this point.

IRA as you can imagine is huge and quite complex, the more I investigate the
problem the more I realize that
there isn't a spot fix for this issue.  It will require a lot more work in IRA
and understanding parts of it
that I don't fully understand yet.

But one thing is clear, there is a severe interaction between IPA predictions
and IRA under conditions where
there is high register pressure *and* a function call.

The problem is that the changes introduced in
g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 make local changes.
i.e. they effect only some BB and not others.  The problem is any spot fix in
IRA would be a globally scoped.

I was investigating whether the issue could be solved by having IRA treat the
recursive inlined function in
exchange2 as one region instead of going live range splitting. And yes using
-fira-region=one does make a
difference, but only a small difference of about 33% of the regression. However
doing this has some disadvantage
in that regions that before would not count in the live range of the call are
now counted, so you regress
spilling in those cases.  This is why this flag can only recover 33% of the
regression, it introduces some of it's
own.

The second alternative I tried as a spot fix is to be able to specify a weight
for the CALL_FREQ for use during
situations of high reg pressure and call live ranges.  The "hack" looks like
this:

index 4fe019b2367..674e6ca7a48 100644
--- a/gcc/caller-save.c
+++ b/gcc/caller-save.c
@@ -425,6 +425,7 @@ setup_save_areas (void)
          || find_reg_note (insn, REG_NORETURN, NULL))
        continue;
       freq = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (insn));
+      freq = freq * (param_ira_call_freq_weight / 100.f);
       REG_SET_TO_HARD_REG_SET (hard_regs_to_save,
                               &chain->live_throughout);
       used_regs = insn_callee_abi (insn).full_reg_clobbers ();
diff --git a/gcc/ira-lives.c b/gcc/ira-lives.c
index 4ba29dcadf4..6e2699e5a7d 100644
--- a/gcc/ira-lives.c
+++ b/gcc/ira-lives.c
@@ -1392,7 +1392,7 @@ process_bb_node_lives (ira_loop_tree_node_t
loop_tree_node)
                       it was saved on previous call in the same basic
                       block and the hard register was not mentioned
                       between the two calls.  */
-                   ALLOCNO_CALL_FREQ (a) += freq / 3;
+                   ALLOCNO_CALL_FREQ (a) += (freq *
(param_ira_call_freq_weight / 100.0f));
diff --git a/gcc/params.opt b/gcc/params.opt
index cfed980a4d2..39d5cae9f31 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -321,6 +321,11 @@ Max size of conflict table in MB.
 Common Joined UInteger Var(param_ira_max_loops_num) Init(100) Param
Optimization
 Max loops number for regional RA.

+-param=ira-call-freq-weight=
+Common Joined UInteger Var(param_ira_call_freq_weight) Init(100) Param
Optimization
+Scale to be applied to the weighting of the frequencies of allocations live
across
+a call.
+
 -param=iv-always-prune-cand-set-bound=
 Common Joined UInteger Var(param_iv_always_prune_cand_set_bound) Init(10)
Param Optimization
 If number of candidates in the set is smaller, we always try to remove unused
ivs during its optimization.

And if we look at the changes in the frequency between the good and bad case
the prediction changes approx 40%.
So using the value of --param ira-call-freq-weight=40 recovers about 60% of the
regression.  The issue this global
change introduce is however that IRA seems to start preferring callee-saves.
Which is in itself not an issue, but
at the boundary of a region it will then emit moves from temp to callee-saves
to carry live values to the next region.

This is completely unneeded, enabling the late register renaming pass
(-frename-registers) removes these superfluous moves
and we recover 66% of the regression. But this is just a big hack.  The obvious
disadvantage here, since again it's a global
change is that it pushes all caller saves to be spilled before the function
call.  And indeed, before the recursive call
there now is a massive amount of spilling happening.

But it is something that would be "safe" to do at this point in the GCC
development cycle.

The last and preferred approach, if you agree Honza, is if we can revert
g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 for
GCC 11 and commit it back first things GCC 12.  This would give us a release
cycle to focus on the interaction with IRA.

If we can fix it we fix it, if we can't, we'll have to live with it, but it
gives us time to try to look into alternate
optimizations.

A possibility would also be to introduce a flag that would restore the GCC 11
behavior, which would give you the optimization
for GCC 11 but buy us again time to properly fix in GCC 12.  The downside is,
well, there's a flag now.  But from looking at
the patch this would be easy to do, it's just ignoring the already predicted
check. 

I would really appreciate some feedback from you two, Honza and Vlad.

For now I am also marking this as a 11 regression as I want to reach some kind
of judgement on what the options are.

To summerize, the options I see are in order of preference:

1) temporarily revert g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92
2) Add a parameter to restore GCC 10 behavior
3) Add a partial workaround in IRA using the parameter
4) Do nothing and use -fira-region=one

In all cases we'll have to look at RA in GCC 12.

Thanks,
Tamar

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-02-05 12:02 ` [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in " tnfchris at gcc dot gnu.org
@ 2021-02-23  2:11 ` jiangning.liu at amperecomputing dot com
  2021-02-23 12:06 ` rguenth at gcc dot gnu.org
                   ` (39 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: jiangning.liu at amperecomputing dot com @ 2021-02-23  2:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #4 from Jiangning Liu <jiangning.liu at amperecomputing dot com> ---
Hi Honza,

Do you see any other real case problems if the patch
g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 is not applied?

If exchange2 is the only one affected by this patch so far, and because we have
observed big performance regression, it sounds we need to provide an IRA fix
along with this patch to avoid unexpected performance degradation for gcc11
release vs. gcc10.

Thanks,
-Jiangning

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-02-23  2:11 ` jiangning.liu at amperecomputing dot com
@ 2021-02-23 12:06 ` rguenth at gcc dot gnu.org
  2021-02-26 12:32 ` rguenth at gcc dot gnu.org
                   ` (38 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-23 12:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-02-23 12:06 ` rguenth at gcc dot gnu.org
@ 2021-02-26 12:32 ` rguenth at gcc dot gnu.org
  2021-11-28 19:07 ` [Bug rtl-optimization/98782] [11/12 " hubicka at gcc dot gnu.org
                   ` (37 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-26 12:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.0                        |12.0
           Priority|P3                          |P2

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Re-targeting to GCC 12.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-02-26 12:32 ` rguenth at gcc dot gnu.org
@ 2021-11-28 19:07 ` hubicka at gcc dot gnu.org
  2021-11-29  1:33 ` jiangning.liu at amperecomputing dot com
                   ` (36 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-11-28 19:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I am sorry for getting back to this again late.  This stage1 we spent some time
with Martin improving the ipa-cp profile updating and looked again into the
realism of the profile. Also recently the codegen has improved somewhat due to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227 and due to modref
propagation.

I still believe for trunk GCC we should not have patch that intentionally makes
profile unrealistic just to make IRA work better by accident since it does not
seem to help anything real world except this somewhat odd benchmark. So I
wonder if we can make profile to work better for IRA without actually making it
unrealistic and tampering with ipa-cp cloning heuristics

I added code that compares guessed profile with feedback
https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585599.html
and also fixed/improved code to dump stats about profile updates
https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585578.html

This gives bit more handle on how realistic the profile is.  Answer is that not
very in general, but at least for basic blocks containing calls it is not bad
(we guess 0.9 while relity is 0.999).
I am not sure how much better we can do statically since this is such a special
case of backtracking.

Last week we also noticed that with -Ofast we inline the newly produced clones
together which makes IRA job a lot harder.  This is done by
-finline-functions-called-once and we tend to inline blocks of 2 or 3 clones
leading to 18 or 27 nested loops in each.  Simply disabling this optimization
gets another performance hit.

I filled in PR https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103454 and I think
we could teach the inliner to not inline functions called once in large loop
depths and restrict the large functions growths here since there are multiple
benchmarks that now degrade on this.

Worse yet, the heuristics for inlininig functions called once is not very smart
and it depends on the order of cgrpah_nodes in the linked list which is bit
random.

I wonder how the situation looks on AArch64?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-11-28 19:07 ` [Bug rtl-optimization/98782] [11/12 " hubicka at gcc dot gnu.org
@ 2021-11-29  1:33 ` jiangning.liu at amperecomputing dot com
  2021-11-29  6:59 ` tnfchris at gcc dot gnu.org
                   ` (35 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: jiangning.liu at amperecomputing dot com @ 2021-11-29  1:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #7 from Jiangning Liu <jiangning.liu at amperecomputing dot com> ---
Without reverting the commit g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92, we
still see exchange2 performance issue for aarch64. BTW, we have been using
-fno-inline-functions-called-once to get the best performance number for
exchange2.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-11-29  1:33 ` jiangning.liu at amperecomputing dot com
@ 2021-11-29  6:59 ` tnfchris at gcc dot gnu.org
  2021-12-03 11:44 ` hubicka at gcc dot gnu.org
                   ` (34 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-11-29  6:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #8 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> 
> I wonder how the situation looks on AArch64?

The situation didn't improve, up until the end of stage-1 we were seeing a 6%
perf uplift from somewhere which seems to have gone away now (in a commit range
with a non IPA related patch).

The major problems is still the spills. Talking to Vlad I took at look at
improving the Chaitin-Briggs heuristics for spilling during the presence of
calls and how it tries to improve the allocation by moving spills along the
call gaph.

By improving on these heuristics I was able to reduce the number of spills and
saw improvements on both x86 and AArch64 which brought us back to the old
numbers.

However this same information is used by other areas such as register
preferences and so I had a regression in shrink wrapping.  There's also an
issue where x86 seems to assign negative values to register costs to indicate
they REALLY want this register.  This seems to work because the penalty applied
usually is large and it cancels out the negative cost.  But now the value stays
negative causing the register to not be used instead.

To fix these I need to keep track of the penalties and the costs separately but
did not get time to finish that work before the end of stage-1.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-11-29  6:59 ` tnfchris at gcc dot gnu.org
@ 2021-12-03 11:44 ` hubicka at gcc dot gnu.org
  2021-12-03 11:47 ` hubicka at gcc dot gnu.org
                   ` (33 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-03 11:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pthaugen at gcc dot gnu.org

--- Comment #9 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
*** Bug 96825 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2021-12-03 11:44 ` hubicka at gcc dot gnu.org
@ 2021-12-03 11:47 ` hubicka at gcc dot gnu.org
  2021-12-07 11:19 ` tnfchris at gcc dot gnu.org
                   ` (32 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-03 11:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Thanks for looking into this. I was planning to try to contact Vladimir about
the IRA behaviour here, but there was always something else to work with higher
priority.  I wonder if you could possibly attach the WIP patch?

Teaching inliner to not inline this function as called once is probably just a
matter of adding a limit capping loop depth. I think that is meaningful
heuristics since inlining very large function to large loop depth is probably
not a good idea. I have patch for testing.

Do you know of other benchmarks where -fno-inline-functions-called-once help
(and ideally know why?).  We plan to look into roms and tramp3d which also
regresses.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2021-12-03 11:47 ` hubicka at gcc dot gnu.org
@ 2021-12-07 11:19 ` tnfchris at gcc dot gnu.org
  2021-12-07 11:21 ` tnfchris at gcc dot gnu.org
                   ` (31 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-07 11:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #10)
> Thanks for looking into this. I was planning to try to contact Vladimir
> about the IRA behaviour here, but there was always something else to work
> with higher priority.  I wonder if you could possibly attach the WIP patch?
> 

I did exchange a few emails on the subject with him, and we agreed that the
best way forward would be to try to improve the heuristics instead of changing
the reg allocator algorithm or handle live ranged differently.  So the patches
I've attached try to do just that.

They're git patches where the description contains what they're trying to do.
With -fno-inline-functions-called-once on AArch64 they add about a 13%
improvement on exchange and no regressions anywhere else. On x86 the
improvement last measured was 24% and reduces the number of spills by ~90%. 
The improvement on x86 is higher since AArch64 has more integer registers so we
had more leeway here.

> Teaching inliner to not inline this function as called once is probably just
> a matter of adding a limit capping loop depth. I think that is meaningful
> heuristics since inlining very large function to large loop depth is
> probably not a good idea. I have patch for testing.

That's great! It would be nice not needing the flag anymore.  I tried to see if
I could address this from a regalloc point of view but it didn't look
promising...

> 
> Do you know of other benchmarks where -fno-inline-functions-called-once help
> (and ideally know why?).  We plan to look into roms and tramp3d which also
> regresses.

Hmm no I can't say I do, we only use it for SPECCPU Intrate for Fortran
benchmarks so haven't tracked the others with it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2021-12-07 11:19 ` tnfchris at gcc dot gnu.org
@ 2021-12-07 11:21 ` tnfchris at gcc dot gnu.org
  2021-12-07 11:21 ` tnfchris at gcc dot gnu.org
                   ` (30 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-07 11:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Created attachment 51942
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51942&action=edit
0001-PATCH-1-2-GCC-reload-Weigh-available-callee-saves-ba.patch

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2021-12-07 11:21 ` tnfchris at gcc dot gnu.org
@ 2021-12-07 11:21 ` tnfchris at gcc dot gnu.org
  2021-12-07 19:44 ` rsandifo at gcc dot gnu.org
                   ` (29 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-07 11:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #13 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Created attachment 51943
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51943&action=edit
0002-PATCH-2-2-GCC-reload-Don-t-move-if-the-next-BB-has-m.patch

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2021-12-07 11:21 ` tnfchris at gcc dot gnu.org
@ 2021-12-07 19:44 ` rsandifo at gcc dot gnu.org
  2021-12-07 23:52 ` tnfchris at gcc dot gnu.org
                   ` (28 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-12-07 19:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #14 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Thanks for the nice cut-down example.

In the original comment and the covering note to patch 1,
the highlighted spilled value is the final (terminating)
value of foo + 1024.  Let's start with the SPILLED=0.51
case and consider that register ("R") in isolation without
changing anything else.

If we spill R to memory (as for the current SPILLED=0.51 code),
we generate:

- a store of R outside the loop (low execution count)
- a load of R inside the loop (after the call) with freq 0.51 * loop iters
- a load of R inside the loop with freq 0.49 * loop iters

If we force R to be allocated a call-clobbered register instead
of being spilled (and changing nothing else, via a hack to
ira-color.c:improve_allocation) then we generate:

- a store of R inside the loop (before the call) with freq 0.51 * loop iters
- a load of R inside the loop (after the call) with freq 0.51 * loop iters

So the in-loop cost of the second (allocated) version is higher
than the in-loop cost of the first (spilled) version.  As the value
of SPILLED increases, the difference between the two also increases:
spilling R gets progressively “cheaper” than the allocating a
call-clobbered register to R.

That is, we're trading two memory accesses in the call branch
(if we allocate R) against one memory access in both branches
(if we spill R).  As the call branch gets more likely,
the cost of doing two memory accesses there gets higher
relative to the cost of doing one memory access in both branches.
And that seems like the right behaviour in principle.

>From that point of view, it doesn't look like the memory and register
costs of R are too wrong here.  The things being costed are the store
and load around the call (which do exist if we allocate a call-clobbered
register) and the loads at each use site (which do exist if we spill R).

Like Feng Xue says in comment 1, I think the main missed optimisation
opportunity here is that foo + 1024 is invariant, so if we allocate
a call-clobbered register, we could save R once outside the loop
and reload it after each call.  That would give:

- a store of R outside the loop (low execution count)
- a load of R inside the loop (after the call) with freq 0.51 * loop iters

which is cheaper than both the current approaches.  We don't do that
optimisation yet though, so the current costing seems to reflect what we
currently generate.

I don't know how well the above translates to the original example
though.  Are the some of the spilled values in exchange loop-invariant
as well?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2021-12-07 19:44 ` rsandifo at gcc dot gnu.org
@ 2021-12-07 23:52 ` tnfchris at gcc dot gnu.org
  2021-12-08  9:33 ` rsandifo at gcc dot gnu.org
                   ` (27 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-07 23:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #15 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> That is, we're trading two memory accesses in the call branch
> (if we allocate R) against one memory access in both branches
> (if we spill R).  As the call branch gets more likely,
> the cost of doing two memory accesses there gets higher
> relative to the cost of doing one memory access in both branches.
> And that seems like the right behaviour in principle.
> 
> From that point of view, it doesn't look like the memory and register
> costs of R are too wrong here.  The things being costed are the store
> and load around the call (which do exist if we allocate a call-clobbered
> register) and the loads at each use site (which do exist if we spill R).

Indeed, I don't think the heuristics are wrong, but because one frequency
CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are a
bit sensitive to any changes.  The edge probabilities can barely change while
the BB_FREQ can change dramatically.

> 
> Like Feng Xue says in comment 1, I think the main missed optimisation
> opportunity here is that foo + 1024 is invariant, so if we allocate
> a call-clobbered register, we could save R once outside the loop
> and reload it after each call.  That would give:
> 
> - a store of R outside the loop (low execution count)
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> 

Yes, that is the ideal solution, but also requires more changes to RA.
Instead I've chosen a middle ground here (same as yours but done in
ira_tune_allocno_costs instead), which is to store and load only inside
the loop, but to do so only in the BB which contains the call.

This is a major improvement over the current situation because when you
have several nested loops where the value is invariant across a number of them
you run into problems when each of these BB have naturally very high register
pressure.

As you say:

> - a store of R outside the loop (low execution count)
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> - a load of R inside the loop with freq 0.49 * loop iters

and if the loop has various BB (like a long if/then/elseif/else) chain the load
has to happen in in every BB in the loop.  That's why we get the large amount
of spills we currently do.

By forcing it to spill only in the BB with the call inside the loop the other
BBs are freed from all the loads.

> If we force R to be allocated a call-clobbered register instead
> of being spilled (and changing nothing else, via a hack to
> ira-color.c:improve_allocation) then we generate:
> 
> - a store of R inside the loop (before the call) with freq 0.51 * loop iters
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters

I essentially did the same thing, but I think in a more conservative way. When
you just have a single call inside the entire loop I force it to assign the
call-clobbered if it needs it.  This removed the loads with freq 0.49.  But
left the ones with 0.51.

I use call counts as the measure here because with 1 call and multiple BB
inside the live range you know that at most 1 BB will have a call and so the
rest won't have any.

Since essentially if you have high register pressure, just only make the part
that increases it more pay for the spills.  As you say it's not perfect, but
it's a conservative improvement over the current situation.

> which is cheaper than both the current approaches.  We don't do that
> optimisation yet though, so the current costing seems to reflect what we
> currently generate.

In many (if not most) Arches stores are significantly cheaper than the loads
though. So the store before the call doesn't end up making that much of a
difference, but yes it adds up if you have many of them.

So indeed removing it is optimal, but that seems like a very hard one to do.  I
would assume that the live range for the loop starts at the body of the loop. 
So I would imagine it's very hard to tell reload to spill outside of the
current allocas it's currently allocating for?

> 
> I don't know how well the above translates to the original example
> though.  Are the some of the spilled values in exchange loop-invariant
> as well?

Yes I believe so, It's a bit hard for me to tell since the functions are huge
and have many nested loops... But in rtl the BBs quite large as well after the
constprop and recursive inlining stuff.

But the behaviour is consistent with the minimal problem here.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2021-12-07 23:52 ` tnfchris at gcc dot gnu.org
@ 2021-12-08  9:33 ` rsandifo at gcc dot gnu.org
  2021-12-08 14:31 ` tnfchris at gcc dot gnu.org
                   ` (26 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-12-08  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #16 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #15)
> > From that point of view, it doesn't look like the memory and register
> > costs of R are too wrong here.  The things being costed are the store
> > and load around the call (which do exist if we allocate a call-clobbered
> > register) and the loads at each use site (which do exist if we spill R).
> 
> Indeed, I don't think the heuristics are wrong, but because one frequency
> CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are
> a
> bit sensitive to any changes.  The edge probabilities can barely change while
> the BB_FREQ can change dramatically.
On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
count.  It shouldn't grow or shrink based on the value of SPILLED.
That's because every execution of the loop body involves exactly one
reference to r104.  SPILLED specifies the probability that that single
reference is the “call” use rather than the “non-call” use, but it doesn't
change the total number of references per iteration.

So I think the only reason we see the different ALLOCNO_FREQs in:

   ALLOCNO_FREQ 989, …

vs:

   ALLOCNO_FREQ 990, …

is round-off error.  If the values had more precision, I think we'd
have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.

At one extreme, if SPILLED is very low (but if we nevertheless
continue to duplicate the common part of the loop body) then
storing and loading around the call becomes very cheap
(ALLOCNO_CALL_FREQ is much lower than the conceptually fixed
ALLOCNO_FREQ).  If SPILLED is very high (but again we continue
to duplicate the common part of the loop body) then storing and
loading around the call becomes very expensive.  So I guess the
question is: where is the cut-off?

Given that:

- the loop iterates 1024 times
- there is a single store outside the loop
- the target claims that loads and stores have equal cost

ISTM that the cut-off really is in the range [0.5, 0.51].  If the
value of SPILLED reflects the real probability at runtime then I think
IRA is doing the right thing, given the claimed costs of storing and
loading.

I guess the problems are:

- SPILLED is only a guess
- the aarch64 costs of stores and loads don't reflect reality

> > Like Feng Xue says in comment 1, I think the main missed optimisation
> > opportunity here is that foo + 1024 is invariant, so if we allocate
> > a call-clobbered register, we could save R once outside the loop
> > and reload it after each call.  That would give:
> > 
> > - a store of R outside the loop (low execution count)
> > - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> > 
> 
> Yes, that is the ideal solution, but also requires more changes to RA.
True :-)

> Instead I've chosen a middle ground here (same as yours but done in
> ira_tune_allocno_costs instead), which is to store and load only inside
> the loop, but to do so only in the BB which contains the call.
I don't think you were saying otherwise, but just FTR: I wasn't
proposing a solution, I was just describing a hack.  It seemed
to me like IRA was making the right decision for r104 in isolation,
for the given SPILLED value and target costs.  My hack to force
an allocation for r104 made things worse.

> > which is cheaper than both the current approaches.  We don't do that
> > optimisation yet though, so the current costing seems to reflect what we
> > currently generate.
> 
> In many (if not most) Arches stores are significantly cheaper than the loads
> though. So the store before the call doesn't end up making that much of a
> difference, but yes it adds up if you have many of them.
Yeah.  Could we fix the problem that way instead?  The only reason IRA is
treating loads and stores as equal cost is because aarch64 asked it to :-)

At least for the reduced testcase, ISTM that IRA is making the right choice
for the “loads and stores are equally expensive” assumption.  It also seems
to make the right choice for a “loads are more expensive than stores”
assumption.  It's up to the target to choose which is right.

So I think the reduced testcase is showing a target bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2021-12-08  9:33 ` rsandifo at gcc dot gnu.org
@ 2021-12-08 14:31 ` tnfchris at gcc dot gnu.org
  2021-12-08 15:02 ` rsandifo at gcc dot gnu.org
                   ` (25 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-08 14:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #17 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
> ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
> count.  It shouldn't grow or shrink based on the value of SPILLED.
> That's because every execution of the loop body involves exactly one
> reference to r104.  SPILLED specifies the probability that that single
> reference is the “call” use rather than the “non-call” use, but it doesn't
> change the total number of references per iteration.
> 
> So I think the only reason we see the different ALLOCNO_FREQs in:
> 
>    ALLOCNO_FREQ 989, …
> 
> vs:
> 
>    ALLOCNO_FREQ 990, …
> 
> is round-off error.  If the values had more precision, I think we'd
> have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.

yeah, that's plausible, as far as I can tell the FREQ are always scaled by
REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an
integer division.  The general problem is that the IPA frequences don't
really seem to have any bounded range and so it always needs to scale.

So I think you're always going to have this error one way or another
which may or may not work to your advantage on any given program.

Maybe we need a way to be a bit more tolerant of this rounding error
instead?

> > Instead I've chosen a middle ground here (same as yours but done in
> > ira_tune_allocno_costs instead), which is to store and load only inside
> > the loop, but to do so only in the BB which contains the call.
> I don't think you were saying otherwise, but just FTR: I wasn't
> proposing a solution, I was just describing a hack.  It seemed
> to me like IRA was making the right decision for r104 in isolation,
> for the given SPILLED value and target costs.  My hack to force
> an allocation for r104 made things worse.

Ah ok, fair enough :)

> 
> > > which is cheaper than both the current approaches.  We don't do that
> > > optimisation yet though, so the current costing seems to reflect what we
> > > currently generate.
> > 
> > In many (if not most) Arches stores are significantly cheaper than the loads
> > though. So the store before the call doesn't end up making that much of a
> > difference, but yes it adds up if you have many of them.
> Yeah.  Could we fix the problem that way instead?  The only reason IRA is
> treating loads and stores as equal cost is because aarch64 asked it to :-)

I tried a quick check and it does fix the testcase but not the benchmark. which
is not entirely unexpected thinking about it because x86 does correctly model
the
store costs.

I can try fixing the costs correctly and try reducing again.  It looks like it
still
thinks spilling to memory is cheaper than caller saves reloads.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2021-12-08 14:31 ` tnfchris at gcc dot gnu.org
@ 2021-12-08 15:02 ` rsandifo at gcc dot gnu.org
  2021-12-09 19:56 ` pthaugen at gcc dot gnu.org
                   ` (24 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-12-08 15:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #18 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #17)
> > On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
> > ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
> > count.  It shouldn't grow or shrink based on the value of SPILLED.
> > That's because every execution of the loop body involves exactly one
> > reference to r104.  SPILLED specifies the probability that that single
> > reference is the “call” use rather than the “non-call” use, but it doesn't
> > change the total number of references per iteration.
> > 
> > So I think the only reason we see the different ALLOCNO_FREQs in:
> > 
> >    ALLOCNO_FREQ 989, …
> > 
> > vs:
> > 
> >    ALLOCNO_FREQ 990, …
> > 
> > is round-off error.  If the values had more precision, I think we'd
> > have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.
> 
> yeah, that's plausible, as far as I can tell the FREQ are always scaled by
> REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an
> integer division.  The general problem is that the IPA frequences don't
> really seem to have any bounded range and so it always needs to scale.
> 
> So I think you're always going to have this error one way or another
> which may or may not work to your advantage on any given program.
> 
> Maybe we need a way to be a bit more tolerant of this rounding error
> instead?
I guess the thing is: with the costs that the target is giving IRA,
there isn't much of a difference between the cost of the “0.5 way”
and the cost of the “0.51 way” for values of SPILLED around the
0.5 mark.

For spilling r104 we have:

  Factor            Cost type  What
  ======            =========  ====
  1                 store      store outside the loop when R is set
  1024×SPILLED      load       load inside the loop when R is used (call)
  1024×(1-SPILLED)  load       load inside the loop when R is used (non-call)

For allocating a call-clobbered register to r104 we have:

  Factor            Cost type  What
  ======            =========  ====
  1024×SPILLED      store      store before the call
  1024×SPILLED      load       load after the call

When loads and store have equal weight, that weight cancels out,
giving:

    1025  vs.  2048×SPILLED

So the costs are equal for SPILLED=1025/2048.  Above that spilling
is cheaper, below that allocating a call-clobbered register is
cheaper.

IRA has done well to find the exact tipping point (based on the
assumptions it has been told to use), but for values of SPILLED
around the tipping point, the two costs are very close.

So with the current cost model, we're not falling off a cliff
cost-wise when we hit the tipping point or rounding error.
The costs coming in give the impression that there isn't much
to chose between the two approaches when SPILLED is roughly
half-and-half.

Now if SPILLED is completely off the mark, e.g. we say it has
a probability of 0.51 but the actual runtime probability is more
like 0.1, then clearly we're in trouble.  I'm not sure what we can
do about that though.  Perhaps if the edge frequencies (and derived
block frequencies) are pure guesses, we should weight based on
the approximate cost of the block instead.  I.e. the call block
gets a low frequency because calls are expensive, while the
non-call block gets a high frequency because it does simple
arithmetic.  Thus any new instructions added to the call block
are likely to have a proportionately lower effect than any new
instructions added to the non-call block.

I hope we don't have to do that though :-)

> > > > which is cheaper than both the current approaches.  We don't do that
> > > > optimisation yet though, so the current costing seems to reflect what we
> > > > currently generate.
> > > 
> > > In many (if not most) Arches stores are significantly cheaper than the loads
> > > though. So the store before the call doesn't end up making that much of a
> > > difference, but yes it adds up if you have many of them.
> > Yeah.  Could we fix the problem that way instead?  The only reason IRA is
> > treating loads and stores as equal cost is because aarch64 asked it to :-)
> 
> I tried a quick check and it does fix the testcase but not the benchmark.
> which
> is not entirely unexpected thinking about it because x86 does correctly
> model the
> store costs.
Yeah, was afraid that would be the case.

> I can try fixing the costs correctly and try reducing again.  It looks like
> it still
> thinks spilling to memory is cheaper than caller saves reloads.
Thanks.  I think that would help.  The current aarch64 costs are likely
to distort things if we try to reduce using them.

IMO the initial reduction has been useful because it gives a nice
crystallised example of why the load/store cost ratio matters.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2021-12-08 15:02 ` rsandifo at gcc dot gnu.org
@ 2021-12-09 19:56 ` pthaugen at gcc dot gnu.org
  2021-12-09 20:12 ` hubicka at gcc dot gnu.org
                   ` (23 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: pthaugen at gcc dot gnu.org @ 2021-12-09 19:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #19 from pthaugen at gcc dot gnu.org ---
I tried -fno-inline-functions-called-once and the patches on a Power9 system.
Following are the run times and spill counts (grep -c Spilling
exchange2.fppized.f90.298r.ira). Interesting that the spill counts increase
when -fno-inline-functions-called-once is added, but obviously that additional
spill is on cold paths since execution time improves. Compiler options used are
"-O3 -mcpu=power9 -fpeel-loops -funroll-loops -ffast-math".

                                  time(sec)      Spill
base                                473           3284

no-inline-functions-called-once     370           5410

patches 1 & 2                       397            461

patches 1 & 2
+ no-inline-functions-called-once   299            870

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2021-12-09 19:56 ` pthaugen at gcc dot gnu.org
@ 2021-12-09 20:12 ` hubicka at gcc dot gnu.org
  2021-12-09 21:27 ` pthaugen at gcc dot gnu.org
                   ` (22 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-09 20:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #20 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no longer
need -fno-inline-functions-called-once

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2021-12-09 20:12 ` hubicka at gcc dot gnu.org
@ 2021-12-09 21:27 ` pthaugen at gcc dot gnu.org
  2021-12-10 11:36 ` hubicka at gcc dot gnu.org
                   ` (21 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: pthaugen at gcc dot gnu.org @ 2021-12-09 21:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #21 from pthaugen at gcc dot gnu.org ---
(In reply to Jan Hubicka from comment #20)
> With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no
> longer need -fno-inline-functions-called-once

Yes, I see that now with an updated trunk.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2021-12-09 21:27 ` pthaugen at gcc dot gnu.org
@ 2021-12-10 11:36 ` hubicka at gcc dot gnu.org
  2021-12-14 14:38 ` tnfchris at gcc dot gnu.org
                   ` (20 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-10 11:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #22 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
We discussed with Richard the option to biass the IRA spilling heuristics to
spill into blocks containing calls (since they are slow anyway) bypassing some
of the guessed frequencies.

Thinking on it, perhaps I should note that I believe that even with profile
feedback we seem to have IRA problem (it used to be possible to get faster code
w/o -fprofile-use then with), so the default heuristics seems to be somehow
misbehaving. With current guessed profile the frequency of BB containing the
recursive call is about 0.9, while reality is about 0.99 which is quite close.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2021-12-10 11:36 ` hubicka at gcc dot gnu.org
@ 2021-12-14 14:38 ` tnfchris at gcc dot gnu.org
  2021-12-14 14:40 ` hubicka at kam dot mff.cuni.cz
                   ` (19 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-14 14:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #23 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #20)
> With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no
> longer need -fno-inline-functions-called-once

Awesome! thanks!

I wonder if we can get rid of the final magic parameter too, we run with
--param ipa-cp-unit-growth=80 too which seems to have no more effect on
exchange, though still a slight effect on leela but a 12% gain in imagick.

This looks like without the parameter we lose constprop on MorphologyApply
resulting in a much bigger function.  Do you want me to make a new ticket for
this?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2021-12-14 14:38 ` tnfchris at gcc dot gnu.org
@ 2021-12-14 14:40 ` hubicka at kam dot mff.cuni.cz
  2021-12-14 14:48 ` tnfchris at gcc dot gnu.org
                   ` (18 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-14 14:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #24 from hubicka at kam dot mff.cuni.cz ---
> Awesome! thanks!
> 
> I wonder if we can get rid of the final magic parameter too, we run with
> --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> exchange, though still a slight effect on leela but a 12% gain in imagick.

Interesting - this is Martin Jambor's area but I do not think we was
aware of this.  I wonder how bug growth is really needed - 80% is quite
a lot especially if imagemagick is already big binary.  Is this about
LTO or non-LTO build?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2021-12-14 14:40 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-14 14:48 ` tnfchris at gcc dot gnu.org
  2021-12-14 14:58 ` hubicka at kam dot mff.cuni.cz
                   ` (17 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-14 14:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #25 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to hubicka from comment #24)
> > Awesome! thanks!
> > 
> > I wonder if we can get rid of the final magic parameter too, we run with
> > --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> > exchange, though still a slight effect on leela but a 12% gain in imagick.
> 
> Interesting - this is Martin Jambor's area but I do not think we was
> aware of this.  I wonder how bug growth is really needed - 80% is quite
> a lot especially if imagemagick is already big binary.  Is this about
> LTO or non-LTO build?

It's with LTO, I'll see if non-LTO has the same benefit.  In terms of code-size
it looks like it accounts for a 20% increase for binary size, but the hot
function shrinks approx 6x. But of course this increase covers the entire
binaries and there are other hot functions like GetVirtualPixelsFromNexus that
also shrink significantly when specialized.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2021-12-14 14:48 ` tnfchris at gcc dot gnu.org
@ 2021-12-14 14:58 ` hubicka at kam dot mff.cuni.cz
  2021-12-14 15:07 ` tnfchris at gcc dot gnu.org
                   ` (16 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-12-14 14:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #26 from hubicka at kam dot mff.cuni.cz ---
> It's with LTO, I'll see if non-LTO has the same benefit.  In terms of code-size
> it looks like it accounts for a 20% increase for binary size, but the hot
> function shrinks approx 6x. But of course this increase covers the entire
> binaries and there are other hot functions like GetVirtualPixelsFromNexus that
> also shrink significantly when specialized.
Thanks.  I will check this.  We mey need tweaking cost model to first
specialize the functions that reduce more (by doing multiple passes with
decreasing threshold)

It is really nice to see some real improvmeents from ipa-cp. For years
the pass did almost nothing on the benchmarks I tested (spec, Firefox,
clang) but it has changed.  I also see measurable speedups on clang
binary with ipa-cp on.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2021-12-14 14:58 ` hubicka at kam dot mff.cuni.cz
@ 2021-12-14 15:07 ` tnfchris at gcc dot gnu.org
  2021-12-14 15:08 ` tnfchris at gcc dot gnu.org
                   ` (15 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-14 15:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #27 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to hubicka from comment #26)
> > It's with LTO, I'll see if non-LTO has the same benefit.  In terms of code-size
> > it looks like it accounts for a 20% increase for binary size, but the hot
> > function shrinks approx 6x. But of course this increase covers the entire
> > binaries and there are other hot functions like GetVirtualPixelsFromNexus that
> > also shrink significantly when specialized.
>
> Thanks.  I will check this.  We mey need tweaking cost model to first
> specialize the functions that reduce more (by doing multiple passes with
> decreasing threshold)

Thanks! It looks like it doesn't need LTO, just plain -Ofast is enough to get
the gain.

> 
> It is really nice to see some real improvmeents from ipa-cp. For years
> the pass did almost nothing on the benchmarks I tested (spec, Firefox,
> clang) but it has changed.  I also see measurable speedups on clang
> binary with ipa-cp on.

Indeed! Some of the gains from these cp's are quite nice!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2021-12-14 15:07 ` tnfchris at gcc dot gnu.org
@ 2021-12-14 15:08 ` tnfchris at gcc dot gnu.org
  2021-12-14 18:16 ` jamborm at gcc dot gnu.org
                   ` (14 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-14 15:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #28 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to pthaugen from comment #19)
> I tried -fno-inline-functions-called-once and the patches on a Power9
> system. Following are the run times and spill counts (grep -c Spilling
> exchange2.fppized.f90.298r.ira). Interesting that the spill counts increase
> when -fno-inline-functions-called-once is added, but obviously that
> additional spill is on cold paths since execution time improves. Compiler
> options used are "-O3 -mcpu=power9 -fpeel-loops -funroll-loops -ffast-math".
> 
>                                   time(sec)      Spill
> base                                473           3284
> 
> no-inline-functions-called-once     370           5410
> 
> patches 1 & 2                       397            461
> 
> patches 1 & 2
> + no-inline-functions-called-once   299            870

Ah thanks for testing!, so at least we know that all architectures are
suffering from the same underlying issue :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2021-12-14 15:08 ` tnfchris at gcc dot gnu.org
@ 2021-12-14 18:16 ` jamborm at gcc dot gnu.org
  2021-12-15 12:15 ` tnfchris at gcc dot gnu.org
                   ` (13 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: jamborm at gcc dot gnu.org @ 2021-12-14 18:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #29 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #23)
> I wonder if we can get rid of the final magic parameter too, we run with
> --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> exchange, though still a slight effect on leela but a 12% gain in imagick.
> 
> This looks like without the parameter we lose constprop on MorphologyApply
> resulting in a much bigger function.  Do you want me to make a new ticket
> for this?

Indeed I did not know about this.  I think tracking it in a (separate) PR would
make sense.  Thanks a lot for the pointer!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2021-12-14 18:16 ` jamborm at gcc dot gnu.org
@ 2021-12-15 12:15 ` tnfchris at gcc dot gnu.org
  2021-12-20 18:06 ` rsandifo at gcc dot gnu.org
                   ` (12 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-12-15 12:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #30 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Martin Jambor from comment #29)
> (In reply to Tamar Christina from comment #23)
> > I wonder if we can get rid of the final magic parameter too, we run with
> > --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> > exchange, though still a slight effect on leela but a 12% gain in imagick.
> > 
> > This looks like without the parameter we lose constprop on MorphologyApply
> > resulting in a much bigger function.  Do you want me to make a new ticket
> > for this?
> 
> Indeed I did not know about this.  I think tracking it in a (separate) PR
> would make sense.  Thanks a lot for the pointer!

Thanks, created a new ticket
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103734 the threshold seems to be a
lot lower than 80%, 20% seems to already be enough.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2021-12-15 12:15 ` tnfchris at gcc dot gnu.org
@ 2021-12-20 18:06 ` rsandifo at gcc dot gnu.org
  2021-12-31 17:28 ` rsandifo at gcc dot gnu.org
                   ` (11 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-12-20 18:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #31 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
I don't know how relevant this is to the exchange2 problem yet,
but for:

---------------------------------------------------------------------
#define PROB 0.95

struct L {
  int data;
  L *next;
  L *inner;
};

template<int N>
struct S {
  static __attribute__((always_inline)) void f(L *head, int inc) {
    while (head) {
      asm volatile ("// Loop %0" :: "i" (N));
      int subinc = head->data + inc;
      if (__builtin_expect_with_probability (bool(head->inner), 0, PROB))
        S<N-1>::f(head->inner, subinc);
      head->data = subinc;
      head = head->inner;
    }
  }
};

template<>
struct S<0> {
  static void f(L *, int) {
    asm volatile ("// foo" ::: "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7");
  }
};

void x(L *head) { S<10>::f(head, 1); }
---------------------------------------------------------------------

compiled on aarch64 with -O3, we always seem to spill in the outer
loops regardless of the value of PROB.  That is, the inner loops
always seem to get priority even if their bb frequencies are low.
I would have expected that moving PROB from 0.05 (inner loop very
likely to be executed) to 0.95 (inner loop very likely to be skipped)
would have moved the spills from the outer loops to the inner loops.

An unrelated issue on aarch64 is that IP0 and IP1 are explicitly
clobbered by call instructions, via CALL_INSN_FUNCTION_USAGE.
We have to do this because the linker is allowed to insert code
that clobbers those registers even if we “know” that the target
of the call doesn't clobber them.  However, clobbering the registers
that way prevents them from being used for registers that are live
across the call, even if the call is very unlikely to be executed.
Hacking a fix for that does reduce the number of spills, but doesn't
get rid of the ones that matter in exchange2.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2021-12-20 18:06 ` rsandifo at gcc dot gnu.org
@ 2021-12-31 17:28 ` rsandifo at gcc dot gnu.org
  2022-01-04 22:26 ` pthaugen at gcc dot gnu.org
                   ` (10 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-12-31 17:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rsandifo at gcc dot gnu.org

--- Comment #32 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Created attachment 52102
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52102&action=edit
Alternative patch

This patch is a squash of several ira tweaks that together recover the
pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
yet (hence lack of comments and changelog).  It would be great to hear
whether/how it works on other targets though.

The patch bootstraps on aarch64-linux-gnu and x86_64-linux-gnu,
but there are some new scan-assembler failures that need looking at.

Quoting from the covering message:

The main changes are:

(1) Add ira_loop_border_costs to simplify spill cost calculations
    (NFC intended)

(2) Avoid freeing updated costs until the loop node has been fully
    allocated.  This in turn allows:

(3) Make improve_allocation work exclusively on updated costs,
    rather than using a mixture of updated and original costs.
    One reason this matters is that the register costs only make
    sense relative to the memory costs, so in some cases,
    a common register is subtracted from the updated memory cost
    instead of being added to each individual updated register cost.

(4) If a child allocno has a hard register conflict, allow the parent
    allocno to handle the conflict by spilling to memory throughout
    the child allocno's loop.  This carries the child allocno's full
    memory cost plus the cost of spilling to memory on entry to the
    loop and restoring it on exit, but this can still be cheaper
    than spilling the entire parent allocno.  In particular, it helps
    for allocnos that are live across a loop but not referenced
    within it, since the child allocno's memory cost is 0 in
    that case.

(5) Extend (4) to cases in which the child allocno is live across
    a call.  The parent then has a free choice between spilling
    call-clobbered registers around each call (as normal) or
    spilling them on entry to the loop, keeping the allocno in memory
    throughout the loop, and restoring them on exit from the loop.

(6) Detect <E2><80><9C>soft conflicts<E2><80><9D> in which:

    - one allocno (A1) is a cap whose (transitive) <E2><80><9C>real<E2><80><9D>
allocno
      is A1'

    - A1' occurs in loop L1'

    - the other allocno (A2) is a non-cap allocno

    - the equivalent of A2 is live across L1' (hence the conflict)
      but has no references in L1'

    In this case we can spill A2 around L1' (or perhaps some parent
    loop) and reuse the same register for A1'.  A1 and A2 can then
    use the same hard register, provided that we make sure not to
    propagate A1's allocation to A1'.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2021-12-31 17:28 ` rsandifo at gcc dot gnu.org
@ 2022-01-04 22:26 ` pthaugen at gcc dot gnu.org
  2022-01-04 22:29 ` pthaugen at gcc dot gnu.org
                   ` (9 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: pthaugen at gcc dot gnu.org @ 2022-01-04 22:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #33 from pthaugen at gcc dot gnu.org ---
(In reply to rsandifo@gcc.gnu.org from comment #32)
> Created attachment 52102 [details]
> Alternative patch
> 
> This patch is a squash of several ira tweaks that together recover the
> pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
> yet (hence lack of comments and changelog).  It would be great to hear
> whether/how it works on other targets though.

I tried the patch on a Power9 system. Execution time went from 371 seconds to
291.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2022-01-04 22:26 ` pthaugen at gcc dot gnu.org
@ 2022-01-04 22:29 ` pthaugen at gcc dot gnu.org
  2022-01-06 14:53 ` rsandifo at gcc dot gnu.org
                   ` (8 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: pthaugen at gcc dot gnu.org @ 2022-01-04 22:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #34 from pthaugen at gcc dot gnu.org ---
(In reply to pthaugen from comment #33)
> 
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.

Which I should have included is in line, or even slightly better, than the 2
patches posted by Tamar.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2022-01-04 22:29 ` pthaugen at gcc dot gnu.org
@ 2022-01-06 14:53 ` rsandifo at gcc dot gnu.org
  2022-01-10  1:29 ` crazylht at gmail dot com
                   ` (7 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-01-06 14:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |https://gcc.gnu.org/piperma
                   |                            |il/gcc-patches/2022-January
                   |                            |/587761.html

--- Comment #35 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to pthaugen from comment #33)
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.
Thanks Pat!  Given that it seems to work OK on aarch64 and Power,
I've submitted a fixed version of the series for review:
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587761.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2022-01-06 14:53 ` rsandifo at gcc dot gnu.org
@ 2022-01-10  1:29 ` crazylht at gmail dot com
  2022-01-11 10:14   ` Jan Hubicka
  2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
                   ` (6 subsequent siblings)
  44 siblings, 1 reply; 47+ messages in thread
From: crazylht at gmail dot com @ 2022-01-10  1:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #36 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to pthaugen from comment #33)
> (In reply to rsandifo@gcc.gnu.org from comment #32)
> > Created attachment 52102 [details]
> > Alternative patch
> > 
> > This patch is a squash of several ira tweaks that together recover the
> > pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
> > yet (hence lack of comments and changelog).  It would be great to hear
> > whether/how it works on other targets though.
> 
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.

The patch also recover 548.exchange2_r performance on ICX, thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2022-01-10  1:29 ` crazylht at gmail dot com
@ 2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
  2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
                   ` (5 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-10 14:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #37 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:909a4b4764c4f270f09ccb2a950c91b21ed7b33a

commit r12-6412-g909a4b4764c4f270f09ccb2a950c91b21ed7b33a
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 10 14:47:07 2022 +0000

    ira: Add comments and fix move_spill_restore calculation

    This patch adds comments to describe each use of ira_loop_border_costs.
    I think this highlights that move_spill_restore was using the wrong cost
    in one case, which came from tranposing [0] and [1] in the original
    (pre-ira_loop_border_costs) ira_memory_move_cost expressions.  The
    difference would only be noticeable on targets that distinguish between
    load and store costs.

    gcc/
            PR rtl-optimization/98782
            * ira-color.c (color_pass): Add comments to describe the spill
costs.
            (move_spill_restore): Likewise.  Fix reversed calculation.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (38 preceding siblings ...)
  2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
@ 2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
  2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-10 14:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #38 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:d54565d87ff79b882208dfb29af50232033c233d

commit r12-6413-gd54565d87ff79b882208dfb29af50232033c233d
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 10 14:47:07 2022 +0000

    ira: Add ira_subloop_allocnos_can_differ_p

    color_pass has two instances of the same code for propagating non-cap
    assignments from parent loops to subloops.  This patch adds a helper
    function for testing when such propagations are required for correctness
    and uses it to remove the duplicated code.

    A later patch will use this in ira-build.c too, which is why the
    function is exported to ira-int.h.

    No functional change intended.

    gcc/
            PR rtl-optimization/98782
            * ira-int.h (ira_subloop_allocnos_can_differ_p): New function,
            extracted from...
            * ira-color.c (color_pass): ...here.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (39 preceding siblings ...)
  2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
@ 2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
  2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-10 14:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #39 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:01f3e6a40e7202310abbeb41c345d325bd69554f

commit r12-6415-g01f3e6a40e7202310abbeb41c345d325bd69554f
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 10 14:47:08 2022 +0000

    ira: Consider modelling caller-save allocations as loop spills

    If an allocno A in an inner loop L spans a call, a parent allocno AP
    can choose to handle a call-clobbered/caller-saved hard register R
    in one of two ways:

    (1) save R before each call in L and restore R after each call
    (2) spill R to memory throughout L

    (2) can be cheaper than (1) in some cases, particularly if L does
    not reference A.

    Before the patch we always did (1).  The patch adds support for
    picking (2) instead, when it seems cheaper.  It builds on the
    earlier support for not propagating conflicts to parent allocnos.

    gcc/
            PR rtl-optimization/98782
            * ira-int.h (ira_caller_save_cost): New function.
            (ira_caller_save_loop_spill_p): Likewise.
            * ira-build.c (ira_propagate_hard_reg_costs): Test whether it is
            cheaper to spill a call-clobbered register throughout a loop rather
            than spill it around each individual call.  If so, treat all
            call-clobbered registers as conflicts and...
            (propagate_allocno_info): ...do not propagate call information
            from the child to the parent.
            * ira-color.c (move_spill_restore): Update accordingly.
            * ira-costs.c (ira_tune_allocno_costs): Use ira_caller_save_cost.

    gcc/testsuite/
            * gcc.target/aarch64/reg-alloc-3.c: New test.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (40 preceding siblings ...)
  2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
@ 2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
  2022-01-10 14:52 ` [Bug rtl-optimization/98782] [11 " rsandifo at gcc dot gnu.org
                   ` (2 subsequent siblings)
  44 siblings, 0 replies; 47+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-10 14:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #40 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:037cc0b4a6646cc86549247a3590215ebd5c4c43

commit r12-6416-g037cc0b4a6646cc86549247a3590215ebd5c4c43
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 10 14:47:09 2022 +0000

    ira: Handle "soft" conflicts between cap and non-cap allocnos

    This patch looks for allocno conflicts of the following form:

    - One allocno (X) is a cap allocno for some non-cap allocno X2.
    - X2 belongs to some loop L2.
    - The other allocno (Y) is a non-cap allocno.
    - Y is an ancestor of some allocno Y2 in L2.
    - Y2 is not referenced in L2 (that is, ALLOCNO_NREFS (Y2) == 0).
    - Y can use a different allocation from Y2.

    In this case, Y's register is live across L2 but is not used within it,
    whereas X's register is used only within L2.  The conflict is therefore
    only "soft", in that it can easily be avoided by spilling Y2 inside L2
    without affecting any insn references.

    In principle we could do this for ALLOCNO_NREFS (Y2) != 0 too, with the
    callers then taking Y2's ALLOCNO_MEMORY_COST into account.  There would
    then be no "cliff edge" between a Y2 that has no references and a Y2 that
    has (say) a single cold reference.

    However, doing that isn't necessary for the PR and seems to give
    variable results in practice.  (fotonik3d_r improves slightly but
    namd_r regresses slightly.)  It therefore seemed better to start
    with the higher-value zero-reference case and see how things go.

    On top of the previous patches in the series, this fixes the exchange2
    regression seen in GCC 11.

    gcc/
            PR rtl-optimization/98782
            * ira-int.h (ira_soft_conflict): Declare.
            * ira-color.c (max_soft_conflict_loop_depth): New constant.
            (ira_soft_conflict): New function.
            (spill_soft_conflicts): Likewise.
            (assign_hard_reg): Use them to handle the case described by
            the comment above ira_soft_conflict.
            (improve_allocation): Likewise.
            * ira.c (check_allocation): Allow allocnos with "soft" conflicts
            to share the same register.

    gcc/testsuite/
            * gcc.target/aarch64/reg-alloc-4.c: New test.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (41 preceding siblings ...)
  2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
@ 2022-01-10 14:52 ` rsandifo at gcc dot gnu.org
  2022-01-11 10:14 ` hubicka at kam dot mff.cuni.cz
  2022-01-11 14:22 ` rsandifo at gcc dot gnu.org
  44 siblings, 0 replies; 47+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-01-10 14:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
            Summary|[11/12 Regression] Bad      |[11 Regression] Bad
                   |interaction between IPA     |interaction between IPA
                   |frequences and IRA          |frequences and IRA
                   |resulting in spills due to  |resulting in spills due to
                   |changes in BB frequencies   |changes in BB frequencies
             Status|ASSIGNED                    |RESOLVED

--- Comment #41 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Hopefully fixed on trunk, much too invasive to backport.

Thanks Vlad for the reviews.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2022-01-10  1:29 ` crazylht at gmail dot com
@ 2022-01-11 10:14   ` Jan Hubicka
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Hubicka @ 2022-01-11 10:14 UTC (permalink / raw)
  To: crazylht at gmail dot com; +Cc: gcc-bugs

on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2
and -Ofast -march=native which is both very nice!
Zen1 for some reason sees less improvement, about 6%.
With PGO it is 3.8%

Overall it seems a win, but there are few noteworthy issues.

I also see a 6.69% regression on x64 with -Ofast -march=native -flto
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0
and perhaps 3-5% on sphinx
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0

For non-spec benchmarks spec there is a regression on nbench
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1
There are also large changes in tsvc
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report
it may be noise since kernels are tiny, but for example x293 reproduces
both on kabylake and zen by about 80-90% regression that may be easy to
track (the kernel is included in the testsuite). Same regression is not
seen on zen3, so may be an ISA specific or so.

FInally there seems relatively large code size savings on polyhedron
benchmarks today (8% on capacita, 

Thanks a lot!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (42 preceding siblings ...)
  2022-01-10 14:52 ` [Bug rtl-optimization/98782] [11 " rsandifo at gcc dot gnu.org
@ 2022-01-11 10:14 ` hubicka at kam dot mff.cuni.cz
  2022-01-11 14:22 ` rsandifo at gcc dot gnu.org
  44 siblings, 0 replies; 47+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-01-11 10:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #42 from hubicka at kam dot mff.cuni.cz ---
on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2
and -Ofast -march=native which is both very nice!
Zen1 for some reason sees less improvement, about 6%.
With PGO it is 3.8%

Overall it seems a win, but there are few noteworthy issues.

I also see a 6.69% regression on x64 with -Ofast -march=native -flto
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0
and perhaps 3-5% on sphinx
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0

For non-spec benchmarks spec there is a regression on nbench
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1
There are also large changes in tsvc
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report
it may be noise since kernels are tiny, but for example x293 reproduces
both on kabylake and zen by about 80-90% regression that may be easy to
track (the kernel is included in the testsuite). Same regression is not
seen on zen3, so may be an ISA specific or so.

FInally there seems relatively large code size savings on polyhedron
benchmarks today (8% on capacita, 

Thanks a lot!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies
  2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
                   ` (43 preceding siblings ...)
  2022-01-11 10:14 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-11 14:22 ` rsandifo at gcc dot gnu.org
  44 siblings, 0 replies; 47+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-01-11 14:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #43 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to hubicka from comment #42)
> I also see a 6.69% regression on x64 with -Ofast -march=native -flto
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0
I can reproduce this with -Ofast -flto -march=znver3 (but not running
on a Zen 3).  It looks like it's due to g:d3ff7420e94 instead though
(sorry Andre!).  With a 3-iteration run, I see a 6.2% regression after
that revision compared with before it.

It would be great if someone more familiar than me with x86
could confirm the bisection though.

> and perhaps 3-5% on sphinx
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0
I'll look at these next.

> For non-spec benchmarks spec there is a regression on nbench
> https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1
> There are also large changes in tsvc
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report
> it may be noise since kernels are tiny, but for example x293 reproduces
> both on kabylake and zen by about 80-90% regression that may be easy to
> track (the kernel is included in the testsuite). Same regression is not
> seen on zen3, so may be an ISA specific or so.
To summarise what we discussed on irc (for the record): it looks like
the s293 regression is in the noise, like you say.  I can't convince
GCC to generate different code before and after the IRA patches for that.
I haven't looked at the other tsvc tests yet.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2022-01-11 14:22 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 14:31 [Bug rtl-optimization/98782] New: IRA artificially creating spills due to BB frequencies tnfchris at gcc dot gnu.org
2021-01-21 14:35 ` [Bug rtl-optimization/98782] " jgreenhalgh at gcc dot gnu.org
2021-01-22 10:12 ` fxue at os dot amperecomputing.com
2021-01-29 13:34 ` tnfchris at gcc dot gnu.org
2021-02-05 12:02 ` [Bug rtl-optimization/98782] [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in " tnfchris at gcc dot gnu.org
2021-02-23  2:11 ` jiangning.liu at amperecomputing dot com
2021-02-23 12:06 ` rguenth at gcc dot gnu.org
2021-02-26 12:32 ` rguenth at gcc dot gnu.org
2021-11-28 19:07 ` [Bug rtl-optimization/98782] [11/12 " hubicka at gcc dot gnu.org
2021-11-29  1:33 ` jiangning.liu at amperecomputing dot com
2021-11-29  6:59 ` tnfchris at gcc dot gnu.org
2021-12-03 11:44 ` hubicka at gcc dot gnu.org
2021-12-03 11:47 ` hubicka at gcc dot gnu.org
2021-12-07 11:19 ` tnfchris at gcc dot gnu.org
2021-12-07 11:21 ` tnfchris at gcc dot gnu.org
2021-12-07 11:21 ` tnfchris at gcc dot gnu.org
2021-12-07 19:44 ` rsandifo at gcc dot gnu.org
2021-12-07 23:52 ` tnfchris at gcc dot gnu.org
2021-12-08  9:33 ` rsandifo at gcc dot gnu.org
2021-12-08 14:31 ` tnfchris at gcc dot gnu.org
2021-12-08 15:02 ` rsandifo at gcc dot gnu.org
2021-12-09 19:56 ` pthaugen at gcc dot gnu.org
2021-12-09 20:12 ` hubicka at gcc dot gnu.org
2021-12-09 21:27 ` pthaugen at gcc dot gnu.org
2021-12-10 11:36 ` hubicka at gcc dot gnu.org
2021-12-14 14:38 ` tnfchris at gcc dot gnu.org
2021-12-14 14:40 ` hubicka at kam dot mff.cuni.cz
2021-12-14 14:48 ` tnfchris at gcc dot gnu.org
2021-12-14 14:58 ` hubicka at kam dot mff.cuni.cz
2021-12-14 15:07 ` tnfchris at gcc dot gnu.org
2021-12-14 15:08 ` tnfchris at gcc dot gnu.org
2021-12-14 18:16 ` jamborm at gcc dot gnu.org
2021-12-15 12:15 ` tnfchris at gcc dot gnu.org
2021-12-20 18:06 ` rsandifo at gcc dot gnu.org
2021-12-31 17:28 ` rsandifo at gcc dot gnu.org
2022-01-04 22:26 ` pthaugen at gcc dot gnu.org
2022-01-04 22:29 ` pthaugen at gcc dot gnu.org
2022-01-06 14:53 ` rsandifo at gcc dot gnu.org
2022-01-10  1:29 ` crazylht at gmail dot com
2022-01-11 10:14   ` Jan Hubicka
2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
2022-01-10 14:47 ` cvs-commit at gcc dot gnu.org
2022-01-10 14:52 ` [Bug rtl-optimization/98782] [11 " rsandifo at gcc dot gnu.org
2022-01-11 10:14 ` hubicka at kam dot mff.cuni.cz
2022-01-11 14:22 ` rsandifo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).