Am 13.12.19 um 13:45 schrieb Richard Sandiford: > Georg-Johann Lay writes: >> Am 11.12.19 um 18:55 schrieb Richard Sandiford: >>> Georg-Johann Lay writes: >>>> Hi, doesn't actually anybody know know to make memory more expensive >>>> than registers when it comes to allocating registers? >>>> >>>> Whatever I am trying for TARGET_MEMORY_MOVE_COST and >>>> TARGET_REGISTER_MOVE_COST, ira-costs.c always makes registers more >>>> expensive than mem and therefore allocates values to stack slots instead >>>> of keeping them in registers. >>>> >>>> Test case (for avr) is as simple as it gets: >>>> >>>> float func (float); >>>> >>>> float call (float f) >>>> { >>>> return func (f); >>>> } >>>> >>>> What am I missing? >>>> >>>> Johann >>>> >>>> >>>> Georg-Johann Lay schrieb: >>>>> Hi, >>>>> >>>>> I am trying to track down a code bloat issue and am stuck because I do >>>>> not understand IRA's cost model. >>>>> >>>>> The test case is as simple as it gets: >>>>> >>>>> float func (float); >>>>> >>>>> float call (float f) >>>>> { >>>>> return func (f); >>>>> } >>>>> >>>>> IRA dump shows the following insns: >>>>> >>>>> >>>>> (insn 14 4 2 2 (set (reg:SF 44) >>>>> (reg:SF 22 r22 [ f ])) "bloat.c":4:1 85 {*movsf} >>>>> (expr_list:REG_DEAD (reg:SF 22 r22 [ f ]) >>>>> (nil))) >>>>> (insn 2 14 3 2 (set (reg/v:SF 43 [ f ]) >>>>> (reg:SF 44)) "bloat.c":4:1 85 {*movsf} >>>>> (expr_list:REG_DEAD (reg:SF 44) >>>>> (nil))) >>>>> (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG) >>>>> (insn 6 3 7 2 (set (reg:SF 22 r22) >>>>> (reg/v:SF 43 [ f ])) "bloat.c":5:12 85 {*movsf} >>>>> (expr_list:REG_DEAD (reg/v:SF 43 [ f ]) >>>>> (nil))) >>>>> (call_insn/j 7 6 8 2 (parallel [ >>>>> >>>>> #14 sets pseudo 44 from arg register R22. >>>>> #2 moves it to pseudo 43 >>>>> #6 moves it to R22 as it prepares for call_insn #7. >>>>> >>>>> There are 2 allocnos and cost: >>>>> >>>>> Pass 0 for finding pseudo/allocno costs >>>>> >>>>> a1 (r44,l0) best NO_REGS, allocno NO_REGS >>>>> a0 (r43,l0) best NO_REGS, allocno NO_REGS >>>>> >>>>> a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 >>>>> NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000 >>>>> a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 >>>>> NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000 >>>>> >>>>> which is quite odd because MEM is way more expensive here than any REG. >>>>> >>>>> Okay, so let's boost the MEM cost (TARGET_MEMORY_MOVE_COST) by a factor >>>>> of 100: >>>>> >>>>> a1 (r44,l0) best NO_REGS, allocno NO_REGS >>>>> a0 (r43,l0) best NO_REGS, allocno NO_REGS >>>>> >>>>> a0(r43,l0) costs: ADDW_REGS:3200000 SIMPLE_LD_REGS:3200000 >>>>> LD_REGS:3200000 NO_LD_REGS:3200000 GENERAL_REGS:3200000 MEM:801000 >>>>> a1(r44,l0) costs: ADDW_REGS:3200000 SIMPLE_LD_REGS:3200000 >>>>> LD_REGS:3200000 NO_LD_REGS:3200000 GENERAL_REGS:3200000 MEM:801000 >>>>> >>>>> What??? The REG costs are 100 times higher, and stille higher that the >>>>> MEM costs. What the heck is going on? >>>>> >>>>> Setting TARGET_REGISTER_MOVE_COST and also TARGET_MEMORY_MOVE_COST to 0 >>>>> yiels: >>>>> >>>>> a0(r43,l0) costs: ADDW_REGS:0 SIMPLE_LD_REGS:0 LD_REGS:0 NO_LD_REGS:0 >>>>> GENERAL_REGS:0 MEM:0 >>>>> a1(r44,l0) costs: ADDW_REGS:0 SIMPLE_LD_REGS:0 LD_REGS:0 NO_LD_REGS:0 >>>>> GENERAL_REGS:0 MEM:0 >>>>> >>>>> as expected, i.e. there is no other hidden source of costs considered by >>>>> IRA. And even TARGET_REGISTER_MOVE_COST = 0 and >>>>> TARGET_MEMORY_MOVE_COST = original gives: >>>>> >>>>> a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 >>>>> NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000 >>>>> a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 >>>>> NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000 >>>>> >>>>> How the heck do I tell ira-costs that registers are way cheaper than MEM? >>> >>> I think this is coming from: >>> >>> /* FIXME: Ideally, the following test is not needed. >>> However, it turned out that it can reduce the number >>> of spill fails. AVR and it's poor endowment with >>> address registers is extreme stress test for reload. */ >>> >>> if (GET_MODE_SIZE (mode) >= 4 >>> && regno >= REG_X) >>> return false; >> >> This was introduced to "fix" unable to find a register to spill ICE. >> >> What I do not understand is that the code with long (which is SImode on >> avr) is fine: >> >> long lunc (long); >> >> long callL (long f) >> { >> return lunc (f); >> } >> >> callL: >> rjmp lunc ; 7 [c=24 l=1] call_value_insn/3 > > This is due to differences in the way that lower-subreg.c lowers > SF moves vs. SI moves. For SI it generates pure QI moves and so > gets rid of the SI entirely. For SF it still builds the QI values > back into an SF: > > || (!SCALAR_INT_MODE_P (dest_mode) > && !resolve_reg_p (dest) > && !resolve_subreg_p (dest))) > > I imagine this is because non-int modes are held in FPRs rather than > GPRs on most targets, but TBH I'm not sure. I couldn't see a comment > that explains the above decision. > > With -fno-split-wide-types I see the same RA behaviour for both SI and SF > (i.e. both spill to memory). > >>> in avr_hard_regno_mode_ok. This forbids SFmode in r26+ and means that >>> moves between pointer registers and general registers have the highest >>> possible cost (65535) to prevent them for being used for SFmode. So: >>> >>> ira_register_move_cost[SFmode][POINTER_REGS][GENERAL_REGS] = 65535; >>> >>> The costs for union classes are the maximum (worst-case) cost of >>> for each subclass, so this means that: >>> >>> ira_register_move_cost[SFmode][GENERAL_REGS][GENERAL_REGS] = 65535; >>> >>> as well. >> >> This means that, when there is an expensive class (because it only >> contains one register for example), > > Having one register doesn't automatically make it expensive. > E.g. there's only one "c" register on x86, but it's not more expensive > than other registers because of that. > > Move costs aren't a good way of deterring unnecessary uses of small > classes. The costs should just describe the actual size or speed > overhead of moving the register. > >> then it will blow the cost of GENERAL_REGS to crazy values no matter >> what? > > Yeah. This is because (with the above intended use of costs) the > worst-case cost of a superclass X can't be less than the worst-case cost > of one of its subclasses Y. If the RA decides to allocate an X, it might > get unlucky and be forced to use a register in Y. > > If a class X - Y exists then it won't be affected by the Y costs. > So taking Y's cost into account when calculating X's cost means that > the RA will prefer X - Y over X, which is exactly what making Y > expensive should achieve. > > FWIW, that's why I suggested seeing what would happen if you added a new > class for GENERAL_REGS - POINTER_REGS. Hi, I have tried that. Didn't fix the issue. For reference, the test case compiled with -Os -fsplit-wide-types-early: float func (float); float call (float f) { return func (f); } with the attached delta. It's still the case that there are 16 superfluous instructions, dead store, setup of frame even though not needed. call: push r28 ; 17 [c=4 l=1] pushqi1/0 push r29 ; 18 [c=4 l=1] pushqi1/0 ; SP -= 4 ; 22 [c=4 l=2] *addhi3_sp rcall . rcall . in r28,__SP_L__ ; 23 [c=4 l=2] *movhi/7 in r29,__SP_H__ /* prologue: function */ /* frame size = 4 */ /* stack size = 6 */ .L__stack_usage = 6 std Y+1,r22 ; 14 [c=4 l=4] *movsf/3 std Y+2,r23 std Y+3,r24 std Y+4,r25 /* epilogue start */ ; SP += 4 ; 34 [c=4 l=4] *addhi3_sp pop __tmp_reg__ pop __tmp_reg__ pop __tmp_reg__ pop __tmp_reg__ pop r29 ; 35 [c=4 l=1] popqi pop r28 ; 36 [c=4 l=1] popqi rjmp func ; 7 [c=24 l=1] call_value_insn/3 In the IRA dump, it's still the case that REGs are consistently more expensive than MEM: Pass 0 for finding pseudo/allocno costs a1 (r44,l0) best NO_REGS, allocno NO_REGS a0 (r43,l0) best NO_REGS, allocno NO_REGS a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 NO_POINTER_REGS:32000 MEM:9000 a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 NO_POINTER_REGS:32000 MEM:9000 Pass 1 for finding pseudo/allocno costs r44: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS r43: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS a0(r43,l0) costs: ADDW_REGS:48000 SIMPLE_LD_REGS:48000 LD_REGS:48000 NO_LD_REGS:48000 NO_POINTER_REGS:48000 MEM:17000 a1(r44,l0) costs: ADDW_REGS:40008 SIMPLE_LD_REGS:40008 LD_REGS:40008 NO_LD_REGS:40008 NO_POINTER_REGS:40008 MEM:17000 Johann p.s.: Also added a reg-class for the intersection of R0..r25 and r24..r31. Don't know if that's a requirement for regclass layout though. >> What's also strange is that the register allocator would not need to >> allocate a register at all: The incoming parameter comes in SI:22 and >> is just be passed through to the callee, which also receives the value >> in SI:22. Why would one move that value to memory? Even if memory was >> cheaper, moving the value to mem just to load it again to the same >> register is not very sensible... because in almost any case, /no/ >> instruction is cheaper than /some/ instructions? > > Earlier passes could perhaps propagate the pseudo registers away in very > simple cases like this. It would be a very special-case optimisation > though. If there was anything other than "move register X to register X" > between the calls, getting rid of the pseudo registers before RA could > introduce spill failures. > > combine's to blame for the fact that we have two pseudo registers rather > than one. See the comments about the avr-elf results in: > > https://gcc.gnu.org/ml/gcc-patches/2019-11/msg02150.html > > for more details. > >>> Removing the code above fixes it. If you don't want to do that, an >>> alternative might be to add a class for r0-r25 (but I've not tested that). >> >> Is there a way that it would use a similar path like SImode? > > AFAICT the SI and SF costs are the same. The difference is coming > from -fsplit-wide-types rather than RA. > > Thanks, > Richard >