From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 25939 invoked by alias); 21 Dec 2014 12:36:08 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 25869 invoked by uid 48); 21 Dec 2014 12:36:02 -0000 From: "olegendo at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/55212] [SH] Switch to LRA Date: Sun, 21 Dec 2014 12:36:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: unknown X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: olegendo at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-12/txt/msg02497.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55212 --- Comment #91 from Oleg Endo --- Re: [RFC PATCH 9/9] [SH] Split QI/HImode load/store via r0 (In reply to Kazumoto Kojima from comment #83) > Created attachment 33992 [details] > a patch for the issue c#77 > > Interestingly, this reduces the total text size of CSiBE test ~0.04% > at -O2 even for the trunk i.e. with the old reload. I've checked this change to prepare_move_operands without LRA with trunk r218988, to see whether it should be enabled for non-LRA. I can confirm the -1140 bytes / -0.04% on the CSiBE set. However, as mentioned in comment #82, it results in unnecessary zero extensions before other logic/arithmetic because combine doesn't (want to) see through the R0 hardreg. Unnecessary sign/zero extensions are actually a separate topic (e.g. PR 53987). If there was a good sign/zero extension elimination in place, this wouldn't be an issue here. I've tried disabling the prepare_move_operands change and instead adding the following splitters, which are done after combine and before RA: (define_split [(set (match_operand:SI 0 "arith_reg_dest") (sign_extend:SI (match_operand:QIHI 1 "displacement_mem_operand")))] "TARGET_SH1 && can_create_pseudo_p () && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)" [(set (match_dup 2) (reg:SI R0_REG)) (set (reg:SI R0_REG) (sign_extend:SI (match_dup 1))) (set (match_dup 0) (reg:SI R0_REG)) (set (reg:SI R0_REG) (match_dup 2))] { operands[2] = gen_reg_rtx (SImode); }) (define_split [(set (match_operand:QIHI 0 "arith_reg_dest") (match_operand:QIHI 1 "displacement_mem_operand"))] "TARGET_SH1 && can_create_pseudo_p () && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)" [(set (match_dup 2) (reg:SI R0_REG)) (set (reg:QIHI R0_REG) (match_dup 1)) (set (match_dup 0) (reg:QIHI R0_REG)) (set (reg:SI R0_REG) (match_dup 2))] { operands[2] = gen_reg_rtx (SImode); }) With these two splitters for mem loads I get exactly the same -1140 bytes / -0.04% on the CSiBE set. The simple test case int test_tst (unsigned char* x, int y, int z) { return x[4] ? y : z; } does not contain the extu.b insn anymore, but instead we get this: mov.b @(4,r4),r0 mov r0,r1 tst r1,r1 << should be: tst r0,r0 bf .L4 mov r6,r5 .L4: rts mov r5,r0 Other cases of new unnessecary zero-extension insns are in e.g. in jpeg-6b/jdcoefct.s. In linux-2.4.23-pre3-testplatform/arch/testplatform/kernel/signal.s some mov reg,reg insns end up as: extu.b r0,r0 mov.b r0,@(1,r8) mov r9,r0 shlr16 r0 extu.b r0,r0 mov.b r0,@(2,r8) mov r9,r0 shlr16 r0 shlr8 r0 mov.b r0,@(3,r8) extu.b r1,r0 mov.b r0,@(4,r8) mov r1,r0 .... those can be wallpapered with peepholes though. I've also tried using the splitters instead of the prepare_move_operands hunk with LRA. But then I get spill errors on QI/HImode stores with displacement addressing. I've also tried removing the prepare_move_operands hunk with LRA. Compared with trunk no-lra I get: sum: 3368431 -> 3378515 +10084 / +0.299368 % And LRA + prepare_move_operands hunk vs. trunk no-lra is: sum: 3368431 -> 3376507 +8076 / +0.239756 % Doing this kind of pre-RA R0 pre-allocation seems to result in better RA choices w.r.t. commutative insns such as addition. After all, maybe it's worth trying out an SH specific R0 pre-alloc pass that offloads some of the RA choices. Of course it will not be able to solve issues that arise when spilling code is generated that uses QI/HImode mem accesses during the RA/spilling process. R0 is the most difficult candidate, but I've also seen reports about FPU code ICEing due FR0 spill failures when there are lots of (interdependent?) FMAC insns at -O3 (e.g. FP matrix multiplication). Another register (class), of which there is only one on SH, would be the MACH/MACL pair. Currently MACH/MACL are fixed hardregs. Early experiments to allow MACH/MACL to be used by RA and adding the MAC insn showed some problems (see PR 53949 #c3).