From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-471490-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 25939 invoked by alias); 21 Dec 2014 12:36:08 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 25869 invoked by uid 48); 21 Dec 2014 12:36:02 -0000
From: "olegendo at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/55212] [SH] Switch to LRA
Date: Sun, 21 Dec 2014 12:36:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: unknown
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: olegendo at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-55212-4-udmeBDTMiM@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-55212-4@http.gcc.gnu.org/bugzilla/>
References: <bug-55212-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-12/txt/msg02497.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55212
--- Comment #91 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Re: [RFC PATCH 9/9] [SH] Split QI/HImode load/store via r0

(In reply to Kazumoto Kojima from comment #83)
> Created attachment 33992 [details]
> a patch for the issue c#77
> 
> Interestingly, this reduces the total text size of CSiBE test ~0.04%
> at -O2 even for the trunk i.e. with the old reload.

I've checked this change to prepare_move_operands without LRA with trunk
r218988, to see whether it should be enabled for non-LRA.

I can confirm the -1140 bytes / -0.04% on the CSiBE set.  However, as mentioned
in comment #82, it results in unnecessary zero extensions before other
logic/arithmetic because combine doesn't (want to) see through the R0 hardreg.  

Unnecessary sign/zero extensions are actually a separate topic (e.g. PR 53987).
 If there was a good sign/zero extension elimination in place, this wouldn't be
an issue here.

I've tried disabling the prepare_move_operands change and instead adding the
following splitters, which are done after combine and before RA:

(define_split
  [(set (match_operand:SI 0 "arith_reg_dest")
    (sign_extend:SI (match_operand:QIHI 1 "displacement_mem_operand")))]
  "TARGET_SH1 && can_create_pseudo_p ()
   && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)"
  [(set (match_dup 2) (reg:SI R0_REG))
   (set (reg:SI R0_REG) (sign_extend:SI (match_dup 1)))
   (set (match_dup 0) (reg:SI R0_REG))
   (set (reg:SI R0_REG) (match_dup 2))]
{
  operands[2] = gen_reg_rtx (SImode);
})

(define_split
  [(set (match_operand:QIHI 0 "arith_reg_dest")
    (match_operand:QIHI 1 "displacement_mem_operand"))]
  "TARGET_SH1 && can_create_pseudo_p ()
   && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)"
  [(set (match_dup 2) (reg:SI R0_REG))
   (set (reg:QIHI R0_REG) (match_dup 1))
   (set (match_dup 0) (reg:QIHI R0_REG))
   (set (reg:SI R0_REG) (match_dup 2))]
{
  operands[2] = gen_reg_rtx (SImode);
})

With these two splitters for mem loads I get exactly the same -1140 bytes /
-0.04% on the CSiBE set.

The simple test case

int test_tst (unsigned char* x, int y, int z)
{
  return x[4] ? y : z;
}

does not contain the extu.b insn anymore, but instead we get this:
        mov.b    @(4,r4),r0
        mov    r0,r1
        tst    r1,r1        << should be: tst r0,r0
        bf    .L4
    mov    r6,r5
.L4:
    rts
    mov    r5,r0


Other cases of new unnessecary zero-extension insns are in e.g. in
jpeg-6b/jdcoefct.s.

In linux-2.4.23-pre3-testplatform/arch/testplatform/kernel/signal.s some mov
reg,reg insns end up as:
    extu.b    r0,r0
    mov.b    r0,@(1,r8)
    mov    r9,r0
    shlr16    r0
    extu.b    r0,r0
    mov.b    r0,@(2,r8)
    mov    r9,r0
    shlr16    r0
    shlr8    r0
    mov.b    r0,@(3,r8)
    extu.b    r1,r0
    mov.b    r0,@(4,r8)
    mov    r1,r0
        ....
those can be wallpapered with peepholes though.

I've also tried using the splitters instead of the prepare_move_operands hunk
with LRA.  But then I get spill errors on QI/HImode stores with displacement
addressing.

I've also tried removing the prepare_move_operands hunk with LRA.  Compared
with trunk no-lra I get:
sum:  3368431 -> 3378515    +10084 / +0.299368 %

And LRA + prepare_move_operands hunk vs. trunk no-lra is:
sum:  3368431 -> 3376507    +8076 / +0.239756 %

Doing this kind of pre-RA R0 pre-allocation seems to result in better RA
choices w.r.t. commutative insns such as addition.

After all, maybe it's worth trying out an SH specific R0 pre-alloc pass that
offloads some of the RA choices.  Of course it will not be able to solve issues
that arise when spilling code is generated that uses QI/HImode mem accesses
during the RA/spilling process.

R0 is the most difficult candidate, but I've also seen reports about FPU code
ICEing due FR0 spill failures when there are lots of (interdependent?) FMAC
insns at -O3 (e.g. FP matrix multiplication).  Another register (class), of
which there is only one on SH, would be the MACH/MACL pair.  Currently
MACH/MACL are fixed hardregs.  Early experiments to allow MACH/MACL to be used
by RA and adding the MAC insn showed some problems (see PR 53949 #c3).