[Bug target/98981] New: gcc-10.2 for RISC-V has extraneous register moves

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/98981] New: gcc-10.2 for RISC-V has extraneous register moves
@ 2021-02-06  0:20 brian.grayson at sifive dot com
  2021-02-06  0:57 ` [Bug target/98981] " wilson at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: brian.grayson at sifive dot com @ 2021-02-06  0:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

            Bug ID: 98981
           Summary: gcc-10.2 for RISC-V has extraneous register moves
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: brian.grayson at sifive dot com
  Target Milestone: ---

gcc is inserting an unnecessary register-register move for a simple max-style
operation:

int a[256], b[256];
int32_t find_max_i32() {
  int32_t xme = 0, sc=0;
  for (int32_t i = 0; i < 100; i++) {
    if ((sc=a[i]+b[i]) > xme) xme=sc;
  }
  return xme;
}

This is from the SPECint2006 benchmark HMMER, in P7Viterbi(), hence the
variable names sc and xme from the original source.

Under these flags:
-march=rv64imafdc -mcmodel=medany -mabi=lp64d -O3

I get this disassembly for the loop:

.L5:
  lw  a5,0(a4)
  lw  a2,0(a3)
  addi  a4,a4,4
  addi  a3,a3,4
  addw  a2,a5,a2
  mv  a5,a2  <--- unnecessary move
  bge a2,a0,.L4
  mv  a5,a0
.L4:
  sext.w  a0,a5
  bne a4,a1,.L5

If the addw targets a5, and the bge compares a5 to a0, the mv could be removed.
In fact, if the variable types are changed to int64_t, that's exactly what
happens:

.L13:
  ld  a5,0(a4)
  ld  a2,0(a3)
  addi  a4,a4,8
  addi  a3,a3,8
  add a5,a5,a2
  bgeu  a0,a5,.L12
  mv  a0,a5
.L12:
  bne a4,a1,.L13

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves
  2021-02-06  0:20 [Bug target/98981] New: gcc-10.2 for RISC-V has extraneous register moves brian.grayson at sifive dot com
@ 2021-02-06  0:57 ` wilson at gcc dot gnu.org
  2021-02-06  2:29 ` kito at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: wilson at gcc dot gnu.org @ 2021-02-06  0:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

Jim Wilson <wilson at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilson at gcc dot gnu.org

--- Comment #1 from Jim Wilson <wilson at gcc dot gnu.org> ---
The extra move instruction is a side effect of how the riscv64 toolchain
handles 32-bit arithmetic.  We lie to the compiler and tell it that we have
instructions that produce 32-bit results.  In fact, we only have instructions
that produce 64-bit sign-extended 32-bit results.  The lie means that the RTL
has some insns with SImode output and some instructions with DImode outputs,
and sometimes we end up with nop moves to convert between the modes.  In this
case, it is peephole2 after regalloc that notices a SImode add followed by a
sign-extend, and converts it to a sign-extending 32-bit add followed by a move,
but can't eliminate the move because we already did register allocation.

This same problem is also why we get the unnecessary sext after the label, as
peephole can't fix that.

This problem has been on my todo list for a few years, and I have ideas of how
to fix it, but I have no idea when I will have time to try to fix it. I did
document it for the RISC-V International Code Speed Optimization task group.
https://github.com/riscv/riscv-code-speed-optimization/blob/main/projects/gcc-optimizations.adoc
This one is the first one in the list.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves
  2021-02-06  0:20 [Bug target/98981] New: gcc-10.2 for RISC-V has extraneous register moves brian.grayson at sifive dot com
  2021-02-06  0:57 ` [Bug target/98981] " wilson at gcc dot gnu.org
@ 2021-02-06  2:29 ` kito at gcc dot gnu.org
  2021-02-06  5:46 ` wilson at gcc dot gnu.org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: kito at gcc dot gnu.org @ 2021-02-06  2:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

Kito Cheng <kito at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kito at gcc dot gnu.org

--- Comment #2 from Kito Cheng <kito at gcc dot gnu.org> ---
Here is a quick patch for fix part of this issue, it seems like because our
cost model is inprecise, but I guess I need run benchmark to make sure the
performance and code size didn't get any regression. 

find_max_i32:
        lui     a4,%hi(.LANCHOR0)
        addi    a4,a4,%lo(.LANCHOR0)
        addi    a3,a4,1024
        addi    a6,a4,400
        li      a0,0
.L3:
        lw      a5,0(a4)
        lw      a2,0(a3)
        addi    a4,a4,4
        addi    a3,a3,4
        addw    a1,a5,a2
        addw    a5,a5,a2
        bge     a1,a0,.L2
        mv      a5,a0
.L2:
        sext.w  a0,a5
        bne     a4,a6,.L3
        ret



Patch:
diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
index d489717b2a5..b8c9f7200ce 100644
--- a/gcc/config/riscv/riscv.c
+++ b/gcc/config/riscv/riscv.c
@@ -1879,6 +1879,15 @@ riscv_rtx_costs (rtx x, machine_mode mode, int
outer_code, int opno ATTRIBUTE_UN
        }
       /* Fall through.  */
     case SIGN_EXTEND:
+      if (TARGET_64BIT && !REG_P (XEXP (x, 0)))
+       {
+         int code = GET_CODE (XEXP (x, 0));
+         if (code == PLUS || code == MINUS || code == NEG || code == MULT)
+           {
+             *total = COSTS_N_INSNS (1);
+             return true;
+           }
+       }
       *total = riscv_extend_cost (XEXP (x, 0), GET_CODE (x) == ZERO_EXTEND);
       return false;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves
  2021-02-06  0:20 [Bug target/98981] New: gcc-10.2 for RISC-V has extraneous register moves brian.grayson at sifive dot com
  2021-02-06  0:57 ` [Bug target/98981] " wilson at gcc dot gnu.org
  2021-02-06  2:29 ` kito at gcc dot gnu.org
@ 2021-02-06  5:46 ` wilson at gcc dot gnu.org
  2021-02-18  6:28 ` wilson at gcc dot gnu.org
  2021-02-19  0:42 ` wilson at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: wilson at gcc dot gnu.org @ 2021-02-06  5:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

--- Comment #3 from Jim Wilson <wilson at gcc dot gnu.org> ---
I suppose cost model problems could explain why combine didn't do the
optimization.  I didn't have a chance to look at that.

I still think there is a fundmental problem with how we represent SImode
operations, but again cost model problems could explain why my experiments to
fix that didn't work as expected.  I probably didn't look at that when I was
experimenting with riscv.md changes.

Your patch does look useful, but setting cost to 1 for MULT is wrong, and would
be just as wrong for DIV.  That is OK for PLUS, MINUS, and NEG though.  I think
a better option is to set *total = 0 and return false.  That gives no extra
cost to the sign extend, and recurs to get the proper cost for the operation
underneath.  That would work for MUL and DIV.  I found code in the rs6000 port
that does this.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves
  2021-02-06  0:20 [Bug target/98981] New: gcc-10.2 for RISC-V has extraneous register moves brian.grayson at sifive dot com
                   ` (2 preceding siblings ...)
  2021-02-06  5:46 ` wilson at gcc dot gnu.org
@ 2021-02-18  6:28 ` wilson at gcc dot gnu.org
  2021-02-19  0:42 ` wilson at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: wilson at gcc dot gnu.org @ 2021-02-18  6:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

--- Comment #4 from Jim Wilson <wilson at gcc dot gnu.org> ---
With this testcase

extern void sub2 (void);

void
sub (int *i, int *j)
{
  int k = *i + 1;
  *j = k;
  if (k == 0)
    sub2 ();
}

Compiling without the riscv_rtx_cost patch, I get
        lw      a5,0(a0)
        addiw   a4,a5,1
        sw      a4,0(a1)
        beq     a4,zero,.L4
compiling with the riscv_rtx_cost patch, I get
        lw      a5,0(a0)
        addiw   a4,a5,1
        sw      a4,0(a1)
        addiw   a5,a5,1
        beq     a5,zero,.L4

The problem here is that we have RTL
(insn 9 7 10 2 (set (reg:SI 76)
        (plus:SI (subreg:SI (reg:DI 78 [ *i_4(D) ]) 0)
            (const_int 1 [0x1]))) "tmp.c":6:7 3 {addsi3}
     (expr_list:REG_DEAD (reg:DI 78 [ *i_4(D) ])
        (nil)))
(insn 10 9 11 2 (set (reg/v:DI 73 [ k ])
        (sign_extend:DI (reg:SI 76))) "tmp.c":6:7 90 {extendsidi2}
     (nil))
with the SImode 76 used for the sw and the DImode 73 used for the beq.  With
the riscv_rtx_cost change, which makes the sign_extend add cheap, combine folds
the add into the sign-extend to get
(insn 9 7 10 2 (set (reg:SI 76)
        (plus:SI (subreg:SI (reg:DI 78 [ *i_4(D) ]) 0)
            (const_int 1 [0x1]))) "tmp.c":6:7 3 {addsi3}
     (nil))
(insn 10 9 11 2 (set (reg/v:DI 73 [ k ])
        (sign_extend:DI (plus:SI (subreg:SI (reg:DI 78 [ *i_4(D) ]) 0)
                (const_int 1 [0x1])))) "tmp.c":6:7 5 {*addsi3_extended}
     (expr_list:REG_DEAD (reg:DI 78 [ *i_4(D) ])
        (nil)))
and now we have two adds which is wrong.  Without the patch, combine does
nothing, then ree manages to merge the sign_extend into the add and subseg the
sw, so we end up with only the one addw and no explicit sign extend.

My take on this is that we never should have generated the SImode add in the
first place, because we don't actually have that instruction.  If we would have
generated the sign-extended add at rtl generation time, and subreged the
result, then we would have gotten the right result.  In theory.

So I think the riscv_rtx_cost change is useful, but only if combined with the
rtl generation change I proposed in comment 1.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/98981] gcc-10.2 for RISC-V has extraneous register moves
  2021-02-06  0:20 [Bug target/98981] New: gcc-10.2 for RISC-V has extraneous register moves brian.grayson at sifive dot com
                   ` (3 preceding siblings ...)
  2021-02-18  6:28 ` wilson at gcc dot gnu.org
@ 2021-02-19  0:42 ` wilson at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: wilson at gcc dot gnu.org @ 2021-02-19  0:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98981

--- Comment #5 from Jim Wilson <wilson at gcc dot gnu.org> ---
Neither of the two patches I mentioned in comment 1 can fix the problem by
themselves, as we still have a mix of SImode and DImode operations.

I looked at REE.  It doesn't work because there is more than one reaching def. 
But even if it did work, I don't think it would completely solve the problem
because it runs after register allocation and hence won't be able to remove
move instructions.

To get the best result, we need the register allocator to take two registers
with different modes with overlapping live ranges, and realize that they can be
allocated to the same hard reg because the overlapping uses are
non-conflicting.  I haven't tried looking at the register allocator, but it
doesn't seem like a good way to try to solve the problem.

We have an inconvenient mix of SImdoe and DImode because we don't have SImode
compare and branch instructions.  That requires sign extending 32-bit values to
64-bit to compare them, which then results in the sign extend and register
allocation optimization issues.  it is unlikely that 32-bit compare and branch
instructions will be added to the ISA though.

One useful thing I noticed is that the program is doing a max operation, and
the B extension adds a max instruction.  Having one instruction instead of a
series of instructions including a branch to compute max makes the optimization
issues easier, and gcc does give the right result in this case.  Using a
compiler with B support I get
        lw      a4,0(a5)
        lw      a2,0(a3)
        addi    a5,a5,4
        addi    a3,a3,4
        addw    a4,a4,a2
        max     a0,a4,a0
        bne     a5,a1,.L2
which is good code with the extra moves and sign-extends removed.  So I have a
workaround of sorts, but only if you have the B extension.

The -mtune=sifive-7-series can support conditional move via macro fusion, I was
hopeful that this would work as well as max, but unfortunately the sign-extend
that was removed in the max case does't get removed in the conditional move
case.  Also, the conditional move is 2-address, and the register allocator ends
up needing a reload, which gives us the unwanted mv again.  So the code in this
case is the same as without the option.  I didn't check to see if this is
fixable.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-02-19  0:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-06  0:20 [Bug target/98981] New: gcc-10.2 for RISC-V has extraneous register moves brian.grayson at sifive dot com
2021-02-06  0:57 ` [Bug target/98981] " wilson at gcc dot gnu.org
2021-02-06  2:29 ` kito at gcc dot gnu.org
2021-02-06  5:46 ` wilson at gcc dot gnu.org
2021-02-18  6:28 ` wilson at gcc dot gnu.org
2021-02-19  0:42 ` wilson at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).