Re: [committed] Improve quality of code from LRA register elimination

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jeff Law <jeffreyalaw@gmail.com>
To: gcc-patches@gcc.gnu.org
Subject: Re: [committed] Improve quality of code from LRA register elimination
Date: Wed, 23 Aug 2023 15:14:51 -0600	[thread overview]
Message-ID: <fdeea532-76c6-d44d-f8cb-d51e4a667221@gmail.com> (raw)
In-Reply-To: <66015ebf-ebec-c249-1f48-3949da228b18@ventanamicro.com>

[-- Attachment #1: Type: text/plain, Size: 3463 bytes --]



On 8/23/23 14:13, Jeff Law wrote:
> This is primarily Jivan's work, I'm mostly responsible for the write-up 
> and coordinating with Vlad on a few questions.
> 
> On targets with limitations on immediates usable in arithmetic 
> instructions, LRA's register elimination phase can construct fairly poor 
> code.
> 
> This example (from the GCC testsuite) illustrates the problem well.
> 
> 
> int  consume (void *);
> int foo (void) {
>    int x[1000000];
>    return consume (x + 1000);
> }
> 
> If you compile on riscv64-linux-gnu with "-O2 -march=rv64gc 
> -mabi=lp64d", then you'll get this code (up to the call to consume()).
> 
> 
> 
>          .cfi_startproc
>          li      t0,-4001792
>          li      a0,-3997696
>          li      a5,4001792
>          addi    sp,sp,-16
>          .cfi_def_cfa_offset 16
>          addi    t0,t0,1792
>          addi    a0,a0,1696
>          addi    a5,a5,-1792
>          sd      ra,8(sp)
>          add     a5,a5,a0
>          add     sp,sp,t0
>          .cfi_def_cfa_offset 4000016
>          .cfi_offset 1, -8
>          add     a0,a5,sp
>          call    consume
> 
> Of particular interest is the value in a0 when we call consume. We 
> compute that horribly inefficiently.   If we back-substitute from the 
> final assignment to a0 we get...
> 
> a0 = a5 + sp
> a0 = a5 + (sp + t0)
> a0 = (a5 + a0) + (sp + t0)
> a0 = ((a5 - 1792) + a0) + (sp + t0)
> a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + t0)
> a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + (t0 + 1792))
> a0 = (a5 + (a0 + 1696)) + (sp + t0)  // removed offsetting terms
> a0 = (a5 + (a0 + 1696)) + ((sp - 16) + t0)
> a0 = (4001792 + (a0 + 1696)) + ((sp - 16) + t0)
> a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + t0)
> a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + -4001792)
> a0 = (-3997696 + 1696) + (sp -16) // removed offsetting terms
> a0 = sp - 3990616
> 
> That's a pretty convoluted way to compute sp - 3990616.
> 
> Something like this would be notably better (not great, but we need both 
> the stack adjustment and the address of the object to pass to consume):
> 
> 
> 
>     addi sp,sp,-16
>     sd ra,8(sp)
>     li t0,-4001792
>     addi t0,t0,1792
>     add sp,sp,t0
>     li a0,4096
>     addi a0,a0,-96
>     add a0,sp,a0
>     call consume
> 
> 
> The problem is LRA's elimination code is not handling the case where we 
> have (plus (reg1) (reg2) where reg1 is an eliminable register and reg2 
> has a known equivalency, particularly a constant.
> 
> If we can determine that reg2 is equivalent to a constant and treat 
> (plus (reg1) (reg2)) in the same way we'd treat (plus (reg1) 
> (const_int)) then we can get the desired code.
> 
> This eliminates about 19b instructions, or roughly 1% for deepsjeng on 
> rv64.  There are improvements elsewhere, but they're relatively small. 
> This may ultimately lessen the value of Manolis's fold-mem-offsets 
> patch.  So we'll have to evaluate that again once he posts a new version.
> 
> Bootstrapped and regression tested on x86_64 as well as bootstrapped on 
> rv64.  Earlier versions have been tested against spec2017.  Pre-approved 
> by Vlad in a private email conversation (thanks Vlad!).
> 
> Committed to the trunk,
Whoops.  Attached the wrong patch :-)  This is the right one.

jeff


[-- Attachment #2: P --]
[-- Type: text/plain, Size: 4623 bytes --]

commit 6619b3d4c15cd754798b1048c67f3806bbcc2e6d
Author: Jivan Hakobyan <jivanhakobyan9@gmail.com>
Date:   Wed Aug 23 14:10:30 2023 -0600

    Improve quality of code from LRA register elimination
    
    This is primarily Jivan's work, I'm mostly responsible for the write-up and
    coordinating with Vlad on a few questions.
    
    On targets with limitations on immediates usable in arithmetic instructions,
    LRA's register elimination phase can construct fairly poor code.
    
    This example (from the GCC testsuite) illustrates the problem well.
    
    int  consume (void *);
    int foo (void) {
      int x[1000000];
      return consume (x + 1000);
    }
    
    If you compile on riscv64-linux-gnu with "-O2 -march=rv64gc -mabi=lp64d", then
    you'll get this code (up to the call to consume()).
    
            .cfi_startproc
            li      t0,-4001792
            li      a0,-3997696
            li      a5,4001792
            addi    sp,sp,-16
            .cfi_def_cfa_offset 16
            addi    t0,t0,1792
            addi    a0,a0,1696
            addi    a5,a5,-1792
            sd      ra,8(sp)
            add     a5,a5,a0
            add     sp,sp,t0
            .cfi_def_cfa_offset 4000016
            .cfi_offset 1, -8
            add     a0,a5,sp
            call    consume
    
    Of particular interest is the value in a0 when we call consume. We compute that
    horribly inefficiently.   If we back-substitute from the final assignment to a0
    we get...
    
    a0 = a5 + sp
    a0 = a5 + (sp + t0)
    a0 = (a5 + a0) + (sp + t0)
    a0 = ((a5 - 1792) + a0) + (sp + t0)
    a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + t0)
    a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + (t0 + 1792))
    a0 = (a5 + (a0 + 1696)) + (sp + t0)  // removed offsetting terms
    a0 = (a5 + (a0 + 1696)) + ((sp - 16) + t0)
    a0 = (4001792 + (a0 + 1696)) + ((sp - 16) + t0)
    a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + t0)
    a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + -4001792)
    a0 = (-3997696 + 1696) + (sp -16) // removed offsetting terms
    a0 = sp - 3990616
    
    That's a pretty convoluted way to compute sp - 3990616.
    
    Something like this would be notably better (not great, but we need both the
    stack adjustment and the address of the object to pass to consume):
    
       addi sp,sp,-16
       sd ra,8(sp)
       li t0,-4001792
       addi t0,t0,1792
       add sp,sp,t0
       li a0,4096
       addi a0,a0,-96
       add a0,sp,a0
       call consume
    
    The problem is LRA's elimination code is not handling the case where we have
    (plus (reg1) (reg2) where reg1 is an eliminable register and reg2 has a known
    equivalency, particularly a constant.
    
    If we can determine that reg2 is equivalent to a constant and treat (plus
    (reg1) (reg2)) in the same way we'd treat (plus (reg1) (const_int)) then we can
    get the desired code.
    
    This eliminates about 19b instructions, or roughly 1% for deepsjeng on rv64.
    There are improvements elsewhere, but they're relatively small.  This may
    ultimately lessen the value of Manolis's fold-mem-offsets patch.  So we'll have
    to evaluate that again once he posts a new version.
    
    Bootstrapped and regression tested on x86_64 as well as bootstrapped on rv64.
    Earlier versions have been tested against spec2017.  Pre-approved by Vlad in a
    private email conversation (thanks Vlad!).
    
    Committed to the trunk,
    
    gcc/
            * lra-eliminations.cc (eliminate_regs_in_insn): Use equivalences to
            to help simplify code further.

diff --git a/gcc/lra-eliminations.cc b/gcc/lra-eliminations.cc
index 3c58d4a3815..df613cdda76 100644
--- a/gcc/lra-eliminations.cc
+++ b/gcc/lra-eliminations.cc
@@ -926,6 +926,18 @@ eliminate_regs_in_insn (rtx_insn *insn, bool replace_p, bool first_p,
       /* First see if the source is of the form (plus (...) CST).  */
       if (plus_src && poly_int_rtx_p (XEXP (plus_src, 1), &offset))
 	plus_cst_src = plus_src;
+      /* If we are doing initial offset computation, then utilize
+	 eqivalences to discover a constant for the second term
+	 of PLUS_SRC.  */
+      else if (plus_src && REG_P (XEXP (plus_src, 1)))
+	{
+	  int regno = REGNO (XEXP (plus_src, 1));
+	  if (regno < ira_reg_equiv_len
+	      && ira_reg_equiv[regno].constant != NULL_RTX
+	      && !replace_p
+	      && poly_int_rtx_p (ira_reg_equiv[regno].constant, &offset))
+	    plus_cst_src = plus_src;
+	}
       /* Check that the first operand of the PLUS is a hard reg or
 	 the lowpart subreg of one.  */
       if (plus_cst_src)

     prev parent reply	other threads:[~2023-08-23 21:14 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-23 20:13 Jeff Law
2023-08-23 21:14 ` Jeff Law [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fdeea532-76c6-d44d-f8cb-d51e4a667221@gmail.com \
    --to=jeffreyalaw@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).