```public inbox for gcc-patches@gcc.gnu.org
help / color / mirror / Atom feed```
```* [committed] Improve quality of code from LRA register elimination
@ 2023-08-23 20:13 Jeff Law
2023-08-23 21:14 ` Jeff Law
From: Jeff Law @ 2023-08-23 20:13 UTC (permalink / raw)
To: gcc-patches; +Cc: Jivan Hakobyan, Vladimir Makarov

[-- Attachment #1: Type: text/plain, Size: 2992 bytes --]

This is primarily Jivan's work, I'm mostly responsible for the write-up
and coordinating with Vlad on a few questions.

On targets with limitations on immediates usable in arithmetic
instructions, LRA's register elimination phase can construct fairly poor
code.

This example (from the GCC testsuite) illustrates the problem well.

int  consume (void *);
int foo (void) {
int x[1000000];
return consume (x + 1000);
}

If you compile on riscv64-linux-gnu with "-O2 -march=rv64gc
-mabi=lp64d", then you'll get this code (up to the call to consume()).

.cfi_startproc
li      t0,-4001792
li      a0,-3997696
li      a5,4001792
.cfi_def_cfa_offset 16
sd      ra,8(sp)
.cfi_def_cfa_offset 4000016
.cfi_offset 1, -8
call    consume

Of particular interest is the value in a0 when we call consume. We
compute that horribly inefficiently.   If we back-substitute from the
final assignment to a0 we get...

a0 = a5 + sp
a0 = a5 + (sp + t0)
a0 = (a5 + a0) + (sp + t0)
a0 = ((a5 - 1792) + a0) + (sp + t0)
a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + t0)
a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + (t0 + 1792))
a0 = (a5 + (a0 + 1696)) + (sp + t0)  // removed offsetting terms
a0 = (a5 + (a0 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (a0 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + -4001792)
a0 = (-3997696 + 1696) + (sp -16) // removed offsetting terms
a0 = sp - 3990616

That's a pretty convoluted way to compute sp - 3990616.

Something like this would be notably better (not great, but we need both
the stack adjustment and the address of the object to pass to consume):

sd ra,8(sp)
li t0,-4001792
li a0,4096
call consume

The problem is LRA's elimination code is not handling the case where we
have (plus (reg1) (reg2) where reg1 is an eliminable register and reg2
has a known equivalency, particularly a constant.

If we can determine that reg2 is equivalent to a constant and treat
(plus (reg1) (reg2)) in the same way we'd treat (plus (reg1)
(const_int)) then we can get the desired code.

This eliminates about 19b instructions, or roughly 1% for deepsjeng on
rv64.  There are improvements elsewhere, but they're relatively small.
This may ultimately lessen the value of Manolis's fold-mem-offsets
patch.  So we'll have to evaluate that again once he posts a new version.

Bootstrapped and regression tested on x86_64 as well as bootstrapped on
rv64.  Earlier versions have been tested against spec2017.  Pre-approved

Committed to the trunk,

Jeff

[-- Attachment #2: P --]
[-- Type: text/plain, Size: 1114 bytes --]

commit 47f95bc4be4eb14730ab3eaaaf8f6e71fda47690
Author: Raphael Moreira Zinsly <rzinsly@ventanamicro.com>
Date:   Tue Aug 22 11:37:04 2023 -0600

RISC-V: Add multiarch support on riscv-linux-gnu

This adds multiarch support to the RISC-V port so that bootstraps work with
Debian out-of-the-box.  Without this patch the stage1 compiler is unable to
find headers/libraries when building the stage1 runtime.

This is functionally (and possibly textually) equivalent to Debian's fix for
the same problem.

gcc/

diff --git a/gcc/config/riscv/t-linux b/gcc/config/riscv/t-linux
index 216d2776a18..a6f64f88d25 100644
--- a/gcc/config/riscv/t-linux
+++ b/gcc/config/riscv/t-linux
@@ -1,3 +1,5 @@
# Only XLEN and ABI affect Linux multilib dir names, e.g. /lib32/ilp32d/
MULTILIB_DIRNAMES := \$(patsubst rv32%,lib32,\$(patsubst rv64%,lib64,\$(MULTILIB_DIRNAMES)))
MULTILIB_OSDIRNAMES := \$(patsubst lib%,../lib%,\$(MULTILIB_DIRNAMES))
+
+MULTIARCH_DIRNAME := \$(call if_multiarch,\$(firstword \$(subst -, ,\$(target)))-linux-gnu)

```* Re: [committed] Improve quality of code from LRA register elimination
2023-08-23 20:13 [committed] Improve quality of code from LRA register elimination Jeff Law
@ 2023-08-23 21:14 ` Jeff Law
0 siblings, 0 replies; 2+ messages in thread
From: Jeff Law @ 2023-08-23 21:14 UTC (permalink / raw)
To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3463 bytes --]

On 8/23/23 14:13, Jeff Law wrote:
> This is primarily Jivan's work, I'm mostly responsible for the write-up
> and coordinating with Vlad on a few questions.
>
> On targets with limitations on immediates usable in arithmetic
> instructions, LRA's register elimination phase can construct fairly poor
> code.
>
> This example (from the GCC testsuite) illustrates the problem well.
>
>
> int  consume (void *);
> int foo (void) {
>    int x[1000000];
>    return consume (x + 1000);
> }
>
> If you compile on riscv64-linux-gnu with "-O2 -march=rv64gc
> -mabi=lp64d", then you'll get this code (up to the call to consume()).
>
>
>
>          .cfi_startproc
>          li      t0,-4001792
>          li      a0,-3997696
>          li      a5,4001792
>          .cfi_def_cfa_offset 16
>          sd      ra,8(sp)
>          .cfi_def_cfa_offset 4000016
>          .cfi_offset 1, -8
>          call    consume
>
> Of particular interest is the value in a0 when we call consume. We
> compute that horribly inefficiently.   If we back-substitute from the
> final assignment to a0 we get...
>
> a0 = a5 + sp
> a0 = a5 + (sp + t0)
> a0 = (a5 + a0) + (sp + t0)
> a0 = ((a5 - 1792) + a0) + (sp + t0)
> a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + t0)
> a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + (t0 + 1792))
> a0 = (a5 + (a0 + 1696)) + (sp + t0)  // removed offsetting terms
> a0 = (a5 + (a0 + 1696)) + ((sp - 16) + t0)
> a0 = (4001792 + (a0 + 1696)) + ((sp - 16) + t0)
> a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + t0)
> a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + -4001792)
> a0 = (-3997696 + 1696) + (sp -16) // removed offsetting terms
> a0 = sp - 3990616
>
> That's a pretty convoluted way to compute sp - 3990616.
>
> Something like this would be notably better (not great, but we need both
> the stack adjustment and the address of the object to pass to consume):
>
>
>
>     sd ra,8(sp)
>     li t0,-4001792
>     li a0,4096
>     call consume
>
>
> The problem is LRA's elimination code is not handling the case where we
> have (plus (reg1) (reg2) where reg1 is an eliminable register and reg2
> has a known equivalency, particularly a constant.
>
> If we can determine that reg2 is equivalent to a constant and treat
> (plus (reg1) (reg2)) in the same way we'd treat (plus (reg1)
> (const_int)) then we can get the desired code.
>
> This eliminates about 19b instructions, or roughly 1% for deepsjeng on
> rv64.  There are improvements elsewhere, but they're relatively small.
> This may ultimately lessen the value of Manolis's fold-mem-offsets
> patch.  So we'll have to evaluate that again once he posts a new version.
>
> Bootstrapped and regression tested on x86_64 as well as bootstrapped on
> rv64.  Earlier versions have been tested against spec2017.  Pre-approved
>
> Committed to the trunk,
Whoops.  Attached the wrong patch :-)  This is the right one.

jeff

[-- Attachment #2: P --]
[-- Type: text/plain, Size: 4623 bytes --]

commit 6619b3d4c15cd754798b1048c67f3806bbcc2e6d
Author: Jivan Hakobyan <jivanhakobyan9@gmail.com>
Date:   Wed Aug 23 14:10:30 2023 -0600

Improve quality of code from LRA register elimination

This is primarily Jivan's work, I'm mostly responsible for the write-up and
coordinating with Vlad on a few questions.

On targets with limitations on immediates usable in arithmetic instructions,
LRA's register elimination phase can construct fairly poor code.

This example (from the GCC testsuite) illustrates the problem well.

int  consume (void *);
int foo (void) {
int x[1000000];
return consume (x + 1000);
}

If you compile on riscv64-linux-gnu with "-O2 -march=rv64gc -mabi=lp64d", then
you'll get this code (up to the call to consume()).

.cfi_startproc
li      t0,-4001792
li      a0,-3997696
li      a5,4001792
.cfi_def_cfa_offset 16
sd      ra,8(sp)
.cfi_def_cfa_offset 4000016
.cfi_offset 1, -8
call    consume

Of particular interest is the value in a0 when we call consume. We compute that
horribly inefficiently.   If we back-substitute from the final assignment to a0
we get...

a0 = a5 + sp
a0 = a5 + (sp + t0)
a0 = (a5 + a0) + (sp + t0)
a0 = ((a5 - 1792) + a0) + (sp + t0)
a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + t0)
a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + (t0 + 1792))
a0 = (a5 + (a0 + 1696)) + (sp + t0)  // removed offsetting terms
a0 = (a5 + (a0 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (a0 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + -4001792)
a0 = (-3997696 + 1696) + (sp -16) // removed offsetting terms
a0 = sp - 3990616

That's a pretty convoluted way to compute sp - 3990616.

Something like this would be notably better (not great, but we need both the
stack adjustment and the address of the object to pass to consume):

sd ra,8(sp)
li t0,-4001792
li a0,4096
call consume

The problem is LRA's elimination code is not handling the case where we have
(plus (reg1) (reg2) where reg1 is an eliminable register and reg2 has a known
equivalency, particularly a constant.

If we can determine that reg2 is equivalent to a constant and treat (plus
(reg1) (reg2)) in the same way we'd treat (plus (reg1) (const_int)) then we can
get the desired code.

This eliminates about 19b instructions, or roughly 1% for deepsjeng on rv64.
There are improvements elsewhere, but they're relatively small.  This may
ultimately lessen the value of Manolis's fold-mem-offsets patch.  So we'll have
to evaluate that again once he posts a new version.

Bootstrapped and regression tested on x86_64 as well as bootstrapped on rv64.
Earlier versions have been tested against spec2017.  Pre-approved by Vlad in a

Committed to the trunk,

gcc/
* lra-eliminations.cc (eliminate_regs_in_insn): Use equivalences to
to help simplify code further.

diff --git a/gcc/lra-eliminations.cc b/gcc/lra-eliminations.cc
index 3c58d4a3815..df613cdda76 100644
--- a/gcc/lra-eliminations.cc
+++ b/gcc/lra-eliminations.cc
@@ -926,6 +926,18 @@ eliminate_regs_in_insn (rtx_insn *insn, bool replace_p, bool first_p,
/* First see if the source is of the form (plus (...) CST).  */
if (plus_src && poly_int_rtx_p (XEXP (plus_src, 1), &offset))
plus_cst_src = plus_src;
+      /* If we are doing initial offset computation, then utilize
+	 eqivalences to discover a constant for the second term
+	 of PLUS_SRC.  */
+      else if (plus_src && REG_P (XEXP (plus_src, 1)))
+	{
+	  int regno = REGNO (XEXP (plus_src, 1));
+	  if (regno < ira_reg_equiv_len
+	      && ira_reg_equiv[regno].constant != NULL_RTX
+	      && !replace_p
+	      && poly_int_rtx_p (ira_reg_equiv[regno].constant, &offset))
+	    plus_cst_src = plus_src;
+	}
/* Check that the first operand of the PLUS is a hard reg or
the lowpart subreg of one.  */
if (plus_cst_src)

```end of thread, other threads:[~2023-08-23 21:14 UTC | newest]
```This is a public inbox, see mirroring instructions