This is primarily Jivan's work, I'm mostly responsible for the write-up 
and coordinating with Vlad on a few questions.

On targets with limitations on immediates usable in arithmetic 
instructions, LRA's register elimination phase can construct fairly poor 
code.

This example (from the GCC testsuite) illustrates the problem well.


int  consume (void *);
int foo (void) {
   int x[1000000];
   return consume (x + 1000);
}

If you compile on riscv64-linux-gnu with "-O2 -march=rv64gc 
-mabi=lp64d", then you'll get this code (up to the call to consume()).


         .cfi_startproc
         li      t0,-4001792
         li      a0,-3997696
         li      a5,4001792
         addi    sp,sp,-16
         .cfi_def_cfa_offset 16
         addi    t0,t0,1792
         addi    a0,a0,1696
         addi    a5,a5,-1792
         sd      ra,8(sp)
         add     a5,a5,a0
         add     sp,sp,t0
         .cfi_def_cfa_offset 4000016
         .cfi_offset 1, -8
         add     a0,a5,sp
         call    consume

Of particular interest is the value in a0 when we call consume. We 
compute that horribly inefficiently.   If we back-substitute from the 
final assignment to a0 we get...

a0 = a5 + sp
a0 = a5 + (sp + t0)
a0 = (a5 + a0) + (sp + t0)
a0 = ((a5 - 1792) + a0) + (sp + t0)
a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + t0)
a0 = ((a5 - 1792) + (a0 + 1696)) + (sp + (t0 + 1792))
a0 = (a5 + (a0 + 1696)) + (sp + t0)  // removed offsetting terms
a0 = (a5 + (a0 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (a0 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + t0)
a0 = (4001792 + (-3997696 + 1696)) + ((sp - 16) + -4001792)
a0 = (-3997696 + 1696) + (sp -16) // removed offsetting terms
a0 = sp - 3990616

That's a pretty convoluted way to compute sp - 3990616.

Something like this would be notably better (not great, but we need both 
the stack adjustment and the address of the object to pass to consume):


    addi sp,sp,-16
    sd ra,8(sp)
    li t0,-4001792
    addi t0,t0,1792
    add sp,sp,t0
    li a0,4096
    addi a0,a0,-96
    add a0,sp,a0
    call consume


The problem is LRA's elimination code is not handling the case where we 
have (plus (reg1) (reg2) where reg1 is an eliminable register and reg2 
has a known equivalency, particularly a constant.

If we can determine that reg2 is equivalent to a constant and treat 
(plus (reg1) (reg2)) in the same way we'd treat (plus (reg1) 
(const_int)) then we can get the desired code.

This eliminates about 19b instructions, or roughly 1% for deepsjeng on 
rv64.  There are improvements elsewhere, but they're relatively small. 
This may ultimately lessen the value of Manolis's fold-mem-offsets 
patch.  So we'll have to evaluate that again once he posts a new version.

Bootstrapped and regression tested on x86_64 as well as bootstrapped on 
rv64.  Earlier versions have been tested against spec2017.  Pre-approved 
by Vlad in a private email conversation (thanks Vlad!).

Committed to the trunk,

Jeff