* LRA vs reload on powerpc: 2 extra FAILs that are actually improvements?
@ 2013-11-02 22:49 Steven Bosscher
2013-11-04 14:16 ` David Edelsohn
0 siblings, 1 reply; 3+ messages in thread
From: Steven Bosscher @ 2013-11-02 22:49 UTC (permalink / raw)
To: GCC Patches; +Cc: Vladimir Makarov, David Edelsohn
Hello,
Today's powerpc64-linux gcc has 2 extra failures with -mlra vs. reload
(i.e. svn unpatched).
(I'm excluding guality failure differences here because there are too
many of them that seem to fail at random after minimal changes
anywhere in the compiler...).
Test results are posted here:
reload: http://gcc.gnu.org/ml/gcc-testresults/2013-11/msg00128.html
lra: http://gcc.gnu.org/ml/gcc-testresults/2013-11/msg00129.html
The new failures and total score is as follows (+=lra, -=reload):
+FAIL: gcc.target/powerpc/pr53199.c scan-assembler-times stwbrx 6
+FAIL: gcc.target/powerpc/pr58330.c scan-assembler-not stwbrx
=== gcc Summary ===
-# of expected passes 97887
-# of unexpected failures 536
+# of expected passes 97903
+# of unexpected failures 538
# of unexpected successes 38
# of expected failures 244
-# of unsupported tests 1910
+# of unsupported tests 1892
The failure of pr53199.c is because of different instruction selection
for bswap. Test case is reduced to just one function:
/* { dg-options "-O2 -mcpu=power6 -mavoid-indexed-addresses" } */
long long
reg_reverse (long long x)
{
return __builtin_bswap64 (x);
}
Reload left vs. LRA right:
reg_reverse: reg_reverse:
srdi 8,3,32 | addi 8,1,-16
rlwinm 7,3,8,0xffffffff | srdi 10,3,32
rlwinm 9,8,8,0xffffffff | addi 9,8,4
rlwimi 7,3,24,0,7 | stwbrx 3,0,8
rlwimi 7,3,24,16,23 | stwbrx 10,0,9
rlwimi 9,8,24,0,7 | ld 3,-16(1)
rlwimi 9,8,24,16,23 <
sldi 7,7,32 <
or 7,7,9 <
mr 3,7 <
blr blr
This same difference is responsible for the failure of pr58330.c which
also uses __builtin_bswap64().
The difference in RTL for the test case is this (after reload vs. after LRA):
- 11: {%7:DI=bswap(%3:DI);clobber %8:DI;clobber %9:DI;clobber %10:DI;}
- 20: %3:DI=%7:DI
+ 20: %8:DI=%1:DI-0x10
+ 21: %8:DI=%8:DI // stupid no-op move
+ 11: {[%8:DI]=bswap(%3:DI);clobber %9:DI;clobber %10:DI;clobber scratch;}
+ 19: %3:DI=[%1:DI-0x10]
So LRA believes going through memory is better than using a register,
even though obviously there are plenty registers available.
What LRA does:
Creating newreg=129
Removing SCRATCH in insn #11 (nop 2)
Creating newreg=130
Removing SCRATCH in insn #11 (nop 3)
Creating newreg=131
Removing SCRATCH in insn #11 (nop 4)
// at this point the insn would be a bswapdi2_64bit:
// 11: {%3:DI=bswap(%3:DI);clobber r129;clobber r130;clobber r131;}
// cost calculation for the insn alternatives:
0 Early clobber: reject++
1 Non-pseudo reload: reject+=2
1 Spill pseudo in memory: reject+=3
2 Scratch win: reject+=2
3 Scratch win: reject+=2
4 Scratch win: reject+=2
alt=0,overall=18,losers=1,rld_nregs=0
0 Non-pseudo reload: reject+=2
0 Spill pseudo in memory: reject+=3
0 Non input pseudo reload: reject++
2 Scratch win: reject+=2
3 Scratch win: reject+=2
alt=1,overall=16,losers=1,rld_nregs=0
Staticly defined alt reject+=12
0 Early clobber: reject++
2 Scratch win: reject+=2
3 Scratch win: reject+=2
4 Scratch win: reject+=2
0 Conflict early clobber reload: reject--
alt=2,overall=24,losers=1,rld_nregs=0
Choosing alt 1 in insn 11: (0) Z (1) r (2) &b (3) &r (4)
X {*bswapdi2_64bit}
Change to class BASE_REGS for r129
Change to class GENERAL_REGS for r130
Creating newreg=132 from oldreg=3, assigning class NO_REGS to r132
Change to class NO_REGS for r131
11: {r132:DI=bswap(%3:DI);clobber r129:DI;clobber r130:DI;clobber r131:DI;}
REG_UNUSED r131:DI
REG_UNUSED r130:DI
REG_UNUSED r129:DI
LRA selects alternative 1 (Z,r,&b,&r,X) which seems to be the right
choice, from looking at the constraints. Reload selects alternative 2
which is slightly^2 discouraged: (??&r,r,&r,&r,&r).
Is this an improvement or a regression? If it's an improvement then
these two test cases should be adjusted :-)
Ciao!
Steven
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: LRA vs reload on powerpc: 2 extra FAILs that are actually improvements?
2013-11-02 22:49 LRA vs reload on powerpc: 2 extra FAILs that are actually improvements? Steven Bosscher
@ 2013-11-04 14:16 ` David Edelsohn
2013-12-01 5:55 ` Alan Modra
0 siblings, 1 reply; 3+ messages in thread
From: David Edelsohn @ 2013-11-04 14:16 UTC (permalink / raw)
To: Steven Bosscher, Michael Meissner; +Cc: GCC Patches, Vladimir Makarov
Hi, Steven
Thanks for investigating this. This presumably was the reason that
Vlad changed the constraint modifier for that pattern in his patch for
LRA. I don't think that using memory is an improvement, but Mike is
the best person to comment.
Thanks, David
On Sat, Nov 2, 2013 at 6:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> Hello,
>
> Today's powerpc64-linux gcc has 2 extra failures with -mlra vs. reload
> (i.e. svn unpatched).
>
> (I'm excluding guality failure differences here because there are too
> many of them that seem to fail at random after minimal changes
> anywhere in the compiler...).
>
> Test results are posted here:
> reload: http://gcc.gnu.org/ml/gcc-testresults/2013-11/msg00128.html
> lra: http://gcc.gnu.org/ml/gcc-testresults/2013-11/msg00129.html
>
> The new failures and total score is as follows (+=lra, -=reload):
> +FAIL: gcc.target/powerpc/pr53199.c scan-assembler-times stwbrx 6
> +FAIL: gcc.target/powerpc/pr58330.c scan-assembler-not stwbrx
>
> === gcc Summary ===
>
> -# of expected passes 97887
> -# of unexpected failures 536
> +# of expected passes 97903
> +# of unexpected failures 538
> # of unexpected successes 38
> # of expected failures 244
> -# of unsupported tests 1910
> +# of unsupported tests 1892
>
>
> The failure of pr53199.c is because of different instruction selection
> for bswap. Test case is reduced to just one function:
>
> /* { dg-options "-O2 -mcpu=power6 -mavoid-indexed-addresses" } */
> long long
> reg_reverse (long long x)
> {
> return __builtin_bswap64 (x);
> }
>
> Reload left vs. LRA right:
> reg_reverse: reg_reverse:
> srdi 8,3,32 | addi 8,1,-16
> rlwinm 7,3,8,0xffffffff | srdi 10,3,32
> rlwinm 9,8,8,0xffffffff | addi 9,8,4
> rlwimi 7,3,24,0,7 | stwbrx 3,0,8
> rlwimi 7,3,24,16,23 | stwbrx 10,0,9
> rlwimi 9,8,24,0,7 | ld 3,-16(1)
> rlwimi 9,8,24,16,23 <
> sldi 7,7,32 <
> or 7,7,9 <
> mr 3,7 <
> blr blr
>
> This same difference is responsible for the failure of pr58330.c which
> also uses __builtin_bswap64().
>
> The difference in RTL for the test case is this (after reload vs. after LRA):
> - 11: {%7:DI=bswap(%3:DI);clobber %8:DI;clobber %9:DI;clobber %10:DI;}
> - 20: %3:DI=%7:DI
> + 20: %8:DI=%1:DI-0x10
> + 21: %8:DI=%8:DI // stupid no-op move
> + 11: {[%8:DI]=bswap(%3:DI);clobber %9:DI;clobber %10:DI;clobber scratch;}
> + 19: %3:DI=[%1:DI-0x10]
>
> So LRA believes going through memory is better than using a register,
> even though obviously there are plenty registers available.
>
> What LRA does:
> Creating newreg=129
> Removing SCRATCH in insn #11 (nop 2)
> Creating newreg=130
> Removing SCRATCH in insn #11 (nop 3)
> Creating newreg=131
> Removing SCRATCH in insn #11 (nop 4)
> // at this point the insn would be a bswapdi2_64bit:
> // 11: {%3:DI=bswap(%3:DI);clobber r129;clobber r130;clobber r131;}
> // cost calculation for the insn alternatives:
> 0 Early clobber: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Spill pseudo in memory: reject+=3
> 2 Scratch win: reject+=2
> 3 Scratch win: reject+=2
> 4 Scratch win: reject+=2
> alt=0,overall=18,losers=1,rld_nregs=0
> 0 Non-pseudo reload: reject+=2
> 0 Spill pseudo in memory: reject+=3
> 0 Non input pseudo reload: reject++
> 2 Scratch win: reject+=2
> 3 Scratch win: reject+=2
> alt=1,overall=16,losers=1,rld_nregs=0
> Staticly defined alt reject+=12
> 0 Early clobber: reject++
> 2 Scratch win: reject+=2
> 3 Scratch win: reject+=2
> 4 Scratch win: reject+=2
> 0 Conflict early clobber reload: reject--
> alt=2,overall=24,losers=1,rld_nregs=0
> Choosing alt 1 in insn 11: (0) Z (1) r (2) &b (3) &r (4)
> X {*bswapdi2_64bit}
> Change to class BASE_REGS for r129
> Change to class GENERAL_REGS for r130
> Creating newreg=132 from oldreg=3, assigning class NO_REGS to r132
> Change to class NO_REGS for r131
> 11: {r132:DI=bswap(%3:DI);clobber r129:DI;clobber r130:DI;clobber r131:DI;}
> REG_UNUSED r131:DI
> REG_UNUSED r130:DI
> REG_UNUSED r129:DI
>
> LRA selects alternative 1 (Z,r,&b,&r,X) which seems to be the right
> choice, from looking at the constraints. Reload selects alternative 2
> which is slightly^2 discouraged: (??&r,r,&r,&r,&r).
>
> Is this an improvement or a regression? If it's an improvement then
> these two test cases should be adjusted :-)
>
> Ciao!
> Steven
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: LRA vs reload on powerpc: 2 extra FAILs that are actually improvements?
2013-11-04 14:16 ` David Edelsohn
@ 2013-12-01 5:55 ` Alan Modra
0 siblings, 0 replies; 3+ messages in thread
From: Alan Modra @ 2013-12-01 5:55 UTC (permalink / raw)
To: David Edelsohn, Michael Meissner
Cc: Steven Bosscher, GCC Patches, Vladimir Makarov
> On Sat, Nov 2, 2013 at 6:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> > The failure of pr53199.c is because of different instruction selection
> > for bswap. Test case is reduced to just one function:
[snip]
> > Is this an improvement or a regression? If it's an improvement then
> > these two test cases should be adjusted :-)
As David said, going through memory is bad, we get a load-hit-store
flush. Definitely a regression on power7. Does anyone know why the
bswapdi2_64bit r,r alternative is disparaged? Seems like it has been
that way since the orginal mainline commit.
int main (void)
{
int i;
long ret = 0;
long tmp1, tmp2, tmp3;
for (i = 0; i < 1000000000; i++)
#if MEM == 1
/* From pr53199.c reg_reverse, -mlra -mcpu=power6 -mtune=power7. */
__asm__ __volatile__ ("\
addi %1,1,-16\n\
srdi %3,%0,32\n\
li %2,4\n\
stwbrx %0,0,%1\n\
stwbrx %3,%2,%1\n\
ld %0,-16(1)" : "+r" (ret), "=&b" (tmp1), "=&r" (tmp2), "=&r" (tmp3));
#elif MEM == 2
/* From pr53199.c reg_reverse, -mlra -mcpu=power6. */
__asm__ __volatile__ ("\
addi %1,1,-16\n\
srdi %3,%0,32\n\
addi %2,%1,4\n\
stwbrx %0,0,%1\n\
stwbrx %3,0,%2\n\
ld %0,-16(1)" : "+r" (ret), "=&b" (tmp1), "=&b" (tmp2), "=&r" (tmp3));
#elif MEM == 3
/* From pr53199.c reg_reverse, -mlra -mcpu=power7. */
__asm__ __volatile__ ("\
std %0,-16(1)\n\
addi %1,1,-16\n\
ldbrx %0,0,%1\n" : "+r" (ret), "=&b" (tmp1));
#else
__asm__ __volatile__ ("\
srdi %1,%0,32\n\
rlwinm %2,%0,8,0xffffffff\n\
rlwinm %3,%1,8,0xffffffff\n\
rlwimi %2,%0,24,0,7\n\
rlwimi %2,%0,24,16,23\n\
rlwimi %3,%1,24,0,7\n\
rlwimi %3,%1,24,16,23\n\
sldi %2,%2,32\n\
or %2,%2,%3\n\
mr %0,%2" : "+r" (ret), "=&r" (tmp1), "=&r" (tmp2), "=&r" (tmp3));
#endif
return ret;
}
/*
amodra@bns:~> gcc -O2 bswap_mem.c
amodra@bns:~> time ./a.out
real 0m3.096s
user 0m3.089s
sys 0m0.001s
amodra@bns:~> time ./a.out
real 0m3.096s
user 0m3.094s
sys 0m0.002s
amodra@bns:~> gcc -O2 -DMEM=1 bswap_mem.c
amodra@bns:~> time ./a.out
real 0m12.661s
user 0m12.657s
sys 0m0.003s
amodra@bns:~> time ./a.out
real 0m12.660s
user 0m12.657s
sys 0m0.003s
amodra@bns:~> gcc -O2 -DMEM=2 bswap_mem.c
amodra@bns:~> time ./a.out
real 0m12.660s
user 0m12.657s
sys 0m0.003s
amodra@bns:~> time ./a.out
real 0m12.660s
user 0m12.657s
sys 0m0.004s
amodra@bns:~> gcc -O2 -DMEM=3 bswap_mem.c
amodra@bns:~> time ./a.out
real 0m10.279s
user 0m10.276s
sys 0m0.003s
amodra@bns:~> time ./a.out
real 0m10.279s
user 0m10.276s
sys 0m0.003s
I also looked at the register version and -DMEM=1 case with power7
simulators finding that the register version had a delay of 12 cycles
from completion of the first instruction to completion of the last.
The -DMEM=1 case had a corresponding delay of 49 cycles, which matches
the loop timing above quite well.
*/
--
Alan Modra
Australia Development Lab, IBM
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-12-01 5:55 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-02 22:49 LRA vs reload on powerpc: 2 extra FAILs that are actually improvements? Steven Bosscher
2013-11-04 14:16 ` David Edelsohn
2013-12-01 5:55 ` Alan Modra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).