public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* PING: PR/17236, improve long long multiply on x86 (middle-end)
@ 2008-02-20  8:55 Paolo Bonzini
  2008-02-21 16:00 ` Richard Guenther
  2008-03-05 15:40 ` Ian Lance Taylor
  0 siblings, 2 replies; 9+ messages in thread
From: Paolo Bonzini @ 2008-02-20  8:55 UTC (permalink / raw)
  To: GCC Patches

http://gcc.gnu.org/ml/gcc-patches/2007-12/msg00995.html

Thanks!

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PING: PR/17236, improve long long multiply on x86 (middle-end)
  2008-02-20  8:55 PING: PR/17236, improve long long multiply on x86 (middle-end) Paolo Bonzini
@ 2008-02-21 16:00 ` Richard Guenther
  2008-02-21 16:54   ` Paolo Bonzini
  2008-03-05 15:40 ` Ian Lance Taylor
  1 sibling, 1 reply; 9+ messages in thread
From: Richard Guenther @ 2008-02-21 16:00 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: GCC Patches

2008/2/20 Paolo Bonzini <bonzini@gnu.org>:
> http://gcc.gnu.org/ml/gcc-patches/2007-12/msg00995.html

The combine trick is obvious, but I'd like to know whether the two
other tweaks have an impact on other targets.  Does the combine
trick alone help?  Can you try crafting a testcase somehow (I realize
this might be difficult, but maybe counting the number of moves
works out in a target dependent test).

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PING: PR/17236, improve long long multiply on x86 (middle-end)
  2008-02-21 16:00 ` Richard Guenther
@ 2008-02-21 16:54   ` Paolo Bonzini
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2008-02-21 16:54 UTC (permalink / raw)
  To: Richard Guenther; +Cc: GCC Patches

Richard Guenther wrote:
> 2008/2/20 Paolo Bonzini <bonzini@gnu.org>:
>> http://gcc.gnu.org/ml/gcc-patches/2007-12/msg00995.html
> 
> The combine trick is obvious, but I'd like to know whether the two
> other tweaks have an impact on other targets.  Does the combine
> trick alone help?  Can you try crafting a testcase somehow (I realize
> this might be difficult, but maybe counting the number of moves
> works out in a target dependent test).

No, all three are necessary.  Each is a tiny step further: with the 
regclass.c hunk GCC "considers" using %eax, but without the 
local-alloc.c hunk that class is already taken by another pseudo.  The 
combine.c hunk is actually the least necessary of all; without it, GCC 
just generates one more memory access, but not more spills.

The regclass.c hunk is also pretty obvious -- it is just a trick to 
improve coalescing between a register that dies and one that has a 
duplicate constraint with the dying register (a la regmove).  Note that 
if the register does not die, the behavior is the same as without the 
patch (alt_cost += 2).

Most RISC processors would probably be unaffected by the changes, since 
they have plenty of registers and no small register classes.  sh could 
be an, ehm, interesting target to test.

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PING: PR/17236, improve long long multiply on x86 (middle-end)
  2008-02-20  8:55 PING: PR/17236, improve long long multiply on x86 (middle-end) Paolo Bonzini
  2008-02-21 16:00 ` Richard Guenther
@ 2008-03-05 15:40 ` Ian Lance Taylor
  2008-03-06  8:29   ` Paolo Bonzini
  1 sibling, 1 reply; 9+ messages in thread
From: Ian Lance Taylor @ 2008-03-05 15:40 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: GCC Patches

Paolo Bonzini <bonzini@gnu.org> writes:

> http://gcc.gnu.org/ml/gcc-patches/2007-12/msg00995.html
> 
> 2007-12-20  Paolo Bonzini  <bonzini@gnu.org>
> 
> 	* combine.c (simplify_set): Try to canonicalize SETS with
> 	swap_commutative_operands_with_target.
> 	* optabs.c (swap_commutative_operands_with_target): Move...
> 	* rtlanal.c (swap_commutative_operands_with_target): ... here.
> 	* rtl.h (swap_commutative_operands_with_target): Add prototype.

This part of the patch is OK.


> 	* regclass.c (record_reg_classes): Favor the class of the destination
> 	operand if the source dies in this instruction.

I don't see anything here which prevent pp->cost[classes[j]] from
going negative.  This patch is OK if you add a check that the value is
greater than zero before you decrement it.


> 	* local-alloc.c (QTY_CMP_PRI): Bump priority of registers with a
> 	small suggested class.

You need to update the comments.  And if you read the comments, you
will see that the current formula is also used in global.c, in
allocno_compare.  Actually the two formulas are very close but not
quite identical; for some reason they differ in how they handle the
size of the pseudo.

This change is going to give a big bump to registers whose preferred
class has only one or two registers, such as the HI/LO registers on
MIPS.  It's hard to know whether the effect will be beneficial.

What happens if you only use size of register class as a tie-breaker
for quantities which are otherwise equal?

In general I can see the advantage to allocating registers in smaller
register classes first, especially when there is an overlap between
classes.  But I'm not yet convinced that this is the best way to do
so.

Thanks.

Ian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PING: PR/17236, improve long long multiply on x86 (middle-end)
  2008-03-05 15:40 ` Ian Lance Taylor
@ 2008-03-06  8:29   ` Paolo Bonzini
  2008-03-06 17:19     ` Ian Lance Taylor
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2008-03-06  8:29 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: GCC Patches


> This change is going to give a big bump to registers whose preferred
> class has only one or two registers, such as the HI/LO registers on
> MIPS.

I don't really think so.  For MIPS, it is going to favor registers whose 
  preferred class is MD*_REG, over those whose preferred class is 
LO_AND_GR_REGS.

But on MIPS in particular, I doubt regclass will give a preferred 
register class of MD*_REG to anything, because moves from and to LO/HI 
use starred constraints to avoid this case.  My regalloc-fu is low, but 
IIUC this means that only reload can create moves from/to the 
multiplication result registers.

That said, I wonder if it might actually make sense (no, I didn't think 
it that way, but I ask) if we do this change only in local-alloc.c?

> What happens if you only use size of register class as a tie-breaker
> for quantities which are otherwise equal?

I'm worried that in general it won't work, because priorities are rarely 
exactly equal (mostly because of the live range).  I'm almost sure this 
happens for the PR testcase, but I don't have data at hand.

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PING: PR/17236, improve long long multiply on x86 (middle-end)
  2008-03-06  8:29   ` Paolo Bonzini
@ 2008-03-06 17:19     ` Ian Lance Taylor
  2008-03-06 17:49       ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Ian Lance Taylor @ 2008-03-06 17:19 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: GCC Patches

Paolo Bonzini <bonzini@gnu.org> writes:

> > This change is going to give a big bump to registers whose preferred
> > class has only one or two registers, such as the HI/LO registers on
> > MIPS.
> 
> I don't really think so.  For MIPS, it is going to favor registers
> whose preferred class is MD*_REG, over those whose preferred class is
> LO_AND_GR_REGS.
> 
> But on MIPS in particular, I doubt regclass will give a preferred
> register class of MD*_REG to anything, because moves from and to LO/HI
> use starred constraints to avoid this case.  My regalloc-fu is low,
> but IIUC this means that only reload can create moves from/to the
> multiplication result registers.

The point is that in general even machines with large numbers of
registers have classes with few registers.  Your proposal may
significantly change the register allocator's behaviour on those
machines.  It may be fine.  It may not.  I don't know.  Can you
convince me?

> That said, I wonder if it might actually make sense (no, I didn't
> think it that way, but I ask) if we do this change only in
> local-alloc.c?

I think you need to make an argument for why that would be better.


> > What happens if you only use size of register class as a tie-breaker
> > for quantities which are otherwise equal?
> 
> I'm worried that in general it won't work, because priorities are
> rarely exactly equal (mostly because of the live range).  I'm almost
> sure this happens for the PR testcase, but I don't have data at hand.

Let's find out.


I'm OK with making this sort of change without a clear argument, but
only if you're prepared to do performance testing on three or more
primary platforms.  I'm not OK with making this sort of change with
neither an argument nor performance testing.  Right now, as far as I
know, you just have a single test case which improves.  Experience
suggests that other test cases will get worse.  I don't know which
effect will dominate.

Ian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PING: PR/17236, improve long long multiply on x86 (middle-end)
  2008-03-06 17:19     ` Ian Lance Taylor
@ 2008-03-06 17:49       ` Paolo Bonzini
  2008-03-07  0:53         ` Ian Lance Taylor
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2008-03-06 17:49 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: GCC Patches

>> That said, I wonder if it might actually make sense (no, I didn't
>> think it that way, but I ask) if we do this change only in
>> local-alloc.c?
> 
> I think you need to make an argument for why that would be better.

Because local-alloc.c only allocates registers with a relatively short 
live range, and bumping the priority of smaller register classes would 
be less disrupting.

> I'm OK with making this sort of change without a clear argument, but
> only if you're prepared to do performance testing on three or more
> primary platforms.  I'm not OK with making this sort of change with
> neither an argument nor performance testing.  Right now, as far as I
> know, you just have a single test case which improves.

I already had SPEC on x86 and it was neutral, except that half of SPECfp 
was perturbated by the machine load.  I should be able to rerun it 
sometime next week.  Running i386 and x86_64 would not be very different 
(the RA problems are the same -- %eax/%edx for MUL/DIV and %cl for shift 
counts) and I don't have access to other platforms.

So, I will try to audit other back-ends like I did for MIPS.  Before 
that, I will see what IRA does; maybe we don't need at all the patch.

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PING: PR/17236, improve long long multiply on x86 (middle-end)
  2008-03-06 17:49       ` Paolo Bonzini
@ 2008-03-07  0:53         ` Ian Lance Taylor
  2008-03-07  8:31           ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Ian Lance Taylor @ 2008-03-07  0:53 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: GCC Patches

Paolo Bonzini <bonzini@gnu.org> writes:

>> I'm OK with making this sort of change without a clear argument, but
>> only if you're prepared to do performance testing on three or more
>> primary platforms.  I'm not OK with making this sort of change with
>> neither an argument nor performance testing.  Right now, as far as I
>> know, you just have a single test case which improves.
>
> I already had SPEC on x86 and it was neutral, except that half of
> SPECfp was perturbated by the machine load.  I should be able to rerun
> it sometime next week.  Running i386 and x86_64 would not be very
> different (the RA problems are the same -- %eax/%edx for MUL/DIV and
> %cl for shift counts) and I don't have access to other platforms.

I think you do need some performance results on other platforms.
There are people who can help you get them.

I'd also like to know if you can use the size of the register class as
a tie-breaker.  That kind of patch would be OK.

Ian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: PING: PR/17236, improve long long multiply on x86 (middle-end)
  2008-03-07  0:53         ` Ian Lance Taylor
@ 2008-03-07  8:31           ` Paolo Bonzini
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2008-03-07  8:31 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: GCC Patches

Ian Lance Taylor wrote:
> Paolo Bonzini <bonzini@gnu.org> writes:
> 
>>> I'm OK with making this sort of change without a clear argument, but
>>> only if you're prepared to do performance testing on three or more
>>> primary platforms.  I'm not OK with making this sort of change with
>>> neither an argument nor performance testing.  Right now, as far as I
>>> know, you just have a single test case which improves.
>> I already had SPEC on x86 and it was neutral, except that half of
>> SPECfp was perturbated by the machine load.  I should be able to rerun
>> it sometime next week.  Running i386 and x86_64 would not be very
>> different (the RA problems are the same -- %eax/%edx for MUL/DIV and
>> %cl for shift counts) and I don't have access to other platforms.
> 
> I think you do need some performance results on other platforms.
> There are people who can help you get them.

IRA branch already gets the register allocation part right.  I'll commit 
only the parts you approved, the local-alloc.c tweak will remain on hold 
until we see if IRA gets into 4.4 (as I hope).

Thanks!

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-03-07  8:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-20  8:55 PING: PR/17236, improve long long multiply on x86 (middle-end) Paolo Bonzini
2008-02-21 16:00 ` Richard Guenther
2008-02-21 16:54   ` Paolo Bonzini
2008-03-05 15:40 ` Ian Lance Taylor
2008-03-06  8:29   ` Paolo Bonzini
2008-03-06 17:19     ` Ian Lance Taylor
2008-03-06 17:49       ` Paolo Bonzini
2008-03-07  0:53         ` Ian Lance Taylor
2008-03-07  8:31           ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).