Re: RFC: Simplify rules for ctz/clz patterns and RTL

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
@ 2007-08-13 17:11 Joern Rennecke
       [not found] ` <46C34C06.3000103@codesourcery.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Joern Rennecke @ 2007-08-13 17:11 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

> The score, sh and sparc instructions may or may not display canonical
> behavior; their ports do not define CLZ_DEFINED_VALUE_AT_ZERO and I was
> not able to find documentation of the relevant instruction.

The operation the nsb instruction of the SHmedia instruction set performs
is 'count number of sign bit copies'.

For the 32 bit ffs, first a mask for the least significant bit is computed.
Thus, the upper 32 bit of the 64 bit registers are all zero, so we merely
have to subtract the result of nsb from 63.
So this is more or less what you describe as canonical behaviour.

For the 64 bit ffs, the input is shifted right by one to obtain an unsigned
value.  This has the effect that the value obtained for zero input is the
same as the one for one input, which is compensated for with a conditional
move.  In this sense, for 64 bit operation you have 'non-canonical'
behaviour.

The ARC700 has a NORM instruction, which again counts the number of
sign bit copies.  There is a variant NORM.F which sets the N flag if the
input is negative.

Thus,
      __asm__ ("norm.f\t%0,%1\n\tmov.mi\t%0,-1" : "=r" (c_) : "r" (x) : "cc");
performs a 'canonical' count leading zeros operation, except that you have
to add one to the result.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
       [not found] ` <46C34C06.3000103@codesourcery.com>
@ 2007-08-15 22:53   ` Joern Rennecke
  0 siblings, 0 replies; 19+ messages in thread
From: Joern Rennecke @ 2007-08-15 22:53 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

On Wed, Aug 15, 2007 at 11:55:02AM -0700, Zack Weinberg wrote:
> Joern Rennecke wrote:
> >The operation the nsb instruction of the SHmedia instruction set performs
> >is 'count number of sign bit copies'.
> >[...]
> 
> It sounds like the SH should probably be lumped in with the x86 as not 
> doing "canonical behavior".  Conveniently enough for my grand plan, it 
> already uses an UNSPEC for the actual instruction :-)
> 
> What is the result of the instruction for (64-bit) all-bits-zero or 
> all-bits-one?  64?

No, it is 63.  There is one essential sign bit and 63 more copies.

> Assuming so, it occurs to me that the result of an 
> unsigned clz() on any negative 64-bit value will be zero; thus, you 
> could get a "canonical" clz out of nsb by doing (pseudo-assembly)
> 
> 	mov	result, 0
> 	cmp/pz	arg
> 	bf	1f
> 	nsb	result, arg
> 1:

We are talking about SHmedia code here.  cmp/pz and bf are not SHmedia
instructions.  Loading a zero into result would be movi 0,result .

If you want to special-case the negative input, that would be:

 shari	arg,31,tmp
 nsb	arg,result
 cmvne	tmp,tmp,result
 addi	result,1,result

> Similarly, the x & (x-1) operation

It's x ^ (x-1) (xor) or x &~(x-1) (andc)

> used to set up for ctz/ffs in terms 
> of clz will leave the high bit set *only* for x == 0x8000 0000 0000 
> 0000; which can be tested for as x == (x&(x-1)) and the nsb skipped.
> 
> Would these sequences be slower than the current logic?

currently we have for ffs:

 addi	arg,-1,tmp
 xor	arg,tmp,tmp
 shlri  tmp,1,tmp
 nsb	tmp,tmp
 addi	tmp,-64,tmp
 cmveq	arg,r63,tmp
 sub	r63,tmp,result

Using the above sequence, we get the more register-hungry:

 addi	arg,-1,tmp
 xor	arg,tmp,tmp
 shari	tmp,31,tmp2
 nsb	tmp,tmp
 cmvne	tmp2,tmp2,tmp
 addi	tmp,1-64,tmp
 sub	r63,tmp,result

you propose:

 pt	after_nsb,trtmp
 addi	arg,-1,tmp
 andc	arg,tmp,tmp
 movi	-1,tmp2
 beq	arg,tmp,trtmp
 nsb	tmp,tmp2
after_nsb:
 addi	tmp2,-63,tmp
 sub	r63,tmp,result

or is that:

 pt	after_nsb,trtmp
 addi	arg,-1,tmp
 xor	arg,tmp,tmp
 bgt	r63,tmp,trtmp
 nsb	tmp,tmp
after_nsb:
 addi	tmp,-63,tmp
 sub	r63,tmp,result
 
At any rate, the introduction of the branch makes the code worse.

But for the ARC, it would make an interesting shortcut.  Although norm can't
be conditionalized, we can use the -1 from the xor to save on a long
immediate for 32 bit ffs.

 sub_s	tmp,arg,1
 xor.f	tmp,tmp,arg ; for -Os this can be xor_s and
 norm	result,tmp  ; then norm.f produces the flag.
 mov.mi	result,tmp
 rsub	result,31,result

> >The ARC700 has a NORM instruction, which again counts the number of
> >sign bit copies.  There is a variant NORM.F which sets the N flag if the
> >input is negative.
> 
> Sorry, I don't recognize the ARC700 - which GCC back end is that?

It belongs in config/arc ; however, proper ARC700 support is not in the
FSF mainline yet.
We are working on it.

> It 
> might be worth teaching optabs.c about sign-bit-count operations, but 
> only if we have more than one architecture that can use it.

The NORM instruction is also available as an optional extension operation
for ARCtangent-A5 and ARC600.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 21:24               ` Segher Boessenkool
@ 2007-08-15 21:54                 ` David Edelsohn
  0 siblings, 0 replies; 19+ messages in thread
From: David Edelsohn @ 2007-08-15 21:54 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Andrew Pinski, gcc, Zack Weinberg

>>>>> Segher Boessenkool writes:

>> Yes, but do we even create POPCOUNT rtx if the insn isn't
>> supported?  Wouldn't we expand or create libcall early?

Segher> I don't know, there's only one way to find out... :-)

	I did check.  Didn't you?

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 21:09             ` David Edelsohn
@ 2007-08-15 21:24               ` Segher Boessenkool
  2007-08-15 21:54                 ` David Edelsohn
  0 siblings, 1 reply; 19+ messages in thread
From: Segher Boessenkool @ 2007-08-15 21:24 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Andrew Pinski, gcc, Zack Weinberg

>>> I think the cost would be something like:
>>> +    case POPCOUNT:
>>> +      *total = COSTS_N_INSNS (3);
>>> +      return false;
>
> Segher> Is that the cost when using popcountb?  It is a lot more
> Segher> expensive when that instruction isn't available (like on
> Segher> most current machines).
>
> 	Yes, but do we even create POPCOUNT rtx if the insn isn't
> supported?  Wouldn't we expand or create libcall early?

I don't know, there's only one way to find out... :-)


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 20:33           ` Segher Boessenkool
@ 2007-08-15 21:09             ` David Edelsohn
  2007-08-15 21:24               ` Segher Boessenkool
  0 siblings, 1 reply; 19+ messages in thread
From: David Edelsohn @ 2007-08-15 21:09 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Andrew Pinski, gcc, Zack Weinberg

>>>>> Segher Boessenkool writes:

>> I think the cost would be something like:
>> +    case POPCOUNT:
>> +      *total = COSTS_N_INSNS (3);
>> +      return false;

Segher> Is that the cost when using popcountb?  It is a lot more
Segher> expensive when that instruction isn't available (like on
Segher> most current machines).

	Yes, but do we even create POPCOUNT rtx if the insn isn't
supported?  Wouldn't we expand or create libcall early?

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 20:04         ` David Edelsohn
@ 2007-08-15 20:33           ` Segher Boessenkool
  2007-08-15 21:09             ` David Edelsohn
  0 siblings, 1 reply; 19+ messages in thread
From: Segher Boessenkool @ 2007-08-15 20:33 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Andrew Pinski, gcc, Zack Weinberg

> 	I think the cost would be something like:

> +    case POPCOUNT:
> +      *total = COSTS_N_INSNS (3);
> +      return false;

Is that the cost when using popcountb?  It is a lot more
expensive when that instruction isn't available (like on
most current machines).

The rest (i.e. CLZ, CTZ) looks good to me.


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 19:03       ` Zack Weinberg
  2007-08-15 19:53         ` David Edelsohn
@ 2007-08-15 20:04         ` David Edelsohn
  2007-08-15 20:33           ` Segher Boessenkool
  1 sibling, 1 reply; 19+ messages in thread
From: David Edelsohn @ 2007-08-15 20:04 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Andrew Pinski, Segher Boessenkool, gcc

	I think the cost would be something like:

Index: rs6000.c
===================================================================
--- rs6000.c	(revision 127484)
+++ rs6000.c	(working copy)
@@ -20292,10 +20292,15 @@
 	*total += COSTS_N_INSNS (2);
       return false;
 
+    case CTZ:
     case FFS:
       *total = COSTS_N_INSNS (4);
       return false;
 
+    case POPCOUNT:
+      *total = COSTS_N_INSNS (3);
+      return false;
+
     case NOT:
       if (outer_code == AND || outer_code == IOR || outer_code == XOR)
 	{
@@ -20305,6 +20310,7 @@
       /* FALLTHRU */
 
     case AND:
+    case CLZ:
     case IOR:
     case XOR:
     case ZERO_EXTRACT:

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 18:44   ` Zack Weinberg
  2007-08-15 18:56     ` Andrew Pinski
  2007-08-15 19:37     ` Jan Hubicka
@ 2007-08-15 19:56     ` Segher Boessenkool
  2 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2007-08-15 19:56 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

>> I suppose you're using (assuming 32-bit)
>>     ctz(x) := 31 - clz(x & -x)
>> now, which gives -1 for 0; and the version you're looking for is
>>     ctz(x) := 32 - clz(~x & (x-1))
>> which gives 32 for 0.
>
> Thanks!  That's, unfortunately, one more instruction, although I guess 
> a lot of chips have "a & ~b" as one operation.

Yes, it's exactly the same cost on PowerPC, and on most other
RISC architectures.

> It looks like ~x & (x-1) turns any number into 000...111... where the 
> boundary between zeroes and ones lies at the lowest 1 in the original.

Exactly.  "To the right of the lowest 1".

> Is popcount really slow on PowerPC?  (Compared to clz?)  Ideally one 
> would choose between the two expansions based on RTL costs, but the 
> only  architectures it matters for are i386 and powerpc, and neither 
> of them define the cost of either clz or popcount.

Andrew answered this already.  Adding clz/popcount to the cost
tables seems like a good idea, yes.


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 19:03       ` Zack Weinberg
@ 2007-08-15 19:53         ` David Edelsohn
  2007-08-15 20:04         ` David Edelsohn
  1 sibling, 0 replies; 19+ messages in thread
From: David Edelsohn @ 2007-08-15 19:53 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Andrew Pinski, Segher Boessenkool, gcc

>>>>> Zack Weinberg writes:

Zack> Makes sense.  I don't suppose I could persuade you to teach rs6000 
Zack> RTX_COSTS about clz and popcount...?

	Sure.  It's not that difficult to add to the table.

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 18:44   ` Zack Weinberg
  2007-08-15 18:56     ` Andrew Pinski
@ 2007-08-15 19:37     ` Jan Hubicka
  2007-08-15 19:56     ` Segher Boessenkool
  2 siblings, 0 replies; 19+ messages in thread
From: Jan Hubicka @ 2007-08-15 19:37 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Segher Boessenkool, gcc

> 
> Is popcount really slow on PowerPC?  (Compared to clz?)  Ideally one 
> would choose between the two expansions based on RTL costs, but the only 
>  architectures it matters for are i386 and powerpc, and neither of them 
> define the cost of either clz or popcount.

Of course adding a popcount/clz cost into i386 cost tables is easy and
probably most correct thing to do :)

Honza
> 
> zw

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
@ 2007-08-15 19:27 Zack Weinberg
  0 siblings, 0 replies; 19+ messages in thread
From: Zack Weinberg @ 2007-08-15 19:27 UTC (permalink / raw)
  To: gcc

Joern Rennecke wrote:
>> The score, sh and sparc instructions may or may not display canonical
>> behavior; their ports do not define CLZ_DEFINED_VALUE_AT_ZERO and I was
>> not able to find documentation of the relevant instruction.
> 
> The operation the nsb instruction of the SHmedia instruction set performs
> is 'count number of sign bit copies'.
> [...]

It sounds like the SH should probably be lumped in with the x86 as not
doing "canonical behavior".  Conveniently enough for my grand plan, it
already uses an UNSPEC for the actual instruction :-)

What is the result of the instruction for (64-bit) all-bits-zero or
all-bits-one?  64?  Assuming so, it occurs to me that the result of an
unsigned clz() on any negative 64-bit value will be zero; thus, you
could get a "canonical" clz out of nsb by doing (pseudo-assembly)

	mov	result, 0
	cmp/pz	arg
	bf	1f
	nsb	result, arg
1:

Similarly, the x & (x-1) operation used to set up for ctz/ffs in terms
of clz will leave the high bit set *only* for x == 0x8000 0000 0000
0000; which can be tested for as x == (x&(x-1)) and the nsb skipped.

Would these sequences be slower than the current logic?

> The ARC700 has a NORM instruction, which again counts the number of
> sign bit copies.  There is a variant NORM.F which sets the N flag if the
> input is negative.

Sorry, I don't recognize the ARC700 - which GCC back end is that?  It
might be worth teaching optabs.c about sign-bit-count operations, but
only if we have more than one architecture that can use it.

zw

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 18:56     ` Andrew Pinski
@ 2007-08-15 19:03       ` Zack Weinberg
  2007-08-15 19:53         ` David Edelsohn
  2007-08-15 20:04         ` David Edelsohn
  0 siblings, 2 replies; 19+ messages in thread
From: Zack Weinberg @ 2007-08-15 19:03 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Segher Boessenkool, gcc

Andrew Pinski wrote:
> On 8/15/07, Zack Weinberg <zack@codesourcery.com> wrote:
>> Is popcount really slow on PowerPC?  (Compared to clz?)
> popcount is really popcount in bytes and then you do a multiple to get
> the real popcount.  This is why it is slower than count leading zeros.
>  Also popcount does not exist in most powerpc's while count leading
> zeros exist in all.

Makes sense.  I don't suppose I could persuade you to teach rs6000 
RTX_COSTS about clz and popcount...?

zw

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-15 18:44   ` Zack Weinberg
@ 2007-08-15 18:56     ` Andrew Pinski
  2007-08-15 19:03       ` Zack Weinberg
  2007-08-15 19:37     ` Jan Hubicka
  2007-08-15 19:56     ` Segher Boessenkool
  2 siblings, 1 reply; 19+ messages in thread
From: Andrew Pinski @ 2007-08-15 18:56 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Segher Boessenkool, gcc

On 8/15/07, Zack Weinberg <zack@codesourcery.com> wrote:
> Is popcount really slow on PowerPC?  (Compared to clz?)
popcount is really popcount in bytes and then you do a multiple to get
the real popcount.  This is why it is slower than count leading zeros.
 Also popcount does not exist in most powerpc's while count leading
zeros exist in all.

Thanks,
Andrew Pinski

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-12 14:00 ` Segher Boessenkool
  2007-08-13  0:13   ` Segher Boessenkool
@ 2007-08-15 18:44   ` Zack Weinberg
  2007-08-15 18:56     ` Andrew Pinski
                       ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Zack Weinberg @ 2007-08-15 18:44 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc

Segher Boessenkool wrote:
>> * I would like to do the same for __builtin_ctz, but there is a catch.
>> The synthetic ctz sequence in terms of popcount (as presently
>> implemented by ia64.md, and potentially usable for at least i386 and
>> rs6000 as well if moved to optabs.c) produces the canonical behavior at
>> zero, but the synthetic sequence in terms of clz (as presently
>> implemented by optabs.c) produces the value -1 at zero.
>
> I suppose you're using (assuming 32-bit)
> 
>     ctz(x) := 31 - clz(x & -x)
> 
> now, which gives -1 for 0; and the version you're looking for is
> 
>     ctz(x) := 32 - clz(~x & (x-1))
> 
> which gives 32 for 0.

Thanks!  That's, unfortunately, one more instruction, although I guess a 
lot of chips have "a & ~b" as one operation.

> What does the popcount version look like?  Never seen that before,
> but I think it will be really expensive on PowerPC.

   ctz(x) := popcount(~x & (x-1))

Just the same thing as your version of the ctz-as-clz operation, but 
without the final adjustment.  It looks like ~x & (x-1) turns any number 
into 000...111... where the boundary between zeroes and ones lies at the 
lowest 1 in the original.

Is popcount really slow on PowerPC?  (Compared to clz?)  Ideally one 
would choose between the two expansions based on RTL costs, but the only 
  architectures it matters for are i386 and powerpc, and neither of them 
define the cost of either clz or popcount.

zw

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-12 14:00 ` Segher Boessenkool
@ 2007-08-13  0:13   ` Segher Boessenkool
  2007-08-15 18:44   ` Zack Weinberg
  1 sibling, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2007-08-13  0:13 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc, Zack Weinberg

> I suppose you're using (assuming 32-bit)
>
> 	ctz(x) := 31 - clz(x & -x)
>
> now, which gives -1 for 0; and the version you're looking for is
>
> 	ctz(x) := 32 - clz(~x & (x-1))
>
> which gives 32 for 0.
>
> (Straight from the venerable PowerPC Compiler Writer's Guide, btw).
>
> What does the popcount version look like?  Never seen that before,
> but I think it will be really expensive on PowerPC.

Never mind, too trivial, given the second of the above.


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-11  6:45 Zack Weinberg
  2007-08-11 10:54 ` Richard Kenner
@ 2007-08-12 14:00 ` Segher Boessenkool
  2007-08-13  0:13   ` Segher Boessenkool
  2007-08-15 18:44   ` Zack Weinberg
  1 sibling, 2 replies; 19+ messages in thread
From: Segher Boessenkool @ 2007-08-12 14:00 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

> * I would like to do the same for __builtin_ctz, but there is a catch.
> The synthetic ctz sequence in terms of popcount (as presently
> implemented by ia64.md, and potentially usable for at least i386 and
> rs6000 as well if moved to optabs.c) produces the canonical behavior at
> zero, but the synthetic sequence in terms of clz (as presently
> implemented by optabs.c) produces the value -1 at zero.  I have not 
> been
> able to think of any refinement to that sequence that would reliably
> produce GET_MODE_BITSIZE(mode) at zero in an efficient manner.
> Furthermore, -1 is the value most convenient for implementing ffs in
> terms of ctz.  Opinions and/or clever bit manipulation hacks would be
> much appreciated.

I suppose you're using (assuming 32-bit)

	ctz(x) := 31 - clz(x & -x)

now, which gives -1 for 0; and the version you're looking for is

	ctz(x) := 32 - clz(~x & (x-1))

which gives 32 for 0.

(Straight from the venerable PowerPC Compiler Writer's Guide, btw).

What does the popcount version look like?  Never seen that before,
but I think it will be really expensive on PowerPC.


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-11 10:54 ` Richard Kenner
@ 2007-08-11 18:02   ` Zack Weinberg
  0 siblings, 0 replies; 19+ messages in thread
From: Zack Weinberg @ 2007-08-11 18:02 UTC (permalink / raw)
  To: Richard Kenner; +Cc: gcc

Richard Kenner wrote:
>> * Since no one uses it, we rip out all support for the ffs pattern and
>> expression.
> 
> There's an ffs builtin!  How do we know who uses it?

I am not proposing to remove the built-in (i.e. the language visible 
__builtin_ffs() function); only the RTL expression (ffs:MODE ...) and 
the named machine description pattern "ffs<mode>".

> Moreover, expmed uses it as an option in expanding some comparisons.

I see no code in expmed.c that uses either the RTL expression or the 
named pattern.  It does use the optab, but (to reiterate) use of the 
ffs_optab never generates RTL that uses (ffs:MODE).

zw

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC: Simplify rules for ctz/clz patterns and RTL
  2007-08-11  6:45 Zack Weinberg
@ 2007-08-11 10:54 ` Richard Kenner
  2007-08-11 18:02   ` Zack Weinberg
  2007-08-12 14:00 ` Segher Boessenkool
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Kenner @ 2007-08-11 10:54 UTC (permalink / raw)
  To: zack; +Cc: gcc

> * Since no one uses it, we rip out all support for the ffs pattern and
> expression.

There's an ffs builtin!  How do we know who uses it?

Moreover, expmed uses it as an option in expanding some comparisons.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RFC: Simplify rules for ctz/clz patterns and RTL
@ 2007-08-11  6:45 Zack Weinberg
  2007-08-11 10:54 ` Richard Kenner
  2007-08-12 14:00 ` Segher Boessenkool
  0 siblings, 2 replies; 19+ messages in thread
From: Zack Weinberg @ 2007-08-11  6:45 UTC (permalink / raw)
  To: gcc

During development of the patch I just posted for double-word clz, I
went through all the back ends and audited their use of the bit-scan
named patterns and RTL.  It appears to me that our current handling of
C[LT]Z_DEFINED_VALUE_AT_ZERO is much more complicated than it needs to
be, and also that between my patch and Sandra's earlier patch for
synthetic ctz/ffs, we have an opportunity to delete a bunch of code from
the back ends.

In this message, I'll use the word "instruction" when I am talking about
an actual hardware operation on a particular architecture; the word
"pattern" when I am talking about a named define_insn or define_expand
in a machine description; and the word "expression" when I am talking
about RTL.  The word "port" refers to the GCC back-end for a particular
CPU architecture.

There are eleven ports that make use of an clz instruction.  That use is
not necessarily in a clz pattern or with clz expressions - some only
define ffs patterns, and some use UNSPECs.  This is mostly irrelevant to
what I want to talk about, though.

   alpha arm i386 m68k mips rs6000 s390 score sh sparc xtensa

Of these, the majority have instructions that, when the input is zero,
write to the output a value equal to the number of bits in the input
(i.e. GET_MODE_BITSIZE of the mode of the input).  I'll refer to this as
canonical behavior.  Furthermore, these ports set
CLZ_DEFINED_VALUE_AT_ZERO to reflect that fact.

   alpha arm m68k mips rs6000 s390 xtensa

The score, sh and sparc instructions may or may not display canonical
behavior; their ports do not define CLZ_DEFINED_VALUE_AT_ZERO and I was
not able to find documentation of the relevant instruction.

i386, as is well known, has a clz instruction that does not write a
predictable value to the output when the input is zero, and so correctly
does not define CLZ_DEFINED_VALUE_AT_ZERO.  (Actually, when TARGET_ABM
is true, we are using a new instruction that *does* display canonical
behavior, and my aforementioned patch sets CLZ_DEFINED_VALUE_AT_ZERO to
reflect that; but again this is mostly irrelevant.)

No port needs CLZ_DEFINED_VALUE_AT_ZERO to be a tristate.  Either both
or neither of the clz pattern and the clz expression produce a defined
value at zero.

No port defines CLZ_DEFINED_VALUE_AT_ZERO to set the 'val' argument to
anything other than GET_MODE_BITSIZE (mode).  [Some of them hardcode the
constant instead of using that expression.]

----

There are two ports that make use of a ctz instruction:

   alpha i386

alpha's instruction displays canonical behavior; i386's instruction does
not write a predictable value to the output when the input is zero
(TARGET_ABM does not help here).  Both ports have correct definitions or
non-definitions of CTZ_DEFINED_VALUE_AT_ZERO.

In addition, four ports define ctz patterns that expand to
multi-instruction sequences.

   arm ia64 rs6000 xtensa

Of these, all except ia64 are presently redundant with the generic
expander Sandra added to optabs.c.  ia64 generates a different sequence
involving a popcount instruction; it would be easy enough to add that to
optabs.c.

CTZ_DEFINED_VALUE_AT_ZERO is not defined by all four ports, but could
be.  Those that define it, do so correctly.

No port needs CTZ_DEFINED_VALUE_AT_ZERO to be a tristate.  There are
three cases: both the pattern and the expression have a defined value at
zero (alpha); neither the pattern nor the expression has a defined value
at zero (i386); the pattern has a defined value at zero and the
expression is never emitted so its value at zero is moot (arm, rs6000,
xtensa, ia64).

If optabs.c were taught to synthesize ctz in terms of popcount, the arm,
rs6000, xtensa, and ia64 definitions of ctz patterns could all be
removed.  There would then be no port that defined
CTZ_DEFINED_VALUE_AT_ZERO to set 'val' to anything other than
GET_MODE_BITSIZE (mode).  [My patch removes the arm ctz pattern.  rs6000
and xtensa could be removed now.]

----

There is no port that makes use of an ffs instruction.  However, there
are nine architectures that define ffs patterns.

   alpha arm i386 ia64 rs6000 score sh sparc xtensa

All except ia64's are redundant with optabs.c after Sandra's patch plus
my patch.  ia64's would be redundant if the aforementioned popcount
sequence were added to optabs.c.

There is no port that uses the ffs expression.  ffs always has a defined
value at zero, so there is no FFS_DEFINED_VALUE_AT_ZERO macro nor any
need for one.

----

The machine-independent uses of C[LT]Z_DEFINED_VALUE_AT_ZERO are quite
limited:

  * builtins.c (fold_builtin_bitop): Uses them to determine the value of
  __builtin_clz* and __builtin_ctz* for a zero argument.  Interestingly,
  if the macros are false for a given mode, it folds the builtins as if
  they displayed canonical behavior.

  * optabs.c: Uses them in strategies for expanding ctz and ffs.

  * rtlanal.c (nonzero_bits1): Uses them to decide what bits can be
  nonzero in the result of a clz or ctz expression.

  * simplify-rtx.c: Uses them when propagating zeroes into clz and ctz
  expressions.  Appears to fake the canonical behavior in all
  cases if the macro returns false.

----

Given all of the above, I propose that we simplify our support for bit
scan instructions as follows:

* We do add the popcount expansion of ctz to optabs.c, and we remove the
  ctz and ffs patterns from all targets (except i386 and alpha's ctz).
This IMO is a no-brainer.

* Since no one uses it, we rip out all support for the ffs pattern and
expression.

* We redefine the clz and ctz expressions to have semantics consistent
with canonical hardware behavior, unconditionally.  Ports with
instructions inconsistent with that rule must use UNSPECs instead.  It
is my hope and expectation that this only affects the i386.  However, it
might also affect score, sh, and sparc.  I'd appreciate it if the
maintainers of those ports could report the behavior of the relevant
instructions.

* We also redefine __builtin_clz to have those semantics unconditionally.

* I would like to do the same for __builtin_ctz, but there is a catch.
The synthetic ctz sequence in terms of popcount (as presently
implemented by ia64.md, and potentially usable for at least i386 and
rs6000 as well if moved to optabs.c) produces the canonical behavior at
zero, but the synthetic sequence in terms of clz (as presently
implemented by optabs.c) produces the value -1 at zero.  I have not been
able to think of any refinement to that sequence that would reliably
produce GET_MODE_BITSIZE(mode) at zero in an efficient manner.
Furthermore, -1 is the value most convenient for implementing ffs in
terms of ctz.  Opinions and/or clever bit manipulation hacks would be
much appreciated.

* The sole remaining use of C[LT]Z_DEFINED_VALUE_AT_ZERO is then in
optabs.c.  We replace this with two targetm booleans, defaulting to
true, whose meaning is precisely 'all clz/ctz named patterns for this
target display canonical hardware behavior'.

zw

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-08-15 21:54 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-13 17:11 RFC: Simplify rules for ctz/clz patterns and RTL Joern Rennecke
     [not found] ` <46C34C06.3000103@codesourcery.com>
2007-08-15 22:53   ` Joern Rennecke
  -- strict thread matches above, loose matches on Subject: below --
2007-08-15 19:27 Zack Weinberg
2007-08-11  6:45 Zack Weinberg
2007-08-11 10:54 ` Richard Kenner
2007-08-11 18:02   ` Zack Weinberg
2007-08-12 14:00 ` Segher Boessenkool
2007-08-13  0:13   ` Segher Boessenkool
2007-08-15 18:44   ` Zack Weinberg
2007-08-15 18:56     ` Andrew Pinski
2007-08-15 19:03       ` Zack Weinberg
2007-08-15 19:53         ` David Edelsohn
2007-08-15 20:04         ` David Edelsohn
2007-08-15 20:33           ` Segher Boessenkool
2007-08-15 21:09             ` David Edelsohn
2007-08-15 21:24               ` Segher Boessenkool
2007-08-15 21:54                 ` David Edelsohn
2007-08-15 19:37     ` Jan Hubicka
2007-08-15 19:56     ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).