[RFA] [target/87369] Prefer "bit" over "bfxil"

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFA] [target/87369] Prefer "bit" over "bfxil"
@ 2018-12-07 15:52 Jeff Law
  2018-12-07 17:31 ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Law @ 2018-12-07 15:52 UTC (permalink / raw)
  To: gcc-patches, Richard Earnshaw, James Greenhalgh

[-- Attachment #1: Type: text/plain, Size: 905 bytes --]

As I suggested in the BZ, this patch rejects constants with  just the
high bit set for the recently added "bfxil" pattern.  As a result we'll
return to using "bit" for the test in the BZ.

I'm not versed enough in aarch64 performance tuning to know if "bit" is
actually a better choice than "bfxil".  "bit" results in better code for
the testcase, but that seems more a function of register allocation than
"bit" being inherently better than "bfxil".   Obviously someone with
more aarch64 knowledge needs to make a decision here.

My first iteration of the patch changed "aarch64_high_bits_all_ones_p".
We could still go that way too, though the name probably needs to change.

I've bootstrapped and regression tested on aarch64-linux-gnu and it
fixes the regression.  I've also bootstrapped aarch64_be-linux-gnu, but
haven't done any kind of regression tested on that platform.

OK for the trunk?

Jeff

[-- Attachment #2: P --]
[-- Type: text/plain, Size: 1056 bytes --]

	PR target/87369
	* config/aarch64/aarch64.md (aarch64_bfxil<mode>): Do not accept
	constant with just the high bit set.  That's better handled by
	the "bit" pattern.

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 88f66104db3..ad6822410c2 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -5342,9 +5342,11 @@
 		    (match_operand:GPI 3 "const_int_operand" "n, Ulc"))
 	    (and:GPI (match_operand:GPI 2 "register_operand" "0,r")
 		    (match_operand:GPI 4 "const_int_operand" "Ulc, n"))))]
-  "(INTVAL (operands[3]) == ~INTVAL (operands[4]))
-  && (aarch64_high_bits_all_ones_p (INTVAL (operands[3]))
-    || aarch64_high_bits_all_ones_p (INTVAL (operands[4])))"
+  "(INTVAL (operands[3]) == ~INTVAL (operands[4])
+    && ((aarch64_high_bits_all_ones_p (INTVAL (operands[3]))
+	 && popcount_hwi (INTVAL (operands[3])) != 1)
+        || (aarch64_high_bits_all_ones_p (INTVAL (operands[4]))
+	    && popcount_hwi (INTVAL (operands[4])) != 1)))"
   {
     switch (which_alternative)
     {

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFA] [target/87369] Prefer "bit" over "bfxil"
  2018-12-07 15:52 [RFA] [target/87369] Prefer "bit" over "bfxil" Jeff Law
@ 2018-12-07 17:31 ` Richard Earnshaw (lists)
  2018-12-07 18:01   ` Jeff Law
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Earnshaw (lists) @ 2018-12-07 17:31 UTC (permalink / raw)
  To: Jeff Law, gcc-patches, James Greenhalgh

On 07/12/2018 15:52, Jeff Law wrote:
> As I suggested in the BZ, this patch rejects constants with  just the
> high bit set for the recently added "bfxil" pattern.  As a result we'll
> return to using "bit" for the test in the BZ.
> 
> I'm not versed enough in aarch64 performance tuning to know if "bit" is
> actually a better choice than "bfxil".  "bit" results in better code for
> the testcase, but that seems more a function of register allocation than
> "bit" being inherently better than "bfxil".   Obviously someone with
> more aarch64 knowledge needs to make a decision here.
> 
> My first iteration of the patch changed "aarch64_high_bits_all_ones_p".
> We could still go that way too, though the name probably needs to change.
> 
> I've bootstrapped and regression tested on aarch64-linux-gnu and it
> fixes the regression.  I've also bootstrapped aarch64_be-linux-gnu, but
> haven't done any kind of regression tested on that platform.
> 
> 
> OK for the trunk?

The problem here is that the optimum solution depends on the register
classes involved and we don't know this during combine.  If we have
general register, then we want bfi/bfxil to be used; if we have a vector
register, then bit is preferable as it changes 3 inter-bank register
copies to a single inter-bank copy; and that copy might be hoisted out
of a loop.

For example, this case:

unsigned long
f (unsigned long a, unsigned long b)
{
  return (b & 0x7fffffffffffffff) | (a & 0x8000000000000000);
}

before your patch this expands to just a single bfxil instruction and
that's exactly what we'd want here.  With it, however, I'm now seeing

f:
        and     x1, x1, 9223372036854775807
        and     x0, x0, -9223372036854775808
        orr     x0, x1, x0
        ret

which seems to be even worse than gcc-8 where we got a bfi instruction.

Ultimately, the best solution here will probably depend on which we
think is more likely, copysign or the example I give above.

It might be that for copysign we'll need to expand initially to some
unspec that uses a register initialized with a suitable immediate, but
otherwise hides the operation from combine until after that has run,
thus preventing the compiler from doing the otherwise right thing.  We'd
lose in the (hopefully) rare case where the operands really were in
general registers, but otherwise win for the more common case where they
aren't.

R.

> 
> Jeff
> 
> 
> P
> 
> 	PR target/87369
> 	* config/aarch64/aarch64.md (aarch64_bfxil<mode>): Do not accept
> 	constant with just the high bit set.  That's better handled by
> 	the "bit" pattern.
> 
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 88f66104db3..ad6822410c2 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -5342,9 +5342,11 @@
>  		    (match_operand:GPI 3 "const_int_operand" "n, Ulc"))
>  	    (and:GPI (match_operand:GPI 2 "register_operand" "0,r")
>  		    (match_operand:GPI 4 "const_int_operand" "Ulc, n"))))]
> -  "(INTVAL (operands[3]) == ~INTVAL (operands[4]))
> -  && (aarch64_high_bits_all_ones_p (INTVAL (operands[3]))
> -    || aarch64_high_bits_all_ones_p (INTVAL (operands[4])))"
> +  "(INTVAL (operands[3]) == ~INTVAL (operands[4])
> +    && ((aarch64_high_bits_all_ones_p (INTVAL (operands[3]))
> +	 && popcount_hwi (INTVAL (operands[3])) != 1)
> +        || (aarch64_high_bits_all_ones_p (INTVAL (operands[4]))
> +	    && popcount_hwi (INTVAL (operands[4])) != 1)))"
>    {
>      switch (which_alternative)
>      {
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFA] [target/87369] Prefer "bit" over "bfxil"
  2018-12-07 17:31 ` Richard Earnshaw (lists)
@ 2018-12-07 18:01   ` Jeff Law
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff Law @ 2018-12-07 18:01 UTC (permalink / raw)
  To: Richard Earnshaw (lists), gcc-patches, James Greenhalgh

On 12/7/18 10:31 AM, Richard Earnshaw (lists) wrote:
> On 07/12/2018 15:52, Jeff Law wrote:
>> As I suggested in the BZ, this patch rejects constants with  just the
>> high bit set for the recently added "bfxil" pattern.  As a result we'll
>> return to using "bit" for the test in the BZ.
>>
>> I'm not versed enough in aarch64 performance tuning to know if "bit" is
>> actually a better choice than "bfxil".  "bit" results in better code for
>> the testcase, but that seems more a function of register allocation than
>> "bit" being inherently better than "bfxil".   Obviously someone with
>> more aarch64 knowledge needs to make a decision here.
>>
>> My first iteration of the patch changed "aarch64_high_bits_all_ones_p".
>> We could still go that way too, though the name probably needs to change.
>>
>> I've bootstrapped and regression tested on aarch64-linux-gnu and it
>> fixes the regression.  I've also bootstrapped aarch64_be-linux-gnu, but
>> haven't done any kind of regression tested on that platform.
>>
>>
>> OK for the trunk?
> 
> The problem here is that the optimum solution depends on the register
> classes involved and we don't know this during combine.  If we have
> general register, then we want bfi/bfxil to be used; if we have a vector
> register, then bit is preferable as it changes 3 inter-bank register
> copies to a single inter-bank copy; and that copy might be hoisted out
> of a loop.
Ugh.  Things are never simple, are they?

> 
> Ultimately, the best solution here will probably depend on which we
> think is more likely, copysign or the example I give above.
I'd tend to suspect we'd see more pure integer bit twiddling than the
copysign stuff.


> 
> It might be that for copysign we'll need to expand initially to some
> unspec that uses a register initialized with a suitable immediate, but
> otherwise hides the operation from combine until after that has run,
> thus preventing the compiler from doing the otherwise right thing.  We'd
> lose in the (hopefully) rare case where the operands really were in
> general registers, but otherwise win for the more common case where they
> aren't.
Could we have the bfxil pattern have an alternative that accepts vector
regs and generates bit in appropriate circumstances?

Hmm, maybe the other way around would be better.   Have the "bit"
pattern with a general register alternative that generates bfxil when
presented with general registers.

I would generally warn against hiding things in unspecs like you've
suggested above.  We're seeing cases where that's been on in the x86
backend and it's inhibiting optimizations in various places.

Thoughts?
Jeff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFA] [target/87369] Prefer "bit" over "bfxil"
  2018-12-07 18:48 Wilco Dijkstra
@ 2018-12-07 19:01 ` Jeff Law
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff Law @ 2018-12-07 19:01 UTC (permalink / raw)
  To: Wilco Dijkstra, Richard Earnshaw, James Greenhalgh; +Cc: GCC Patches, nd

On 12/7/18 11:48 AM, Wilco Dijkstra wrote:
> Hi,
> 
>>> Ultimately, the best solution here will probably depend on which we
>>> think is more likely, copysign or the example I give above.
>> I'd tend to suspect we'd see more pure integer bit twiddling than the
>> copysign stuff.
> 
> All we need to do is to clearly separate the integer and FP/SIMD cases.
> Copysign should always expand into a pattern that cannot generate
> integer instructions. This could be done by adding a bit/bif pattern with
> UNSPEC for the DI/SImode case or use V2DI/V2SI in the copysign
> expansion.
As I've noted, adding those unspecs is likely to get in the way of
things like CSE, combine, etc.

>  
>> Could we have the bfxil pattern have an alternative that accepts vector
>> regs and generates bit in appropriate circumstances?
> 
> We already do that in too many cases, and it only makes the problem
> worse since the register allocator cannot cost these patterns at all (let
> alone accurately). This is particularly bad when the expansions are
> wildly different and emit extra instructions which cannot be optimized
> after register allocation.
I'm not sure what you mean by it can't cost them.  Costs certainly
factor into the algorithms used by IRA/LRA.  But I would agree that it's
not particularly good at costing across register banks and modeling the
cost of reloads it'll have to generate if it needs to move a value from
one bank to another.

> 
> We simply need to make an early choice which register file to use.
GCC fundamentally isn't designed to do that.

> 
>> Hmm, maybe the other way around would be better.   Have the "bit"
>> pattern with a general register alternative that generates bfxil when
>> presented with general registers.
> 
> We already have that, and that's a very complex pattern which already
> results in inefficient integer code.
> 
> For the overlapping cases between bfi and bfxil the mid-end should really
> simplify one into the other to avoid having to have multiple MD patterns
> for equivalent expressions. This may solve the problem.
Well, the bfxil pattern is general enough to handle both, the problem is
it only works on one register file.

> 
>> I would generally warn against hiding things in unspecs like you've
>> suggested above.  We're seeing cases where that's been on in the x86
>> backend and it's inhibiting optimizations in various places.
> 
> In the general case we can't describe a clear preference for a specific
> register file without support for scalar vector types (eg. V1DI, V1SI) or
> having a way to set virtual register preferences at expand time.
I'm going to step away from this problem.  It looked like it might be
trackable, but there's clearly a lot more to it and someone with more
experience on aarch64 will have to run with it.

Patch withdrawn.

jeff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFA] [target/87369] Prefer "bit" over "bfxil"
@ 2018-12-07 18:48 Wilco Dijkstra
  2018-12-07 19:01 ` Jeff Law
  0 siblings, 1 reply; 5+ messages in thread
From: Wilco Dijkstra @ 2018-12-07 18:48 UTC (permalink / raw)
  To: Jeff Law, Richard Earnshaw, James Greenhalgh; +Cc: GCC Patches, nd

Hi,

>> Ultimately, the best solution here will probably depend on which we
>> think is more likely, copysign or the example I give above.
> I'd tend to suspect we'd see more pure integer bit twiddling than the
> copysign stuff.

All we need to do is to clearly separate the integer and FP/SIMD cases.
Copysign should always expand into a pattern that cannot generate
integer instructions. This could be done by adding a bit/bif pattern with
UNSPEC for the DI/SImode case or use V2DI/V2SI in the copysign
expansion.

> Could we have the bfxil pattern have an alternative that accepts vector
> regs and generates bit in appropriate circumstances?

We already do that in too many cases, and it only makes the problem
worse since the register allocator cannot cost these patterns at all (let
alone accurately). This is particularly bad when the expansions are
wildly different and emit extra instructions which cannot be optimized
after register allocation.

We simply need to make an early choice which register file to use.

> Hmm, maybe the other way around would be better.   Have the "bit"
> pattern with a general register alternative that generates bfxil when
> presented with general registers.

We already have that, and that's a very complex pattern which already
results in inefficient integer code.

For the overlapping cases between bfi and bfxil the mid-end should really
simplify one into the other to avoid having to have multiple MD patterns
for equivalent expressions. This may solve the problem.

> I would generally warn against hiding things in unspecs like you've
> suggested above.  We're seeing cases where that's been on in the x86
> backend and it's inhibiting optimizations in various places.

In the general case we can't describe a clear preference for a specific
register file without support for scalar vector types (eg. V1DI, V1SI) or
having a way to set virtual register preferences at expand time.

Wilco

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-12-07 19:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-07 15:52 [RFA] [target/87369] Prefer "bit" over "bfxil" Jeff Law
2018-12-07 17:31 ` Richard Earnshaw (lists)
2018-12-07 18:01   ` Jeff Law
2018-12-07 18:48 Wilco Dijkstra
2018-12-07 19:01 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).