public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Series of patch to fix PR106594
@ 2023-03-09 12:08 Richard Sandiford
  2023-03-09 12:09 ` [PATCH v2 1/2] combine: Split code out of make_compound_operation_int Richard Sandiford
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Richard Sandiford @ 2023-03-09 12:08 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher

This series of patches fixes PR106594, an aarch64 regression in which
we fail to combine an extension into an address.  The first patch just
refactors code.  The second patch contains the actual fix.

The cover note for the second patch describes the problem and the fix.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/2] combine: Split code out of make_compound_operation_int
  2023-03-09 12:08 [PATCH v2 0/2] Series of patch to fix PR106594 Richard Sandiford
@ 2023-03-09 12:09 ` Richard Sandiford
  2023-03-31 11:06   ` Segher Boessenkool
  2023-03-09 12:10 ` [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
  2023-03-27 12:14 ` [PATCH v2 0/2] Series of patch to fix PR106594 Richard Sandiford
  2 siblings, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2023-03-09 12:09 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher

This patch just splits some code out of make_compound_operation_int
into a new function called make_compound_operation_and.  It is a
prerequisite for the fix for PR106594.

It might (or might not) make sense to put more of the existing
"and" handling into the new function, so that the subreg+lshiftrt
case can be handled through recursion rather than duplication.
But that's certainly not necessary to fix the bug, so is at
best stage 1 material.

No behavioural change intended.

gcc/
	* combine.cc (make_compound_operation_and): New function,
	split out from...
	(make_compound_operation_int): ...here.
---
 gcc/combine.cc | 84 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 54 insertions(+), 30 deletions(-)

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 053879500b7..7d446d02cb4 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7952,6 +7952,56 @@ extract_left_shift (scalar_int_mode mode, rtx x, int count)
   return 0;
 }
 \f
+/* A subroutine of make_compound_operation_int.  Try to combine an outer
+   AND of X and MASK with a partnering inner operation to form a compound
+   operation.  Return the new X on success, otherwise return null.
+
+   MODE is the mode of X.  IN_CODE is as for make_compound_operation.
+   NEXT_CODE is the value of IN_CODE that should be used for (recursive)
+   calls to make_compound_operation.  */
+
+static rtx
+make_compound_operation_and (scalar_int_mode mode, rtx x,
+			     unsigned HOST_WIDE_INT mask,
+			     rtx_code in_code, rtx_code next_code)
+{
+  switch (GET_CODE (x))
+    {
+    case SUBREG:
+      /* If the operand is a paradoxical subreg of a register or memory
+	 and MASK (limited to the smaller mode) has only zero bits where
+	 the sub expression has known zero bits, this can be expressed as
+	 a zero_extend.  */
+      {
+	rtx sub = XEXP (x, 0);
+	machine_mode sub_mode = GET_MODE (sub);
+	int sub_width;
+	if ((REG_P (sub) || MEM_P (sub))
+	    && GET_MODE_PRECISION (sub_mode).is_constant (&sub_width)
+	    && sub_width < GET_MODE_PRECISION (mode))
+	  {
+	    unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (sub_mode);
+	    unsigned HOST_WIDE_INT submask;
+
+	    /* The shifted AND constant with all the known zero
+	       bits set.  */
+	    submask = mask | ~nonzero_bits (sub, sub_mode);
+	    if ((submask & mode_mask) == mode_mask)
+	      {
+		rtx new_rtx = make_compound_operation (sub, next_code);
+		return make_extraction (mode, new_rtx, 0, 0, sub_width,
+					1, 0, in_code == COMPARE);
+	      }
+	  }
+	break;
+      }
+
+    default:
+      break;
+    }
+  return NULL_RTX;
+}
+
 /* Subroutine of make_compound_operation.  *X_PTR is the rtx at the current
    level of the expression and MODE is its mode.  IN_CODE is as for
    make_compound_operation.  *NEXT_CODE_PTR is the value of IN_CODE
@@ -8184,36 +8234,10 @@ make_compound_operation_int (scalar_int_mode mode, rtx *x_ptr,
 				   make_compound_operation (XEXP (x, 0),
 							    next_code),
 				   i, NULL_RTX, 1, 1, 0, 1);
-
-      /* If the one operand is a paradoxical subreg of a register or memory and
-	 the constant (limited to the smaller mode) has only zero bits where
-	 the sub expression has known zero bits, this can be expressed as
-	 a zero_extend.  */
-      else if (GET_CODE (XEXP (x, 0)) == SUBREG)
-	{
-	  rtx sub;
-
-	  sub = XEXP (XEXP (x, 0), 0);
-	  machine_mode sub_mode = GET_MODE (sub);
-	  int sub_width;
-	  if ((REG_P (sub) || MEM_P (sub))
-	      && GET_MODE_PRECISION (sub_mode).is_constant (&sub_width)
-	      && sub_width < mode_width)
-	    {
-	      unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (sub_mode);
-	      unsigned HOST_WIDE_INT mask;
-
-	      /* original AND constant with all the known zero bits set */
-	      mask = UINTVAL (XEXP (x, 1)) | (~nonzero_bits (sub, sub_mode));
-	      if ((mask & mode_mask) == mode_mask)
-		{
-		  new_rtx = make_compound_operation (sub, next_code);
-		  new_rtx = make_extraction (mode, new_rtx, 0, 0, sub_width,
-					     1, 0, in_code == COMPARE);
-		}
-	    }
-	}
-
+      else
+	new_rtx = make_compound_operation_and (mode, XEXP (x, 0),
+					       UINTVAL (XEXP (x, 1)),
+					       in_code, next_code);
       break;
 
     case LSHIFTRT:
-- 
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594]
  2023-03-09 12:08 [PATCH v2 0/2] Series of patch to fix PR106594 Richard Sandiford
  2023-03-09 12:09 ` [PATCH v2 1/2] combine: Split code out of make_compound_operation_int Richard Sandiford
@ 2023-03-09 12:10 ` Richard Sandiford
  2023-03-31 12:16   ` Segher Boessenkool
  2023-03-27 12:14 ` [PATCH v2 0/2] Series of patch to fix PR106594 Richard Sandiford
  2 siblings, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2023-03-09 12:10 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher

g:c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f uses nonzero_bits
information to convert sign_extends into zero_extends.
That change is semantically correct in itself, but for the
testcase in the PR, it leads to a series of unfortunate events,
as described below.

We try to combine:

Trying 24 -> 25:
   24: r116:DI=sign_extend(r115:SI)
      REG_DEAD r115:SI
   25: r117:SI=[r116:DI*0x4+r118:DI]
      REG_DEAD r116:DI
      REG_EQUAL [r116:DI*0x4+`constellation_64qam']

which previously succeeded, giving:

(set (reg:SI 117 [ constellation_64qam[_5] ])
    (mem/u:SI (plus:DI (mult:DI (sign_extend:DI (reg:SI 115))
                (const_int 4 [0x4]))
            (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))

However, nonzero_bits can tell that only the low 6 bits of r115
can be nonzero.  The commit above therefore converts the sign_extend
to a zero_extend.  Using the same nonzero_bits information, we then
"expand" the zero_extend to:

  (and:DI (subreg:DI (reg:SI r115) 0)
          (const_int 63))

Substituting into the mult gives the unsimplified expression:

  (mult:DI (and:DI (subreg:DI (reg:SI r115) 0)
                   (const_int 63))
           (const_int 4))

The simplification rules for mult convert this to an ashift by 2.
Then, this rule in simplify_shift_const_1:

	  /* If we have (shift (logical)), move the logical to the outside
	     to allow it to possibly combine with another logical and the
	     shift to combine with another shift.  This also canonicalizes to
	     what a ZERO_EXTRACT looks like.  Also, some machines have
	     (and (shift)) insns.  */

moves the shift inside the "and", so that the expression becomes:

  (and:DI (ashift:DI (subreg:DI (reg:SI r115) 0)
                     (const_int 2))
          (const_int 252))

We later recanonicalise to a mult (since this is an address):

  (and:DI (mult:DI (subreg:DI (reg:SI r115) 0)
                   (const_int 4))
          (const_int 252))

But we fail to transform this back to the natural substitution:

  (mult:DI (zero_extend:DI (reg:SI r115))
           (const_int 4))

There are several other cases in which make_compound_operation
needs to look more than one level down in order to complete a
compound operation.  For example:

(a) the ashiftrt handling uses extract_left_shift to look through
    things like logic ops in order to find a partnering ashift
    operation

(b) the "and" handling looks through subregs, xors and iors
    to find a partnerning lshiftrt

This patch takes the same approach for mult.

gcc/
	PR rtl-optimization/106594
	* combine.cc (make_compound_operation_and): Look through
	multiplications by a power of two.

gcc/testsuite/
	* gcc.target/aarch64/pr106594.c: New test.
---
 gcc/combine.cc                              | 17 +++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/pr106594.c | 20 ++++++++++++++++++++
 2 files changed, 37 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594.c

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 7d446d02cb4..36d04ad6703 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7996,6 +7996,23 @@ make_compound_operation_and (scalar_int_mode mode, rtx x,
 	break;
       }
 
+    case MULT:
+      /* Recurse through a power of 2 multiplication (as can be found
+	 in an address), using the relationship:
+
+	 (and (mult X 2**N1) N2) == (mult (and X (lshifrt N2 N1)) 2**N1).  */
+      if (CONST_INT_P (XEXP (x, 1))
+	  && pow2p_hwi (INTVAL (XEXP (x, 1))))
+	{
+	  int shift = exact_log2 (INTVAL (XEXP (x, 1)));
+	  rtx sub = make_compound_operation_and (mode, XEXP (x, 0),
+						 mask >> shift, in_code,
+						 next_code);
+	  if (sub)
+	    return gen_rtx_MULT (mode, sub, XEXP (x, 1));
+	}
+      break;
+
     default:
       break;
     }
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594.c b/gcc/testsuite/gcc.target/aarch64/pr106594.c
new file mode 100644
index 00000000000..beda8e050a5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr106594.c
@@ -0,0 +1,20 @@
+/* { dg-options "-O2" } */
+
+extern const int constellation_64qam[64];
+
+void foo(int nbits,
+         const char *p_src,
+         int *p_dst) {
+
+  while (nbits > 0U) {
+    char first = *p_src++;
+
+    char index1 = ((first & 0x3) << 4) | (first >> 4);
+
+    *p_dst++ = constellation_64qam[index1];
+
+    nbits--;
+  }
+}
+
+/* { dg-final { scan-assembler {ldr\tw[0-9]+, \[x[0-9]+, w[0-9]+, [su]xtw #?2\]} } } */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] Series of patch to fix PR106594
  2023-03-09 12:08 [PATCH v2 0/2] Series of patch to fix PR106594 Richard Sandiford
  2023-03-09 12:09 ` [PATCH v2 1/2] combine: Split code out of make_compound_operation_int Richard Sandiford
  2023-03-09 12:10 ` [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
@ 2023-03-27 12:14 ` Richard Sandiford
  2 siblings, 0 replies; 10+ messages in thread
From: Richard Sandiford @ 2023-03-27 12:14 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher

Ping

https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613640.html

Richard Sandiford <richard.sandiford@arm.com> writes:
> This series of patches fixes PR106594, an aarch64 regression in which
> we fail to combine an extension into an address.  The first patch just
> refactors code.  The second patch contains the actual fix.
>
> The cover note for the second patch describes the problem and the fix.
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>
> Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] combine: Split code out of make_compound_operation_int
  2023-03-09 12:09 ` [PATCH v2 1/2] combine: Split code out of make_compound_operation_int Richard Sandiford
@ 2023-03-31 11:06   ` Segher Boessenkool
  2023-03-31 12:13     ` Richard Sandiford
  0 siblings, 1 reply; 10+ messages in thread
From: Segher Boessenkool @ 2023-03-31 11:06 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

Hi!

On Thu, Mar 09, 2023 at 12:09:59PM +0000, Richard Sandiford wrote:
> This patch just splits some code out of make_compound_operation_int
> into a new function called make_compound_operation_and.  It is a
> prerequisite for the fix for PR106594.
> 
> It might (or might not) make sense to put more of the existing
> "and" handling into the new function, so that the subreg+lshiftrt
> case can be handled through recursion rather than duplication.
> But that's certainly not necessary to fix the bug, so is at
> best stage 1 material.
> 
> No behavioural change intended.

That doesn't sound as if you are very sure about things.  I'll just
pretend it says "no functional changes" :-)

(*Is* this a pure refactoring?)


Segher

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] combine: Split code out of make_compound_operation_int
  2023-03-31 11:06   ` Segher Boessenkool
@ 2023-03-31 12:13     ` Richard Sandiford
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Sandiford @ 2023-03-31 12:13 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> On Thu, Mar 09, 2023 at 12:09:59PM +0000, Richard Sandiford wrote:
>> This patch just splits some code out of make_compound_operation_int
>> into a new function called make_compound_operation_and.  It is a
>> prerequisite for the fix for PR106594.
>> 
>> It might (or might not) make sense to put more of the existing
>> "and" handling into the new function, so that the subreg+lshiftrt
>> case can be handled through recursion rather than duplication.
>> But that's certainly not necessary to fix the bug, so is at
>> best stage 1 material.
>> 
>> No behavioural change intended.
>
> That doesn't sound as if you are very sure about things.  I'll just
> pretend it says "no functional changes" :-)

Heh.  Stock phrase borrowed from LLVM.  I'm not going to sign off with
"this patch contains no bugs". :-)

> (*Is* this a pure refactoring?)

Yeah, this patch is a pure refactoring.  It's the follow-on 2/2 part
that does something useful.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594]
  2023-03-09 12:10 ` [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
@ 2023-03-31 12:16   ` Segher Boessenkool
  2023-03-31 14:06     ` Richard Sandiford
  0 siblings, 1 reply; 10+ messages in thread
From: Segher Boessenkool @ 2023-03-31 12:16 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On Thu, Mar 09, 2023 at 12:10:51PM +0000, Richard Sandiford wrote:
> g:c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f uses nonzero_bits
> information to convert sign_extends into zero_extends.
> That change is semantically correct in itself, but for the
> testcase in the PR, it leads to a series of unfortunate events,
> as described below.

That patch also does other much more obviously beneficial transforms.
These things should not have been mashed together.  Smaller more atomic
patches are much nicer to handle :-/

> We try to combine:
> 
> Trying 24 -> 25:
>    24: r116:DI=sign_extend(r115:SI)
>       REG_DEAD r115:SI
>    25: r117:SI=[r116:DI*0x4+r118:DI]
>       REG_DEAD r116:DI
>       REG_EQUAL [r116:DI*0x4+`constellation_64qam']
> 
> which previously succeeded, giving:
> 
> (set (reg:SI 117 [ constellation_64qam[_5] ])
>     (mem/u:SI (plus:DI (mult:DI (sign_extend:DI (reg:SI 115))
>                 (const_int 4 [0x4]))
>             (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))
> 
> However, nonzero_bits can tell that only the low 6 bits of r115
> can be nonzero.  The commit above therefore converts the sign_extend
> to a zero_extend.  Using the same nonzero_bits information, we then
> "expand" the zero_extend to:
> 
>   (and:DI (subreg:DI (reg:SI r115) 0)
>           (const_int 63))

This is more expensive already?!  An "and" usually costs a real machine
instruction, while subregs are usually free (just notational).  The
"and" can not easily be optimised away either (after combine we only
have less accurate nonzero_bits, to start with).

> Substituting into the mult gives the unsimplified expression:
> 
>   (mult:DI (and:DI (subreg:DI (reg:SI r115) 0)
>                    (const_int 63))
>            (const_int 4))
> 
> The simplification rules for mult convert this to an ashift by 2.
> Then, this rule in simplify_shift_const_1:
> 
> 	  /* If we have (shift (logical)), move the logical to the outside
> 	     to allow it to possibly combine with another logical and the
> 	     shift to combine with another shift.  This also canonicalizes to
> 	     what a ZERO_EXTRACT looks like.  Also, some machines have
> 	     (and (shift)) insns.  */
> 
> moves the shift inside the "and", so that the expression becomes:
> 
>   (and:DI (ashift:DI (subreg:DI (reg:SI r115) 0)
>                      (const_int 2))
>           (const_int 252))
> 
> We later recanonicalise to a mult (since this is an address):
> 
>   (and:DI (mult:DI (subreg:DI (reg:SI r115) 0)
>                    (const_int 4))
>           (const_int 252))
> 
> But we fail to transform this back to the natural substitution:
> 
>   (mult:DI (zero_extend:DI (reg:SI r115))
>            (const_int 4))
> 
> There are several other cases in which make_compound_operation
> needs to look more than one level down in order to complete a
> compound operation.  For example:
> 
> (a) the ashiftrt handling uses extract_left_shift to look through
>     things like logic ops in order to find a partnering ashift
>     operation
> 
> (b) the "and" handling looks through subregs, xors and iors
>     to find a partnerning lshiftrt
> 
> This patch takes the same approach for mult.
> 
> gcc/
> 	PR rtl-optimization/106594
> 	* combine.cc (make_compound_operation_and): Look through
> 	multiplications by a power of two.
> 
> gcc/testsuite/
> 	* gcc.target/aarch64/pr106594.c: New test.
> ---
>  gcc/combine.cc                              | 17 +++++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/pr106594.c | 20 ++++++++++++++++++++
>  2 files changed, 37 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594.c
> 
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index 7d446d02cb4..36d04ad6703 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -7996,6 +7996,23 @@ make_compound_operation_and (scalar_int_mode mode, rtx x,
>  	break;
>        }
>  
> +    case MULT:
> +      /* Recurse through a power of 2 multiplication (as can be found
> +	 in an address), using the relationship:
> +
> +	 (and (mult X 2**N1) N2) == (mult (and X (lshifrt N2 N1)) 2**N1).  */
> +      if (CONST_INT_P (XEXP (x, 1))
> +	  && pow2p_hwi (INTVAL (XEXP (x, 1))))
> +	{
> +	  int shift = exact_log2 (INTVAL (XEXP (x, 1)));
> +	  rtx sub = make_compound_operation_and (mode, XEXP (x, 0),
> +						 mask >> shift, in_code,
> +						 next_code);
> +	  if (sub)
> +	    return gen_rtx_MULT (mode, sub, XEXP (x, 1));
> +	}
> +      break;
> +
>      default:
>        break;
>      }
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594.c b/gcc/testsuite/gcc.target/aarch64/pr106594.c
> new file mode 100644
> index 00000000000..beda8e050a5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr106594.c
> @@ -0,0 +1,20 @@
> +/* { dg-options "-O2" } */
> +
> +extern const int constellation_64qam[64];
> +
> +void foo(int nbits,
> +         const char *p_src,
> +         int *p_dst) {
> +
> +  while (nbits > 0U) {
> +    char first = *p_src++;
> +
> +    char index1 = ((first & 0x3) << 4) | (first >> 4);
> +
> +    *p_dst++ = constellation_64qam[index1];
> +
> +    nbits--;
> +  }
> +}
> +
> +/* { dg-final { scan-assembler {ldr\tw[0-9]+, \[x[0-9]+, w[0-9]+, [su]xtw #?2\]} } } */

This looks pretty good (thanks!), but as always it needs testing on more
architectures, showing it doesn't hurt things.  It should be beneficial,
but it is not unlikely to hurt other existing backends, and we are not
in stage 1 (we are in stage 4 even!)

Do you have any such proof / indication / anything?  I can start
some run but it takes a day (or two), and I cannot start it until next
week.

Do you have plans to make combine not do this insane "and" thing at all?
Or to attack the compound operation stuff head on?


Segher

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594]
  2023-03-31 12:16   ` Segher Boessenkool
@ 2023-03-31 14:06     ` Richard Sandiford
  2023-03-31 14:35       ` Segher Boessenkool
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2023-03-31 14:06 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Thu, Mar 09, 2023 at 12:10:51PM +0000, Richard Sandiford wrote:
>> g:c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f uses nonzero_bits
>> information to convert sign_extends into zero_extends.
>> That change is semantically correct in itself, but for the
>> testcase in the PR, it leads to a series of unfortunate events,
>> as described below.
>
> That patch also does other much more obviously beneficial transforms.
> These things should not have been mashed together.  Smaller more atomic
> patches are much nicer to handle :-/
>
>> We try to combine:
>> 
>> Trying 24 -> 25:
>>    24: r116:DI=sign_extend(r115:SI)
>>       REG_DEAD r115:SI
>>    25: r117:SI=[r116:DI*0x4+r118:DI]
>>       REG_DEAD r116:DI
>>       REG_EQUAL [r116:DI*0x4+`constellation_64qam']
>> 
>> which previously succeeded, giving:
>> 
>> (set (reg:SI 117 [ constellation_64qam[_5] ])
>>     (mem/u:SI (plus:DI (mult:DI (sign_extend:DI (reg:SI 115))
>>                 (const_int 4 [0x4]))
>>             (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))
>> 
>> However, nonzero_bits can tell that only the low 6 bits of r115
>> can be nonzero.  The commit above therefore converts the sign_extend
>> to a zero_extend.  Using the same nonzero_bits information, we then
>> "expand" the zero_extend to:
>> 
>>   (and:DI (subreg:DI (reg:SI r115) 0)
>>           (const_int 63))
>
> This is more expensive already?!  An "and" usually costs a real machine
> instruction, while subregs are usually free (just notational).  The
> "and" can not easily be optimised away either (after combine we only
> have less accurate nonzero_bits, to start with).

Well, it's an "and" in place of a zero_extend rather than an "and"
in place of a subreg.  So it is one real instruction for another
real instruction.  But yeah, it's not exactly a simplification.

The "we" here is expand_compound_operation, via simplify_and_const_int.
So it's the usual thing: expand_compound_operation creates something
that is perhaps more expensive, then after simplification,
make_compound_operation is supposed to collapse these "expanded"
perhaps-more-expensive forms back.  And it's the last bit that
goes wrong.

>> Substituting into the mult gives the unsimplified expression:
>> 
>>   (mult:DI (and:DI (subreg:DI (reg:SI r115) 0)
>>                    (const_int 63))
>>            (const_int 4))
>> 
>> The simplification rules for mult convert this to an ashift by 2.
>> Then, this rule in simplify_shift_const_1:
>> 
>> 	  /* If we have (shift (logical)), move the logical to the outside
>> 	     to allow it to possibly combine with another logical and the
>> 	     shift to combine with another shift.  This also canonicalizes to
>> 	     what a ZERO_EXTRACT looks like.  Also, some machines have
>> 	     (and (shift)) insns.  */
>> 
>> moves the shift inside the "and", so that the expression becomes:
>> 
>>   (and:DI (ashift:DI (subreg:DI (reg:SI r115) 0)
>>                      (const_int 2))
>>           (const_int 252))
>> 
>> We later recanonicalise to a mult (since this is an address):
>> 
>>   (and:DI (mult:DI (subreg:DI (reg:SI r115) 0)
>>                    (const_int 4))
>>           (const_int 252))
>> 
>> But we fail to transform this back to the natural substitution:
>> 
>>   (mult:DI (zero_extend:DI (reg:SI r115))
>>            (const_int 4))
>> 
>> There are several other cases in which make_compound_operation
>> needs to look more than one level down in order to complete a
>> compound operation.  For example:
>> 
>> (a) the ashiftrt handling uses extract_left_shift to look through
>>     things like logic ops in order to find a partnering ashift
>>     operation
>> 
>> (b) the "and" handling looks through subregs, xors and iors
>>     to find a partnerning lshiftrt
>> 
>> This patch takes the same approach for mult.
>> 
>> gcc/
>> 	PR rtl-optimization/106594
>> 	* combine.cc (make_compound_operation_and): Look through
>> 	multiplications by a power of two.
>> 
>> gcc/testsuite/
>> 	* gcc.target/aarch64/pr106594.c: New test.
>> ---
>>  gcc/combine.cc                              | 17 +++++++++++++++++
>>  gcc/testsuite/gcc.target/aarch64/pr106594.c | 20 ++++++++++++++++++++
>>  2 files changed, 37 insertions(+)
>>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594.c
>> 
>> diff --git a/gcc/combine.cc b/gcc/combine.cc
>> index 7d446d02cb4..36d04ad6703 100644
>> --- a/gcc/combine.cc
>> +++ b/gcc/combine.cc
>> @@ -7996,6 +7996,23 @@ make_compound_operation_and (scalar_int_mode mode, rtx x,
>>  	break;
>>        }
>>  
>> +    case MULT:
>> +      /* Recurse through a power of 2 multiplication (as can be found
>> +	 in an address), using the relationship:
>> +
>> +	 (and (mult X 2**N1) N2) == (mult (and X (lshifrt N2 N1)) 2**N1).  */
>> +      if (CONST_INT_P (XEXP (x, 1))
>> +	  && pow2p_hwi (INTVAL (XEXP (x, 1))))
>> +	{
>> +	  int shift = exact_log2 (INTVAL (XEXP (x, 1)));
>> +	  rtx sub = make_compound_operation_and (mode, XEXP (x, 0),
>> +						 mask >> shift, in_code,
>> +						 next_code);
>> +	  if (sub)
>> +	    return gen_rtx_MULT (mode, sub, XEXP (x, 1));
>> +	}
>> +      break;
>> +
>>      default:
>>        break;
>>      }
>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594.c b/gcc/testsuite/gcc.target/aarch64/pr106594.c
>> new file mode 100644
>> index 00000000000..beda8e050a5
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/pr106594.c
>> @@ -0,0 +1,20 @@
>> +/* { dg-options "-O2" } */
>> +
>> +extern const int constellation_64qam[64];
>> +
>> +void foo(int nbits,
>> +         const char *p_src,
>> +         int *p_dst) {
>> +
>> +  while (nbits > 0U) {
>> +    char first = *p_src++;
>> +
>> +    char index1 = ((first & 0x3) << 4) | (first >> 4);
>> +
>> +    *p_dst++ = constellation_64qam[index1];
>> +
>> +    nbits--;
>> +  }
>> +}
>> +
>> +/* { dg-final { scan-assembler {ldr\tw[0-9]+, \[x[0-9]+, w[0-9]+, [su]xtw #?2\]} } } */
>
> This looks pretty good (thanks!), but as always it needs testing on more
> architectures, showing it doesn't hurt things.  It should be beneficial,
> but it is not unlikely to hurt other existing backends, and we are not
> in stage 1 (we are in stage 4 even!)

Yeah, but it's fixing a GCC 13 regression (and an important one on aarch64).

> Do you have any such proof / indication / anything?  I can start
> some run but it takes a day (or two), and I cannot start it until next
> week.

This is an alternative presentation of the change that we discussed
a few weeks ago, and that you already tested:

  https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613486.html

The results seem to indicate that the patch had no effect on targets
other than aarch64.  [A good thing, IMO. :-)  The purpose of the
patch is to fix the aarch64 regression in a minimally invasive way.]

I tried building an aarch64 linux kernel locally with -Os
-fno-schedule-insns{,2}.  I saw code size improvements in 182 .os and a
regression in only one .o.  (I was comparing individual .os because it
makes isolation easier.)

The one regression was because the optimised version had fewer pseudos,
and so something that was previously allocated to x3 was allocated to x2.
This pseudo was initialised to 0, and a preceding stack protect
instruction had the known side effect (both before and after the patch)
of setting x3 to 0.  So with the x3 allocation, postreload was able to
remove the separate initialisation of x3 with 0, but with the x2
allocation it couldn't.

So for my kernel build it was a 182:1 split in favour of improvements,
with the one regression being a quirk rather than a true regression.

I also tried seeing what effect it had for aarch64-linux-gnu on
gcc.c-torture, gcc.dg and g++.dg.  All the changes were improvements
here too.  (The point of using these testsuites are that they contain
small, naturally isolated tests that can be quickly compared en masse,
rather than because the code is representative of a real workload.)

ArmRAL (the workload that showed up the problem originally)
also improved, as expected.

I'm confident that this an improvement on aarch64.

> Do you have plans to make combine not do this insane "and" thing at all?
> Or to attack the compound operation stuff head on?

I don't have any plans to do that myself, but I agree it would be
better to get rid of the compound operation stuff if we can.
I can see why the expand/simplify/remake process seemed like
a nice idea, but in practice, there's just too much that can
go wrong.  And removing the compound stuff would go a long way
to making combine simplification more like fwprop simpliification
(unlike now, where we effectively have two separate simplification
processes for things involving *_extends and *_extracts).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594]
  2023-03-31 14:06     ` Richard Sandiford
@ 2023-03-31 14:35       ` Segher Boessenkool
  2023-03-31 14:55         ` Richard Sandiford
  0 siblings, 1 reply; 10+ messages in thread
From: Segher Boessenkool @ 2023-03-31 14:35 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford

On Fri, Mar 31, 2023 at 03:06:41PM +0100, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Thu, Mar 09, 2023 at 12:10:51PM +0000, Richard Sandiford wrote:
> >>   (and:DI (subreg:DI (reg:SI r115) 0)
> >>           (const_int 63))
> >
> > This is more expensive already?!  An "and" usually costs a real machine
> > instruction, while subregs are usually free (just notational).  The
> > "and" can not easily be optimised away either (after combine we only
> > have less accurate nonzero_bits, to start with).
> 
> Well, it's an "and" in place of a zero_extend rather than an "and"
> in place of a subreg.  So it is one real instruction for another
> real instruction.  But yeah, it's not exactly a simplification.

We should be able to optimise away the zero_extend in many cases, and we
cannot do that to an "and" usually.

> The "we" here is expand_compound_operation, via simplify_and_const_int.
> So it's the usual thing: expand_compound_operation creates something
> that is perhaps more expensive, then after simplification,
> make_compound_operation is supposed to collapse these "expanded"
> perhaps-more-expensive forms back.  And it's the last bit that
> goes wrong.

Yes.  And that is even unavoidable.  The whole concept is a bad idea.
We should not create worse things hoping that we can make it better
things later again; we usually cannot make it as good anymore, or even
reasonably good :-(

Totally skipping this dance of course results in worse code as well, so
more work is needed.

> > This looks pretty good (thanks!), but as always it needs testing on more
> > architectures, showing it doesn't hurt things.  It should be beneficial,
> > but it is not unlikely to hurt other existing backends, and we are not
> > in stage 1 (we are in stage 4 even!)
> 
> Yeah, but it's fixing a GCC 13 regression (and an important one on aarch64).

That is not an excuse for (potentially) causing hundreds of new
regressions, some maybe even worse.

> > Do you have any such proof / indication / anything?  I can start
> > some run but it takes a day (or two), and I cannot start it until next
> > week.
> 
> This is an alternative presentation of the change that we discussed
> a few weeks ago, and that you already tested:
> 
>   https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613486.html
> 
> The results seem to indicate that the patch had no effect on targets
> other than aarch64.  [A good thing, IMO. :-)  The purpose of the
> patch is to fix the aarch64 regression in a minimally invasive way.]

If it is the same (I don't agree at all fwiw) it has no effect on
aarch64 either, other than on your single testcase.  So that is a good
reason to NAK this patch outside of stage 1 already.

> I tried building an aarch64 linux kernel locally with -Os
> -fno-schedule-insns{,2}.

Completely unrealistic.  No one builds anything they care for speed fo
with -Os (and if people care for size, you still get better results
with -O2 + some other tunings), and disabling scheduling is disastrous.

> I saw code size improvements in 182 .os and a
> regression in only one .o.  (I was comparing individual .os because it
> makes isolation easier.)


> The one regression was because the optimised version had fewer pseudos,
> and so something that was previously allocated to x3 was allocated to x2.
> This pseudo was initialised to 0, and a preceding stack protect
> instruction had the known side effect (both before and after the patch)
> of setting x3 to 0.  So with the x3 allocation, postreload was able to
> remove the separate initialisation of x3 with 0, but with the x2
> allocation it couldn't.
> 
> So for my kernel build it was a 182:1 split in favour of improvements,
> with the one regression being a quirk rather than a true regression.

For a default configuration it gave 100.033% size, a non-trivial
regression.  But again, I don't think this code has the same effect.

> > Do you have plans to make combine not do this insane "and" thing at all?
> > Or to attack the compound operation stuff head on?
> 
> I don't have any plans to do that myself, but I agree it would be
> better to get rid of the compound operation stuff if we can.

Getting rid of it is easy enough, but that also loses wanted
functionality (even for archs that do not have any extract
functionality; these functions do more than that :-( ).

> I can see why the expand/simplify/remake process seemed like
> a nice idea, but in practice, there's just too much that can
> go wrong.

Yup.

> And removing the compound stuff would go a long way
> to making combine simplification more like fwprop simpliification
> (unlike now, where we effectively have two separate simplification
> processes for things involving *_extends and *_extracts).

This has nothing to do with simplification?  We just use
simplify-rtx (which started out as part of combine exclusively), just
like everything else.

All the other transforms that are combine-only are very useful to have
as well, as shown a year or so ago.  But those are not
"simplifications" :-)


Segher

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594]
  2023-03-31 14:35       ` Segher Boessenkool
@ 2023-03-31 14:55         ` Richard Sandiford
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Sandiford @ 2023-03-31 14:55 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Fri, Mar 31, 2023 at 03:06:41PM +0100, Richard Sandiford wrote:
>> This is an alternative presentation of the change that we discussed
>> a few weeks ago, and that you already tested:
>> 
>>   https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613486.html
>> 
>> The results seem to indicate that the patch had no effect on targets
>> other than aarch64.  [A good thing, IMO. :-)  The purpose of the
>> patch is to fix the aarch64 regression in a minimally invasive way.]
>
> If it is the same (I don't agree at all fwiw)

Why don't you agree that it's behaviourally the same?  It's really just
a refactoring of my earlier patch.

> it has no effect on aarch64 either, other than on your single
> testcase.  So that is a good reason to NAK this patch outside of stage
> 1 already.

But if it only affected this single testcase, I wouldn't see
improvements in 182 kernel objects :-)

>> I tried building an aarch64 linux kernel locally with -Os
>> -fno-schedule-insns{,2}.
>
> Completely unrealistic.  No one builds anything they care for speed fo
> with -Os (and if people care for size, you still get better results
> with -O2 + some other tunings), and disabling scheduling is disastrous.

As discussed before, the point is to use these results to get an
idea of the scope of the change.  And if we're using size to measure
the scope, it makes sense to align the compiler's goals with that,
since we then need to spend less time filtering the results for
oddities like different RA leading to different code layout, etc.

If we want to measure speed, we should do that.  But I don't expect
a significant speed impact on the kernel either way (whether the
patch is good or bad).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-03-31 14:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-09 12:08 [PATCH v2 0/2] Series of patch to fix PR106594 Richard Sandiford
2023-03-09 12:09 ` [PATCH v2 1/2] combine: Split code out of make_compound_operation_int Richard Sandiford
2023-03-31 11:06   ` Segher Boessenkool
2023-03-31 12:13     ` Richard Sandiford
2023-03-09 12:10 ` [PATCH v2 2/2] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
2023-03-31 12:16   ` Segher Boessenkool
2023-03-31 14:06     ` Richard Sandiford
2023-03-31 14:35       ` Segher Boessenkool
2023-03-31 14:55         ` Richard Sandiford
2023-03-27 12:14 ` [PATCH v2 0/2] Series of patch to fix PR106594 Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).