From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Gc68=7A=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id AA1493858C54
	for <gcc-patches@gcc.gnu.org>; Wed,  8 Mar 2023 11:58:53 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org AA1493858C54
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AF02B1042;
	Wed,  8 Mar 2023 03:59:36 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 548163F5A1;
	Wed,  8 Mar 2023 03:58:52 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Mail-Followup-To: Segher Boessenkool <segher@kernel.crashing.org>,Jakub Jelinek <jakub@redhat.com>,  Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org>,  Tamar Christina <Tamar.Christina@arm.com>,  Roger Sayle <roger@nextmovesoftware.com>,  Jeff Law <jeffreyalaw@gmail.com>, richard.sandiford@arm.com
Cc: Jakub Jelinek <jakub@redhat.com>,  Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org>,  Tamar Christina <Tamar.Christina@arm.com>,  Roger Sayle <roger@nextmovesoftware.com>,  Jeff Law <jeffreyalaw@gmail.com>
Subject: Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
References: <20230304221749.GK25951@gate.crashing.org>
	<VI1PR08MB53251080224EFA9FF000DC33FFB19@VI1PR08MB5325.eurprd08.prod.outlook.com>
	<3b1ed616-5d90-7a66-63b5-bdb5e320eebf@gmail.com>
	<mptilfe9hdh.fsf_-_@arm.com>
	<20230306135850.GN25951@gate.crashing.org> <mptv8jd9aun.fsf@arm.com>
	<ZAYSagwSM7zMO4ZD@tucnak> <mpth6ux96to.fsf@arm.com>
	<20230306183151.GP25951@gate.crashing.org> <mptv8jd7kxn.fsf@arm.com>
	<20230306233142.GR25951@gate.crashing.org>
Date: Wed, 08 Mar 2023 11:58:51 +0000
In-Reply-To: <20230306233142.GR25951@gate.crashing.org> (Segher Boessenkool's
	message of "Mon, 6 Mar 2023 17:31:42 -0600")
Message-ID: <mptcz5j4fpg.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-33.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> On Mon, Mar 06, 2023 at 07:13:08PM +0000, Richard Sandiford wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > Most importantly, what makes you think this is a problem for aarch64
>> > only?  If it actually is, you can fix it in the aarch64 config!  Either
>> > with or without new hooks, whatever works best.
>> 
>> The point is that I don't think it's a problem for AArch64 only.
>> I think it's a generic issue that should be solved in a generic way
>> (which is what the patch is trying to do).  The suggestion to restrict
>> it to AArch64 came from Jakub.
>> 
>> The reason I'm pushing back against a hook is precisely because
>> I don't want to solve this in AArch64-specific code.
>
> But it is many times worse still to do it in target-specific magic code
> disguised as generic code :-(
>
> If there is no clear explanation why combine should do X, then it
> probably should not.
>
>> I'm not sure we would be talking about restricting this to AArch64
>> if the patch had been posted in stage 1.  If people are concerned
>> about doing this for all targets in stage 4 (which they seem to be),
>
> Not me, not in principle.  But it takes more time than we have left in
> stage 4 to handle this, even for only combine.  We should give the other
> target maintainers much longer as well.
>
>> I thought the #ifdef was the simplest way of addressing that concern.
>
> An #ifdef is a way of making a change that is not finished yet not hurt
> the other targets.  It still hurts generic development, which indirectly
> hurts all targets.

Seems like this might be moot anyway given that your results
suggest no impact on other targets.

>> And I don't think what the patch does is ad hoc.
>
> It is almost impossible to explain what it does and why that is a good
> thing, why it is what we want, what we should do here; and certainly not
> in a compact, terse, focused way.  It has all the hallmarks of ad hoc
> patches.
>
>> Reorganising the
>> expression in this way isn't something new.  extract_left_shift already
>> does a similar thing (and does it for all targets).
>
> That is not similar at all, no.
>
> /* See if X (of mode MODE) contains an ASHIFT of COUNT or more bits that
>    can be commuted with any other operations in X.  Return X without
>    that shift if so.  */
>
> If you can factor out a utility function like that, with an actual nice
> description like that, it would be a much more palatable patch.

OK, I've factored it out below.  Does the comment look OK?

As mentioned in the patch description below, I think some of the other
"and" handling could be moved here too, which should avoid a bit of
(existing) duplication.  But that isn't necessary to fix the bug.

On the code size results: as I mentioned on IRC yesterday, when I tried
building the linux kernel locally (with -Os -fno-schedule-insns{,2}),
I saw code size improvements in 182 .os and a regression in only one .o.
(I was comparing individual .os because it makes isolation easier.)

The one regression was because the optimised version had fewer pseudos,
and so something that was previously allocated to x3 was allocated to x2.
This pseudo was initialised to 0, and a preceding stack protect
instruction had the known side effect (both before and after the patch)
of setting x3 to 0.  So with the x3 allocation, postreload was able to
remove the separate initialisation of x3 with 0, but with the x2
allocation it couldn't.

I also tried seeing what effect it had on gcc.c-torture, gcc.dg and
g++.dg.  All the changes were improvements.

And the testcase in the PR was from a real workload (ArmRAL).

If you can isolate the case you saw, I'll have a look.  But ISTM that
the change is a pretty clear win on aarch64.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.

Thanks,
Richard

----

Combine's approach to simplifying a pattern X is to:

(1) expand "compound" operations (such as extends and extracts)
    into multiple operations (such as shifts), to give a new
    expression X1

(2) simplify X1 in the normal way, and with some combine-specific
    extras, to give X2

(3) remake compound operations from X2, to give X3

For example, (1) expands sign_extend to an ashift/ashiftrt pair,
with the ashift being an immediate operand of the ashiftrt.
I'll call ashiftrt the "outer" operation and ashift the
"inner" operation below.

By design, (2) can perturb the structure of the expanded operations
in X1.  Sometimes it will rework them entirely.  But sometimes it
will keep the outer operations and inner operations, but in a
slightly different arrangement.  For example, the inner operation
might no longer be a direct operand of the outer operation.

Therefore, when make_compound_operation sees a potential outer
operation, it sometimes looks through suboperands to find the
potential inner operation.  Examples include:

(a) the ashiftrt handling, which uses extract_left_shift to find
    a partnering ashift operation

(b) the "and" handling, which looks through subregs, xors and iors
    to find a partnerning lshiftrt.

The PR contains another case where we need this.  We have:

  (set (reg:DI R2) (sign_extend:DI (reg:SI R1)))
  ... (plus:DI (mult:DI (reg:DI R2) (const_int 4)) (reg:DI R3)) ...

which is a natural, direct comnbination on aarch64.

First, (1) expands the sign_extend into two operations.  It uses
R1's nonzero_bits information to determine that the sign extension
is equivalent to a zero extension:

  /* Convert sign extension to zero extension, if we know that the high
     bit is not set, as this is easier to optimize.  It will be converted
     back to cheaper alternative in make_extraction.  */

As I'll explain below, the problem is that this conversion back to a
cheaper alternative doesn't happen.

The zero extension is expanded into an "and" of a subreg.
Again, the nonzero_bits information narrows the "and" mask
from a full SImode mask to a smaller constant (63).  So X1
contains:

  (mult:DI (and:DI (subreg:DI (reg:SI R1) 0)
                   (const_int 63))
           (const_int 4))

The simplification rules for mult convert this to an ashift by 2.
Then, this rule in simplify_shift_const_1:

	  /* If we have (shift (logical)), move the logical to the outside
	     to allow it to possibly combine with another logical and the
	     shift to combine with another shift.  This also canonicalizes to
	     what a ZERO_EXTRACT looks like.  Also, some machines have
	     (and (shift)) insns.  */

moves the shift inside the "and", so that X2 contains:

  (and:DI (ashift:DI (subreg:DI (reg:SI R1) 0)
                     (const_int 2))
          (const_int 252))

We later recanonicalise to a mult (since this is an address):

  (and:DI (mult:DI (subreg:DI (reg:SI R1) 0)
                   (const_int 4))
          (const_int 252))

But we fail to transform this back to the natural substitution:

  (mult:DI (sign_extend:DI (reg:SI R1))
           (const_int 4))

This patch extends the "look through suboperands" approach
described above to handle this case too.

It might (or might not) make sense to put more of the existing
"and" handling into the new function, so that the subreg+lshiftrt
case can be handled through recursion rather than duplication.
But that's certainly not necessary to fix the bug, so is at
best stage 1 material.

gcc/
	PR rtl-optimization/106594
	* combine.cc (make_compound_operation_int): Extend the AND to
	ZERO_EXTEND transformation so that it can handle an intervening
	multiplication by a power of two.

gcc/testsuite/
	* gcc.target/aarch64/pr106594.c: New test.
---
 gcc/combine.cc                              | 96 ++++++++++++++-------
 gcc/testsuite/gcc.target/aarch64/pr106594.c | 21 +++++
 2 files changed, 87 insertions(+), 30 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594.c

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 053879500b7..4533ec1de84 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7952,6 +7952,68 @@ extract_left_shift (scalar_int_mode mode, rtx x, int count)
   return 0;
 }
 
+/* A subroutine of make_compound_operation_int.  Try to combine an outer
+   AND of X and MASK with a partnering inner operation to form a compound
+   operation.  Return the new X on success, otherwise return null.
+   MODE is the mode of X.  */
+
+static rtx
+make_compound_operation_and (scalar_int_mode mode, rtx x,
+			     unsigned HOST_WIDE_INT mask,
+			     rtx_code in_code, rtx_code next_code)
+{
+  switch (GET_CODE (x))
+    {
+    case SUBREG:
+      /* If the operand is a paradoxical subreg of a register or memory
+	 and MASK (limited to the smaller mode) has only zero bits where
+	 the sub expression has known zero bits, this can be expressed as
+	 a zero_extend.  */
+      {
+	rtx sub = XEXP (x, 0);
+	machine_mode sub_mode = GET_MODE (sub);
+	int sub_width;
+	if ((REG_P (sub) || MEM_P (sub))
+	    && GET_MODE_PRECISION (sub_mode).is_constant (&sub_width)
+	    && sub_width < GET_MODE_PRECISION (mode))
+	  {
+	    unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (sub_mode);
+	    unsigned HOST_WIDE_INT submask;
+
+	    /* The shifted AND constant with all the known zero
+	       bits set.  */
+	    submask = mask | ~nonzero_bits (sub, sub_mode);
+	    if ((submask & mode_mask) == mode_mask)
+	      {
+		rtx new_rtx = make_compound_operation (sub, next_code);
+		return make_extraction (mode, new_rtx, 0, 0, sub_width,
+					1, 0, in_code == COMPARE);
+	      }
+	  }
+	break;
+      }
+
+    case MULT:
+      /* Recurse through a power of 2 multiplication (as can be found
+	 in an address.  */
+      if (CONST_INT_P (XEXP (x, 1))
+	  && pow2p_hwi (INTVAL (XEXP (x, 1))))
+	{
+	  int shift = exact_log2 (INTVAL (XEXP (x, 1)));
+	  rtx sub = make_compound_operation_and (mode, XEXP (x, 0),
+						 mask >> shift, in_code,
+						 next_code);
+	  if (sub)
+	    return gen_rtx_MULT (mode, sub, XEXP (x, 1));
+	}
+      break;
+
+    default:
+      break;
+    }
+  return NULL_RTX;
+}
+
 /* Subroutine of make_compound_operation.  *X_PTR is the rtx at the current
    level of the expression and MODE is its mode.  IN_CODE is as for
    make_compound_operation.  *NEXT_CODE_PTR is the value of IN_CODE
@@ -8184,36 +8246,10 @@ make_compound_operation_int (scalar_int_mode mode, rtx *x_ptr,
 				   make_compound_operation (XEXP (x, 0),
 							    next_code),
 				   i, NULL_RTX, 1, 1, 0, 1);
-
-      /* If the one operand is a paradoxical subreg of a register or memory and
-	 the constant (limited to the smaller mode) has only zero bits where
-	 the sub expression has known zero bits, this can be expressed as
-	 a zero_extend.  */
-      else if (GET_CODE (XEXP (x, 0)) == SUBREG)
-	{
-	  rtx sub;
-
-	  sub = XEXP (XEXP (x, 0), 0);
-	  machine_mode sub_mode = GET_MODE (sub);
-	  int sub_width;
-	  if ((REG_P (sub) || MEM_P (sub))
-	      && GET_MODE_PRECISION (sub_mode).is_constant (&sub_width)
-	      && sub_width < mode_width)
-	    {
-	      unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (sub_mode);
-	      unsigned HOST_WIDE_INT mask;
-
-	      /* original AND constant with all the known zero bits set */
-	      mask = UINTVAL (XEXP (x, 1)) | (~nonzero_bits (sub, sub_mode));
-	      if ((mask & mode_mask) == mode_mask)
-		{
-		  new_rtx = make_compound_operation (sub, next_code);
-		  new_rtx = make_extraction (mode, new_rtx, 0, 0, sub_width,
-					     1, 0, in_code == COMPARE);
-		}
-	    }
-	}
-
+      else
+	new_rtx = make_compound_operation_and (mode, XEXP (x, 0),
+					       UINTVAL (XEXP (x, 1)),
+					       in_code, next_code);
       break;
 
     case LSHIFTRT:
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594.c b/gcc/testsuite/gcc.target/aarch64/pr106594.c
new file mode 100644
index 00000000000..f10d72665e0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr106594.c
@@ -0,0 +1,21 @@
+/* { dg-options "-O2" } */
+
+extern const int constellation_64qam[64];
+
+void foo(int nbits,
+         const char *p_src,
+         int *p_dst) {
+
+  while (nbits > 0U) {
+    char first = *p_src++;
+
+    char index1 = ((first & 0x3) << 4) | (first >> 4);
+
+    *p_dst++ = constellation_64qam[index1];
+
+    nbits--;
+  }
+}
+
+/* { dg-final { scan-assembler {ldr\tw[0-9]+, \[x[0-9]+, w[0-9]+, [su]xtw #?2\]} } } */
+
-- 
2.25.1