[PATCH] PR rtl-optimization/106594: Preserve zero

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.
@ 2023-03-04 18:32 Roger Sayle
  2023-03-04 22:17 ` Segher Boessenkool
  0 siblings, 1 reply; 21+ messages in thread
From: Roger Sayle @ 2023-03-04 18:32 UTC (permalink / raw)
  To: 'GCC Patches'
  Cc: 'Tamar Christina', 'Richard Sandiford',
	'Segher Boessenkool'

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

This patch addresses PR rtl-optimization/106594, a P1 performance
regression affecting aarch64.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.

If someone (who can regression test this on aarch64) could take this
from here that would be much appreciated.  Thanks in advance.


2023-03-04  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR rtl-optimization/106594
        * combine.cc (expand_compound_operation): Don't expand/transform
        ZERO_EXTEND or SIGN_EXTEND on targets where rtx_cost claims they are
        cheap.

Roger
--


[-- Attachment #2: patchaa6.txt --]
[-- Type: text/plain, Size: 865 bytes --]

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 0538795..cf126c8 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7288,7 +7288,17 @@ expand_compound_operation (rtx x)
 	  && (STORE_FLAG_VALUE & ~GET_MODE_MASK (inner_mode)) == 0)
 	return SUBREG_REG (XEXP (x, 0));
 
+      /* If ZERO_EXTEND is cheap on this target, do nothing,
+	 i.e. don't attempt to convert it to a pair of shifts.  */
+      if (set_src_cost (x, mode, optimize_this_for_speed_p)
+	  <= COSTS_N_INSNS (1))
+	return x;
     }
+  /* Likewise, if SIGN_EXTEND is cheap, do nothing.  */
+  else if (GET_CODE (x) == SIGN_EXTEND
+	   && set_src_cost (x, mode, optimize_this_for_speed_p)
+	      <= COSTS_N_INSNS (1))
+    return x;
 
   /* If we reach here, we want to return a pair of shifts.  The inner
      shift is a left shift of BITSIZE - POS - LEN bits.  The outer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.
  2023-03-04 18:32 [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap Roger Sayle
@ 2023-03-04 22:17 ` Segher Boessenkool
  2023-03-05 19:28   ` Tamar Christina
  0 siblings, 1 reply; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-04 22:17 UTC (permalink / raw)
  To: Roger Sayle
  Cc: 'GCC Patches', 'Tamar Christina',
	'Richard Sandiford'

On Sat, Mar 04, 2023 at 06:32:15PM -0000, Roger Sayle wrote:
> This patch addresses PR rtl-optimization/106594, a P1 performance
> regression affecting aarch64.
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.

It should be tested for performance everywhere else, too.

It can very easily result in worse code on some targets.  This kind of
thing really should be done in stage 1, not stage 4.

>         PR rtl-optimization/106594
>         * combine.cc (expand_compound_operation): Don't expand/transform
>         ZERO_EXTEND or SIGN_EXTEND on targets where rtx_cost claims they are
>         cheap.

That is not how combine works.  If the old code is more expensive than
what combine comes up with., it *should* transform it.  Magic cost
cutoffs are not okay anywhere in combine, either.

If expand_compound_operation and friends misbehave (not really an "if",
unfortunately), then please fix that, instead of randomly disabling
parts of combine?

Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.
  2023-03-04 22:17 ` Segher Boessenkool
@ 2023-03-05 19:28   ` Tamar Christina
  2023-03-05 19:56     ` Jeff Law
  0 siblings, 1 reply; 21+ messages in thread
From: Tamar Christina @ 2023-03-05 19:28 UTC (permalink / raw)
  To: Segher Boessenkool, Roger Sayle; +Cc: 'GCC Patches', Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 1805 bytes --]

The regression was reported during stage-1. A patch was provided during stage 1 and the discussions around combine stalled.

The regression for AArch64 needs to be fixed in GCC 13. The hit is too big just to "take".

So we need a way forward, even if it's stage-4.

Thanks,
Tamar

________________________________
From: Segher Boessenkool <segher@kernel.crashing.org>
Sent: Saturday, March 4, 2023 10:17 PM
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org>; Tamar Christina <Tamar.Christina@arm.com>; Richard Sandiford <Richard.Sandiford@arm.com>
Subject: Re: [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.

On Sat, Mar 04, 2023 at 06:32:15PM -0000, Roger Sayle wrote:
> This patch addresses PR rtl-optimization/106594, a P1 performance
> regression affecting aarch64.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.

It should be tested for performance everywhere else, too.

It can very easily result in worse code on some targets.  This kind of
thing really should be done in stage 1, not stage 4.

>         PR rtl-optimization/106594
>         * combine.cc (expand_compound_operation): Don't expand/transform
>         ZERO_EXTEND or SIGN_EXTEND on targets where rtx_cost claims they are
>         cheap.

That is not how combine works.  If the old code is more expensive than
what combine comes up with., it *should* transform it.  Magic cost
cutoffs are not okay anywhere in combine, either.

If expand_compound_operation and friends misbehave (not really an "if",
unfortunately), then please fix that, instead of randomly disabling
parts of combine?

Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.
  2023-03-05 19:28   ` Tamar Christina
@ 2023-03-05 19:56     ` Jeff Law
  2023-03-05 20:43       ` Tamar Christina
  2023-03-06 12:47       ` [PATCH] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff Law @ 2023-03-05 19:56 UTC (permalink / raw)
  To: Tamar Christina, Segher Boessenkool, Roger Sayle
  Cc: 'GCC Patches', Richard Sandiford



On 3/5/23 12:28, Tamar Christina via Gcc-patches wrote:
> 
> The regression was reported during stage-1. A patch was provided during stage 1 and the discussions around combine stalled.
> 
> The regression for AArch64 needs to be fixed in GCC 13. The hit is too big just to "take".
> 
> So we need a way forward, even if it's stage-4.
Then it needs to be in a way that works within the design constraints of 
combine.

As Segher has indicated, using a magic constant to say "this is always 
cheap enough" isn't acceptable.  Furthermore, what this patch changes is 
combine's internal canonicalization of extensions into shift pairs.

So I think another path forward needs to be found.  I don't see hacking 
up expand_compound_operation is viable.

Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.
  2023-03-05 19:56     ` Jeff Law
@ 2023-03-05 20:43       ` Tamar Christina
  2023-03-05 21:33         ` Segher Boessenkool
  2023-03-06 12:47       ` [PATCH] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
  1 sibling, 1 reply; 21+ messages in thread
From: Tamar Christina @ 2023-03-05 20:43 UTC (permalink / raw)
  To: Jeff Law, Segher Boessenkool, Roger Sayle
  Cc: 'GCC Patches', Richard Sandiford

> 
> On 3/5/23 12:28, Tamar Christina via Gcc-patches wrote:
> >
> > The regression was reported during stage-1. A patch was provided during
> stage 1 and the discussions around combine stalled.
> >
> > The regression for AArch64 needs to be fixed in GCC 13. The hit is too big just
> to "take".
> >
> > So we need a way forward, even if it's stage-4.
> Then it needs to be in a way that works within the design constraints of
> combine.
> 
> As Segher has indicated, using a magic constant to say "this is always cheap
> enough" isn't acceptable.  Furthermore, what this patch changes is combine's
> internal canonicalization of extensions into shift pairs.
> 
> So I think another path forward needs to be found.  I don't see hacking up
> expand_compound_operation is viable.

I'm not arguing at all about the merits of the patch. My argument was about Segher saying he doesn't think this is a P1 regression or one that should be addressed in stage-4.

We noticed and reported the regression early on during stage-1.  So I'm unsure what else we should have done and it's not right to waive off fixing it now, otherwise what's the point in us filing bug reports.

Tamar.
> 
> Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.
  2023-03-05 20:43       ` Tamar Christina
@ 2023-03-05 21:33         ` Segher Boessenkool
  2023-03-06 12:08           ` Segher Boessenkool
  0 siblings, 1 reply; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-05 21:33 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Jeff Law, Roger Sayle, 'GCC Patches', Richard Sandiford

Hi!

On Sun, Mar 05, 2023 at 08:43:20PM +0000, Tamar Christina wrote:
> > On 3/5/23 12:28, Tamar Christina via Gcc-patches wrote:
> > > The regression was reported during stage-1. A patch was provided during
> > stage 1 and the discussions around combine stalled.
> > >
> > > The regression for AArch64 needs to be fixed in GCC 13. The hit is too big just
> > to "take".
> > >
> > > So we need a way forward, even if it's stage-4.
> > Then it needs to be in a way that works within the design constraints of
> > combine.
> > 
> > As Segher has indicated, using a magic constant to say "this is always cheap
> > enough" isn't acceptable.  Furthermore, what this patch changes is combine's
> > internal canonicalization of extensions into shift pairs.
> > 
> > So I think another path forward needs to be found.  I don't see hacking up
> > expand_compound_operation is viable.
> 
> I'm not arguing at all about the merits of the patch. My argument was about Segher saying he doesn't think this is a P1 regression or one that should be addressed in stage-4.

That is not what I said (in the PR).  I said:
  Either this should not be P1, or the proposed patch is taking
  completely the wrong direction.  P1 means there is a regression.
  There is no regression in combine, in fact the proposed patch would
  *cause* regressions on many targets!
(<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106594#c13>)

> We noticed and reported the regression early on during stage-1.  So I'm unsure what else we should have done and it's not right to waive off fixing it now, otherwise what's the point in us filing bug reports.

Something that fixes the regression is of course welcome.  But something
that *causes* many regressions is not.  Something that makes
compound_operation stuff better on all targets is more than welcome as
well, but *better* on *all* targets, not regressing most.  This really
is stage 1 material most likely.  I have been chipping away at this for
years, I don't expect any trivial patch can help, and it certainly won't
solve most problems here.

Maybe a target hook for this is best.  But not a completely ad-hoc one,
something usable and maintainable please.  So, it should say we do not
want certain kinds of code (or what kind of code we want instead), and
it should not add magic to the bowels of basic passes, magic that just
happens to make the code of particular testcases look better on a
particular target.

Yes, *look* better: I have seen no proof or indication that this would
actually generate better code, not even on just aarch, let alone on the
majority of targets.  As I said I have a test running, you may be lucky
even :-)  It has to run for about six hours more and after that it needs
analysis still (a few more hours if it isn't obviously always better or
worse), so expect results tomorrow night at the earliest.

Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.
  2023-03-05 21:33         ` Segher Boessenkool
@ 2023-03-06 12:08           ` Segher Boessenkool
  2023-03-06 12:11             ` Tamar Christina
  0 siblings, 1 reply; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-06 12:08 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Jeff Law, Roger Sayle, 'GCC Patches', Richard Sandiford

Hi!

On Sun, Mar 05, 2023 at 03:33:40PM -0600, Segher Boessenkool wrote:
> On Sun, Mar 05, 2023 at 08:43:20PM +0000, Tamar Christina wrote:
> Yes, *look* better: I have seen no proof or indication that this would

("looks", I cannot type, sorry)

> actually generate better code, not even on just aarch, let alone on the
> majority of targets.  As I said I have a test running, you may be lucky
> even :-)  It has to run for about six hours more and after that it needs
> analysis still (a few more hours if it isn't obviously always better or
> worse), so expect results tomorrow night at the earliest.

The results are in:

$ perl sizes.pl --percent C[12]
                    C1        C2
       alpha   7082243  100.066%
         arc   4207975  100.015%
         arm  11518624  100.008%
       arm64  24514565  100.067%
       armhf  16661684  100.098%
        csky   4031841  100.002%
        i386         0         0
        ia64  20354295  100.029%
        m68k   4394084  100.023%
  microblaze   6549965  100.014%
        mips  10684680  100.024%
      mips64   8171850  100.002%
       nios2   4356713  100.012%
    openrisc   5010570  100.003%
      parisc   8406294  100.002%
    parisc64         0         0
     powerpc  11104901   99.992%
   powerpc64  24532358  100.057%
 powerpc64le  21293219  100.062%
     riscv32   2028474  100.131%
     riscv64   9515453  100.120%
        s390  20519612  100.279%
          sh         0         0
     shnommu   1840960  100.012%
       sparc   5314422  100.004%
     sparc64   7964129   99.992%
      x86_64         0         0
      xtensa   2925723  100.070%


C1 is the original, C2 with your patch.  These numbers are the code
sizes of a Linux kernel, some defconfig for every arch.  This is a good
measure of how effective combine was.

The patch is a tiny win for sparc64 and classic powerpc32 only, but bad
everywhere else.  Look at that s390 number!  Or riscv, or most of the
arm variants (including aarch64).

Do you want me to look in detail what causes this regression on some
particular target, i.e. why we really still need the expand_compound
functionality there?

(Btw.  "0" means the target did not build.  For the x86 targets this is
just more -Werror madness that seeped in it seems.  For parisc64 and sh
it is the choice of config.  Will fix.)


Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap.
  2023-03-06 12:08           ` Segher Boessenkool
@ 2023-03-06 12:11             ` Tamar Christina
  0 siblings, 0 replies; 21+ messages in thread
From: Tamar Christina @ 2023-03-06 12:11 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Jeff Law, Roger Sayle, 'GCC Patches', Richard Sandiford

> Hi!
> 
> On Sun, Mar 05, 2023 at 03:33:40PM -0600, Segher Boessenkool wrote:
> > On Sun, Mar 05, 2023 at 08:43:20PM +0000, Tamar Christina wrote:
> > Yes, *look* better: I have seen no proof or indication that this would
> 
> ("looks", I cannot type, sorry)
> 
> > actually generate better code, not even on just aarch, let alone on
> > the majority of targets.  As I said I have a test running, you may be
> > lucky even :-)  It has to run for about six hours more and after that
> > it needs analysis still (a few more hours if it isn't obviously always
> > better or worse), so expect results tomorrow night at the earliest.
> 
> The results are in:
> 
> $ perl sizes.pl --percent C[12]
>                     C1        C2
>        alpha   7082243  100.066%
>          arc   4207975  100.015%
>          arm  11518624  100.008%
>        arm64  24514565  100.067%
>        armhf  16661684  100.098%
>         csky   4031841  100.002%
>         i386         0         0
>         ia64  20354295  100.029%
>         m68k   4394084  100.023%
>   microblaze   6549965  100.014%
>         mips  10684680  100.024%
>       mips64   8171850  100.002%
>        nios2   4356713  100.012%
>     openrisc   5010570  100.003%
>       parisc   8406294  100.002%
>     parisc64         0         0
>      powerpc  11104901   99.992%
>    powerpc64  24532358  100.057%
>  powerpc64le  21293219  100.062%
>      riscv32   2028474  100.131%
>      riscv64   9515453  100.120%
>         s390  20519612  100.279%
>           sh         0         0
>      shnommu   1840960  100.012%
>        sparc   5314422  100.004%
>      sparc64   7964129   99.992%
>       x86_64         0         0
>       xtensa   2925723  100.070%
> 
> 
> C1 is the original, C2 with your patch.  These numbers are the code sizes of a
> Linux kernel, some defconfig for every arch.  This is a good measure of how
> effective combine was.
> 
> The patch is a tiny win for sparc64 and classic powerpc32 only, but bad
> everywhere else.  Look at that s390 number!  Or riscv, or most of the arm
> variants (including aarch64).
> 
> Do you want me to look in detail what causes this regression on some
> particular target, i.e. why we really still need the expand_compound
> functionality there?
> 

Hi,

Thanks for having a look! I think the Richards are exploring a different solution on the PR
so I don't think it's worth looking at now (maybe in stage-1?).  Thanks for checking though!

I Appreciate you all helping to get this fixed!

Kind Regards,
Tamar

> (Btw.  "0" means the target did not build.  For the x86 targets this is just more
> -Werror madness that seeped in it seems.  For parisc64 and sh it is the choice
> of config.  Will fix.)
> 
> 
> Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-05 19:56     ` Jeff Law
  2023-03-05 20:43       ` Tamar Christina
@ 2023-03-06 12:47       ` Richard Sandiford
  2023-03-06 13:58         ` Segher Boessenkool
  1 sibling, 1 reply; 21+ messages in thread
From: Richard Sandiford @ 2023-03-06 12:47 UTC (permalink / raw)
  To: Jeff Law via Gcc-patches
  Cc: Tamar Christina, Segher Boessenkool, Roger Sayle, Jeff Law

Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On 3/5/23 12:28, Tamar Christina via Gcc-patches wrote:
>> 
>> The regression was reported during stage-1. A patch was provided during stage 1 and the discussions around combine stalled.
>> 
>> The regression for AArch64 needs to be fixed in GCC 13. The hit is too big just to "take".
>> 
>> So we need a way forward, even if it's stage-4.
> Then it needs to be in a way that works within the design constraints of 
> combine.
>
> As Segher has indicated, using a magic constant to say "this is always 
> cheap enough" isn't acceptable.  Furthermore, what this patch changes is 
> combine's internal canonicalization of extensions into shift pairs.
>
> So I think another path forward needs to be found.  I don't see hacking 
> up expand_compound_operation is viable.
>
> Jeff

How about the patch below?  Tested on aarch64-linux-gnu,
but I'll test on x86_64-linux-gnu and powerpc64le-linux-gnu
before committing.

-----------------------------

The PR contains a case where we want GCC to combine a sign_extend
into an address (which is something that GCC 12 could do).  After
substitution, the sign_extend goes through the usual
expand_compound_operation wrangler and, after taking nonzero_bits
into account, make_compound_operation is presented with:

  X1: (and (mult (subreg x) (const_int N2)) (const_int N1))

where:

(a) the subreg is paradoxical
(b) N2 is a power of 2
(c) all bits outside N1/N2 are known to be zero in x

This is equivalent to:

  X2: (mult (and (subreg x) (const_int N1/N2)) (const_int N2))

Given in this form, existing code would already use (c) to convert
the inner "and" to a zero_extend:

  (mult (zero_extend x) (const_int N2))

This patch makes the code handle X1 as well as X2.

Logically, it would make sense to do the same for ASHIFT, which
would be the canonical form outside memory addresses.  However, it
seemed better to do the minimum possible, given the late stage in
the release cycle.

gcc/
	PR rtl-optimization/106594
	* combine.cc (make_compound_operation_int): Extend the AND to
	ZERO_EXTEND transformation so that it can handle an intervening
	multiplication by a power of two.

gcc/testsuite/
	* gcc.target/aarch64/pr106594.c: New test.
---
 gcc/combine.cc                              | 60 ++++++++++++++-------
 gcc/testsuite/gcc.target/aarch64/pr106594.c | 21 ++++++++
 2 files changed, 63 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594.c

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 053879500b7..b45042bbafd 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -8188,28 +8188,52 @@ make_compound_operation_int (scalar_int_mode mode, rtx *x_ptr,
       /* If the one operand is a paradoxical subreg of a register or memory and
 	 the constant (limited to the smaller mode) has only zero bits where
 	 the sub expression has known zero bits, this can be expressed as
-	 a zero_extend.  */
-      else if (GET_CODE (XEXP (x, 0)) == SUBREG)
-	{
-	  rtx sub;
+	 a zero_extend.
+
+	 Look also for the case where the operand is such a subreg that
+	 is multiplied by 2**N:
 
-	  sub = XEXP (XEXP (x, 0), 0);
-	  machine_mode sub_mode = GET_MODE (sub);
-	  int sub_width;
-	  if ((REG_P (sub) || MEM_P (sub))
-	      && GET_MODE_PRECISION (sub_mode).is_constant (&sub_width)
-	      && sub_width < mode_width)
+	   (and (mult ... 2**N) M) --> (mult (and ... M>>N) 2**N) -> ...  */
+      else
+	{
+	  rtx y = XEXP (x, 0);
+	  rtx top = y;
+	  int shift = 0;
+	  if (GET_CODE (y) == MULT
+	      && CONST_INT_P (XEXP (y, 1))
+	      && pow2p_hwi (INTVAL (XEXP (y, 1))))
 	    {
-	      unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (sub_mode);
-	      unsigned HOST_WIDE_INT mask;
+	      shift = exact_log2 (INTVAL (XEXP (y, 1)));
+	      y = XEXP (y, 0);
+	    }
+	  if (GET_CODE (y) == SUBREG)
+	    {
+	      rtx sub;
 
-	      /* original AND constant with all the known zero bits set */
-	      mask = UINTVAL (XEXP (x, 1)) | (~nonzero_bits (sub, sub_mode));
-	      if ((mask & mode_mask) == mode_mask)
+	      sub = XEXP (y, 0);
+	      machine_mode sub_mode = GET_MODE (sub);
+	      int sub_width;
+	      if ((REG_P (sub) || MEM_P (sub))
+		  && GET_MODE_PRECISION (sub_mode).is_constant (&sub_width)
+		  && sub_width < mode_width)
 		{
-		  new_rtx = make_compound_operation (sub, next_code);
-		  new_rtx = make_extraction (mode, new_rtx, 0, 0, sub_width,
-					     1, 0, in_code == COMPARE);
+		  unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (sub_mode);
+		  unsigned HOST_WIDE_INT mask;
+
+		  /* The shifted AND constant with all the known zero
+		     bits set.  */
+		  mask = ((UINTVAL (XEXP (x, 1)) >> shift)
+			  | ~nonzero_bits (sub, sub_mode));
+		  if ((mask & mode_mask) == mode_mask)
+		    {
+		      new_rtx = make_compound_operation (sub, next_code);
+		      new_rtx = make_extraction (mode, new_rtx,
+						 0, 0, sub_width,
+						 1, 0, in_code == COMPARE);
+		      if (shift)
+			new_rtx = gen_rtx_fmt_ee (GET_CODE (top), mode,
+						  new_rtx, XEXP (top, 1));
+		    }
 		}
 	    }
 	}
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594.c b/gcc/testsuite/gcc.target/aarch64/pr106594.c
new file mode 100644
index 00000000000..f10d72665e0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr106594.c
@@ -0,0 +1,21 @@
+/* { dg-options "-O2" } */
+
+extern const int constellation_64qam[64];
+
+void foo(int nbits,
+         const char *p_src,
+         int *p_dst) {
+
+  while (nbits > 0U) {
+    char first = *p_src++;
+
+    char index1 = ((first & 0x3) << 4) | (first >> 4);
+
+    *p_dst++ = constellation_64qam[index1];
+
+    nbits--;
+  }
+}
+
+/* { dg-final { scan-assembler {ldr\tw[0-9]+, \[x[0-9]+, w[0-9]+, [su]xtw #?2\]} } } */
+
-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 12:47       ` [PATCH] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
@ 2023-03-06 13:58         ` Segher Boessenkool
  2023-03-06 15:08           ` Richard Sandiford
  0 siblings, 1 reply; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-06 13:58 UTC (permalink / raw)
  To: Jeff Law via Gcc-patches, Tamar Christina, Roger Sayle, Jeff Law,
	richard.sandiford

On Mon, Mar 06, 2023 at 12:47:06PM +0000, Richard Sandiford wrote:
> How about the patch below?

What about it?  What would make it any better than the previous?

Oh, and please do not send new patches in old threads :-(


Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 13:58         ` Segher Boessenkool
@ 2023-03-06 15:08           ` Richard Sandiford
  2023-03-06 16:18             ` Jakub Jelinek
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Sandiford @ 2023-03-06 15:08 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Jeff Law via Gcc-patches, Tamar Christina, Roger Sayle, Jeff Law

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Mon, Mar 06, 2023 at 12:47:06PM +0000, Richard Sandiford wrote:
>> How about the patch below?
>
> What about it?  What would make it any better than the previous?

It does what Jeff suggested in the quoted message: work within the existing
extract/make_compound_operation scheme rather than try to opt out of it.

Richard



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 15:08           ` Richard Sandiford
@ 2023-03-06 16:18             ` Jakub Jelinek
  2023-03-06 16:34               ` Richard Sandiford
  2023-03-06 18:13               ` Segher Boessenkool
  0 siblings, 2 replies; 21+ messages in thread
From: Jakub Jelinek @ 2023-03-06 16:18 UTC (permalink / raw)
  To: Segher Boessenkool, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law, richard.sandiford

On Mon, Mar 06, 2023 at 03:08:00PM +0000, Richard Sandiford via Gcc-patches wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Mon, Mar 06, 2023 at 12:47:06PM +0000, Richard Sandiford wrote:
> >> How about the patch below?
> >
> > What about it?  What would make it any better than the previous?
> 
> It does what Jeff suggested in the quoted message: work within the existing
> extract/make_compound_operation scheme rather than try to opt out of it.

That still feels like it could be risky in stage4, affecting various other
FEs which would be expecting ANDs in their patterns instead of *_EXTEND, no?
So, at least we'd need something like Segher ran to test it on various
targets on Linux kernel (but would be really nice to get also i?86/x86_64).

If it were on the aarch64 side just one pattern, I'd suggest a pre-reload
splitter, but unfortunately the sign extends (and zero extends?) are handled
in legitimate address hook.  Also, I see nonzero_bits only called in
rs6000's combine splitter and s390'x canonicalize_comparison target hook,
nowhere else in the backends, so I think using it outside of the combiner
isn't desirable.

Could we have a target hook to canonicalize memory addresses for combiner,
like we have that targetm.canonicalize_comparison ?

	Jakub

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 16:18             ` Jakub Jelinek
@ 2023-03-06 16:34               ` Richard Sandiford
  2023-03-06 18:31                 ` Segher Boessenkool
  2023-03-06 22:58                 ` Segher Boessenkool
  2023-03-06 18:13               ` Segher Boessenkool
  1 sibling, 2 replies; 21+ messages in thread
From: Richard Sandiford @ 2023-03-06 16:34 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Segher Boessenkool, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law

Jakub Jelinek <jakub@redhat.com> writes:
> On Mon, Mar 06, 2023 at 03:08:00PM +0000, Richard Sandiford via Gcc-patches wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > On Mon, Mar 06, 2023 at 12:47:06PM +0000, Richard Sandiford wrote:
>> >> How about the patch below?
>> >
>> > What about it?  What would make it any better than the previous?
>> 
>> It does what Jeff suggested in the quoted message: work within the existing
>> extract/make_compound_operation scheme rather than try to opt out of it.
>
> That still feels like it could be risky in stage4, affecting various other
> FEs which would be expecting ANDs in their patterns instead of *_EXTEND, no?
> So, at least we'd need something like Segher ran to test it on various
> targets on Linux kernel (but would be really nice to get also i?86/x86_64).
>
> If it were on the aarch64 side just one pattern, I'd suggest a pre-reload
> splitter, but unfortunately the sign extends (and zero extends?) are handled
> in legitimate address hook.  Also, I see nonzero_bits only called in
> rs6000's combine splitter and s390'x canonicalize_comparison target hook,
> nowhere else in the backends, so I think using it outside of the combiner
> isn't desirable.
>
> Could we have a target hook to canonicalize memory addresses for combiner,
> like we have that targetm.canonicalize_comparison ?

I don't think a hook makes sense as a long-term design decision.
The canonicalisation we're doing here isn't logically AArch64-specific,
and in general, the less variation in RTL rules between targets, the better.

But if you mean adding target control as a GCC 13 hack, to avoid any
effect on other targets, then TBH, I'd prefer just sticking it in an
#ifdef GCC_AARCH64_H :-)  That makes it 100% clear that it's a
temporary hack to restrict the target impact rather than something
based on fundamentals.  We can then revisit for GCC 14.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 16:18             ` Jakub Jelinek
  2023-03-06 16:34               ` Richard Sandiford
@ 2023-03-06 18:13               ` Segher Boessenkool
  1 sibling, 0 replies; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-06 18:13 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jeff Law via Gcc-patches, Tamar Christina, Roger Sayle, Jeff Law,
	richard.sandiford

On Mon, Mar 06, 2023 at 05:18:50PM +0100, Jakub Jelinek wrote:
> On Mon, Mar 06, 2023 at 03:08:00PM +0000, Richard Sandiford via Gcc-patches wrote:
> That still feels like it could be risky in stage4, affecting various other
> FEs which would be expecting ANDs in their patterns instead of *_EXTEND, no?
> So, at least we'd need something like Segher ran to test it on various
> targets on Linux kernel (but would be really nice to get also i?86/x86_64).

It is running.  Still without x86 though, but I'll add that later
hopefully, also for the previous runs.

> If it were on the aarch64 side just one pattern, I'd suggest a pre-reload
> splitter, but unfortunately the sign extends (and zero extends?) are handled
> in legitimate address hook.  Also, I see nonzero_bits only called in
> rs6000's combine splitter and s390'x canonicalize_comparison target hook,
> nowhere else in the backends, so I think using it outside of the combiner
> isn't desirable.

nonzero_bits cannot be used in insn conditions.  This is a well-known
long-standing problem.

The problem is that it can give different output in the passes after
combine than it does in combine itself, since combine does more thorough
analysis.  This than causes insns generated in combine to no longer be
recognised later -> kaboom, ICE.

> Could we have a target hook to canonicalize memory addresses for combiner,
> like we have that targetm.canonicalize_comparison ?

If it makes sense, sure.  And it is implemented in a sensible spot.  It
has to stay maintainable :-)

Looking foreward to a patch,


Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 16:34               ` Richard Sandiford
@ 2023-03-06 18:31                 ` Segher Boessenkool
  2023-03-06 19:13                   ` Richard Sandiford
  2023-03-06 22:58                 ` Segher Boessenkool
  1 sibling, 1 reply; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-06 18:31 UTC (permalink / raw)
  To: Jakub Jelinek, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law, richard.sandiford

On Mon, Mar 06, 2023 at 04:34:59PM +0000, Richard Sandiford wrote:
> Jakub Jelinek <jakub@redhat.com> writes:
> > Could we have a target hook to canonicalize memory addresses for combiner,
> > like we have that targetm.canonicalize_comparison ?
> 
> I don't think a hook makes sense as a long-term design decision.
> The canonicalisation we're doing here isn't logically AArch64-specific,
> and in general, the less variation in RTL rules between targets, the better.

Some targets do not want all insasnity allowed for other targets.  We
have quite a few exampples of this already.  But of course a hook like
the proposed one can be abused a lot to do completely unrelated things.
We'll just have to trust target maintainers to have good taste and some
wisdom (because not everyine else looks at all target patches).  What
else is new :-)

> But if you mean adding target control as a GCC 13 hack, to avoid any
> effect on other targets, then TBH, I'd prefer just sticking it in an
> #ifdef GCC_AARCH64_H :-)

And I will NAK that for all the same reasons: it is unmaintainable, it
makes things harder instead of solving problems, it is a completely
ad-hoc code change.

> That makes it 100% clear that it's a
> temporary hack to restrict the target impact rather than something
> based on fundamentals.  We can then revisit for GCC 14.

And that will never happen, you know this as well as anyone else :-(

Most importantly, what makes you think this is a problem for aarch64
only?  If it actually is, you can fix it in the aarch64 config!  Either
with or without new hooks, whatever works best.

Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 18:31                 ` Segher Boessenkool
@ 2023-03-06 19:13                   ` Richard Sandiford
  2023-03-06 23:31                     ` Segher Boessenkool
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Sandiford @ 2023-03-06 19:13 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Jakub Jelinek, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Mon, Mar 06, 2023 at 04:34:59PM +0000, Richard Sandiford wrote:
>> Jakub Jelinek <jakub@redhat.com> writes:
>> > Could we have a target hook to canonicalize memory addresses for combiner,
>> > like we have that targetm.canonicalize_comparison ?
>> 
>> I don't think a hook makes sense as a long-term design decision.
>> The canonicalisation we're doing here isn't logically AArch64-specific,
>> and in general, the less variation in RTL rules between targets, the better.
>
> Some targets do not want all insasnity allowed for other targets.  We
> have quite a few exampples of this already.  But of course a hook like
> the proposed one can be abused a lot to do completely unrelated things.
> We'll just have to trust target maintainers to have good taste and some
> wisdom (because not everyine else looks at all target patches).  What
> else is new :-)
>
>> But if you mean adding target control as a GCC 13 hack, to avoid any
>> effect on other targets, then TBH, I'd prefer just sticking it in an
>> #ifdef GCC_AARCH64_H :-)
>
> And I will NAK that for all the same reasons: it is unmaintainable, it
> makes things harder instead of solving problems, it is a completely
> ad-hoc code change.
>
>> That makes it 100% clear that it's a
>> temporary hack to restrict the target impact rather than something
>> based on fundamentals.  We can then revisit for GCC 14.
>
> And that will never happen, you know this as well as anyone else :-(
>
> Most importantly, what makes you think this is a problem for aarch64
> only?  If it actually is, you can fix it in the aarch64 config!  Either
> with or without new hooks, whatever works best.

The point is that I don't think it's a problem for AArch64 only.
I think it's a generic issue that should be solved in a generic way
(which is what the patch is trying to do).  The suggestion to restrict
it to AArch64 came from Jakub.

The reason I'm pushing back against a hook is precisely because
I don't want to solve this in AArch64-specific code.

I'm not sure we would be talking about restricting this to AArch64
if the patch had been posted in stage 1.  If people are concerned
about doing this for all targets in stage 4 (which they seem to be),
I thought the #ifdef was the simplest way of addressing that concern.
"Revisit for GCC 14" would be a case of removing the #ifdef in stage 1.

And I don't think what the patch does is ad hoc.  Reorganising the
expression in this way isn't something new.  extract_left_shift already
does a similar thing (and does it for all targets).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 16:34               ` Richard Sandiford
  2023-03-06 18:31                 ` Segher Boessenkool
@ 2023-03-06 22:58                 ` Segher Boessenkool
  1 sibling, 0 replies; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-06 22:58 UTC (permalink / raw)
  To: Jakub Jelinek, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law, richard.sandiford

On Mon, Mar 06, 2023 at 04:34:59PM +0000, Richard Sandiford wrote:
> Jakub Jelinek <jakub@redhat.com> writes:
> > On Mon, Mar 06, 2023 at 03:08:00PM +0000, Richard Sandiford via Gcc-patches wrote:
> >> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> > On Mon, Mar 06, 2023 at 12:47:06PM +0000, Richard Sandiford wrote:
> >> >> How about the patch below?
> >> >
> >> > What about it?  What would make it any better than the previous?
> >> 
> >> It does what Jeff suggested in the quoted message: work within the existing
> >> extract/make_compound_operation scheme rather than try to opt out of it.
> >
> > That still feels like it could be risky in stage4, affecting various other
> > FEs which would be expecting ANDs in their patterns instead of *_EXTEND, no?
> > So, at least we'd need something like Segher ran to test it on various
> > targets on Linux kernel (but would be really nice to get also i?86/x86_64).
> >
> > If it were on the aarch64 side just one pattern, I'd suggest a pre-reload
> > splitter, but unfortunately the sign extends (and zero extends?) are handled
> > in legitimate address hook.  Also, I see nonzero_bits only called in
> > rs6000's combine splitter and s390'x canonicalize_comparison target hook,
> > nowhere else in the backends, so I think using it outside of the combiner
> > isn't desirable.
> >
> > Could we have a target hook to canonicalize memory addresses for combiner,
> > like we have that targetm.canonicalize_comparison ?
> 
> I don't think a hook makes sense as a long-term design decision.
> The canonicalisation we're doing here isn't logically AArch64-specific,
> and in general, the less variation in RTL rules between targets, the better.

C1 is trunk, C2 is the previous patch, C3 is this one:

$ perl sizes.pl --percent C[123]
                    C1        C2        C3
       alpha   7082243  100.066%  100.000%
         arc   4207975  100.015%  100.000%
         arm  11518624  100.008%  100.000%
       arm64  24514565  100.067%  100.033%
       armhf  16661684  100.098%  100.000%
        csky   4031841  100.002%  100.000%
        i386         0         0         0
        ia64  20354295  100.029%  100.000%
        m68k   4394084  100.023%  100.000%
  microblaze   6549965  100.014%  100.000%
        mips  10684680  100.024%  100.000%
      mips64   8171850  100.002%  100.000%
       nios2   4356713  100.012%  100.000%
    openrisc   5010570  100.003%  100.000%
      parisc   8406294  100.002%  100.000%
    parisc64         0         0         0
     powerpc  11104901   99.992%  100.000%
   powerpc64  24532358  100.057%  100.000%
 powerpc64le  21293219  100.062%  100.000%
     riscv32   2028474  100.131%  100.000%
     riscv64   9515453  100.120%  100.000%
        s390  20519612  100.279%  100.000%
          sh         0         0         0
     shnommu   1840960  100.012%  100.000%
       sparc   5314422  100.004%  100.000%
     sparc64   7964129   99.992%  100.000%
      x86_64         0         0         0
      xtensa   2925723  100.070%  100.000%

It does absolutely nothing for all those other targets you say it is
beneficial for; and it is a net *negative* for aarch64 itself!


Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 19:13                   ` Richard Sandiford
@ 2023-03-06 23:31                     ` Segher Boessenkool
  2023-03-08 11:58                       ` Richard Sandiford
  0 siblings, 1 reply; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-06 23:31 UTC (permalink / raw)
  To: Jakub Jelinek, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law, richard.sandiford

Hi!

On Mon, Mar 06, 2023 at 07:13:08PM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > Most importantly, what makes you think this is a problem for aarch64
> > only?  If it actually is, you can fix it in the aarch64 config!  Either
> > with or without new hooks, whatever works best.
> 
> The point is that I don't think it's a problem for AArch64 only.
> I think it's a generic issue that should be solved in a generic way
> (which is what the patch is trying to do).  The suggestion to restrict
> it to AArch64 came from Jakub.
> 
> The reason I'm pushing back against a hook is precisely because
> I don't want to solve this in AArch64-specific code.

But it is many times worse still to do it in target-specific magic code
disguised as generic code :-(

If there is no clear explanation why combine should do X, then it
probably should not.

> I'm not sure we would be talking about restricting this to AArch64
> if the patch had been posted in stage 1.  If people are concerned
> about doing this for all targets in stage 4 (which they seem to be),

Not me, not in principle.  But it takes more time than we have left in
stage 4 to handle this, even for only combine.  We should give the other
target maintainers much longer as well.

> I thought the #ifdef was the simplest way of addressing that concern.

An #ifdef is a way of making a change that is not finished yet not hurt
the other targets.  It still hurts generic development, which indirectly
hurts all targets.

> And I don't think what the patch does is ad hoc.

It is almost impossible to explain what it does and why that is a good
thing, why it is what we want, what we should do here; and certainly not
in a compact, terse, focused way.  It has all the hallmarks of ad hoc
patches.

> Reorganising the
> expression in this way isn't something new.  extract_left_shift already
> does a similar thing (and does it for all targets).

That is not similar at all, no.

/* See if X (of mode MODE) contains an ASHIFT of COUNT or more bits that
   can be commuted with any other operations in X.  Return X without
   that shift if so.  */

If you can factor out a utility function like that, with an actual nice
description like that, it would be a much more palatable patch.

Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-06 23:31                     ` Segher Boessenkool
@ 2023-03-08 11:58                       ` Richard Sandiford
  2023-03-08 22:50                         ` Segher Boessenkool
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Sandiford @ 2023-03-08 11:58 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Jakub Jelinek, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law

Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> On Mon, Mar 06, 2023 at 07:13:08PM +0000, Richard Sandiford wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > Most importantly, what makes you think this is a problem for aarch64
>> > only?  If it actually is, you can fix it in the aarch64 config!  Either
>> > with or without new hooks, whatever works best.
>> 
>> The point is that I don't think it's a problem for AArch64 only.
>> I think it's a generic issue that should be solved in a generic way
>> (which is what the patch is trying to do).  The suggestion to restrict
>> it to AArch64 came from Jakub.
>> 
>> The reason I'm pushing back against a hook is precisely because
>> I don't want to solve this in AArch64-specific code.
>
> But it is many times worse still to do it in target-specific magic code
> disguised as generic code :-(
>
> If there is no clear explanation why combine should do X, then it
> probably should not.
>
>> I'm not sure we would be talking about restricting this to AArch64
>> if the patch had been posted in stage 1.  If people are concerned
>> about doing this for all targets in stage 4 (which they seem to be),
>
> Not me, not in principle.  But it takes more time than we have left in
> stage 4 to handle this, even for only combine.  We should give the other
> target maintainers much longer as well.
>
>> I thought the #ifdef was the simplest way of addressing that concern.
>
> An #ifdef is a way of making a change that is not finished yet not hurt
> the other targets.  It still hurts generic development, which indirectly
> hurts all targets.

Seems like this might be moot anyway given that your results
suggest no impact on other targets.

>> And I don't think what the patch does is ad hoc.
>
> It is almost impossible to explain what it does and why that is a good
> thing, why it is what we want, what we should do here; and certainly not
> in a compact, terse, focused way.  It has all the hallmarks of ad hoc
> patches.
>
>> Reorganising the
>> expression in this way isn't something new.  extract_left_shift already
>> does a similar thing (and does it for all targets).
>
> That is not similar at all, no.
>
> /* See if X (of mode MODE) contains an ASHIFT of COUNT or more bits that
>    can be commuted with any other operations in X.  Return X without
>    that shift if so.  */
>
> If you can factor out a utility function like that, with an actual nice
> description like that, it would be a much more palatable patch.

OK, I've factored it out below.  Does the comment look OK?

As mentioned in the patch description below, I think some of the other
"and" handling could be moved here too, which should avoid a bit of
(existing) duplication.  But that isn't necessary to fix the bug.

On the code size results: as I mentioned on IRC yesterday, when I tried
building the linux kernel locally (with -Os -fno-schedule-insns{,2}),
I saw code size improvements in 182 .os and a regression in only one .o.
(I was comparing individual .os because it makes isolation easier.)

The one regression was because the optimised version had fewer pseudos,
and so something that was previously allocated to x3 was allocated to x2.
This pseudo was initialised to 0, and a preceding stack protect
instruction had the known side effect (both before and after the patch)
of setting x3 to 0.  So with the x3 allocation, postreload was able to
remove the separate initialisation of x3 with 0, but with the x2
allocation it couldn't.

I also tried seeing what effect it had on gcc.c-torture, gcc.dg and
g++.dg.  All the changes were improvements.

And the testcase in the PR was from a real workload (ArmRAL).

If you can isolate the case you saw, I'll have a look.  But ISTM that
the change is a pretty clear win on aarch64.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.

Thanks,
Richard

----

Combine's approach to simplifying a pattern X is to:

(1) expand "compound" operations (such as extends and extracts)
    into multiple operations (such as shifts), to give a new
    expression X1

(2) simplify X1 in the normal way, and with some combine-specific
    extras, to give X2

(3) remake compound operations from X2, to give X3

For example, (1) expands sign_extend to an ashift/ashiftrt pair,
with the ashift being an immediate operand of the ashiftrt.
I'll call ashiftrt the "outer" operation and ashift the
"inner" operation below.

By design, (2) can perturb the structure of the expanded operations
in X1.  Sometimes it will rework them entirely.  But sometimes it
will keep the outer operations and inner operations, but in a
slightly different arrangement.  For example, the inner operation
might no longer be a direct operand of the outer operation.

Therefore, when make_compound_operation sees a potential outer
operation, it sometimes looks through suboperands to find the
potential inner operation.  Examples include:

(a) the ashiftrt handling, which uses extract_left_shift to find
    a partnering ashift operation

(b) the "and" handling, which looks through subregs, xors and iors
    to find a partnerning lshiftrt.

The PR contains another case where we need this.  We have:

  (set (reg:DI R2) (sign_extend:DI (reg:SI R1)))
  ... (plus:DI (mult:DI (reg:DI R2) (const_int 4)) (reg:DI R3)) ...

which is a natural, direct comnbination on aarch64.

First, (1) expands the sign_extend into two operations.  It uses
R1's nonzero_bits information to determine that the sign extension
is equivalent to a zero extension:

  /* Convert sign extension to zero extension, if we know that the high
     bit is not set, as this is easier to optimize.  It will be converted
     back to cheaper alternative in make_extraction.  */

As I'll explain below, the problem is that this conversion back to a
cheaper alternative doesn't happen.

The zero extension is expanded into an "and" of a subreg.
Again, the nonzero_bits information narrows the "and" mask
from a full SImode mask to a smaller constant (63).  So X1
contains:

  (mult:DI (and:DI (subreg:DI (reg:SI R1) 0)
                   (const_int 63))
           (const_int 4))

The simplification rules for mult convert this to an ashift by 2.
Then, this rule in simplify_shift_const_1:

	  /* If we have (shift (logical)), move the logical to the outside
	     to allow it to possibly combine with another logical and the
	     shift to combine with another shift.  This also canonicalizes to
	     what a ZERO_EXTRACT looks like.  Also, some machines have
	     (and (shift)) insns.  */

moves the shift inside the "and", so that X2 contains:

  (and:DI (ashift:DI (subreg:DI (reg:SI R1) 0)
                     (const_int 2))
          (const_int 252))

We later recanonicalise to a mult (since this is an address):

  (and:DI (mult:DI (subreg:DI (reg:SI R1) 0)
                   (const_int 4))
          (const_int 252))

But we fail to transform this back to the natural substitution:

  (mult:DI (sign_extend:DI (reg:SI R1))
           (const_int 4))

This patch extends the "look through suboperands" approach
described above to handle this case too.

It might (or might not) make sense to put more of the existing
"and" handling into the new function, so that the subreg+lshiftrt
case can be handled through recursion rather than duplication.
But that's certainly not necessary to fix the bug, so is at
best stage 1 material.

gcc/
	PR rtl-optimization/106594
	* combine.cc (make_compound_operation_int): Extend the AND to
	ZERO_EXTEND transformation so that it can handle an intervening
	multiplication by a power of two.

gcc/testsuite/
	* gcc.target/aarch64/pr106594.c: New test.
---
 gcc/combine.cc                              | 96 ++++++++++++++-------
 gcc/testsuite/gcc.target/aarch64/pr106594.c | 21 +++++
 2 files changed, 87 insertions(+), 30 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594.c

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 053879500b7..4533ec1de84 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7952,6 +7952,68 @@ extract_left_shift (scalar_int_mode mode, rtx x, int count)
   return 0;
 }
 \f
+/* A subroutine of make_compound_operation_int.  Try to combine an outer
+   AND of X and MASK with a partnering inner operation to form a compound
+   operation.  Return the new X on success, otherwise return null.
+   MODE is the mode of X.  */
+
+static rtx
+make_compound_operation_and (scalar_int_mode mode, rtx x,
+			     unsigned HOST_WIDE_INT mask,
+			     rtx_code in_code, rtx_code next_code)
+{
+  switch (GET_CODE (x))
+    {
+    case SUBREG:
+      /* If the operand is a paradoxical subreg of a register or memory
+	 and MASK (limited to the smaller mode) has only zero bits where
+	 the sub expression has known zero bits, this can be expressed as
+	 a zero_extend.  */
+      {
+	rtx sub = XEXP (x, 0);
+	machine_mode sub_mode = GET_MODE (sub);
+	int sub_width;
+	if ((REG_P (sub) || MEM_P (sub))
+	    && GET_MODE_PRECISION (sub_mode).is_constant (&sub_width)
+	    && sub_width < GET_MODE_PRECISION (mode))
+	  {
+	    unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (sub_mode);
+	    unsigned HOST_WIDE_INT submask;
+
+	    /* The shifted AND constant with all the known zero
+	       bits set.  */
+	    submask = mask | ~nonzero_bits (sub, sub_mode);
+	    if ((submask & mode_mask) == mode_mask)
+	      {
+		rtx new_rtx = make_compound_operation (sub, next_code);
+		return make_extraction (mode, new_rtx, 0, 0, sub_width,
+					1, 0, in_code == COMPARE);
+	      }
+	  }
+	break;
+      }
+
+    case MULT:
+      /* Recurse through a power of 2 multiplication (as can be found
+	 in an address.  */
+      if (CONST_INT_P (XEXP (x, 1))
+	  && pow2p_hwi (INTVAL (XEXP (x, 1))))
+	{
+	  int shift = exact_log2 (INTVAL (XEXP (x, 1)));
+	  rtx sub = make_compound_operation_and (mode, XEXP (x, 0),
+						 mask >> shift, in_code,
+						 next_code);
+	  if (sub)
+	    return gen_rtx_MULT (mode, sub, XEXP (x, 1));
+	}
+      break;
+
+    default:
+      break;
+    }
+  return NULL_RTX;
+}
+
 /* Subroutine of make_compound_operation.  *X_PTR is the rtx at the current
    level of the expression and MODE is its mode.  IN_CODE is as for
    make_compound_operation.  *NEXT_CODE_PTR is the value of IN_CODE
@@ -8184,36 +8246,10 @@ make_compound_operation_int (scalar_int_mode mode, rtx *x_ptr,
 				   make_compound_operation (XEXP (x, 0),
 							    next_code),
 				   i, NULL_RTX, 1, 1, 0, 1);
-
-      /* If the one operand is a paradoxical subreg of a register or memory and
-	 the constant (limited to the smaller mode) has only zero bits where
-	 the sub expression has known zero bits, this can be expressed as
-	 a zero_extend.  */
-      else if (GET_CODE (XEXP (x, 0)) == SUBREG)
-	{
-	  rtx sub;
-
-	  sub = XEXP (XEXP (x, 0), 0);
-	  machine_mode sub_mode = GET_MODE (sub);
-	  int sub_width;
-	  if ((REG_P (sub) || MEM_P (sub))
-	      && GET_MODE_PRECISION (sub_mode).is_constant (&sub_width)
-	      && sub_width < mode_width)
-	    {
-	      unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (sub_mode);
-	      unsigned HOST_WIDE_INT mask;
-
-	      /* original AND constant with all the known zero bits set */
-	      mask = UINTVAL (XEXP (x, 1)) | (~nonzero_bits (sub, sub_mode));
-	      if ((mask & mode_mask) == mode_mask)
-		{
-		  new_rtx = make_compound_operation (sub, next_code);
-		  new_rtx = make_extraction (mode, new_rtx, 0, 0, sub_width,
-					     1, 0, in_code == COMPARE);
-		}
-	    }
-	}
-
+      else
+	new_rtx = make_compound_operation_and (mode, XEXP (x, 0),
+					       UINTVAL (XEXP (x, 1)),
+					       in_code, next_code);
       break;
 
     case LSHIFTRT:
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594.c b/gcc/testsuite/gcc.target/aarch64/pr106594.c
new file mode 100644
index 00000000000..f10d72665e0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr106594.c
@@ -0,0 +1,21 @@
+/* { dg-options "-O2" } */
+
+extern const int constellation_64qam[64];
+
+void foo(int nbits,
+         const char *p_src,
+         int *p_dst) {
+
+  while (nbits > 0U) {
+    char first = *p_src++;
+
+    char index1 = ((first & 0x3) << 4) | (first >> 4);
+
+    *p_dst++ = constellation_64qam[index1];
+
+    nbits--;
+  }
+}
+
+/* { dg-final { scan-assembler {ldr\tw[0-9]+, \[x[0-9]+, w[0-9]+, [su]xtw #?2\]} } } */
+
-- 
2.25.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-08 11:58                       ` Richard Sandiford
@ 2023-03-08 22:50                         ` Segher Boessenkool
  2023-03-09 10:18                           ` Richard Sandiford
  0 siblings, 1 reply; 21+ messages in thread
From: Segher Boessenkool @ 2023-03-08 22:50 UTC (permalink / raw)
  To: Jakub Jelinek, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law, richard.sandiford

On Wed, Mar 08, 2023 at 11:58:51AM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > An #ifdef is a way of making a change that is not finished yet not hurt
> > the other targets.  It still hurts generic development, which indirectly
> > hurts all targets.
> 
> Seems like this might be moot anyway given that your results
> suggest no impact on other targets.

Which means the patch does not do what it says it does.  It is a net
negative on the only target it did change code on, too.

If the patch did do what it promises it would be a (large!) net benefit,
and also on various other targets.

As it is, either the regression wasn't P1 at all, or the patch doesn't
fix the problem, or the problem only happens in unusual code (or vector
or float code).  Please explain what the regression is you want to
solve?  With a compilable testcase etc., the usual.

> >> Reorganising the
> >> expression in this way isn't something new.  extract_left_shift already
> >> does a similar thing (and does it for all targets).
> >
> > That is not similar at all, no.
> >
> > /* See if X (of mode MODE) contains an ASHIFT of COUNT or more bits that
> >    can be commuted with any other operations in X.  Return X without
> >    that shift if so.  */
> >
> > If you can factor out a utility function like that, with an actual nice
> > description like that, it would be a much more palatable patch.
> 
> OK, I've factored it out below.  Does the comment look OK?

> As mentioned in the patch description below, I think some of the other
> "and" handling could be moved here too, which should avoid a bit of
> (existing) duplication.  But that isn't necessary to fix the bug.

And stage 1 material.  Like I still think the current patch is as well.

> On the code size results: as I mentioned on IRC yesterday, when I tried
> building the linux kernel locally (with -Os -fno-schedule-insns{,2}),
> I saw code size improvements in 182 .os and a regression in only one .o.
> (I was comparing individual .os because it makes isolation easier.)

Nothing here is about code size.  It is just a good stand-in to compare
the result of a change in combine with: almost all changes in generated
code are because combine could combine more (or fewer) instructions.

This is good if you just want to look at a table of numbers.  It will
often show something is obviously not good, or obviously good, and it
shows what targets are of extra interest.

You still need to look at the actual generated code to confirm things.
For example, with your previous patch on aarch64 part of the code size
increase is extra tail duplication (in the bpf interpreter), not a bad
thing.  Unfortunately that was only a small part of the code size
increase.

> And the testcase in the PR was from a real workload (ArmRAL).

What testcase?  Oh the snippet in #c0?

This is not a good example if you want this to be P1, heh.

> If you can isolate the case you saw, I'll have a look.  But ISTM that
> the change is a pretty clear win on aarch64.

No, *you* have to show it is an improvement.  I have a day job as well,
you know.

> Combine's approach to simplifying a pattern X is to:
> 
> (1) expand "compound" operations (such as extends and extracts)
>     into multiple operations (such as shifts), to give a new
>     expression X1
> 
> (2) simplify X1 in the normal way, and with some combine-specific
>     extras, to give X2
> 
> (3) remake compound operations from X2, to give X3

This is not a good description of what really goes on.  This is only
what is done for some compound operations in some cases.  And then it
is a tiny part of what is done.  And yes, it sucks, but disabling it
causes regressions.

> For example, (1) expands sign_extend to an ashift/ashiftrt pair,
> with the ashift being an immediate operand of the ashiftrt.

It more often converts it to a zero_extend here, for example (a real
one, or what expand_compound_operation comes up with).

> By design, (2) can perturb the structure of the expanded operations
> in X1.

Where X1 is just an abstraction you use here, not something that
actually exists in combine.  Okay.

> Sometimes it will rework them entirely.  But sometimes it
> will keep the outer operations and inner operations, but in a
> slightly different arrangement.  For example, the inner operation
> might no longer be a direct operand of the outer operation.

What does that mean?  It will always still be a single expression.

What is a "direct operand"?  A subexpression?  But it always *is* one.

> The PR contains another case where we need this.  We have:
> 
>   (set (reg:DI R2) (sign_extend:DI (reg:SI R1)))
>   ... (plus:DI (mult:DI (reg:DI R2) (const_int 4)) (reg:DI R3)) ...
> 
> which is a natural, direct comnbination on aarch64.
> 
> First, (1) expands the sign_extend into two operations.  It uses
> R1's nonzero_bits information to determine that the sign extension
> is equivalent to a zero extension:
> 
>   /* Convert sign extension to zero extension, if we know that the high
>      bit is not set, as this is easier to optimize.  It will be converted
>      back to cheaper alternative in make_extraction.  */
> 
> As I'll explain below, the problem is that this conversion back to a
> cheaper alternative doesn't happen.
> 
> The zero extension is expanded into an "and" of a subreg.
> Again, the nonzero_bits information narrows the "and" mask
> from a full SImode mask to a smaller constant (63).  So X1
> contains:
> 
>   (mult:DI (and:DI (subreg:DI (reg:SI R1) 0)
>                    (const_int 63))
>            (const_int 4))
> 
> The simplification rules for mult convert this to an ashift by 2.
> Then, this rule in simplify_shift_const_1:
> 
> 	  /* If we have (shift (logical)), move the logical to the outside
> 	     to allow it to possibly combine with another logical and the
> 	     shift to combine with another shift.  This also canonicalizes to
> 	     what a ZERO_EXTRACT looks like.  Also, some machines have
> 	     (and (shift)) insns.  */
> 
> moves the shift inside the "and", so that X2 contains:
> 
>   (and:DI (ashift:DI (subreg:DI (reg:SI R1) 0)
>                      (const_int 2))
>           (const_int 252))
> 
> We later recanonicalise to a mult (since this is an address):

Yeah that is a big wart, one that causes problems on most targets.  But
nothing new ;-)

>   (and:DI (mult:DI (subreg:DI (reg:SI R1) 0)
>                    (const_int 4))
>           (const_int 252))
> 
> But we fail to transform this back to the natural substitution:
> 
>   (mult:DI (sign_extend:DI (reg:SI R1))
>            (const_int 4))

You call this "natural".  Is that a reasonable thing to do on aarch?
What MD patterns should I look at?

> 	* combine.cc (make_compound_operation_int): Extend the AND to
> 	ZERO_EXTEND transformation so that it can handle an intervening
> 	multiplication by a power of two.

If that is truly all it does, that sounds nice :-)

But the patch does more?  make_compound_operation_and is the obvious
example, a new function.

Oh, this is factored out from existing code?  Please do that as a
separate patch.  First the refactoring, than the (hopefully tiny!) one
with the actual changes.

(And send new patch series as a new mail thread please).

Segher

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
  2023-03-08 22:50                         ` Segher Boessenkool
@ 2023-03-09 10:18                           ` Richard Sandiford
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Sandiford @ 2023-03-09 10:18 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Jakub Jelinek, Jeff Law via Gcc-patches, Tamar Christina,
	Roger Sayle, Jeff Law

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Wed, Mar 08, 2023 at 11:58:51AM +0000, Richard Sandiford wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > An #ifdef is a way of making a change that is not finished yet not hurt
>> > the other targets.  It still hurts generic development, which indirectly
>> > hurts all targets.
>> 
>> Seems like this might be moot anyway given that your results
>> suggest no impact on other targets.
>
> Which means the patch does not do what it says it does.  It is a net
> negative on the only target it did change code on, too.
>
> If the patch did do what it promises it would be a (large!) net benefit,
> and also on various other targets.

I'm not sure which promise you're referring to here.  The patch wasn't
supposed to be a big sweeping improvement to combine. :-)  It was just
supposed to fix the regression.

> As it is, either the regression wasn't P1 at all, or the patch doesn't
> fix the problem, or the problem only happens in unusual code (or vector
> or float code).  Please explain what the regression is you want to
> solve?  With a compilable testcase etc., the usual.

The testcase is the one from the patch and the PR, compiled at -O2
on aarch64-linux-gnu:

---------------------------------------------------------------------------
extern const int constellation_64qam[64];

void foo(int nbits,
         const char *p_src,
         int *p_dst) {

  while (nbits > 0U) {
    char first = *p_src++;

    char index1 = ((first & 0x3) << 4) | (first >> 4);

    *p_dst++ = constellation_64qam[index1];

    nbits--;
  }
}
---------------------------------------------------------------------------

The regression occurred in c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f.
Before that patch, the loop body was:

.L3:
        ldrb    w3, [x1, x0]
        ubfiz   w4, w3, 4, 2
        orr     w3, w4, w3, lsr 4
        ldr     w3, [x6, w3, sxtw 2]     // <----
        str     w3, [x2, x0, lsl 2]
        add     x0, x0, 1
        cmp     x5, x0
        bne     .L3

After the patch it is:

.L3:
        ldrb    w0, [x1, x3]
        ubfiz   w4, w0, 4, 2             
        orr     w0, w4, w0, lsr 4
        sxtw    x0, w0                   // <----
        ldr     w0, [x6, x0, lsl 2]      // <----
        str     w0, [x2, x3, lsl 2]
        add     x3, x3, 1
        cmp     x5, x3
        bne     .L3

Before the patch:

Trying 24 -> 25:
   24: r116:DI=sign_extend(r115:SI)
      REG_DEAD r115:SI
   25: r117:SI=[r116:DI*0x4+r118:DI]
      REG_DEAD r116:DI
      REG_EQUAL [r116:DI*0x4+`constellation_64qam']
Successfully matched this instruction:
(set (reg:SI 117 [ constellation_64qam[_5] ])
    (mem/u:SI (plus:DI (mult:DI (sign_extend:DI (reg:SI 115))
                (const_int 4 [0x4]))
            (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))
allowing combination of insns 24 and 25
original costs 4 + 16 = 20
replacement cost 16
deferring deletion of insn with uid = 24.
modifying insn i3    25: r117:SI=[sign_extend(r115:SI)*0x4+r118:DI]
      REG_DEAD r115:SI
deferring rescan insn with uid = 25.

After the patch:

Trying 24 -> 25:
   24: r116:DI=sign_extend(r115:SI)
      REG_DEAD r115:SI
   25: r117:SI=[r116:DI*0x4+r118:DI]
      REG_DEAD r116:DI
      REG_EQUAL [r116:DI*0x4+`constellation_64qam']
Failed to match this instruction:
(set (reg:SI 117 [ constellation_64qam[_5] ])
    (mem/u:SI (plus:DI (and:DI (mult:DI (subreg:DI (reg:SI 115) 0)
                    (const_int 4 [0x4]))
                (const_int 252 [0xfc]))
            (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))

expand_compound_operation has the curious code (that Richard pointed
out in the PR):

  /* Convert sign extension to zero extension, if we know that the high
     bit is not set, as this is easier to optimize.  It will be converted
     back to cheaper alternative in make_extraction.  */
  if (GET_CODE (x) == SIGN_EXTEND
      && HWI_COMPUTABLE_MODE_P (mode)
      && ((nonzero_bits (XEXP (x, 0), inner_mode)
	   & ~(((unsigned HOST_WIDE_INT) GET_MODE_MASK (inner_mode)) >> 1))
	  == 0))
    {
      rtx temp = gen_rtx_ZERO_EXTEND (mode, XEXP (x, 0));
      rtx temp2 = expand_compound_operation (temp);

      /* Make sure this is a profitable operation.  */
      if (set_src_cost (x, mode, optimize_this_for_speed_p)
          > set_src_cost (temp2, mode, optimize_this_for_speed_p))
       return temp2;
      else if (set_src_cost (x, mode, optimize_this_for_speed_p)
               > set_src_cost (temp, mode, optimize_this_for_speed_p))
       return temp;
      else
       return x;
    }

So we only use the expanded version of zero_extend if it is strictly
cheaper than sign_extend.  Otherwise we make a choice between the
original *unexpanded* sign_extend and the original *unexpanded*
zero_extend, preferring to keep things as they are in the event of a tie.

That is, this code bypasses the normal expansion of sign_extends if
we can prove that the top bit of the input is clear.  If all costs are
equal, it will keep the unexpanded sign_extend.  This is what happened
in the testcase before c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f.

Among other things, c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f added
the following rule to the simplify-rtx.cc handling of SIGN_EXTEND:

      /* We can canonicalize SIGN_EXTEND (op) as ZERO_EXTEND (op) when
         we know the sign bit of OP must be clear.  */
      if (val_signbit_known_clear_p (GET_MODE (op),
				     nonzero_bits (op, GET_MODE (op))))
	return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op));

This has the effect of overriding the cost check above and
preferring zero_extend over sign_extend as a general principle.
We then optimise the zero_extend as I described yesterday.

So the trigger for the regression is that we convert a sign_extend
to a zero_extend when we didn't previously.

>> On the code size results: as I mentioned on IRC yesterday, when I tried
>> building the linux kernel locally (with -Os -fno-schedule-insns{,2}),
>> I saw code size improvements in 182 .os and a regression in only one .o.
>> (I was comparing individual .os because it makes isolation easier.)
>
> Nothing here is about code size.  It is just a good stand-in to compare
> the result of a change in combine with: almost all changes in generated
> code are because combine could combine more (or fewer) instructions.
>
> This is good if you just want to look at a table of numbers.  It will
> often show something is obviously not good, or obviously good, and it
> shows what targets are of extra interest.

Sure.  But that applies regardless of whether -O2 or -Os is used
to compile the tests.  In both cases we're using code size to get
a general feel for the scope of the changes.  The point of using
-Os is that the compiler's goals are aligned with the thing that
we're measuring, so the at-a-glance results should be less noisy.

With -O2 there are several ways in which a beneficial combine change
(that in itself reduces code size) can cause acceptable increases
in code size downstream, usually via different RA choices leading
to different post-RA code layout.

If there was an "-O2 up to and including combine, -Os thereafter" then I'd
use that instead :-)

> You still need to look at the actual generated code to confirm things.
> For example, with your previous patch on aarch64 part of the code size
> increase is extra tail duplication (in the bpf interpreter), not a bad
> thing.  Unfortunately that was only a small part of the code size
> increase.

OK.  IMO, the benefit of using -Os is that you don't have to weed
out things like tail duplication by hand.

>> And the testcase in the PR was from a real workload (ArmRAL).
>
> What testcase?  Oh the snippet in #c0?
>
> This is not a good example if you want this to be P1, heh.

What kind of example would you need?  The point of the testcase
was to provide a minimal reproducer.

>> If you can isolate the case you saw, I'll have a look.  But ISTM that
>> the change is a pretty clear win on aarch64.
>
> No, *you* have to show it is an improvement.  I have a day job as well,
> you know.

Sorry, I wasn't trying to demand that you spend time isolating the
code size regression you'd seen.  But it seemed like you were treating
your results as a blocker for the patch.  It's difficult for me to explain
the results if I can't reproduce them.

Like I say, I've tried compiling a linux kernel locally and the results
from the patch were overwhelmingly positive.

>> Sometimes it will rework them entirely.  But sometimes it
>> will keep the outer operations and inner operations, but in a
>> slightly different arrangement.  For example, the inner operation
>> might no longer be a direct operand of the outer operation.
>
> What does that mean?  It will always still be a single expression.
>
> What is a "direct operand"?  A subexpression?  But it always *is* one.

I was using "A is a direct operand of B" to mean:

  XEXP (B, N) == A for some N or
  XVECEXP (B, N, M) == A for some N and M

>> 	* combine.cc (make_compound_operation_int): Extend the AND to
>> 	ZERO_EXTEND transformation so that it can handle an intervening
>> 	multiplication by a power of two.
>
> If that is truly all it does, that sounds nice :-)

Gah.  I got so wrapped up in doing a new write-up that I completely
forgot to update the changelog :-)  So yes, this was still for the
original patch, sorry.

> Oh, this is factored out from existing code?  Please do that as a
> separate patch.  First the refactoring, than the (hopefully tiny!) one
> with the actual changes.
>
> (And send new patch series as a new mail thread please).

OK, will do.

Richard

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-03-09 10:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-04 18:32 [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap Roger Sayle
2023-03-04 22:17 ` Segher Boessenkool
2023-03-05 19:28   ` Tamar Christina
2023-03-05 19:56     ` Jeff Law
2023-03-05 20:43       ` Tamar Christina
2023-03-05 21:33         ` Segher Boessenkool
2023-03-06 12:08           ` Segher Boessenkool
2023-03-06 12:11             ` Tamar Christina
2023-03-06 12:47       ` [PATCH] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
2023-03-06 13:58         ` Segher Boessenkool
2023-03-06 15:08           ` Richard Sandiford
2023-03-06 16:18             ` Jakub Jelinek
2023-03-06 16:34               ` Richard Sandiford
2023-03-06 18:31                 ` Segher Boessenkool
2023-03-06 19:13                   ` Richard Sandiford
2023-03-06 23:31                     ` Segher Boessenkool
2023-03-08 11:58                       ` Richard Sandiford
2023-03-08 22:50                         ` Segher Boessenkool
2023-03-09 10:18                           ` Richard Sandiford
2023-03-06 22:58                 ` Segher Boessenkool
2023-03-06 18:13               ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).