Re: [PATCH] combine: Try harder to form zero_extends [PR106594]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Sandiford <richard.sandiford@arm.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Jakub Jelinek <jakub@redhat.com>,
	 Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org>,
	 Tamar Christina <Tamar.Christina@arm.com>,
	 Roger Sayle <roger@nextmovesoftware.com>,
	 Jeff Law <jeffreyalaw@gmail.com>
Subject: Re: [PATCH] combine: Try harder to form zero_extends [PR106594]
Date: Thu, 09 Mar 2023 10:18:14 +0000	[thread overview]
Message-ID: <mpt5yba2pp5.fsf@arm.com> (raw)
In-Reply-To: <20230308225015.GT25951@gate.crashing.org> (Segher Boessenkool's message of "Wed, 8 Mar 2023 16:50:15 -0600")

Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Wed, Mar 08, 2023 at 11:58:51AM +0000, Richard Sandiford wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > An #ifdef is a way of making a change that is not finished yet not hurt
>> > the other targets.  It still hurts generic development, which indirectly
>> > hurts all targets.
>> 
>> Seems like this might be moot anyway given that your results
>> suggest no impact on other targets.
>
> Which means the patch does not do what it says it does.  It is a net
> negative on the only target it did change code on, too.
>
> If the patch did do what it promises it would be a (large!) net benefit,
> and also on various other targets.

I'm not sure which promise you're referring to here.  The patch wasn't
supposed to be a big sweeping improvement to combine. :-)  It was just
supposed to fix the regression.

> As it is, either the regression wasn't P1 at all, or the patch doesn't
> fix the problem, or the problem only happens in unusual code (or vector
> or float code).  Please explain what the regression is you want to
> solve?  With a compilable testcase etc., the usual.

The testcase is the one from the patch and the PR, compiled at -O2
on aarch64-linux-gnu:

---------------------------------------------------------------------------
extern const int constellation_64qam[64];

void foo(int nbits,
         const char *p_src,
         int *p_dst) {

  while (nbits > 0U) {
    char first = *p_src++;

    char index1 = ((first & 0x3) << 4) | (first >> 4);

    *p_dst++ = constellation_64qam[index1];

    nbits--;
  }
}
---------------------------------------------------------------------------

The regression occurred in c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f.
Before that patch, the loop body was:

.L3:
        ldrb    w3, [x1, x0]
        ubfiz   w4, w3, 4, 2
        orr     w3, w4, w3, lsr 4
        ldr     w3, [x6, w3, sxtw 2]     // <----
        str     w3, [x2, x0, lsl 2]
        add     x0, x0, 1
        cmp     x5, x0
        bne     .L3

After the patch it is:

.L3:
        ldrb    w0, [x1, x3]
        ubfiz   w4, w0, 4, 2             
        orr     w0, w4, w0, lsr 4
        sxtw    x0, w0                   // <----
        ldr     w0, [x6, x0, lsl 2]      // <----
        str     w0, [x2, x3, lsl 2]
        add     x3, x3, 1
        cmp     x5, x3
        bne     .L3

Before the patch:

Trying 24 -> 25:
   24: r116:DI=sign_extend(r115:SI)
      REG_DEAD r115:SI
   25: r117:SI=[r116:DI*0x4+r118:DI]
      REG_DEAD r116:DI
      REG_EQUAL [r116:DI*0x4+`constellation_64qam']
Successfully matched this instruction:
(set (reg:SI 117 [ constellation_64qam[_5] ])
    (mem/u:SI (plus:DI (mult:DI (sign_extend:DI (reg:SI 115))
                (const_int 4 [0x4]))
            (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))
allowing combination of insns 24 and 25
original costs 4 + 16 = 20
replacement cost 16
deferring deletion of insn with uid = 24.
modifying insn i3    25: r117:SI=[sign_extend(r115:SI)*0x4+r118:DI]
      REG_DEAD r115:SI
deferring rescan insn with uid = 25.

After the patch:

Trying 24 -> 25:
   24: r116:DI=sign_extend(r115:SI)
      REG_DEAD r115:SI
   25: r117:SI=[r116:DI*0x4+r118:DI]
      REG_DEAD r116:DI
      REG_EQUAL [r116:DI*0x4+`constellation_64qam']
Failed to match this instruction:
(set (reg:SI 117 [ constellation_64qam[_5] ])
    (mem/u:SI (plus:DI (and:DI (mult:DI (subreg:DI (reg:SI 115) 0)
                    (const_int 4 [0x4]))
                (const_int 252 [0xfc]))
            (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))

expand_compound_operation has the curious code (that Richard pointed
out in the PR):

  /* Convert sign extension to zero extension, if we know that the high
     bit is not set, as this is easier to optimize.  It will be converted
     back to cheaper alternative in make_extraction.  */
  if (GET_CODE (x) == SIGN_EXTEND
      && HWI_COMPUTABLE_MODE_P (mode)
      && ((nonzero_bits (XEXP (x, 0), inner_mode)
	   & ~(((unsigned HOST_WIDE_INT) GET_MODE_MASK (inner_mode)) >> 1))
	  == 0))
    {
      rtx temp = gen_rtx_ZERO_EXTEND (mode, XEXP (x, 0));
      rtx temp2 = expand_compound_operation (temp);

      /* Make sure this is a profitable operation.  */
      if (set_src_cost (x, mode, optimize_this_for_speed_p)
          > set_src_cost (temp2, mode, optimize_this_for_speed_p))
       return temp2;
      else if (set_src_cost (x, mode, optimize_this_for_speed_p)
               > set_src_cost (temp, mode, optimize_this_for_speed_p))
       return temp;
      else
       return x;
    }

So we only use the expanded version of zero_extend if it is strictly
cheaper than sign_extend.  Otherwise we make a choice between the
original *unexpanded* sign_extend and the original *unexpanded*
zero_extend, preferring to keep things as they are in the event of a tie.

That is, this code bypasses the normal expansion of sign_extends if
we can prove that the top bit of the input is clear.  If all costs are
equal, it will keep the unexpanded sign_extend.  This is what happened
in the testcase before c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f.

Among other things, c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f added
the following rule to the simplify-rtx.cc handling of SIGN_EXTEND:

      /* We can canonicalize SIGN_EXTEND (op) as ZERO_EXTEND (op) when
         we know the sign bit of OP must be clear.  */
      if (val_signbit_known_clear_p (GET_MODE (op),
				     nonzero_bits (op, GET_MODE (op))))
	return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op));

This has the effect of overriding the cost check above and
preferring zero_extend over sign_extend as a general principle.
We then optimise the zero_extend as I described yesterday.

So the trigger for the regression is that we convert a sign_extend
to a zero_extend when we didn't previously.

>> On the code size results: as I mentioned on IRC yesterday, when I tried
>> building the linux kernel locally (with -Os -fno-schedule-insns{,2}),
>> I saw code size improvements in 182 .os and a regression in only one .o.
>> (I was comparing individual .os because it makes isolation easier.)
>
> Nothing here is about code size.  It is just a good stand-in to compare
> the result of a change in combine with: almost all changes in generated
> code are because combine could combine more (or fewer) instructions.
>
> This is good if you just want to look at a table of numbers.  It will
> often show something is obviously not good, or obviously good, and it
> shows what targets are of extra interest.

Sure.  But that applies regardless of whether -O2 or -Os is used
to compile the tests.  In both cases we're using code size to get
a general feel for the scope of the changes.  The point of using
-Os is that the compiler's goals are aligned with the thing that
we're measuring, so the at-a-glance results should be less noisy.

With -O2 there are several ways in which a beneficial combine change
(that in itself reduces code size) can cause acceptable increases
in code size downstream, usually via different RA choices leading
to different post-RA code layout.

If there was an "-O2 up to and including combine, -Os thereafter" then I'd
use that instead :-)

> You still need to look at the actual generated code to confirm things.
> For example, with your previous patch on aarch64 part of the code size
> increase is extra tail duplication (in the bpf interpreter), not a bad
> thing.  Unfortunately that was only a small part of the code size
> increase.

OK.  IMO, the benefit of using -Os is that you don't have to weed
out things like tail duplication by hand.

>> And the testcase in the PR was from a real workload (ArmRAL).
>
> What testcase?  Oh the snippet in #c0?
>
> This is not a good example if you want this to be P1, heh.

What kind of example would you need?  The point of the testcase
was to provide a minimal reproducer.

>> If you can isolate the case you saw, I'll have a look.  But ISTM that
>> the change is a pretty clear win on aarch64.
>
> No, *you* have to show it is an improvement.  I have a day job as well,
> you know.

Sorry, I wasn't trying to demand that you spend time isolating the
code size regression you'd seen.  But it seemed like you were treating
your results as a blocker for the patch.  It's difficult for me to explain
the results if I can't reproduce them.

Like I say, I've tried compiling a linux kernel locally and the results
from the patch were overwhelmingly positive.

>> Sometimes it will rework them entirely.  But sometimes it
>> will keep the outer operations and inner operations, but in a
>> slightly different arrangement.  For example, the inner operation
>> might no longer be a direct operand of the outer operation.
>
> What does that mean?  It will always still be a single expression.
>
> What is a "direct operand"?  A subexpression?  But it always *is* one.

I was using "A is a direct operand of B" to mean:

  XEXP (B, N) == A for some N or
  XVECEXP (B, N, M) == A for some N and M

>> 	* combine.cc (make_compound_operation_int): Extend the AND to
>> 	ZERO_EXTEND transformation so that it can handle an intervening
>> 	multiplication by a power of two.
>
> If that is truly all it does, that sounds nice :-)

Gah.  I got so wrapped up in doing a new write-up that I completely
forgot to update the changelog :-)  So yes, this was still for the
original patch, sorry.

> Oh, this is factored out from existing code?  Please do that as a
> separate patch.  First the refactoring, than the (hopefully tiny!) one
> with the actual changes.
>
> (And send new patch series as a new mail thread please).

OK, will do.

Richard

next prev parent reply	other threads:[~2023-03-09 10:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-04 18:32 [PATCH] PR rtl-optimization/106594: Preserve zero_extend in combine when cheap Roger Sayle
2023-03-04 22:17 ` Segher Boessenkool
2023-03-05 19:28   ` Tamar Christina
2023-03-05 19:56     ` Jeff Law
2023-03-05 20:43       ` Tamar Christina
2023-03-05 21:33         ` Segher Boessenkool
2023-03-06 12:08           ` Segher Boessenkool
2023-03-06 12:11             ` Tamar Christina
2023-03-06 12:47       ` [PATCH] combine: Try harder to form zero_extends [PR106594] Richard Sandiford
2023-03-06 13:58         ` Segher Boessenkool
2023-03-06 15:08           ` Richard Sandiford
2023-03-06 16:18             ` Jakub Jelinek
2023-03-06 16:34               ` Richard Sandiford
2023-03-06 18:31                 ` Segher Boessenkool
2023-03-06 19:13                   ` Richard Sandiford
2023-03-06 23:31                     ` Segher Boessenkool
2023-03-08 11:58                       ` Richard Sandiford
2023-03-08 22:50                         ` Segher Boessenkool
2023-03-09 10:18                           ` Richard Sandiford [this message]
2023-03-06 22:58                 ` Segher Boessenkool
2023-03-06 18:13               ` Segher Boessenkool

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mpt5yba2pp5.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=Tamar.Christina@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=jeffreyalaw@gmail.com \
    --cc=roger@nextmovesoftware.com \
    --cc=segher@kernel.crashing.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).