RE: [PATCH 1/3]middle-end: Add the ability to let the target decide the method of argument promotions.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <Richard.Sandiford@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, "rguenther@suse.de" <rguenther@suse.de>,
	"jeffreyalaw@gmail.com" <jeffreyalaw@gmail.com>
Subject: RE: [PATCH 1/3]middle-end: Add the ability to let the target decide the method of argument promotions.
Date: Tue, 17 May 2022 07:55:20 +0000	[thread overview]
Message-ID: <VI1PR08MB532557F949E64669B671E999FFCE9@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <mptfsl9whir.fsf@arm.com>

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, May 16, 2022 5:48 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> jeffreyalaw@gmail.com
> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
> the method of argument promotions.
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Monday, May 16, 2022 2:24 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> >> jeffreyalaw@gmail.com
> >> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target
> >> decide the method of argument promotions.
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Monday, May 16, 2022 1:18 PM
> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> >> >> jeffreyalaw@gmail.com
> >> >> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
> >> >> target decide the method of argument promotions.
> >> >>
> >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> > Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> >>> -----Original Message-----
> >> >> >>> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> >>> Sent: Monday, May 16, 2022 12:36 PM
> >> >> >>> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> >>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
> >> rguenther@suse.de;
> >> >> >>> jeffreyalaw@gmail.com
> >> >> >>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
> >> >> >>> target decide the method of argument promotions.
> >> >> >>>
> >> >> >>> Tamar Christina <tamar.christina@arm.com> writes:
> >> >> >>> > Hi All,
> >> >> >>> >
> >> >> >>> > Some targets require function parameters to be promoted to a
> >> >> >>> > different type on expand time because the target may not
> >> >> >>> > have native instructions to work on such types.  As an
> >> >> >>> > example the
> >> >> >>> > AArch64 port does not have native instructions working on
> >> >> >>> > integer
> >> >> >>> > 8- or 16-bit values.  As such it promotes every parameter of
> >> >> >>> > these
> >> >> types to 32-bits.
> >> >> >>>
> >> >> >>> This doesn't seem specific to parameters though.  It applies
> >> >> >>> to any
> >> >> >>> 8- or 16-bit variable.  E.g.:
> >> >> >>>
> >> >> >>> #include <stdint.h>
> >> >> >>> uint8_t foo(uint32_t x, uint32_t y) {
> >> >> >>>     uint8_t z = x != 0 ? x : y;
> >> >> >>>     return z + 1;
> >> >> >>> }
> >> >> >>>
> >> >> >>> generates:
> >> >> >>>
> >> >> >>> foo:
> >> >> >>>         cmp     w0, 0
> >> >> >>>         and     w1, w1, 255
> >> >> >>>         and     w0, w0, 255
> >> >> >>>         csel    w0, w1, w0, eq
> >> >> >>>         add     w0, w0, 1
> >> >> >>>         ret
> >> >> >>>
> >> >> >>> So I think the new behaviour is really a modification of the
> >> >> >>> PROMOTE_MODE behaviour rather than the
> >> >> PROMOTE_FUNCTION_MODE behaviour.
> >> >> >>>
> >> >> >>> FWIW, I agree with Richard that it would be better not to add
> >> >> >>> a new
> >> >> hook.
> >> >> >>> I think we're really making PROMOTE_MODE choose between
> >> >> SIGN_EXTEND,
> >> >> >>> ZERO_EXTEND or SUBREG (what LLVM would call “any
> >> >> >>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND
> >> >> choice.
> >> >> >>
> >> >> >> Ah, I hadn't realized this also applied to locals.. ok I can
> >> >> >> modify PROMOTE_MODE then, but I also need the actual
> SSA_NAME
> >> and
> >> >> >> not just
> >> >> the type so will have to pass this along.
> >> >> >>
> >> >> >> From a practical point of view.. the actual hook however is
> >> >> >> implemented by 34 targets, would I need to CC maintainers for
> >> >> >> each of them or would global maintainer approval suffice for
> >> >> >> these mostly
> >> >> mechanical changes?
> >> >> >
> >> >> > Yeah, single approval should be enough mechanical changes.  It
> >> >> > would be good to do the interface change and mechanical target
> >> >> > changes as a separate prepatch if possible though.
> >> >> >
> >> >> > I'm not sure about passing the SSA name to the target though, or
> >> >> > the way that the aarch64 hook uses the info.  It looks like a
> >> >> > single cold comparison could defeat the optimisation for hot code.
> >> >
> >> > I'm not sure I follow why the likelihood of the comparison matters
> >> > in this
> >> case..
> >> > I'll expand on it below..
> >>
> >> I meant the likelihood that the comparison is executed at all, not
> >> which outcome is more likely.  E.g. suppose the only comparison
> >> occurs on a failure path that eventually calls abort, and that there
> >> are other paths (without comparisons of the same value) that would
> >> benefit from the any-extend optimisation.  We'd prioritise the cold
> >> comparison over optimising the other
> >> (hot) code.
> >>
> >> I'm just suspicious of heuristics along the lines of “don't do X if
> >> there is a single instance of Y”. :-)
> >
> > I'm probably very dense here sorry.. but if there's
> >
> > 1 use: the zero extend gets pushed down into the branch which needs it.
> >
> > i.e. in:
> >
> > extern void foo ();
> > extern void bar ();
> >
> > uint8_t f (uint8_t a, uint8_t b)
> > {
> >   if (b) {
> >     if (a)
> >       foo ();
> >     else
> >       return f (a, b);
> >   } else {
> >       bar ();
> >   }
> >   return b;
> > }
> >
> > The zero extend of a is only done in the true branch for if (b).
> > Secondly the zero extended form is the basis for all other patterns we
> > form, such as ands, which is the combination of the zero extend and
> compare.
> >
> > 2 uses, both live:
> >
> > extern void foo ();
> > extern void bar (uint8_t);
> >
> > uint8_t f (uint8_t a, uint8_t b)
> > {
> >   if (b) {
> >     if (a)
> >       foo ();
> >     else
> >       return f (a, b);
> >   } else {
> >       bar (a);
> >   }
> >   return b;
> > }
> >
> > In which case the extend of a is done before the if (b) and only the
> > extended values used.
> >
> > Even if you had multiple cold/unused branches, I struggle to see any
> > case where the any-extend would be better.  Reload must keep the value
> live as it's a param. Either you:
> >
> > 1. have enough registers to keep the value live, in which case, instead of
> doing a "mov" to
> >      To copy the value and then later an AND or TST, it's better to just do an
> and instead of the mov.
> >       You keep the same number of registers live but in the best case you
> have 1 instruction less, and
> >        The worse case you have 0 instructions more.
> > 2. You don't have enough registers to keep the value live, in which case the
> zero extended value is
> >      Still better because on the reload it can simply use ldrb ..., cbz as we use
> the load for an implicit
> >      zero extend. Which is still better than ldrb ..., tst, cbnz for an any-extend.
> >
> > So I am genuinely struggling to see a case where any-extend is better
> > for comparison. And the only reason I am singling out comparisons is
> > because in GIMPLE integer constants don't get an explicit promotion to
> > int.  Otherwise I wouldn't have needed to as it would have always required
> an extend here.
> 
> IIUC, you're talking about cases involving multiple comparisons.  I was instead
> talking about the case where there is 1 cold comparison that doesn't benefit
> from any-extend and multiple hot operations (not comparisons) that do
> benefit.  The patch then seemed to avoid any-extend because of the cold
> comparison.
> 
> E.g. does the patch avoid the AND in:
> 
> #include <stdint.h>
> uint8_t foo(uint8_t x, int y) {
>     if (y) {
>         printf("Foo %d\n", x ? 1 : 2);
>         __builtin_abort ();
>     }
>     return x + 1;
> }
> 
> ?

Morning,

It does actually, it generates:

foo:
        cbnz    w1, .L9
        add     w0, w0, 1
        ret
.L9:
        tst     w0, 255
        stp     x29, x30, [sp, -16]!
        cset    w1, ne
        add     w1, w1, 1
        mov     x29, sp
        adrp    x0, .LC0
        add     x0, x0, :lo12:.LC0
        bl      printf
        bl      abort
        .size   foo, .-foo

Now I will admit that this isn't because of a grand master design, but
purely because the patch works around the cases seen in SPEC.  In those
cases the comparisons in question were floated out of the if statement.

The heuristic in patch 2/3 allows this because it only looks for compares in
gimple assigns whereas in this case the compare is in the Gimple cond
directly.

> 
> >> >> > If we do try to make the decision based on uses at expand time,
> >> >> > it might be better for the analysis to be in target-independent
> >> >> > code, with help from the target to decide where extensions are
> >> >> > cheap.  It still feels a bit hacky though.
> >> >
> >> > I thought about it but can't see most target having this problem. I
> >> > did go with an optimistic heuristics. There are of course various
> >> > ways to defeat it but looking through the corpus of code I don't
> >> > see any but the simple cases in practice. (more below).
> >> >
> >> >> >
> >> >> > What stops us from forming cbz/cbnz when the extension is done
> >> >> > close to the comparison (from the comment in 2/3)?  If we can
> >> >> > solve that, could we simply do an any-extend all the time, and
> >> >> > treat removing redundant extensions as a global availability problem?
> >> >>
> >> >
> >> > In such cases there's no gain from doing the extension at all, e.g.
> >> > and w0, w0, 255
> >> > cmp w0, 0
> >> > b.eq .Lfoo
> >> >
> >> > will be optimized to
> >> >
> >> > tst w0, 0xff
> >> > b.ne .Lfoo
> >> >
> >> > already.
> >> >
> >> > In RTL the problem occurs when you have nested control flow like
> >> > nested if and switch statements The example in 2/3 was intended to
> >> > show that before what we'd do is
> >> >
> >> > and w22, w0, 255
> >> > .... <code that clobbers cc and caller saves> <switch1> cbz w22,
> >> > .Lfoo1 ....
> >> > <switch2>
> >> > cbz w22, .Lfoo2
> >> >
> >> > If we have a single comparison we already sink the zero_extend today.
> >> >
> >> > Now if we instead any-extend w0 we end up with:
> >> >
> >> > mov w22, w0
> >> > .... <code that clobbers cc and caller saves> <switch1> tst w22,
> >> > 0xff b.ne .Lfoo1 ....
> >> > <switch2>
> >> > tst w22, 0xff
> >> > b.ne .Lfoo2
> >> >
> >> > So you get an additional tst before each branch. You also can't
> >> > perform the
> >> tst higher since CC is clobbered.
> >> > The cbz is nice because the zero extend doesn't use CC of course
> >> > and so having the value live allows us to optimize The branch.
> >>
> >> Once the cbz has been formed (in combine), where does the
> >> optimisation of it happen?
> >
> > There's no real "optimization". Combine combines the cmp 0 and br
> > leaving the AND behind.  Because of the live range required for the
> > value reload must copy it away from a caller save.  It chooses to move it to
> w22 in this case.
> >
> > and w0, w0, 255
> > mov w22, w0
> >
> > this simply gets simplified into and w22, w0, 255 by a zero extending move
> pattern.
> > The only optimization here is when the pattern isn't single use, it's simply
> not moved/folded.
> >
> > The only options available to combine are
> >
> > cmp, br = tst + br (in the case of a subreg where it can't tell what
> > the top bits are) and, cmp, br = ands + br (if value is single use)
> > cmp, br = cbz (in the case it knows that the top bits are 0).
> >
> > If we emit a zero extend both operations above are possible, and we
> > emit them depending on value being single use or not.  If we emit a
> > paradoxical subreg, we never form cbz unless the value comes from an
> operation where GIMPLE has maintained C semantics.
> >
> > But I am probably missing something.. so I'll just make the changes
> > and see where we land 😊
> 
> No, I agree/was agreeing with the description of the combine behaviour.
> I guess I just misunderstood what you meant by “the cbz is nice because the
> zero extend doesn't use CC of course and so having the value live allows us
> to optimize the branch”.
> 
> >> > I don't think branch likeliness matters here as you must keep w22
> >> > live in both cases somehow. In the SPECCPU 2017 Benchmark perlbench
> >> > (which uses a lot of nested switches) this pattern is responsible
> >> > for an extra 0.3%
> >> codesize increase which the approach in 2/3 prevents.
> >> >
> >> >> (which would run after combine)
> >> >>
> >> >> >
> >> >> > What kind of code do we emit when do an extension just before an
> >> >> > operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say,
> >> >> > then it should be safe to do the extension directly into R:
> >> >> >
> >> >> >   (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))
> >> >>
> >> >> Oops, that should of course be:
> >> >>
> >> >>   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))
> >> >>
> >> >> > which avoids the problem of having two values live at once (the
> >> >> > zero-extended value and the any-extended value)
> >
> > I'm assuming R here is the hardreg which has the parameter? In which
> > case wouldn't the subreg be folded away? I.e you end up with
> >
> > (set (reg:SI R) (zero_extend:SI (reg:QI R)))
> 
> No, R is the pseudo that holds the DECL_RTL (for both VAR_DECLs and
> PARM_DECLs).
> 
> > ? But that SET isn’t paradoxical, we wouldn't generate it.
> >
> > We generate for e.g.:
> >
> > #include <stdint.h>
> >
> > uint16_t f8 (uint8_t xr, uint8_t xc){
> >     return (uint8_t)(xr * xc);
> > }
> >
> > (insn 9 6 10 2 (set (reg:HI 101)
> (zero_extend:HI (reg/v:QI 96 [ xr ]))) "prom.c":4:16 -1
> (nil))
> (insn 10 9 11 2 (set (reg:HI 102)
> (zero_extend:HI (reg/v:QI 98 [ xc ]))) "prom.c":4:16 -1
> (nil))
> (insn 11 10 12 2 (set (reg:SI 103)
> (mult:SI (subreg:SI (reg:HI 101) 0)
> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
> (nil))
> >
> > Out of expand. The paradoxical subreg isn't generated at all out of
> > expand unless it's needed. It does keep the original params around as
> unused:
> >
> > (insn 2 7 4 2 (set (reg:QI 97)
> (reg:QI 0 x0 [ xr ])) "prom.c":3:37 -1
> (nil))
> (insn 4 2 3 2 (set (reg:QI 99)
> (reg:QI 1 x1 [ xc ])) "prom.c":3:37 -1
> (nil))
> >
> > And the paradoxical subreg is moved into the first operation requiring it:
> >
> > (insn 11 10 12 2 (set (reg:SI 103)
> (mult:SI (subreg:SI (reg:HI 101) 0)
> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
> (nil))
> 
> Ah, OK, this isn't what I'd imaagined.  I thought the xr and xc registers would
> be SIs and the DECL_RTLs would be QI subregs of those SI regs.
> I think that might work better, for the reasons above.  (That is, whenever we
> need the register in extended form, we can simply extend the existing reg
> rather than create a new one.)

Ah, I see, no, I explicitly avoid this. When doing the type promotions I tell it that
size of the copies of xr and xc is still the original size, e.g. QI (i.e. I don't change 97 and 99).
This is different from what we do with extends where 97 and 99 *would* be changed.

The reason is that if I make this SI the compiler thinks it knows the value of all the bits
in the register which led to various miscompares as it thinks it can use the SI value directly.

This happens because again the xr and xc are hard regs. So having 97 be

(set (reg:SI 97) (subreg:SI (reg:QI 0 x0 [ xr ]) 0))

gets folded to an incorrect

(set (reg:SI 97) (reg:SI 0 x0 [ xr ]))

And now 97 is free to be used without any zero extension, as 97 on it's own is an invalid RTX.

So I have to keep the intermediate copy QI mode, after which the RTX optimizations
being done during expand generates the forms above.

> 
> I think that's where confusion was coming from.
> 
> > In any case, I'm still not totally sure what the objection here is.
> > Afaik, compares need to be treated specially because in GIMPLE they
> > already are.  Afaik, C integer promotion rules state that in the
> > comparison 0 should have been promoted to an integer constant of rank
> > int and so the comparison itself should have been done as integer.
> > i.e. extended.  And most of our patterns are based around this.
> >
> > Gimple however doesn't do this, the comparison is done in the rank of
> > the variable and there is no explicit conversion.  This happened to be
> > fixed up before during the forced promotion.  So to me the heuristic
> > doesn't seem to be that crazy..
> 
> I guess I still don't see the distinction.  C says the same thing about
> +, -, >>, etc.  And gimple is free to do those operations in narrow
> +types
> if it wants to, and if that doesn't change the semantics.  (Not that gimple
> always does them in narrow types.  But it is valid gimple.)
> 
> The optimisation problem doesn't come from C or gimple semantics, but
> from the fact that AArch64 (unlike x86 say) doesn't have byte add, byte
> compare, byte right shift, etc.  We therefore need to promote 8-bit and 16-
> bit operations to 32 bits first.
> 
> For add, subtract, multiply, left shift, and logic ops, we can avoid defining the
> upper bits of the inputs when we do these extensions, because the upper
> bits of the inputs don't affect the useful bits of the result.  But for
> comparisons, right shifts, and divides, etc., we do need to extend.
> 
> AIUI, the comparison case is special because (for AArch64-specific reasons),
> we prefer extend + cbz to tst + branch, especially when the extend can be
> shared.

Agreed, so I think we agree on this 😊 I guess the disagreement is where this
should be done. I'll admit that the testcase above works by coincidence. But if
we don't do it during expand time, the only place I can think of to introduce
the zero extends is to add various patterns to do an early split of any-extend
+ cmp.  But wouldn't that be more fragile? At least at expand time all comparisons
are tcc_comparisons. 

Kind regards,
Tamar
> 
> Thanks,
> Richard

next prev parent reply	other threads:[~2022-05-17  7:55 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-13 17:11 Tamar Christina
2022-05-13 17:11 ` [PATCH 2/3]AArch64 Promote function arguments using a paradoxical subreg when beneficial Tamar Christina
2022-10-27  3:15   ` Andrew Pinski
2022-10-28  9:57     ` Tamar Christina
2022-05-13 17:12 ` [PATCH 3/3]AArch64 Update the testsuite to remove xfails Tamar Christina
2022-05-16  6:31 ` [PATCH 1/3]middle-end: Add the ability to let the target decide the method of argument promotions Richard Biener
2022-05-16  8:26   ` Tamar Christina
2022-05-16 11:36 ` Richard Sandiford
2022-05-16 11:49   ` Tamar Christina
2022-05-16 12:14     ` Richard Sandiford
2022-05-16 12:18       ` Richard Sandiford
2022-05-16 13:02         ` Tamar Christina
2022-05-16 13:24           ` Richard Sandiford
2022-05-16 15:29             ` Tamar Christina
2022-05-16 16:48               ` Richard Sandiford
2022-05-17  7:55                 ` Tamar Christina [this message]
2022-05-17  9:03                   ` Richard Sandiford
2022-05-17 17:45                     ` Tamar Christina
2022-05-18  7:49                       ` Richard Sandiford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB532557F949E64669B671E999FFCE9@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=Richard.Sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jeffreyalaw@gmail.com \
    --cc=nd@arm.com \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).