[Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order
@ 2021-02-07 18:07 ktkachov at gcc dot gnu.org
  2021-02-08  9:20 ` [Bug rtl-optimization/98986] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2021-02-07 18:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986

            Bug ID: 98986
           Summary: Try matching both orders of commutative RTX operations
                    when there is no canonical order
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

The motivating aarch64 testcase is this:

#include <arm_neon.h>
int32x4_t
foo (int16x4_t a, int16x4_t b)
{
  int16x4_t tmp = vdup_n_s16 (vget_lane_s16 (b, 3));

  return vmull_s16 (tmp, a);
}

int32x4_t
foo2 (int16x4_t a, int16x4_t b)
{
  int16x4_t tmp = vdup_n_s16 (vget_lane_s16 (b, 3));

  return vmull_s16 (a, tmp);
}

Both functions should generate the widening-mult-by-lane form:
        smull   v0.4s, v0.4h, v1.h[3]   // 13   [c=16 l=4] 
aarch64_vec_smult_lane_v4hi

However only the second function foo2 manages to match it.
We have a pattern for this in aarch64-simd.md:
(define_insn "aarch64_vec_<su>mult_lane<Qlane>"
  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
        (mult:<VWIDE>
          (ANY_EXTEND:<VWIDE>
            (match_operand:<VCOND> 1 "register_operand" "w"))
          (ANY_EXTEND:<VWIDE>
            (vec_duplicate:<VCOND>
              (vec_select:<VEL>
                (match_operand:VDQHS 2 "register_operand" "<vwx>")
                (parallel [(match_operand:SI 3 "immediate_operand" "i")]))))))]
  "TARGET_SIMD"
  {
    operands[3] = aarch64_endian_lane_rtx (<MODE>mode, INTVAL (operands[3]));
    return "<su>mull\\t%0.<Vwtype>, %1.<Vcondtype>, %2.<Vetype>[%3]";
  }
  [(set_attr "type" "neon_mul_<Vetype>_scalar_long")]
)

For foo combine tries and fails to match the vec_select in the first arm of the
mult:
(set (reg:V4SI 93 [ <retval> ])
    (mult:V4SI (sign_extend:V4SI (vec_duplicate:V4HI (vec_select:HI (reg:V4HI
99)
                    (parallel:V4HI [
                            (const_int 3 [0x3])
                        ]))))
        (sign_extend:V4SI (reg:V4HI 98))))

Unfortunately, due to the sign_extends on both arm of the mult there is no
canonical order for these expressions as both arms of the MULT are RTX_UNARY
expressions and swap_commutative_operands_p doesn't try to swap them around.
I guess we can work around this by adding more patterns in the backend to match
the two different orders we can get in this situation, but we've got 
so many similar patterns in the backend...

Do you think it's feasible to get recog or combine to try out both permutations
of such commutative operations when matching without blowing up compile time?
Any other ideas for resolving this are welcome

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/98986] Try matching both orders of commutative RTX operations when there is no canonical order
  2021-02-07 18:07 [Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order ktkachov at gcc dot gnu.org
@ 2021-02-08  9:20 ` rguenth at gcc dot gnu.org
  2021-02-08 11:00 ` segher at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-08  9:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unknown                     |11.0
                 CC|                            |rguenth at gcc dot gnu.org,
                   |                            |segher at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Clearly making one order canonical would make sense (subreg, extend, lowpart of
a reg should be singled out from other random exprs of the same kind).

Doesn't help if existing backends do not follow the new canonical order.

On tree/GIMPLE we end up "trying" quite some permutes but on statically
generated decision trees (gimple/generic-match.c) which makes it expensive
on the compiler code-size side but not so expensive on the compile-time side.

(so pre-scanning target .mds and auto-generating parts of combine might
even speed it up ...)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/98986] Try matching both orders of commutative RTX operations when there is no canonical order
  2021-02-07 18:07 [Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order ktkachov at gcc dot gnu.org
  2021-02-08  9:20 ` [Bug rtl-optimization/98986] " rguenth at gcc dot gnu.org
@ 2021-02-08 11:00 ` segher at gcc dot gnu.org
  2021-02-10 12:23 ` rsandifo at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-08 11:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986

--- Comment #2 from Segher Boessenkool <segher at gcc dot gnu.org> ---
I agree it makes sense to have the one arm with vec_duplicate first in the
canonical order.  Problem is that this is deep in the arms, but it can be
done of course.

Autogenerating part of combine?  Nonononono please.  Or, what part do you
mean?  Something in rtx-simplify would make sense, and something in recog
would make a *lot* of sense.  For the latter, we probably want some more
syntax in the machine description, things like % are too restrictive (and
that is really only meant for RA).  For example, a common pattern is the
sum of three things, which has no good way of expressing right now.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/98986] Try matching both orders of commutative RTX operations when there is no canonical order
  2021-02-07 18:07 [Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order ktkachov at gcc dot gnu.org
  2021-02-08  9:20 ` [Bug rtl-optimization/98986] " rguenth at gcc dot gnu.org
  2021-02-08 11:00 ` segher at gcc dot gnu.org
@ 2021-02-10 12:23 ` rsandifo at gcc dot gnu.org
  2021-02-10 12:27 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-02-10 12:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #3 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Yeah, I think making the canonicalisation rules go deep into
the operands has scalability problems.

FWIW, another similar thing I've wanted in the past is to try
recognising multiple possible constants in an (and X (const_int N))
when X is known to have some bits clear.  Often we try to make N contain
as few bits as possible, but that can give worse results than a fuller mask.

I had a WIP patch for this and binary order thing.  It used a wrapper
around recog (in recog.c) to apply useful-looking variations.
Of course, trying multiple variations has scalability issues too,
so it would need some kind of limiter.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/98986] Try matching both orders of commutative RTX operations when there is no canonical order
  2021-02-07 18:07 [Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order ktkachov at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-02-10 12:23 ` rsandifo at gcc dot gnu.org
@ 2021-02-10 12:27 ` rguenther at suse dot de
  2021-02-10 16:53 ` segher at gcc dot gnu.org
  2021-02-10 17:02 ` segher at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenther at suse dot de @ 2021-02-10 12:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986

--- Comment #4 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 10 Feb 2021, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986
> 
> rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |rsandifo at gcc dot gnu.org
> 
> --- Comment #3 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> Yeah, I think making the canonicalisation rules go deep into
> the operands has scalability problems.
> 
> FWIW, another similar thing I've wanted in the past is to try
> recognising multiple possible constants in an (and X (const_int N))
> when X is known to have some bits clear.  Often we try to make N contain
> as few bits as possible, but that can give worse results than a fuller mask.
> 
> I had a WIP patch for this and binary order thing.  It used a wrapper
> around recog (in recog.c) to apply useful-looking variations.
> Of course, trying multiple variations has scalability issues too,
> so it would need some kind of limiter.

So this is where the "autogenerated" part comes in.  We should have
an idea what might be useful and what isn't even worth trying by
looking at the machine description (which might require exposing
costs in such form for this case of constants).

For commutative operands maybe recog itself can be relaxed and
accept the insn with the "wrong" commutation (or fix it up
itself) for example.  Or maybe genrecog can magically emit
commutated variants (like genmatch does for :c annotated
expression branches).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/98986] Try matching both orders of commutative RTX operations when there is no canonical order
  2021-02-07 18:07 [Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order ktkachov at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-02-10 12:27 ` rguenther at suse dot de
@ 2021-02-10 16:53 ` segher at gcc dot gnu.org
  2021-02-10 17:02 ` segher at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-10 16:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986

--- Comment #5 from Segher Boessenkool <segher at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #3)
> FWIW, another similar thing I've wanted in the past is to try
> recognising multiple possible constants in an (and X (const_int N))
> when X is known to have some bits clear.  Often we try to make N contain
> as few bits as possible, but that can give worse results than a fuller mask.

This could be done in the target machine description, where it makes a lot more
sense to do anyway, *if* nonzero_bits was generally usable there.  I have Plans
for that for GCC 12, but don't depend on it please :-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/98986] Try matching both orders of commutative RTX operations when there is no canonical order
  2021-02-07 18:07 [Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order ktkachov at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-02-10 16:53 ` segher at gcc dot gnu.org
@ 2021-02-10 17:02 ` segher at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-10 17:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986

--- Comment #6 from Segher Boessenkool <segher at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #4)
> So this is where the "autogenerated" part comes in.  We should have
> an idea what might be useful and what isn't even worth trying by
> looking at the machine description (which might require exposing
> costs in such form for this case of constants).
> 
> For commutative operands maybe recog itself can be relaxed and
> accept the insn with the "wrong" commutation (or fix it up
> itself) for example.  Or maybe genrecog can magically emit
> commutated variants (like genmatch does for :c annotated
> expression branches).

We could probably derive what things in an RTL expression are commutative (even
if there are many quantities in play), but only allowing the canonical forms in
that is a daunting task.  Something like :c could help; we already have % in
RTL,
but we need more general than that (examples: a+b+c and a*b+c*d should both be
handled some way, since such cases (structure, not necessarily those exact ops)
happen a lot in practice.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-02-10 17:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-07 18:07 [Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order ktkachov at gcc dot gnu.org
2021-02-08  9:20 ` [Bug rtl-optimization/98986] " rguenth at gcc dot gnu.org
2021-02-08 11:00 ` segher at gcc dot gnu.org
2021-02-10 12:23 ` rsandifo at gcc dot gnu.org
2021-02-10 12:27 ` rguenther at suse dot de
2021-02-10 16:53 ` segher at gcc dot gnu.org
2021-02-10 17:02 ` segher at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).