[Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves
@ 2022-01-16 12:19 tnfchris at gcc dot gnu.org
  2022-01-17  2:20 ` [Bug rtl-optimization/104049] " pinskia at gcc dot gnu.org
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-01-16 12:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

            Bug ID: 104049
           Summary: [12 Regression] vec_select to subreg lowering causes
                    superfluous moves
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64-*

Consider:

int test (uint8_t *p, uint32_t t[1][1], int n) {

  int sum = 0;
  uint32_t a0;
  for (int i = 0; i < 4; i++, p++)
    t[i][0] = p[0];

  for (int i = 0; i < 4; i++) {
    {
      int t0 = t[0][i] + t[0][i];
      a0 = t0;
    };
    sum += a0;
  }
  return (((uint16_t)sum) + ((uint32_t)sum >> 16)) >> 1;
}

Which after the reduction gets SLP'd used to generate at -O3

        addv    s0, v0.4s
        fmov    w0, s0
        lsr     w1, w0, 16
        add     w0, w1, w0, uxth
        lsr     w0, w0, 1

which was pretty good. However in GCC 12 we now generate worse code:

        addv    s0, v0.4s
        fmov    w0, s0
        fmov    w1, s0
        and     w0, w0, 65535
        add     w0, w0, w1, lsr 16
        lsr     w0, w0, 1

Notice the double transfer of the same value.

This is because at the RTL level the original mov becomes a vec_select

(insn 19 18 20 2 (set (reg:SI 102 [ _43 ])
        (vec_select:SI (reg:V4SI 117)
            (parallel [
                    (const_int 0 [0])
                ]))) -1
     (nil))

which previously stayed as a vec_select and the RA would use this pattern for
the w -> r move.

Now however this vec_select gets transformed into a subreg 0, which causes
combine to push the subreg into each instruction using reg 102.

(insn 21 18 22 2 (set (reg:SI 120)
        (and:SI (subreg:SI (reg:V4SI 117) 0)
            (const_int 65535 [0xffff]))) "/app/example.c":30:27 492 {andsi3}
     (nil))
(insn 22 21 28 2 (set (reg:SI 121)
        (plus:SI (lshiftrt:SI (subreg:SI (reg:V4SI 117) 0)
                (const_int 16 [0x10]))
            (reg:SI 120))) "/app/example.c":30:27 211 {*add_lsr_si}
     (expr_list:REG_DEAD (reg:SI 120)
        (expr_list:REG_DEAD (reg:V4SI 117)
            (nil))))

and because these operations don't exist on the w side, reload is forced to
materialized many duplicate moves from w -> r.  So every operation that gets
the subreg pushed into it for which we don't have an operation for on the w
side gets an extra move.

Aside from that, we seem to lose that the & can be folded into the subreg by
simply truncating the subreg from SI to HI and zero extending that out.

A different reproducer is

#include <arm_neon.h>

typedef int v4si __attribute__ ((vector_size (16)));

int bar (v4si x)
{
  unsigned int sum = vaddvq_s32 (x);
  return (((uint16_t)(sum & 0xffff)) + ((uint32_t)sum >> 16));
}

Note that using -frename-registers does get us to an optimal sequence here
which is better than GCC 11.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug rtl-optimization/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
@ 2022-01-17  2:20 ` pinskia at gcc dot gnu.org
  2022-01-18 13:48 ` rguenth at gcc dot gnu.org
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-17  2:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2022-01-17
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=104039
   Target Milestone|---                         |12.0
     Ever confirmed|0                           |1
           Keywords|                            |ra

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This might be the same issue as PR 104039 really, though there we have a
paradoxical subreg.

The problem is the register allocator really really does not handle subreg in a
decent way.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug rtl-optimization/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
  2022-01-17  2:20 ` [Bug rtl-optimization/104049] " pinskia at gcc dot gnu.org
@ 2022-01-18 13:48 ` rguenth at gcc dot gnu.org
  2022-01-18 17:03 ` vmakarov at gcc dot gnu.org
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-18 13:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vmakarov at gcc dot gnu.org
           Priority|P3                          |P1

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
We need to understand the issue at least.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug rtl-optimization/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
  2022-01-17  2:20 ` [Bug rtl-optimization/104049] " pinskia at gcc dot gnu.org
  2022-01-18 13:48 ` rguenth at gcc dot gnu.org
@ 2022-01-18 17:03 ` vmakarov at gcc dot gnu.org
  2022-01-18 17:39 ` tnfchris at gcc dot gnu.org
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-01-18 17:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #3 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> We need to understand the issue at least.

I think that it is not an RA problem.

IRA assigns quite reasonable registers.  LRA just generates 2 reloads for this
test, one for insn *add_lsr_si which has only one alternative and one for insn
andsi3 which needs reload insns for any alternative and LRA in this case
chooses the best one.

I guess the problem of the code generation regression is in some recent changes
of combiner or most probably aarch64 machine dependent code directing the
combiner (as Tamar wrote).

It would be nice if somebody bisected and found what commit resulted in the
regression.

As for double transfer of the value, it could be removed by inheritance in LRA
but it is impossible as an input reload pseudo got the same hard register (in
LRA assignment subpass) as one of the insn output pseudo (the assignment was
done in IRA) and the reloaded value is still used in subsequent insn.  
Unfortunately it can happen as RA can not make allocation and code selection
optimally in general case.

  Some coordination between LRA-assignment subpass and LRA-inheritance subpass
could help to avoid the double transfer but right now I have no idea how to do
this.  It is also dangerous to implement such coordination at this stage as
LRA-inheritance sub-pass is very complicated.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug rtl-optimization/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-01-18 17:03 ` vmakarov at gcc dot gnu.org
@ 2022-01-18 17:39 ` tnfchris at gcc dot gnu.org
  2022-01-19  3:28 ` pinskia at gcc dot gnu.org
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-01-18 17:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Vladimir Makarov from comment #3)
> (In reply to Richard Biener from comment #2)
> > We need to understand the issue at least.
> 
> I think that it is not an RA problem.
> 
> IRA assigns quite reasonable registers.  LRA just generates 2 reloads for
> this test, one for insn *add_lsr_si which has only one alternative and one
> for insn andsi3 which needs reload insns for any alternative and LRA in this
> case chooses the best one.
> 
> I guess the problem of the code generation regression is in some recent
> changes of combiner or most probably aarch64 machine dependent code
> directing the combiner (as Tamar wrote).

Aside from the duplicate move, the new code is actually better. It's better
that
it combined the shift in the add instead of the zero extend.  So that part is
fine.

> 
> It would be nice if somebody bisected and found what commit resulted in the
> regression.

It's caused by g:8695bf78dad1a42636775843ca832a2f4dba4da3 which is a general
change.
subreg 0 is the canonical RTL for the operation so this now folds to it more
often.

> 
> As for double transfer of the value, it could be removed by inheritance in
> LRA but it is impossible as an input reload pseudo got the same hard
> register (in LRA assignment subpass) as one of the insn output pseudo (the
> assignment was done in IRA) and the reloaded value is still used in
> subsequent insn.   Unfortunately it can happen as RA can not make allocation
> and code selection optimally in general case.
> 
>   Some coordination between LRA-assignment subpass and LRA-inheritance
> subpass could help to avoid the double transfer but right now I have no idea
> how to do this.  It is also dangerous to implement such coordination at this
> stage as LRA-inheritance sub-pass is very complicated.

That's fair if there's no easy fix for 12.  The benefits of the subreg change
outweigh
this downside.  So I'm ok with fixing it in 13.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug rtl-optimization/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-01-18 17:39 ` tnfchris at gcc dot gnu.org
@ 2022-01-19  3:28 ` pinskia at gcc dot gnu.org
  2022-01-19  3:30 ` pinskia at gcc dot gnu.org
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-19  3:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The regression showing up is the same as PR 104059, that is 
r12-5358-g32221357007666124409ec3ee0d3a1cf263ebc9e .

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug rtl-optimization/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-01-19  3:28 ` pinskia at gcc dot gnu.org
@ 2022-01-19  3:30 ` pinskia at gcc dot gnu.org
  2022-01-19  3:39 ` [Bug target/104049] " pinskia at gcc dot gnu.org
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-19  3:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---

Before my patch we had on the gimple level:

  vect_sum_26.13_38 = VIEW_CONVERT_EXPR<vector(4) int>(vect__7.9_48);
  _36 = (vector(4) unsigned int) vect_sum_26.13_38;
  _35 = .REDUC_PLUS (_36);
  _34 = (int) _35;
  _44 = .REDUC_PLUS (vect__7.9_48);
  _20 = _34 & 65535;
  _10 = (unsigned int) _20;
  _12 = _44 >> 16;
  _13 = _10 + _12;
  _14 = _13 >> 1;
  _23 = (int) _14;

After we get:

  _43 = .REDUC_PLUS (vect__7.11_47);
  _37 = _43 & 65535;
  _12 = _43 >> 16;
  _13 = _12 + _37;
  _14 = _13 >> 1;
  _23 = (int) _14;

Which is obviously better really. So there is nothing to be done on the gimple
level really. the only thing I can think of is how reducative plus is
implemented.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2022-01-19  3:30 ` pinskia at gcc dot gnu.org
@ 2022-01-19  3:39 ` pinskia at gcc dot gnu.org
  2022-02-01 11:26 ` tnfchris at gcc dot gnu.org
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-19  3:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |target

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Hmm,
;; _43 = .REDUC_PLUS (vect__7.11_47);

(insn 23 22 24 (set (reg:V4SI 112)
        (unspec:V4SI [
                (reg:V4SI 103 [ vect__7.11 ])
            ] UNSPEC_ADDV)) -1
     (nil))

(insn 24 23 0 (set (reg:SI 102 [ _43 ])
        (vec_select:SI (reg:V4SI 112)
            (parallel [
                    (const_int 0 [0])
                ]))) -1
     (nil))

Could we just represent REDUC_PLUS as:

(insn 23 22 24 (set (reg:SI 112)
        (unspec:SI [
                (reg:V4SI 103 [ vect__7.11 ])
            ] UNSPEC_ADDV_1)) -1
     (nil))

And then I don't think we will have the issue with the vec_select and subreg
really.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2022-01-19  3:39 ` [Bug target/104049] " pinskia at gcc dot gnu.org
@ 2022-02-01 11:26 ` tnfchris at gcc dot gnu.org
  2022-03-25 12:00 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-02-01 11:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #8 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #7)
> Hmm,
> ;; _43 = .REDUC_PLUS (vect__7.11_47);
> 
> (insn 23 22 24 (set (reg:V4SI 112)
>         (unspec:V4SI [
>                 (reg:V4SI 103 [ vect__7.11 ])
>             ] UNSPEC_ADDV)) -1
>      (nil))
> 
> (insn 24 23 0 (set (reg:SI 102 [ _43 ])
>         (vec_select:SI (reg:V4SI 112)
>             (parallel [
>                     (const_int 0 [0])
>                 ]))) -1
>      (nil))
> 
> Could we just represent REDUC_PLUS as:
> 
> (insn 23 22 24 (set (reg:SI 112)
>         (unspec:SI [
>                 (reg:V4SI 103 [ vect__7.11 ])
>             ] UNSPEC_ADDV_1)) -1
>      (nil))
> 
> And then I don't think we will have the issue with the vec_select and subreg
> really.

Hmm that does look like the RTL in the optab is wrong, The V2F optab is right
but the one with the mode iterator looks wrong.. I'll give it a shot.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-02-01 11:26 ` tnfchris at gcc dot gnu.org
@ 2022-03-25 12:00 ` jakub at gcc dot gnu.org
  2022-03-25 14:00 ` jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-03-25 12:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Perhaps the r12-2288-g8695bf78dad1a42636 change wasn't a good idea?
I mean, if we add some hack for the .REDUC_* stuff so that we don't have the
lowpart vec_select that r12-2288 folds into a subreg, won't we still suffer the
same problem when doing anything similar?
E.g. with -O2:

typedef int V __attribute__((vector_size (4 * sizeof (int))));

int
test (V a)
{
  int sum = a[0];
  return (((unsigned short)sum) + ((unsigned int)sum >> 16)) >> 1;
}

The assembly difference is then:
-       fmov    w0, s0
-       lsr     w1, w0, 16
-       add     w0, w1, w0, uxth
+       umov    w0, v0.h[0]
+       fmov    w1, s0
+       add     w0, w0, w1, lsr 16
        lsr     w0, w0, 1
        ret
Dunno how costly on aarch64 is Neon -> GPR register move.
Is fmov w0, s0; fmov w1, s0 or fmov w0, s0; mov w1, w0 cheaper?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2022-03-25 12:00 ` jakub at gcc dot gnu.org
@ 2022-03-25 14:00 ` jakub at gcc dot gnu.org
  2022-03-25 14:42 ` tnfchris at gcc dot gnu.org
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-03-25 14:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Anyway, a difference at *.combine time is:
(insn 19 18 22 2 (set (reg:SI 103 [ _35 ])
        (vec_select:SI (reg:V4SI 121)
            (parallel [
                    (const_int 0 [0])
                ]))) 2489 {aarch64_get_lanev4si}
     (expr_list:REG_DEAD (reg:V4SI 121)
        (nil)))
(note 22 19 23 2 NOTE_INSN_DELETED)
(insn 23 22 24 2 (set (reg:SI 125)
        (lshiftrt:SI (reg:SI 103 [ _35 ])
            (const_int 16 [0x10]))) "pr104049.c":16:54 694
{*aarch64_lshr_sisd_or_int_si3}
     (nil))
(insn 24 23 25 2 (set (reg:SI 126)
        (plus:SI (zero_extend:SI (subreg:HI (reg:SI 103 [ _35 ]) 0))
            (reg:SI 125))) "pr104049.c":16:33 220 {*add_zero_extendhi_si}
     (expr_list:REG_DEAD (reg:SI 103 [ _35 ])
        (expr_list:REG_DEAD (reg:SI 125)
            (nil))))
vs.
(insn 21 20 22 2 (set (reg:SI 120)
        (and:SI (subreg:SI (reg:V4SI 117) 0)
            (const_int 65535 [0xffff]))) "pr104049.c":16:33 500 {andsi3}
     (nil))
(insn 22 21 23 2 (set (reg:SI 121)
        (plus:SI (lshiftrt:SI (subreg:SI (reg:V4SI 117) 0)
                (const_int 16 [0x10]))
            (reg:SI 120))) "pr104049.c":16:33 211 {*add_lsr_si}
     (expr_list:REG_DEAD (reg:V4SI 117)
        (expr_list:REG_DEAD (reg:SI 120)
            (nil))))
Appart from the (reg:SI 103 [ _35 ]) vs. (subreg:SI (reg:V4SI 117) 0)
difference from RA POV I don't see much a difference between the two,
both can be register allocated in a way that pseudo 103 or (subreg:SI (reg:V4SI
117) 0) is used just once.  In the first sequence by choosing e.g. x0 for 103
and 126, x1 for 125, in the second by using x1 for 120 and x0 for 121 and for
reloading of the (subreg:SI (reg:V4SI 117) 0) in all the spots, so trying to
prefer one or another at combine time doesn't look useful (the reason why the
first one isn't used is likely because (subreg:HI (reg:V4SI 117) 0) is
rejected).  But IRA instead decides to put both 120 and 121 into x0 and LRA
chooses x0 for reloading of the subreg in the first insn and x1 for the second
insn.
Could we fix this up in postreload or later?
(insn 35 18 21 2 (set (reg:SI 0 x0 [125])
        (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
     (nil))
(insn 21 35 36 2 (set (reg:SI 0 x0 [120])
        (and:SI (reg:SI 0 x0 [125])
            (const_int 65535 [0xffff]))) "pr104049.c":16:33 500 {andsi3}
     (nil))
(insn 36 21 22 2 (set (reg:SI 1 x1 [126])
        (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
     (nil))
(insn 22 36 28 2 (set (reg:SI 0 x0 [121])
        (plus:SI (lshiftrt:SI (reg:SI 1 x1 [126])
                (const_int 16 [0x10]))
            (reg:SI 0 x0 [120]))) "pr104049.c":16:33 211 {*add_lsr_si}
     (nil))
transformation into:
(insn 35 18 21 2 (set (reg:SI 0 x0 [125])
        (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
     (nil))
(insn 21 35 22 2 (set (reg:SI 0 x1 [120])
        (and:SI (reg:SI 0 x0 [125])
            (const_int 65535 [0xffff]))) "pr104049.c":16:33 500 {andsi3}
     (nil))
(insn 22 21 28 2 (set (reg:SI 0 x0 [121])
        (plus:SI (lshiftrt:SI (reg:SI 1 x0 [126])
                (const_int 16 [0x10]))
            (reg:SI 0 x1 [120]))) "pr104049.c":16:33 211 {*add_lsr_si}
     (nil))
?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2022-03-25 14:00 ` jakub at gcc dot gnu.org
@ 2022-03-25 14:42 ` tnfchris at gcc dot gnu.org
  2022-03-25 14:50 ` tnfchris at gcc dot gnu.org
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-03-25 14:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #9)
> Perhaps the r12-2288-g8695bf78dad1a42636 change wasn't a good idea?

I think it's still a good idea as it fixes a bigger problem (unneeded SIMD
partial extracts) and makes it easier to write RTL as you don't have to deal
with both VEC_SELECT and subregs.  So having one canonical form is better.

> I mean, if we add some hack for the .REDUC_* stuff so that we don't have the
> lowpart vec_select that r12-2288 folds into a subreg, won't we still suffer
> the same problem when doing anything similar?

Yes but I think the problem is in how we do the transfers to start with. While
looking at this issue I noticed that the SIMD <-> genreg transfers for sizes
where we don't have an exact register for on the genreg (i.e. 8-bit and 16-bit)
are suboptimal (even before this) in a number of cases already and dealing with
that  underlying problem first is better, so I postponed it to GCC 13.

That is to say, even

typedef int V __attribute__((vector_size (4 * sizeof (int))));

int
test (V a)
{
  int sum = a[0];
  return (unsigned int)sum >> 16;
}

is suboptimal.

> E.g. with -O2:
> 
> typedef int V __attribute__((vector_size (4 * sizeof (int))));
> 
> int
> test (V a)
> {
>   int sum = a[0];
>   return (((unsigned short)sum) + ((unsigned int)sum >> 16)) >> 1;
> }
> 
> The assembly difference is then:
> -	fmov	w0, s0
> -	lsr	w1, w0, 16
> -	add	w0, w1, w0, uxth
> +	umov	w0, v0.h[0]
> +	fmov	w1, s0
> +	add	w0, w0, w1, lsr 16
>  	lsr	w0, w0, 1
>  	ret
> Dunno how costly on aarch64 is Neon -> GPR register move.
> Is fmov w0, s0; fmov w1, s0 or fmov w0, s0; mov w1, w0 cheaper?

The answer is quite uarch specific, but in general fmov w0, s0; mov w1, w0 is
cheaper, that said for the sequence you pasted above the it's really a bit of a
wash.

The old codegen has a longer dependency chain and needed both a shift and
zero-extend,

The new codegen removes the zero extends and folds the shift into the add but
adds a transfer so it about cancels out.

Ideally we'd want here:

        umov    w0, v0.h[0]
        umov    w1, v0.h[1]
        add     w0, w0, w1
        lsr     w0, w0, 1
        ret

where the shift and the zero extend are gone and the moves could be done in
parallel.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2022-03-25 14:42 ` tnfchris at gcc dot gnu.org
@ 2022-03-25 14:50 ` tnfchris at gcc dot gnu.org
  2022-03-25 14:52 ` tnfchris at gcc dot gnu.org
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-03-25 14:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #10)
> Could we fix this up in postreload or later?
> (insn 35 18 21 2 (set (reg:SI 0 x0 [125])
>         (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
>      (nil))
> (insn 21 35 36 2 (set (reg:SI 0 x0 [120])
>         (and:SI (reg:SI 0 x0 [125])
>             (const_int 65535 [0xffff]))) "pr104049.c":16:33 500 {andsi3}
>      (nil))
> (insn 36 21 22 2 (set (reg:SI 1 x1 [126])
>         (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
>      (nil))
> (insn 22 36 28 2 (set (reg:SI 0 x0 [121])
>         (plus:SI (lshiftrt:SI (reg:SI 1 x1 [126])
>                 (const_int 16 [0x10]))
>             (reg:SI 0 x0 [120]))) "pr104049.c":16:33 211 {*add_lsr_si}
>      (nil))
>
> transformation into:
> (insn 35 18 21 2 (set (reg:SI 0 x0 [125])
>         (reg:SI 32 v0 [117])) "pr104049.c":16:33 52 {*movsi_aarch64}
>      (nil))
> (insn 21 35 22 2 (set (reg:SI 0 x1 [120])
>         (and:SI (reg:SI 0 x0 [125])
>             (const_int 65535 [0xffff]))) "pr104049.c":16:33 500 {andsi3}
>      (nil))
> (insn 22 21 28 2 (set (reg:SI 0 x0 [121])
>         (plus:SI (lshiftrt:SI (reg:SI 1 x0 [126])
>                 (const_int 16 [0x10]))
>             (reg:SI 0 x1 [120]))) "pr104049.c":16:33 211 {*add_lsr_si}
>      (nil))
> ?

Yeah I think so, normally -frename-registers would have done it but it doesn't
find the opportunity in this case because of the zero extend which clobbers x0
before the second extract. So I think purely the instruction scheduling causes
a problem here.  So it's probably better to clean it up pre-reload if possible.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2022-03-25 14:50 ` tnfchris at gcc dot gnu.org
@ 2022-03-25 14:52 ` tnfchris at gcc dot gnu.org
  2022-04-01 11:29 ` rsandifo at gcc dot gnu.org
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-03-25 14:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #13 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
That said, I'll wait for Richard S to respond, but I don't think this is a P1
any longer, we know why it can't be done during reload and neither sequences
are really significantly better/worse.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2022-03-25 14:52 ` tnfchris at gcc dot gnu.org
@ 2022-04-01 11:29 ` rsandifo at gcc dot gnu.org
  2022-04-04 14:06 ` tnfchris at gcc dot gnu.org
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-04-01 11:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #14 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
FWIW, I agree with Vlad that this isn't an RA problem.  Some aarch64
instruction patterns are accepting operands that will inevitably
require a reload.

In principle we could tighten the predicates so that we reject
these kinds of subreg for operands that only allow GPRs (not FPRs).
But that just shifts the problem elsewhere.  Some of these patterns
do support w->w, so the net effect would be to lose out on some useful
combinations without fixing the actual problem.

I guess we should use an STV-like pass to do vector vs. scalar
instruction selection, with that pass being the one that forces
separate moves when (subreg:xI (reg:VnxI R) 0) occurs in a
“scalar” instruction.  Would be good to make it relatively
target-independent though.

So IMO we should fix RTL representation problem that Andrew pointed
out in comment 7 as the P1 fix, then accept the other cases as a P2
regression caused by bigger improvements elsewhere.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2022-04-01 11:29 ` rsandifo at gcc dot gnu.org
@ 2022-04-04 14:06 ` tnfchris at gcc dot gnu.org
  2022-04-07  7:29 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-04-04 14:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #15 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #14)
> 
> So IMO we should fix RTL representation problem that Andrew pointed
> out in comment 7 as the P1 fix, then accept the other cases as a P2
> regression caused by bigger improvements elsewhere.

Alright, mine then.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2022-04-04 14:06 ` tnfchris at gcc dot gnu.org
@ 2022-04-07  7:29 ` cvs-commit at gcc dot gnu.org
  2022-04-07  7:32 ` tnfchris at gcc dot gnu.org
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-04-07  7:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #16 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:024edf08959e9c1d5022901e6c4e5cbaa5b6c8d5

commit r12-8045-g024edf08959e9c1d5022901e6c4e5cbaa5b6c8d5
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Thu Apr 7 08:27:53 2022 +0100

    AArch64: Fix left fold sum reduction RTL patterns [PR104049]

    As the discussion in the PR pointed out the RTL we have for the REDUC_PLUS
    patterns are wrong.  The UNSPECs are modelled as returning a vector and
then
    in an expand pattern we emit a vec_select of the 0th element to get the
scalar.

    This is incorrect as the instruction itself already only returns a single
scalar
    and by declaring it returns a vector it allows combine to push in a subreg
into
    the pattern, which causes reload to make duplicate moves.

    This patch corrects this by removing the weird indirection and making the
RTL
    pattern model the correct semantics of the instruction immediately.

    gcc/ChangeLog:

            PR target/104049
            * config/aarch64/aarch64-simd.md
            (aarch64_reduc_plus_internal<mode>): Fix RTL and rename to...
            (reduc_plus_scal_<mode>): ... This.
            (reduc_plus_scal_v4sf): Moved.
            (aarch64_reduc_plus_internalv2si): Fix RTL and rename to...
            (reduc_plus_scal_v2si): ... This.

    gcc/testsuite/ChangeLog:

            PR target/104049
            * gcc.target/aarch64/vadd_reduc-1.c: New test.
            * gcc.target/aarch64/vadd_reduc-2.c: New test.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2022-04-07  7:29 ` cvs-commit at gcc dot gnu.org
@ 2022-04-07  7:32 ` tnfchris at gcc dot gnu.org
  2023-04-26  6:55 ` [Bug target/104049] [12/13/14 " rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-04-07  7:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|12.0                        |13.0
           Assignee|unassigned at gcc dot gnu.org      |tnfchris at gcc dot gnu.org
           Priority|P1                          |P2

--- Comment #17 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
fixed the immediate regression will handle the general codegen issue in GCC 13

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12/13/14 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2022-04-07  7:32 ` tnfchris at gcc dot gnu.org
@ 2023-04-26  6:55 ` rguenth at gcc dot gnu.org
  2023-07-27  9:22 ` rguenth at gcc dot gnu.org
  2024-02-21  6:37 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-04-26  6:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|13.0                        |13.2

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 13.1 is being released, retargeting bugs to GCC 13.2.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12/13/14 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2023-04-26  6:55 ` [Bug target/104049] [12/13/14 " rguenth at gcc dot gnu.org
@ 2023-07-27  9:22 ` rguenth at gcc dot gnu.org
  2024-02-21  6:37 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-27  9:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|13.2                        |13.3

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 13.2 is being released, retargeting bugs to GCC 13.3.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/104049] [12/13/14 Regression] vec_select to subreg lowering causes superfluous moves
  2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2023-07-27  9:22 ` rguenth at gcc dot gnu.org
@ 2024-02-21  6:37 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-21  6:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2022-01-17 00:00:00         |2024-2-20
                 CC|                            |pinskia at gcc dot gnu.org

--- Comment #20 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Testcase in comment #9 is still an issue on the trunk.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-02-21  6:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-16 12:19 [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves tnfchris at gcc dot gnu.org
2022-01-17  2:20 ` [Bug rtl-optimization/104049] " pinskia at gcc dot gnu.org
2022-01-18 13:48 ` rguenth at gcc dot gnu.org
2022-01-18 17:03 ` vmakarov at gcc dot gnu.org
2022-01-18 17:39 ` tnfchris at gcc dot gnu.org
2022-01-19  3:28 ` pinskia at gcc dot gnu.org
2022-01-19  3:30 ` pinskia at gcc dot gnu.org
2022-01-19  3:39 ` [Bug target/104049] " pinskia at gcc dot gnu.org
2022-02-01 11:26 ` tnfchris at gcc dot gnu.org
2022-03-25 12:00 ` jakub at gcc dot gnu.org
2022-03-25 14:00 ` jakub at gcc dot gnu.org
2022-03-25 14:42 ` tnfchris at gcc dot gnu.org
2022-03-25 14:50 ` tnfchris at gcc dot gnu.org
2022-03-25 14:52 ` tnfchris at gcc dot gnu.org
2022-04-01 11:29 ` rsandifo at gcc dot gnu.org
2022-04-04 14:06 ` tnfchris at gcc dot gnu.org
2022-04-07  7:29 ` cvs-commit at gcc dot gnu.org
2022-04-07  7:32 ` tnfchris at gcc dot gnu.org
2023-04-26  6:55 ` [Bug target/104049] [12/13/14 " rguenth at gcc dot gnu.org
2023-07-27  9:22 ` rguenth at gcc dot gnu.org
2024-02-21  6:37 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).