From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id BE8103858D28; Sun, 16 Jan 2022 12:19:08 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BE8103858D28
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/104049] New: [12 Regression] vec_select to
 subreg lowering causes superfluous moves
Date: Sun, 16 Jan 2022 12:19:08 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 keywords bug_severity priority component assigned_to reporter
 target_milestone cf_gcctarget
Message-ID: <bug-104049-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sun, 16 Jan 2022 12:19:08 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104049

            Bug ID: 104049
           Summary: [12 Regression] vec_select to subreg lowering causes
                    superfluous moves
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64-*

Consider:

int test (uint8_t *p, uint32_t t[1][1], int n) {

  int sum =3D 0;
  uint32_t a0;
  for (int i =3D 0; i < 4; i++, p++)
    t[i][0] =3D p[0];

  for (int i =3D 0; i < 4; i++) {
    {
      int t0 =3D t[0][i] + t[0][i];
      a0 =3D t0;
    };
    sum +=3D a0;
  }
  return (((uint16_t)sum) + ((uint32_t)sum >> 16)) >> 1;
}

Which after the reduction gets SLP'd used to generate at -O3

        addv    s0, v0.4s
        fmov    w0, s0
        lsr     w1, w0, 16
        add     w0, w1, w0, uxth
        lsr     w0, w0, 1

which was pretty good. However in GCC 12 we now generate worse code:

        addv    s0, v0.4s
        fmov    w0, s0
        fmov    w1, s0
        and     w0, w0, 65535
        add     w0, w0, w1, lsr 16
        lsr     w0, w0, 1

Notice the double transfer of the same value.

This is because at the RTL level the original mov becomes a vec_select

(insn 19 18 20 2 (set (reg:SI 102 [ _43 ])
        (vec_select:SI (reg:V4SI 117)
            (parallel [
                    (const_int 0 [0])
                ]))) -1
     (nil))

which previously stayed as a vec_select and the RA would use this pattern f=
or
the w -> r move.

Now however this vec_select gets transformed into a subreg 0, which causes
combine to push the subreg into each instruction using reg 102.

(insn 21 18 22 2 (set (reg:SI 120)
        (and:SI (subreg:SI (reg:V4SI 117) 0)
            (const_int 65535 [0xffff]))) "/app/example.c":30:27 492 {andsi3}
     (nil))
(insn 22 21 28 2 (set (reg:SI 121)
        (plus:SI (lshiftrt:SI (subreg:SI (reg:V4SI 117) 0)
                (const_int 16 [0x10]))
            (reg:SI 120))) "/app/example.c":30:27 211 {*add_lsr_si}
     (expr_list:REG_DEAD (reg:SI 120)
        (expr_list:REG_DEAD (reg:V4SI 117)
            (nil))))

and because these operations don't exist on the w side, reload is forced to
materialized many duplicate moves from w -> r.  So every operation that gets
the subreg pushed into it for which we don't have an operation for on the w
side gets an extra move.

Aside from that, we seem to lose that the & can be folded into the subreg by
simply truncating the subreg from SI to HI and zero extending that out.

A different reproducer is

#include <arm_neon.h>

typedef int v4si __attribute__ ((vector_size (16)));

int bar (v4si x)
{
  unsigned int sum =3D vaddvq_s32 (x);
  return (((uint16_t)(sum & 0xffff)) + ((uint32_t)sum >> 16));
}

Note that using -frename-registers does get us to an optimal sequence here
which is better than GCC 11.=