From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id BE8103858D28; Sun, 16 Jan 2022 12:19:08 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BE8103858D28 From: "tnfchris at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug rtl-optimization/104049] New: [12 Regression] vec_select to subreg lowering causes superfluous moves Date: Sun, 16 Jan 2022 12:19:08 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: rtl-optimization X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: tnfchris at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status keywords bug_severity priority component assigned_to reporter target_milestone cf_gcctarget Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 16 Jan 2022 12:19:08 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104049 Bug ID: 104049 Summary: [12 Regression] vec_select to subreg lowering causes superfluous moves Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64-* Consider: int test (uint8_t *p, uint32_t t[1][1], int n) { int sum =3D 0; uint32_t a0; for (int i =3D 0; i < 4; i++, p++) t[i][0] =3D p[0]; for (int i =3D 0; i < 4; i++) { { int t0 =3D t[0][i] + t[0][i]; a0 =3D t0; }; sum +=3D a0; } return (((uint16_t)sum) + ((uint32_t)sum >> 16)) >> 1; } Which after the reduction gets SLP'd used to generate at -O3 addv s0, v0.4s fmov w0, s0 lsr w1, w0, 16 add w0, w1, w0, uxth lsr w0, w0, 1 which was pretty good. However in GCC 12 we now generate worse code: addv s0, v0.4s fmov w0, s0 fmov w1, s0 and w0, w0, 65535 add w0, w0, w1, lsr 16 lsr w0, w0, 1 Notice the double transfer of the same value. This is because at the RTL level the original mov becomes a vec_select (insn 19 18 20 2 (set (reg:SI 102 [ _43 ]) (vec_select:SI (reg:V4SI 117) (parallel [ (const_int 0 [0]) ]))) -1 (nil)) which previously stayed as a vec_select and the RA would use this pattern f= or the w -> r move. Now however this vec_select gets transformed into a subreg 0, which causes combine to push the subreg into each instruction using reg 102. (insn 21 18 22 2 (set (reg:SI 120) (and:SI (subreg:SI (reg:V4SI 117) 0) (const_int 65535 [0xffff]))) "/app/example.c":30:27 492 {andsi3} (nil)) (insn 22 21 28 2 (set (reg:SI 121) (plus:SI (lshiftrt:SI (subreg:SI (reg:V4SI 117) 0) (const_int 16 [0x10])) (reg:SI 120))) "/app/example.c":30:27 211 {*add_lsr_si} (expr_list:REG_DEAD (reg:SI 120) (expr_list:REG_DEAD (reg:V4SI 117) (nil)))) and because these operations don't exist on the w side, reload is forced to materialized many duplicate moves from w -> r. So every operation that gets the subreg pushed into it for which we don't have an operation for on the w side gets an extra move. Aside from that, we seem to lose that the & can be folded into the subreg by simply truncating the subreg from SI to HI and zero extending that out. A different reproducer is #include typedef int v4si __attribute__ ((vector_size (16))); int bar (v4si x) { unsigned int sum =3D vaddvq_s32 (x); return (((uint16_t)(sum & 0xffff)) + ((uint32_t)sum >> 16)); } Note that using -frename-registers does get us to an optimal sequence here which is better than GCC 11.=