From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 50E92387054C; Fri, 9 Dec 2022 22:45:02 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 50E92387054C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1670625902; bh=jLX1IEH9hQr+v3YpMbI/vTrIarQS0IhYdabv3syLKiE=; h=From:To:Subject:Date:From; b=BvI36J2iKLo4a5c0g2DIb5EI/kbgdsuHPqoN2iUSZPYJ66PCj1V0R5eE1qAMS8zZr R1OsnW2NYPVcc5GmWMZOtj6ATFtxQh7IRJiCRiBWTcKRSDiYuGWu1RhUBIHFLNuU3a VsFkSRplNf21W9QGRGJIiuaRzatwXg3m6Ux2XJng= From: "law at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/108041] New: ivopts results in extra instruction in simple loop Date: Fri, 09 Dec 2022 22:45:01 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: law at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108041 Bug ID: 108041 Summary: ivopts results in extra instruction in simple loop Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: law at gcc dot gnu.org CC: rzinsly at ventanamicro dot com Target Milestone: --- ivopts seems to make a bit of a mess out of this code resulting in the loop having an unnecessary instruction. Compile with rv64 -O2: typedef struct network { long nr_group, full_groups, max_elems; } network_t; void marc_arcs(network_t* net) { while (net->full_groups < 0) { net->full_groups =3D net->nr_group + net->full_groups; net->max_elems--; } } After slp1 we have this loop: ;; basic block 3, loop depth 0 ;; pred: 2 _1 =3D net_8(D)->nr_group; net__max_elems_lsm.4_16 =3D net_8(D)->max_elems; ;; succ: 4 ;; basic block 4, loop depth 1 ;; pred: 7 ;; 3 # _13 =3D PHI <_2(7), _11(3)> # net__max_elems_lsm.4_5 =3D PHI <_4(7), net__max_elems_lsm.4_16(3)> _2 =3D _1 + _13; _4 =3D net__max_elems_lsm.4_5 + -1; if (_2 < 0) goto ; [89.00%] else goto ; [11.00%] ;; succ: 7 ;; 5 ;; basic block 7, loop depth 1 ;; pred: 4 goto ; [100.00%] ;; succ: 4 ;; basic block 5, loop depth 0 ;; pred: 4 # _12 =3D PHI <_2(4)> # _17 =3D PHI <_4(4)> net_8(D)->full_groups =3D _12; net_8(D)->max_elems =3D _17; ;; succ: 6 Of particular interest is the max_elems computation into _4. We accumulate= it in the loop, then do the final store after the loop (thank you LSM!). After ivopts we have: ;; basic block 3, loop depth 0 ;; pred: 2 _1 =3D net_8(D)->nr_group; net__max_elems_lsm.4_16 =3D net_8(D)->max_elems; _22 =3D net__max_elems_lsm.4_16 + -1; ivtmp.10_21 =3D (unsigned long) _22; ;; succ: 4 ;; basic block 4, loop depth 1 ;; pred: 7 ;; 3 # _13 =3D PHI <_2(7), _11(3)> # ivtmp.10_3 =3D PHI _2 =3D _1 + _13; _4 =3D (long int) ivtmp.10_3; ivtmp.10_18 =3D ivtmp.10_3 - 1; if (_2 < 0) goto ; [89.00%] else goto ; [11.00%] ;; succ: 7 ;; 5 ;; basic block 7, loop depth 1 ;; pred: 4=20 goto ; [100.00%] ;; succ: 4 ;; basic block 5, loop depth 0 ;; pred: 4 # _12 =3D PHI <_2(4)> # _17 =3D PHI <_4(4)> net_8(D)->full_groups =3D _12; net_8(D)->max_elems =3D _17; ;; succ: 6 Note the introduction of the IV and its relationship to _4. Essentially we compute both in the loop even _4 is always one greater than the IV. Worse = yet, the IV is only used to compute _4! And since they differ by 1, we actually compute both and keep them alive resulting in this final code for rv64: .L3: add a5,a5,a2 mv a3,a4 addi a4,a4,-1 blt a5,zero,.L3 sd a5,8(a0) sd a3,16(a0) Note how we had to "stash away" the value of a4 before the decrement so tha= t we could store it after the loop. The induction variable doesn't really buy us anything in this loop -- it's actively harmful. Not using the IV would probably be best. Second best would be to realize that _4 (aka a3) can be derived from the IV (a4) after the loop by adding 1.=