From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id E648F394FC21; Thu, 29 Apr 2021 06:41:51 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E648F394FC21 From: "linkw at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug rtl-optimization/100328] New: IRA doesn't model dup num constraint well Date: Thu, 29 Apr 2021 06:41:51 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: rtl-optimization X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: linkw at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Apr 2021 06:41:52 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100328 Bug ID: 100328 Summary: IRA doesn't model dup num constraint well Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- source: function LBM_performStreamCollideTRT in SPEC2017 519.lbm_r This issue was exposed by O2 vectorization enablement evaluation on 519.lbm= _r. baseline option: -O2 -mcpu=3Dpower9 -ffast-math test option: -O2 -mcpu=3Dpower9 -ffast-math -ftree-vectorize -fvect-cost-model=3Dvery-cheap The ratio with test option will degrade -1.66% against baseline (-1.74% wit= hout the very-cheap cost model). The hotspot LBM_performStreamCollideTRT isn't vectorized at all, but the pre-pass if-conversion of vectorization gets the issue exposed. Firstly, if-conversion will use the new copied loop as the scalar version after loop versioning, once vectorization fails, we end up with one loop which has a little difference against before. The difference mainly comes from:=20 1) Different basic block placement. For this function, the fall through BB = and branch BB are switched. The reason is that the new copied loop BBs are adju= sted as dom_order while the idom insertion order changes when it sets the idom during copying. Anyway, it's acceptable.=20 2) SSA names difference. The new copied loop can reuse some discarded SSA_names, the gimple commutative operands canonicalization will change some order. I did some hack to filter the fall through/branch BB difference, the gap becomes smaller but still some. The remaining difference on gimple are some operand orders as mentioned above, the difference on assembly file are some differe= nt insns choices mainly on fma style insns, one remarkable difference is the number of register copies:=20 fmr + xxlor: 16 (baseline) vs 21 (test respecting fall through) In this function, there are many FMA style expressions (27 FMA, 19 FMS, 11 FNMA). Their VSX_REG version insns are destructive and the define_insns look like: (define_insn "*nfma4_fpr" [(set (match_operand:SFDF 0 "gpc_reg_operand" "=3D,wa,wa") (neg:SFDF (fma:SFDF (match_operand:SFDF 1 "gpc_reg_operand" ",wa,wa") (match_operand:SFDF 2 "gpc_reg_operand" ",wa,0") (match_operand:SFDF 3 "gpc_reg_operand" ",0,wa"))))] "TARGET_HARD_FLOAT" "@ fnmadd %0,%1,%2,%3 xsnmaddap %x0,%x1,%x2 xsnmaddmp %x0,%x1,%x3" [(set_attr "type" "fp") (set_attr "isa" "*,,")]) (define_insn "*fms4_fpr" [(set (match_operand:SFDF 0 "gpc_reg_operand" "=3D,wa,wa") (fma:SFDF (match_operand:SFDF 1 "gpc_reg_operand" ",wa,wa") (match_operand:SFDF 2 "gpc_reg_operand" ",wa,0") (neg:SFDF (match_operand:SFDF 3 "gpc_reg_operand" ",0,wa"))))] ... (define_insn "*fma4_fpr" [(set (match_operand:SFDF 0 "gpc_reg_operand" "=3D,wa,wa") (fma:SFDF (match_operand:SFDF 1 "gpc_reg_operand" "%,wa,wa") (match_operand:SFDF 2 "gpc_reg_operand" ",wa,0") (match_operand:SFDF 3 "gpc_reg_operand" ",0,wa")))] ... Since the 1st alternative are with class FLOAT_REG, which are the subset of VSX_REG whose total number are 64 while fp shares the first 32, in most cas= es the preferred rclass for these insns are VSX_REG. Assuming we have the expression that: FMA A,B,C,D If these four register are totally different, it can not meet with the alternatives with duplicated number constraint. If it prefers to use the remaining alternative (1st), at the same time, if one of these isn't low 32= vsx (can't fit with fp), we have to generate register copy from vsx register (h= igh number vsx reg) to fp register (low number vsx reg). How the commutative operand order affects this?=20 IRA tries to create copy for register coalescing, for FMA expression above, assuming both B and C are dead at the current insn, it will have copy on A/B and A/C, later when it does thread forming, if both A/B and A/C have the same f= req, lower copy number comes first. It means the operand order can affect how we form the thread, different pulled-in allocno will probably produce different conflict set, it further affects the global thread forming and final assignment. But I think the root cause is that when we create copy for these fma style insns, ira doesn't fully consider the duplicate number constraint, for exam= ple, for FMS if the operands 1,2,3 are dead, both 2 and 3 should take higher priority in copy queue. I noticed that there is one function ira_get_dup_out_num, which meant to cr= eate this kind of copy, but the below code looks to refuse to create if there is= an alternative which has valid regclass without spilled need.=20 default: { enum constraint_num cn =3D lookup_constraint (str); enum reg_class cl =3D reg_class_for_constraint (cn); if (cl !=3D NO_REGS && !targetm.class_likely_spilled_p (cl)) goto fail ... Is there some particular reason for this behavior?=