From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id E648F394FC21; Thu, 29 Apr 2021 06:41:51 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E648F394FC21
From: "linkw at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/100328] New: IRA doesn't model dup num
 constraint well
Date: Thu, 29 Apr 2021 06:41:51 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: linkw at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-100328-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Apr 2021 06:41:52 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100328

            Bug ID: 100328
           Summary: IRA doesn't model dup num constraint well
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

source: function LBM_performStreamCollideTRT in SPEC2017 519.lbm_r

This issue was exposed by O2 vectorization enablement evaluation on 519.lbm=
_r.

baseline option: -O2 -mcpu=3Dpower9 -ffast-math

test option: -O2 -mcpu=3Dpower9 -ffast-math -ftree-vectorize
             -fvect-cost-model=3Dvery-cheap

The ratio with test option will degrade -1.66% against baseline (-1.74% wit=
hout
the very-cheap cost model).

The hotspot LBM_performStreamCollideTRT isn't vectorized at all, but the
pre-pass if-conversion of vectorization gets the issue exposed. Firstly,
if-conversion will use the new copied loop as the scalar version after loop
versioning, once vectorization fails, we end up with one loop which has a
little
difference against before.

The difference mainly comes from:=20

1) Different basic block placement. For this function, the fall through BB =
and
branch BB are switched. The reason is that the new copied loop BBs are adju=
sted
as dom_order while the idom insertion order changes when it sets the idom
during
copying. Anyway, it's acceptable.=20

2) SSA names difference. The new copied loop can reuse some discarded
SSA_names,
the gimple commutative operands  canonicalization will change some order.

I did some hack to filter the fall through/branch BB difference, the gap
becomes
smaller but still some. The remaining difference on gimple are some operand
orders as mentioned above, the difference on assembly file are some differe=
nt
insns choices mainly on fma style insns, one remarkable difference is the
number
of register copies:=20

  fmr + xxlor: 16 (baseline) vs 21 (test respecting fall through)

In this function, there are many FMA style expressions (27 FMA, 19 FMS, 11
FNMA). Their VSX_REG version insns are destructive and the define_insns look
like:

(define_insn "*nfma<mode>4_fpr"
  [(set (match_operand:SFDF 0 "gpc_reg_operand" "=3D<Ff>,wa,wa")
        (neg:SFDF
         (fma:SFDF
          (match_operand:SFDF 1 "gpc_reg_operand" "<Ff>,wa,wa")
          (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
          (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa"))))]
  "TARGET_HARD_FLOAT"
  "@
   fnmadd<s> %0,%1,%2,%3
   xsnmadda<sd>p %x0,%x1,%x2
   xsnmaddm<sd>p %x0,%x1,%x3"
  [(set_attr "type" "fp")
   (set_attr "isa" "*,<Fisa>,<Fisa>")])

(define_insn "*fms<mode>4_fpr"
  [(set (match_operand:SFDF 0 "gpc_reg_operand" "=3D<Ff>,wa,wa")
        (fma:SFDF
         (match_operand:SFDF 1 "gpc_reg_operand" "<Ff>,wa,wa")
         (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
         (neg:SFDF (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa"))))]
...

(define_insn "*fma<mode>4_fpr"
  [(set (match_operand:SFDF 0 "gpc_reg_operand" "=3D<Ff>,wa,wa")
        (fma:SFDF
          (match_operand:SFDF 1 "gpc_reg_operand" "%<Ff>,wa,wa")
          (match_operand:SFDF 2 "gpc_reg_operand" "<Ff>,wa,0")
          (match_operand:SFDF 3 "gpc_reg_operand" "<Ff>,0,wa")))]
...


Since the 1st alternative are with class FLOAT_REG, which are the subset of
VSX_REG whose total number are 64 while fp shares the first 32, in most cas=
es
the preferred rclass for these insns are VSX_REG. Assuming we have the
expression that:

  FMA A,B,C,D

If these four register are totally different, it can not meet with the
alternatives with duplicated number constraint. If it prefers to use the
remaining alternative (1st), at the same time, if one of these isn't low 32=
 vsx
(can't fit with fp), we have to generate register copy from vsx register (h=
igh
number vsx reg) to fp register (low number vsx reg).

How the commutative operand order affects this?=20

IRA tries to create copy for register coalescing, for FMA expression above,
assuming both B and C are dead at the current insn, it will have copy on A/B
and
A/C, later when it does thread forming, if both A/B and A/C have the same f=
req,
lower copy number comes first. It means the operand order can affect how we
form
the thread, different pulled-in allocno will probably produce different
conflict
set, it further affects the global thread forming and final assignment.

But I think the root cause is that when we create copy for these fma style
insns, ira doesn't fully consider the duplicate number constraint, for exam=
ple,
for FMS if the operands 1,2,3 are dead, both 2 and 3 should take higher
priority
in copy queue.

I noticed that there is one function ira_get_dup_out_num, which meant to cr=
eate
this kind of copy, but the below code looks to refuse to create if there is=
 an
alternative which has valid regclass without spilled need.=20

              default:
                {
                  enum constraint_num cn =3D lookup_constraint (str);
                  enum reg_class cl =3D reg_class_for_constraint (cn);
                  if (cl !=3D NO_REGS
                      && !targetm.class_likely_spilled_p (cl))
                    goto fail

                 ...

Is there some particular reason for this behavior?=