From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 3A6C53857353; Tue, 10 May 2022 20:37:38 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3A6C53857353 From: "bergner at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/105556] New: RA assigns an MMA vector input operand to vs0-vs31 causing an MMA accumulator to be spilled Date: Tue, 10 May 2022 20:37:38 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: bergner at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 May 2022 20:37:38 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D105556 Bug ID: 105556 Summary: RA assigns an MMA vector input operand to vs0-vs31 causing an MMA accumulator to be spilled Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bergner at gcc dot gnu.org Target Milestone: --- With current trunk and GCC 12, the MMA optimized dgemm kernel in OpenBLAS is seeing a performance regression compared to GCC 11 and GCC 10. The problem= is that the core loop in dgemm uses 8 accumulator variables, which want to use= all 8 accumulator registers. Using the 8 accumulators means we should not use = the vs0 thru vs31 vector registers for the MMA instruction's normal vector input operands. However with trunk and GCC 12, the register allocator is assigning one vector input to one of the vs0-vs31 registers leading us to spill one of the accumulators and that causes a bad performance loss. The trunk and GCC 12 asm for the core loop looks like: .L5: lxvp 0,0(10) lxv 40,0(9) addi 10,10,64 addi 9,9,64 lxv 41,-48(9) lxv 42,-32(9) lxv 43,-16(9) lxvp 2,32(1) lxvp 32,-32(10) xvf64gerpp 4,0,40 xvf64gerpp 6,0,41 xvf64gerpp 3,0,42 xvf64gerpp 2,0,43 lxvp 0,64(1) xvf64gerpp 5,32,40 xvf64gerpp 7,32,41 xvf64gerpp 1,32,42 xxmtacc 0 xvf64gerpp 0,32,43 xxmfacc 0 stxvp 2,32(1) stxvp 0,64(1) bdnz .L5 Note the use of vs0 in the MMA instructions which forces the spilling of AC= C0. The "better" GCC 11 and GCC 10 code looks like: .L5: lxvp 44,0(10) lxvp 32,32(10) addi 9,9,64 addi 10,10,64 lxv 39,-64(9) lxv 40,-48(9) lxv 41,-32(9) lxv 42,-16(9) xvf64gerpp 4,44,39 xvf64gerpp 5,32,39 xvf64gerpp 6,44,40 xvf64gerpp 7,32,40 xvf64gerpp 3,44,41 xvf64gerpp 1,32,41 xvf64gerpp 2,44,42 xvf64gerpp 0,32,42 bdnz .L5=