From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 4D2583858426; Wed,  8 Feb 2023 03:11:31 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4D2583858426
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1675825891;
	bh=IfLfP9oMMDseDvM6jns72Xv3sTCStWFE/9M3dFJQspU=;
	h=From:To:Subject:Date:From;
	b=rCbG6vWbLgDlX6qxGVBOKSHRC45qDK9/c8rXeBQIHRSTYvloc6sRk/gPs928jjsGo
	 R6HLxiu1gD4w/QtKL7k2VVl8NnuuF+wrtslvBZ4YpOlzafMbx72jI4CXfjV4Y7E7OS
	 1knN+Jy1u43MhFzZyG8wXjhEr49HSO1TU4vGlxDA=
From: "crazylht at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/108707] New: suboptimal allocation with same
 memory op for many different instructions.
Date: Wed, 08 Feb 2023 03:11:30 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: crazylht at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-108707-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108707

            Bug ID: 108707
           Summary: suboptimal allocation with same memory op for many
                    different instructions.
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: crazylht at gmail dot com
  Target Milestone: ---

#include<immintrin.h>

void
foo (__m512* pv, float* __restrict ps, int n, __m512* pdest,
__m512* p1, __m512* p2, __m512* p3)
{
    __m512 a =3D _mm512_setzero_ps ();
    __m512 b =3D a;
    __m512 c =3D a;
    for (int i =3D 0; i !=3D n ;i++)
    {
        a =3D _mm512_fmadd_ps (p1[i], pv[i], a);
        b =3D _mm512_fmadd_ps (p2[i], pv[i], b);
        c =3D _mm512_fmadd_ps (p3[i], pv[i], c);
    }
    pdest[0] =3D a;
    pdest[1] =3D b;
    pdest[2] =3D c;
}

g++ -O2 -mavx512f -S

got=20

.L3:
        vmovaps (%r8,%rax), %zmm3
        vmovaps (%r9,%rax), %zmm4
        vmovaps (%rsi,%rax), %zmm5
        vfmadd231ps     (%rdi,%rax), %zmm3, %zmm2
        vfmadd231ps     (%rdi,%rax), %zmm4, %zmm1
        vfmadd231ps     (%rdi,%rax), %zmm5, %zmm0
        addq    $64, %rax
        cmpq    %rax, %rdx
        jne     .L3

It would be better to load (%rdi, %rax) into a zmm then

.L3:
        vmovaps (%rdi,%rax), %zmm0
        vfmadd231ps     (%r8,%rax), %zmm0, %zmm3
        vfmadd231ps     (%r9,%rax), %zmm0, %zmm2
        vfmadd231ps     (%rsi,%rax), %zmm0, %zmm1
        addq    $64, %rax
        cmpq    %rax, %rdx
        jne     .L3=