From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 4F9CE3857B8E; Thu, 25 Jan 2024 09:06:21 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4F9CE3857B8E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1706173582;
	bh=NNeGIRzp6Imncc8Y50c/noYDh0OXZKsjO/hg/0IKnEI=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=J6gr8jpQlLyCUG8FJGPuBE5c8zk9IYRFKAvgxaKVUqoqxiihIAkMJBBpWKnmlmL04
	 swPCVykwe/8jCMDFViONvnD55HoXRngNxG961hijMQEV6WK1GO2sSNM9QVA1ENkK7J
	 7r56h88ZoojMJMg50jaq7rF7qTnRWYjvtZVWiPzQ=
From: "rguenther at suse dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
Date: Thu, 25 Jan 2024 09:05:44 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: rguenther at suse dot de
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-113583-4-GoRIwfy2W9@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-113583-4@http.gcc.gnu.org/bugzilla/>
References: <bug-113583-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583
--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583
>=20
> --- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> Both ICC and Clang X86 can vectorize SPEC 2017 lbm:
>=20
> https://godbolt.org/z/MjbTbYf1G
>=20
> But I am not sure X86 ICC is better or X86 Clang is better.

gather/scatter are possibly slow (and gather now has that Intel
security issue).  The reason is a "cost" one:

t.c:47:21: note:   =3D=3D> examining statement: _4 =3D *_3;
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   the size of the group of accesses is not a power of 2=
=20
or not equal to 3
t.c:47:21: missed:   not falling back to elementwise accesses
t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 =3D=20
*_3;
t.c:47:21: missed:  bad operation or unsupported loop bound.

where we don't consider using gather because we have a known constant
stride (20).  Since the stores are really scatters we don't attempt
to SLP either.

Disabling the above heuristic we get this vectorized as well, avoiding
gather/scatter by manually implementing them and using a quite high
VF of 8 (with -mprefer-vector-width=3D256 you get VF 4 and likely
faster code in the end).  But yes, I doubt that any of ICC or clang
vectorized codes are faster anywhere (but without specifying an
uarch you get some generic cost modelling applied).  Maybe SPR doesn't
have the gather bug and it does have reasonable gather and scatter
(zen4 scatter sucks).

.L3:
        vmovsd  952(%rax), %xmm0
        vmovsd  -8(%rax), %xmm2
        addq    $1280, %rsi
        addq    $1280, %rax
        vmovhpd -168(%rax), %xmm0, %xmm1
        vmovhpd -1128(%rax), %xmm2, %xmm2
        vmovsd  -648(%rax), %xmm0
        vmovhpd -488(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm0
        vmovsd  -968(%rax), %xmm1
        vmovhpd -808(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm1, %ymm2, %ymm2
        vinsertf64x4    $0x1, %ymm0, %zmm2, %zmm2
        vmovsd  -320(%rax), %xmm0
        vmovhpd -160(%rax), %xmm0, %xmm1
        vmovsd  -640(%rax), %xmm0
        vmovhpd -480(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm1
        vmovsd  -960(%rax), %xmm0
        vmovhpd -800(%rax), %xmm0, %xmm8
        vmovsd  -1280(%rax), %xmm0
        vmovhpd -1120(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm8, %ymm0, %ymm0
        vinsertf64x4    $0x1, %ymm1, %zmm0, %zmm0
        vmovsd  -312(%rax), %xmm1
        vmovhpd -152(%rax), %xmm1, %xmm8
        vmovsd  -632(%rax), %xmm1
        vmovhpd -472(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm8
        vmovsd  -952(%rax), %xmm1
        vmovhpd -792(%rax), %xmm1, %xmm9
        vmovsd  -1272(%rax), %xmm1
        vmovhpd -1112(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm9, %ymm1, %ymm1
        vinsertf64x4    $0x1, %ymm8, %zmm1, %zmm1
        vaddpd  %zmm1, %zmm0, %zmm0
        vaddpd  %zmm7, %zmm2, %zmm1
        vfnmadd132pd    %zmm3, %zmm2, %zmm1
        vfmadd132pd     %zmm6, %zmm5, %zmm0
        valignq $3, %ymm1, %ymm1, %ymm2
        vmovlpd %xmm1, -1280(%rsi)
        vextractf64x2   $1, %ymm1, %xmm8
        vmovhpd %xmm1, -1120(%rsi)
        vextractf64x4   $0x1, %zmm1, %ymm1
        vmovlpd %xmm1, -640(%rsi)
        vmovhpd %xmm1, -480(%rsi)
        vmovsd  %xmm2, -800(%rsi)
        vextractf64x2   $1, %ymm1, %xmm2
        vmovsd  %xmm8, -960(%rsi)
        valignq $3, %ymm1, %ymm1, %ymm1
        vmovsd  %xmm2, -320(%rsi)
        vmovsd  %xmm1, -160(%rsi)
        vmovsd  -320(%rax), %xmm1
        vmovhpd -160(%rax), %xmm1, %xmm2
        vmovsd  -640(%rax), %xmm1
        vmovhpd -480(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm2, %ymm1, %ymm2
        vmovsd  -960(%rax), %xmm1
        vmovhpd -800(%rax), %xmm1, %xmm8
        vmovsd  -1280(%rax), %xmm1
        vmovhpd -1120(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm1
        vinsertf64x4    $0x1, %ymm2, %zmm1, %zmm1
        vfnmadd132pd    %zmm3, %zmm1, %zmm0
        vaddpd  %zmm4, %zmm0, %zmm0
        valignq $3, %ymm0, %ymm0, %ymm1
        vmovlpd %xmm0, 14728(%rsi)
        vextractf64x2   $1, %ymm0, %xmm2
        vmovhpd %xmm0, 14888(%rsi)
        vextractf64x4   $0x1, %zmm0, %ymm0
        vmovlpd %xmm0, 15368(%rsi)
        vmovhpd %xmm0, 15528(%rsi)
        vmovsd  %xmm1, 15208(%rsi)
        vextractf64x2   $1, %ymm0, %xmm1
        vmovsd  %xmm2, 15048(%rsi)
        valignq $3, %ymm0, %ymm0, %ymm0
        vmovsd  %xmm1, 15688(%rsi)
        vmovsd  %xmm0, 15848(%rsi)
        cmpq    %rdx, %rsi
        jne     .L3=