From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 4F9CE3857B8E; Thu, 25 Jan 2024 09:06:21 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4F9CE3857B8E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1706173582; bh=NNeGIRzp6Imncc8Y50c/noYDh0OXZKsjO/hg/0IKnEI=; h=From:To:Subject:Date:In-Reply-To:References:From; b=J6gr8jpQlLyCUG8FJGPuBE5c8zk9IYRFKAvgxaKVUqoqxiihIAkMJBBpWKnmlmL04 swPCVykwe/8jCMDFViONvnD55HoXRngNxG961hijMQEV6WK1GO2sSNM9QVA1ENkK7J 7r56h88ZoojMJMg50jaq7rF7qTnRWYjvtZVWiPzQ= From: "rguenther at suse dot de" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized. Date: Thu, 25 Jan 2024 09:05:44 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: rguenther at suse dot de X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583 --- Comment #6 from rguenther at suse dot de --- On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113583 >=20 > --- Comment #5 from JuzheZhong --- > Both ICC and Clang X86 can vectorize SPEC 2017 lbm: >=20 > https://godbolt.org/z/MjbTbYf1G >=20 > But I am not sure X86 ICC is better or X86 Clang is better. gather/scatter are possibly slow (and gather now has that Intel security issue). The reason is a "cost" one: t.c:47:21: note: =3D=3D> examining statement: _4 =3D *_3; t.c:47:21: missed: no array mode for V8DF[20] t.c:47:21: missed: no array mode for V8DF[20] t.c:47:21: missed: the size of the group of accesses is not a power of 2= =20 or not equal to 3 t.c:47:21: missed: not falling back to elementwise accesses t.c:58:15: missed: not vectorized: relevant stmt not supported: _4 =3D=20 *_3; t.c:47:21: missed: bad operation or unsupported loop bound. where we don't consider using gather because we have a known constant stride (20). Since the stores are really scatters we don't attempt to SLP either. Disabling the above heuristic we get this vectorized as well, avoiding gather/scatter by manually implementing them and using a quite high VF of 8 (with -mprefer-vector-width=3D256 you get VF 4 and likely faster code in the end). But yes, I doubt that any of ICC or clang vectorized codes are faster anywhere (but without specifying an uarch you get some generic cost modelling applied). Maybe SPR doesn't have the gather bug and it does have reasonable gather and scatter (zen4 scatter sucks). .L3: vmovsd 952(%rax), %xmm0 vmovsd -8(%rax), %xmm2 addq $1280, %rsi addq $1280, %rax vmovhpd -168(%rax), %xmm0, %xmm1 vmovhpd -1128(%rax), %xmm2, %xmm2 vmovsd -648(%rax), %xmm0 vmovhpd -488(%rax), %xmm0, %xmm0 vinsertf32x4 $0x1, %xmm1, %ymm0, %ymm0 vmovsd -968(%rax), %xmm1 vmovhpd -808(%rax), %xmm1, %xmm1 vinsertf32x4 $0x1, %xmm1, %ymm2, %ymm2 vinsertf64x4 $0x1, %ymm0, %zmm2, %zmm2 vmovsd -320(%rax), %xmm0 vmovhpd -160(%rax), %xmm0, %xmm1 vmovsd -640(%rax), %xmm0 vmovhpd -480(%rax), %xmm0, %xmm0 vinsertf32x4 $0x1, %xmm1, %ymm0, %ymm1 vmovsd -960(%rax), %xmm0 vmovhpd -800(%rax), %xmm0, %xmm8 vmovsd -1280(%rax), %xmm0 vmovhpd -1120(%rax), %xmm0, %xmm0 vinsertf32x4 $0x1, %xmm8, %ymm0, %ymm0 vinsertf64x4 $0x1, %ymm1, %zmm0, %zmm0 vmovsd -312(%rax), %xmm1 vmovhpd -152(%rax), %xmm1, %xmm8 vmovsd -632(%rax), %xmm1 vmovhpd -472(%rax), %xmm1, %xmm1 vinsertf32x4 $0x1, %xmm8, %ymm1, %ymm8 vmovsd -952(%rax), %xmm1 vmovhpd -792(%rax), %xmm1, %xmm9 vmovsd -1272(%rax), %xmm1 vmovhpd -1112(%rax), %xmm1, %xmm1 vinsertf32x4 $0x1, %xmm9, %ymm1, %ymm1 vinsertf64x4 $0x1, %ymm8, %zmm1, %zmm1 vaddpd %zmm1, %zmm0, %zmm0 vaddpd %zmm7, %zmm2, %zmm1 vfnmadd132pd %zmm3, %zmm2, %zmm1 vfmadd132pd %zmm6, %zmm5, %zmm0 valignq $3, %ymm1, %ymm1, %ymm2 vmovlpd %xmm1, -1280(%rsi) vextractf64x2 $1, %ymm1, %xmm8 vmovhpd %xmm1, -1120(%rsi) vextractf64x4 $0x1, %zmm1, %ymm1 vmovlpd %xmm1, -640(%rsi) vmovhpd %xmm1, -480(%rsi) vmovsd %xmm2, -800(%rsi) vextractf64x2 $1, %ymm1, %xmm2 vmovsd %xmm8, -960(%rsi) valignq $3, %ymm1, %ymm1, %ymm1 vmovsd %xmm2, -320(%rsi) vmovsd %xmm1, -160(%rsi) vmovsd -320(%rax), %xmm1 vmovhpd -160(%rax), %xmm1, %xmm2 vmovsd -640(%rax), %xmm1 vmovhpd -480(%rax), %xmm1, %xmm1 vinsertf32x4 $0x1, %xmm2, %ymm1, %ymm2 vmovsd -960(%rax), %xmm1 vmovhpd -800(%rax), %xmm1, %xmm8 vmovsd -1280(%rax), %xmm1 vmovhpd -1120(%rax), %xmm1, %xmm1 vinsertf32x4 $0x1, %xmm8, %ymm1, %ymm1 vinsertf64x4 $0x1, %ymm2, %zmm1, %zmm1 vfnmadd132pd %zmm3, %zmm1, %zmm0 vaddpd %zmm4, %zmm0, %zmm0 valignq $3, %ymm0, %ymm0, %ymm1 vmovlpd %xmm0, 14728(%rsi) vextractf64x2 $1, %ymm0, %xmm2 vmovhpd %xmm0, 14888(%rsi) vextractf64x4 $0x1, %zmm0, %ymm0 vmovlpd %xmm0, 15368(%rsi) vmovhpd %xmm0, 15528(%rsi) vmovsd %xmm1, 15208(%rsi) vextractf64x2 $1, %ymm0, %xmm1 vmovsd %xmm2, 15048(%rsi) valignq $3, %ymm0, %ymm0, %ymm0 vmovsd %xmm1, 15688(%rsi) vmovsd %xmm0, 15848(%rsi) cmpq %rdx, %rsi jne .L3=