From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 3299A3858D3C; Mon, 28 Nov 2022 07:41:06 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3299A3858D3C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1669621266; bh=L2uIAQsLcFuo41mlmRaQBl9lx2NKlr+CM4b6cZFmq9g=; h=From:To:Subject:Date:From; b=Pl43HymPthnOkKNORvCAYY+pVctHtd/Ws2B6rDywz05sgfw71faptpk9JR63dee/W AV1xa2xtpFxdp5qPwCbcDK+r6G3WNFTOwdxco2pqDZXQUedCN6MQQH2HhT5RYrO/UQ yz6nF0Wopf15mSTTIPMZsDOO3RJ5gn+ZBLPxFNX8= From: "ebiggers3 at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic Date: Mon, 28 Nov 2022 07:41:05 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: rtl-optimization X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: ebiggers3 at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107892 Bug ID: 107892 Summary: Unnecessary move between ymm registers in loop using AVX2 intrinsic Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ebiggers3 at gmail dot com Target Milestone: --- To reproduce with the latest trunk, compile the following .c file on x86_64= at -O2: #include int __attribute__((target("avx2"))) sum_ints(const __m256i *p, size_t n) { __m256i a =3D _mm256_setzero_si256(); __m128i b; do { a =3D _mm256_add_epi32(a, *p++); } while (--n); b =3D _mm_add_epi32(_mm256_extracti128_si256(a, 0), _mm256_extracti128_si256(a, 1)); b =3D _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31)); b =3D _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02)); return _mm_cvtsi128_si32(b); } The assembly that gcc generates is: 0000000000000000 : 0: c5 f1 ef c9 vpxor %xmm1,%xmm1,%xmm1 4: 0f 1f 40 00 nopl 0x0(%rax) 8: c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0 c: 48 83 c7 20 add $0x20,%rdi 10: c5 fd 6f c8 vmovdqa %ymm0,%ymm1 14: 48 83 ee 01 sub $0x1,%rsi 18: 75 ee jne 8 1a: c4 e3 7d 39 c1 01 vextracti128 $0x1,%ymm0,%xmm1 20: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 24: c5 f9 70 c8 31 vpshufd $0x31,%xmm0,%xmm1 29: c5 f1 fe c8 vpaddd %xmm0,%xmm1,%xmm1 2d: c5 f9 70 c1 02 vpshufd $0x2,%xmm1,%xmm0 32: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 36: c5 f9 7e c0 vmovd %xmm0,%eax 3a: c5 f8 77 vzeroupper 3d: c3 ret The bug is that the inner loop contains an unnecessary vmovdqa: 8: vpaddd (%rdi),%ymm1,%ymm0 add $0x20,%rdi vmovdqa %ymm0,%ymm1 sub $0x1,%rsi jne 8 It should look like the following instead: 8: vpaddd (%rdi),%ymm0,%ymm0 add $0x20,%rdi sub $0x1,%rsi jne 8 Strangely, the bug goes away if the __v8si type is used instead of __m256i = and the addition is done using "+=3D" instead of _mm256_add_epi32(): int __attribute__((target("avx2"))) sum_ints_good(const __v8si *p, size_t n) { __v8si a =3D {}; __m128i b; do { a +=3D *p++; } while (--n); b =3D _mm_add_epi32(_mm256_extracti128_si256((__m256i)a, 0), _mm256_extracti128_si256((__m256i)a, 1)); b =3D _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31)); b =3D _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02)); return _mm_cvtsi128_si32(b); } In the bad version, I noticed that the RTL initially has two separate insns= for 'a +=3D *p': one to do the addition and write the result to a new pseudo register, and one to convert the value from mode V8SI to V4DI and assign it= to the original pseudo register. These two separate insns never get combined.= =20 (That sort of explains why the bug isn't seen with the __v8si and +=3D meth= od; gcc doesn't do a type conversion with that method.) So, I'm wondering if t= he bug is in the instruction combining pass. Or perhaps the RTL should never = have had two separate insns in the first place?=