From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 3299A3858D3C; Mon, 28 Nov 2022 07:41:06 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3299A3858D3C
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1669621266;
	bh=L2uIAQsLcFuo41mlmRaQBl9lx2NKlr+CM4b6cZFmq9g=;
	h=From:To:Subject:Date:From;
	b=Pl43HymPthnOkKNORvCAYY+pVctHtd/Ws2B6rDywz05sgfw71faptpk9JR63dee/W
	 AV1xa2xtpFxdp5qPwCbcDK+r6G3WNFTOwdxco2pqDZXQUedCN6MQQH2HhT5RYrO/UQ
	 yz6nF0Wopf15mSTTIPMZsDOO3RJ5gn+ZBLPxFNX8=
From: "ebiggers3 at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/107892] New: Unnecessary move between ymm
 registers in loop using AVX2 intrinsic
Date: Mon, 28 Nov 2022 07:41:05 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: ebiggers3 at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-107892-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107892

            Bug ID: 107892
           Summary: Unnecessary move between ymm registers in loop using
                    AVX2 intrinsic
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ebiggers3 at gmail dot com
  Target Milestone: ---

To reproduce with the latest trunk, compile the following .c file on x86_64=
 at
-O2:

        #include <immintrin.h>

        int __attribute__((target("avx2")))
        sum_ints(const __m256i *p, size_t n)
        {
                __m256i a =3D _mm256_setzero_si256();
                __m128i b;

                do {
                        a =3D _mm256_add_epi32(a, *p++);
                } while (--n);

                b =3D _mm_add_epi32(_mm256_extracti128_si256(a, 0),
                                  _mm256_extracti128_si256(a, 1));
                b =3D _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
                b =3D _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
                return _mm_cvtsi128_si32(b);
        }

The assembly that gcc generates is:

        0000000000000000 <sum_ints>:
           0:   c5 f1 ef c9             vpxor  %xmm1,%xmm1,%xmm1
           4:   0f 1f 40 00             nopl   0x0(%rax)
           8:   c5 f5 fe 07             vpaddd (%rdi),%ymm1,%ymm0
           c:   48 83 c7 20             add    $0x20,%rdi
          10:   c5 fd 6f c8             vmovdqa %ymm0,%ymm1
          14:   48 83 ee 01             sub    $0x1,%rsi
          18:   75 ee                   jne    8 <sum_ints+0x8>
          1a:   c4 e3 7d 39 c1 01       vextracti128 $0x1,%ymm0,%xmm1
          20:   c5 f9 fe c1             vpaddd %xmm1,%xmm0,%xmm0
          24:   c5 f9 70 c8 31          vpshufd $0x31,%xmm0,%xmm1
          29:   c5 f1 fe c8             vpaddd %xmm0,%xmm1,%xmm1
          2d:   c5 f9 70 c1 02          vpshufd $0x2,%xmm1,%xmm0
          32:   c5 f9 fe c1             vpaddd %xmm1,%xmm0,%xmm0
          36:   c5 f9 7e c0             vmovd  %xmm0,%eax
          3a:   c5 f8 77                vzeroupper
          3d:   c3                      ret

The bug is that the inner loop contains an unnecessary vmovdqa:

           8:   vpaddd (%rdi),%ymm1,%ymm0
                add    $0x20,%rdi
                vmovdqa %ymm0,%ymm1
                sub    $0x1,%rsi
                jne    8 <sum_ints+0x8>

It should look like the following instead:

           8:   vpaddd (%rdi),%ymm0,%ymm0
                add    $0x20,%rdi
                sub    $0x1,%rsi
                jne    8 <sum_ints+0x8>

Strangely, the bug goes away if the __v8si type is used instead of __m256i =
and
the addition is done using "+=3D" instead of _mm256_add_epi32():

        int __attribute__((target("avx2")))
        sum_ints_good(const __v8si *p, size_t n)
        {
                __v8si a =3D {};
                __m128i b;

                do {
                        a +=3D *p++;
                } while (--n);

                b =3D _mm_add_epi32(_mm256_extracti128_si256((__m256i)a, 0),
                                  _mm256_extracti128_si256((__m256i)a, 1));
                b =3D _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
                b =3D _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
                return _mm_cvtsi128_si32(b);
        }

In the bad version, I noticed that the RTL initially has two separate insns=
 for
'a +=3D *p': one to do the addition and write the result to a new pseudo
register, and one to convert the value from mode V8SI to V4DI and assign it=
 to
the original pseudo register.  These two separate insns never get combined.=
=20
(That sort of explains why the bug isn't seen with the __v8si and +=3D meth=
od;
gcc doesn't do a type conversion with that method.)  So, I'm wondering if t=
he
bug is in the instruction combining pass.  Or perhaps the RTL should never =
have
had two separate insns in the first place?=