[Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic
@ 2022-11-28  7:41 ebiggers3 at gmail dot com
  2022-11-28  8:15 ` [Bug rtl-optimization/107892] " ebiggers3 at gmail dot com
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: ebiggers3 at gmail dot com @ 2022-11-28  7:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

            Bug ID: 107892
           Summary: Unnecessary move between ymm registers in loop using
                    AVX2 intrinsic
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ebiggers3 at gmail dot com
  Target Milestone: ---

To reproduce with the latest trunk, compile the following .c file on x86_64 at
-O2:

        #include <immintrin.h>

        int __attribute__((target("avx2")))
        sum_ints(const __m256i *p, size_t n)
        {
                __m256i a = _mm256_setzero_si256();
                __m128i b;

                do {
                        a = _mm256_add_epi32(a, *p++);
                } while (--n);

                b = _mm_add_epi32(_mm256_extracti128_si256(a, 0),
                                  _mm256_extracti128_si256(a, 1));
                b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
                b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
                return _mm_cvtsi128_si32(b);
        }

The assembly that gcc generates is:

        0000000000000000 <sum_ints>:
           0:   c5 f1 ef c9             vpxor  %xmm1,%xmm1,%xmm1
           4:   0f 1f 40 00             nopl   0x0(%rax)
           8:   c5 f5 fe 07             vpaddd (%rdi),%ymm1,%ymm0
           c:   48 83 c7 20             add    $0x20,%rdi
          10:   c5 fd 6f c8             vmovdqa %ymm0,%ymm1
          14:   48 83 ee 01             sub    $0x1,%rsi
          18:   75 ee                   jne    8 <sum_ints+0x8>
          1a:   c4 e3 7d 39 c1 01       vextracti128 $0x1,%ymm0,%xmm1
          20:   c5 f9 fe c1             vpaddd %xmm1,%xmm0,%xmm0
          24:   c5 f9 70 c8 31          vpshufd $0x31,%xmm0,%xmm1
          29:   c5 f1 fe c8             vpaddd %xmm0,%xmm1,%xmm1
          2d:   c5 f9 70 c1 02          vpshufd $0x2,%xmm1,%xmm0
          32:   c5 f9 fe c1             vpaddd %xmm1,%xmm0,%xmm0
          36:   c5 f9 7e c0             vmovd  %xmm0,%eax
          3a:   c5 f8 77                vzeroupper
          3d:   c3                      ret

The bug is that the inner loop contains an unnecessary vmovdqa:

           8:   vpaddd (%rdi),%ymm1,%ymm0
                add    $0x20,%rdi
                vmovdqa %ymm0,%ymm1
                sub    $0x1,%rsi
                jne    8 <sum_ints+0x8>

It should look like the following instead:

           8:   vpaddd (%rdi),%ymm0,%ymm0
                add    $0x20,%rdi
                sub    $0x1,%rsi
                jne    8 <sum_ints+0x8>

Strangely, the bug goes away if the __v8si type is used instead of __m256i and
the addition is done using "+=" instead of _mm256_add_epi32():

        int __attribute__((target("avx2")))
        sum_ints_good(const __v8si *p, size_t n)
        {
                __v8si a = {};
                __m128i b;

                do {
                        a += *p++;
                } while (--n);

                b = _mm_add_epi32(_mm256_extracti128_si256((__m256i)a, 0),
                                  _mm256_extracti128_si256((__m256i)a, 1));
                b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
                b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
                return _mm_cvtsi128_si32(b);
        }

In the bad version, I noticed that the RTL initially has two separate insns for
'a += *p': one to do the addition and write the result to a new pseudo
register, and one to convert the value from mode V8SI to V4DI and assign it to
the original pseudo register.  These two separate insns never get combined. 
(That sort of explains why the bug isn't seen with the __v8si and += method;
gcc doesn't do a type conversion with that method.)  So, I'm wondering if the
bug is in the instruction combining pass.  Or perhaps the RTL should never have
had two separate insns in the first place?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic
  2022-11-28  7:41 [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic ebiggers3 at gmail dot com
@ 2022-11-28  8:15 ` ebiggers3 at gmail dot com
  2022-11-28  8:23 ` ebiggers3 at gmail dot com
  2022-11-28  8:43 ` crazylht at gmail dot com
  2 siblings, 0 replies; 4+ messages in thread
From: ebiggers3 at gmail dot com @ 2022-11-28  8:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

--- Comment #1 from Eric Biggers <ebiggers3 at gmail dot com> ---
The reproducer I gave in my first comment doesn't reproduce the bug on
releases/gcc-11.1.0, so it must have regressed between then and trunk.  I can
do a bisection if needed.

However, I actually still see the bug with gcc-11.1.0 on my original
unminimized code at
https://github.com/ebiggers/libdeflate/blob/fb0c43373f6fe600471457f4c021b8ad7e4bbabf/lib/x86/adler32_impl.h#L142.
 So maybe the reproducer I gave is not the best one.  Here is a slightly
different reproducer that reproduces the bug with both gcc-11.1.0 and trunk:

        #include <immintrin.h>

        __m256i __attribute__((target("avx2")))
        f(const __m256i *p, size_t n)
        {
                __m256i a = _mm256_setzero_si256();

                do {
                        a = _mm256_add_epi32(a, *p++);
                } while (--n);

                return _mm256_madd_epi16(a, a);
        }

The assembly of the loop has the unnecessary vmovdqa:

   8:   c5 f5 fe 07             vpaddd (%rdi),%ymm1,%ymm0
   c:   48 83 c7 20             add    $0x20,%rdi
  10:   c5 fd 6f c8             vmovdqa %ymm0,%ymm1
  14:   48 83 ee 01             sub    $0x1,%rsi
  18:   75 ee                   jne    8 <f+0x8>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic
  2022-11-28  7:41 [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic ebiggers3 at gmail dot com
  2022-11-28  8:15 ` [Bug rtl-optimization/107892] " ebiggers3 at gmail dot com
@ 2022-11-28  8:23 ` ebiggers3 at gmail dot com
  2022-11-28  8:43 ` crazylht at gmail dot com
  2 siblings, 0 replies; 4+ messages in thread
From: ebiggers3 at gmail dot com @ 2022-11-28  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

--- Comment #2 from Eric Biggers <ebiggers3 at gmail dot com> ---
This is also reproducible with SSE2.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic
  2022-11-28  7:41 [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic ebiggers3 at gmail dot com
  2022-11-28  8:15 ` [Bug rtl-optimization/107892] " ebiggers3 at gmail dot com
  2022-11-28  8:23 ` ebiggers3 at gmail dot com
@ 2022-11-28  8:43 ` crazylht at gmail dot com
  2 siblings, 0 replies; 4+ messages in thread
From: crazylht at gmail dot com @ 2022-11-28  8:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---

> In the bad version, I noticed that the RTL initially has two separate insns
> for 'a += *p': one to do the addition and write the result to a new pseudo
> register, and one to convert the value from mode V8SI to V4DI and assign it
Because we're defining __m256i as __v4di, and rtl use subreg to bitcast __v8si
reg to __v4di one.
> to the original pseudo register.  These two separate insns never get
> combined.  (That sort of explains why the bug isn't seen with the __v8si and
> += method; gcc doesn't do a type conversion with that method.)  So, I'm
Combine failed to combine them because the __v8si reg is also used outside of
the loop.
> wondering if the bug is in the instruction combining pass.  Or perhaps the
> RTL should never have had two separate insns in the first place?

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-11-28  8:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-28  7:41 [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic ebiggers3 at gmail dot com
2022-11-28  8:15 ` [Bug rtl-optimization/107892] " ebiggers3 at gmail dot com
2022-11-28  8:23 ` ebiggers3 at gmail dot com
2022-11-28  8:43 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).