public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic
@ 2022-11-28 7:41 ebiggers3 at gmail dot com
2022-11-28 8:15 ` [Bug rtl-optimization/107892] " ebiggers3 at gmail dot com
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: ebiggers3 at gmail dot com @ 2022-11-28 7:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892
Bug ID: 107892
Summary: Unnecessary move between ymm registers in loop using
AVX2 intrinsic
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ebiggers3 at gmail dot com
Target Milestone: ---
To reproduce with the latest trunk, compile the following .c file on x86_64 at
-O2:
#include <immintrin.h>
int __attribute__((target("avx2")))
sum_ints(const __m256i *p, size_t n)
{
__m256i a = _mm256_setzero_si256();
__m128i b;
do {
a = _mm256_add_epi32(a, *p++);
} while (--n);
b = _mm_add_epi32(_mm256_extracti128_si256(a, 0),
_mm256_extracti128_si256(a, 1));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
return _mm_cvtsi128_si32(b);
}
The assembly that gcc generates is:
0000000000000000 <sum_ints>:
0: c5 f1 ef c9 vpxor %xmm1,%xmm1,%xmm1
4: 0f 1f 40 00 nopl 0x0(%rax)
8: c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0
c: 48 83 c7 20 add $0x20,%rdi
10: c5 fd 6f c8 vmovdqa %ymm0,%ymm1
14: 48 83 ee 01 sub $0x1,%rsi
18: 75 ee jne 8 <sum_ints+0x8>
1a: c4 e3 7d 39 c1 01 vextracti128 $0x1,%ymm0,%xmm1
20: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0
24: c5 f9 70 c8 31 vpshufd $0x31,%xmm0,%xmm1
29: c5 f1 fe c8 vpaddd %xmm0,%xmm1,%xmm1
2d: c5 f9 70 c1 02 vpshufd $0x2,%xmm1,%xmm0
32: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0
36: c5 f9 7e c0 vmovd %xmm0,%eax
3a: c5 f8 77 vzeroupper
3d: c3 ret
The bug is that the inner loop contains an unnecessary vmovdqa:
8: vpaddd (%rdi),%ymm1,%ymm0
add $0x20,%rdi
vmovdqa %ymm0,%ymm1
sub $0x1,%rsi
jne 8 <sum_ints+0x8>
It should look like the following instead:
8: vpaddd (%rdi),%ymm0,%ymm0
add $0x20,%rdi
sub $0x1,%rsi
jne 8 <sum_ints+0x8>
Strangely, the bug goes away if the __v8si type is used instead of __m256i and
the addition is done using "+=" instead of _mm256_add_epi32():
int __attribute__((target("avx2")))
sum_ints_good(const __v8si *p, size_t n)
{
__v8si a = {};
__m128i b;
do {
a += *p++;
} while (--n);
b = _mm_add_epi32(_mm256_extracti128_si256((__m256i)a, 0),
_mm256_extracti128_si256((__m256i)a, 1));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
return _mm_cvtsi128_si32(b);
}
In the bad version, I noticed that the RTL initially has two separate insns for
'a += *p': one to do the addition and write the result to a new pseudo
register, and one to convert the value from mode V8SI to V4DI and assign it to
the original pseudo register. These two separate insns never get combined.
(That sort of explains why the bug isn't seen with the __v8si and += method;
gcc doesn't do a type conversion with that method.) So, I'm wondering if the
bug is in the instruction combining pass. Or perhaps the RTL should never have
had two separate insns in the first place?
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic
2022-11-28 7:41 [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic ebiggers3 at gmail dot com
@ 2022-11-28 8:15 ` ebiggers3 at gmail dot com
2022-11-28 8:23 ` ebiggers3 at gmail dot com
2022-11-28 8:43 ` crazylht at gmail dot com
2 siblings, 0 replies; 4+ messages in thread
From: ebiggers3 at gmail dot com @ 2022-11-28 8:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892
--- Comment #1 from Eric Biggers <ebiggers3 at gmail dot com> ---
The reproducer I gave in my first comment doesn't reproduce the bug on
releases/gcc-11.1.0, so it must have regressed between then and trunk. I can
do a bisection if needed.
However, I actually still see the bug with gcc-11.1.0 on my original
unminimized code at
https://github.com/ebiggers/libdeflate/blob/fb0c43373f6fe600471457f4c021b8ad7e4bbabf/lib/x86/adler32_impl.h#L142.
So maybe the reproducer I gave is not the best one. Here is a slightly
different reproducer that reproduces the bug with both gcc-11.1.0 and trunk:
#include <immintrin.h>
__m256i __attribute__((target("avx2")))
f(const __m256i *p, size_t n)
{
__m256i a = _mm256_setzero_si256();
do {
a = _mm256_add_epi32(a, *p++);
} while (--n);
return _mm256_madd_epi16(a, a);
}
The assembly of the loop has the unnecessary vmovdqa:
8: c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0
c: 48 83 c7 20 add $0x20,%rdi
10: c5 fd 6f c8 vmovdqa %ymm0,%ymm1
14: 48 83 ee 01 sub $0x1,%rsi
18: 75 ee jne 8 <f+0x8>
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic
2022-11-28 7:41 [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic ebiggers3 at gmail dot com
2022-11-28 8:15 ` [Bug rtl-optimization/107892] " ebiggers3 at gmail dot com
@ 2022-11-28 8:23 ` ebiggers3 at gmail dot com
2022-11-28 8:43 ` crazylht at gmail dot com
2 siblings, 0 replies; 4+ messages in thread
From: ebiggers3 at gmail dot com @ 2022-11-28 8:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892
--- Comment #2 from Eric Biggers <ebiggers3 at gmail dot com> ---
This is also reproducible with SSE2.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic
2022-11-28 7:41 [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic ebiggers3 at gmail dot com
2022-11-28 8:15 ` [Bug rtl-optimization/107892] " ebiggers3 at gmail dot com
2022-11-28 8:23 ` ebiggers3 at gmail dot com
@ 2022-11-28 8:43 ` crazylht at gmail dot com
2 siblings, 0 replies; 4+ messages in thread
From: crazylht at gmail dot com @ 2022-11-28 8:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
> In the bad version, I noticed that the RTL initially has two separate insns
> for 'a += *p': one to do the addition and write the result to a new pseudo
> register, and one to convert the value from mode V8SI to V4DI and assign it
Because we're defining __m256i as __v4di, and rtl use subreg to bitcast __v8si
reg to __v4di one.
> to the original pseudo register. These two separate insns never get
> combined. (That sort of explains why the bug isn't seen with the __v8si and
> += method; gcc doesn't do a type conversion with that method.) So, I'm
Combine failed to combine them because the __v8si reg is also used outside of
the loop.
> wondering if the bug is in the instruction combining pass. Or perhaps the
> RTL should never have had two separate insns in the first place?
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-11-28 8:43 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-28 7:41 [Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic ebiggers3 at gmail dot com
2022-11-28 8:15 ` [Bug rtl-optimization/107892] " ebiggers3 at gmail dot com
2022-11-28 8:23 ` ebiggers3 at gmail dot com
2022-11-28 8:43 ` crazylht at gmail dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).