public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/108401] New: gcc defeats vector constant generation with intrinsics
@ 2023-01-14 0:24 andysem at mail dot ru
2023-01-16 5:06 ` [Bug target/108401] " pinskia at gcc dot gnu.org
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: andysem at mail dot ru @ 2023-01-14 0:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
Bug ID: 108401
Summary: gcc defeats vector constant generation with intrinsics
Product: gcc
Version: 11.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: andysem at mail dot ru
Target Milestone: ---
Consider the following code:
#include <immintrin.h>
__m256i load_00FF()
{
__m256i mm = _mm256_setzero_si256();
return _mm256_srli_epi16(_mm256_cmpeq_epi64(mm, mm), 8);
}
This function generates a vector constant of alternating 0xFF and 0x00 bytes.
The code is written this way to avoid a load from memory, which may cause a
cache miss. The expected generated code is this:
vpcmpeqq ymm0, ymm0, ymm0
vpsrlw ymm0, ymm0, 8
ret
which is almost exactly what gcc 8 generates (it uses vpcmpeqd instead of
vpcmpeqq, which is fine). However, gcc 9 through 11 generates a memory load
instead, defeating the attempt to avoid it:
vmovdqa ymm0, YMMWORD PTR .LC0[rip]
ret
and gcc 12 generates a worse code:
movabs rax, 71777214294589695
vmovq xmm1, rax
vpbroadcastq ymm0, xmm1
ret
In all cases, the compiler flags are: -O3 -march=haswell
Code on godbolt.org: https://gcc.godbolt.org/z/sfT787PY9
I think the compiler should follow the code in intrinsics more closely since
despite the apparent equivalence, the choice of instructions can have
performance implications. The original code that is written by the developer is
better anyway, so it's not clear why the compiler is being so creative in this
case.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/108401] gcc defeats vector constant generation with intrinsics
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
@ 2023-01-16 5:06 ` pinskia at gcc dot gnu.org
2023-01-16 5:23 ` pinskia at gcc dot gnu.org
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-16 5:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>and gcc 12 generates a worse code:
it is not worse really; depending on the how fast moving between the register
sets is.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/108401] gcc defeats vector constant generation with intrinsics
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
2023-01-16 5:06 ` [Bug target/108401] " pinskia at gcc dot gnu.org
@ 2023-01-16 5:23 ` pinskia at gcc dot gnu.org
2023-01-16 5:28 ` crazylht at gmail dot com
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-16 5:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
r12-1958-gedafb35bdadf30 changed the behavior in GCC 12 to be better ...
(see the commit message that it shows it is better than doing a memory load).
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/108401] gcc defeats vector constant generation with intrinsics
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
2023-01-16 5:06 ` [Bug target/108401] " pinskia at gcc dot gnu.org
2023-01-16 5:23 ` pinskia at gcc dot gnu.org
@ 2023-01-16 5:28 ` crazylht at gmail dot com
2023-01-16 7:00 ` amonakov at gcc dot gnu.org
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: crazylht at gmail dot com @ 2023-01-16 5:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
Hongtao.liu <crazylht at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crazylht at gmail dot com
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
> and gcc 12 generates a worse code:
>
> movabs rax, 71777214294589695
> vmovq xmm1, rax
> vpbroadcastq ymm0, xmm1
> ret
>
It's on purpose by edafb35bdadf309ebb9d1eddc5549f9e1ad49c09 since
microbenchmark shows moving from imm is faster than memory.
> In all cases, the compiler flags are: -O3 -march=haswell
>
> Code on godbolt.org: https://gcc.godbolt.org/z/sfT787PY9
>
> I think the compiler should follow the code in intrinsics more closely since
> despite the apparent equivalence, the choice of instructions can have
> performance implications. The original code that is written by the developer
> is better anyway, so it's not clear why the compiler is being so creative in
> this case.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/108401] gcc defeats vector constant generation with intrinsics
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
` (2 preceding siblings ...)
2023-01-16 5:28 ` crazylht at gmail dot com
@ 2023-01-16 7:00 ` amonakov at gcc dot gnu.org
2023-01-16 7:47 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-01-16 7:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #3)
> > and gcc 12 generates a worse code:
> >
> > movabs rax, 71777214294589695
> > vmovq xmm1, rax
> > vpbroadcastq ymm0, xmm1
> > ret
> >
>
> It's on purpose by edafb35bdadf309ebb9d1eddc5549f9e1ad49c09 since
> microbenchmark shows moving from imm is faster than memory.
But the bug is not asking you to reinstate loading from memory. The bug is
asking you to notice that the result can be constructed via cmpeq+psrlw, which
is even better than a broadcast (cmpeq with dst same as src is usually a
dependency-breaking instruction that does not occupy an execution port).
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/108401] gcc defeats vector constant generation with intrinsics
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
` (3 preceding siblings ...)
2023-01-16 7:00 ` amonakov at gcc dot gnu.org
@ 2023-01-16 7:47 ` rguenth at gcc dot gnu.org
2023-01-16 10:04 ` andysem at mail dot ru
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-16 7:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2023-01-16
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. We expand from
return { 71777214294589695, 71777214294589695, 71777214294589695,
71777214294589695 };
where we could reduce the DImode broadcast to a HImode one (if that exists).
But sure, the x86 backend could implement the intrinsic suggested way to
generate this particular pattern.
I'll also note that -O0 produces quite bad code here.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/108401] gcc defeats vector constant generation with intrinsics
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
` (4 preceding siblings ...)
2023-01-16 7:47 ` rguenth at gcc dot gnu.org
@ 2023-01-16 10:04 ` andysem at mail dot ru
2023-01-16 10:42 ` andysem at mail dot ru
2023-01-17 5:42 ` crazylht at gmail dot com
7 siblings, 0 replies; 9+ messages in thread
From: andysem at mail dot ru @ 2023-01-16 10:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
--- Comment #6 from andysem at mail dot ru ---
(In reply to Andrew Pinski from comment #1)
> >and gcc 12 generates a worse code:
>
> it is not worse really; depending on the how fast moving between the
> register sets is.
I meant "worse" compared to vpcmpeq+vpsrlw pair.
(Side note about the broadcast version: it could have been smaller if it used a
32-bit constant and vpbroadcastd. vpcmpeq+vpsrlw would still be better in this
particular case, but if broadcast is needed, a smaller footprint code is
preferred.)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/108401] gcc defeats vector constant generation with intrinsics
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
` (5 preceding siblings ...)
2023-01-16 10:04 ` andysem at mail dot ru
@ 2023-01-16 10:42 ` andysem at mail dot ru
2023-01-17 5:42 ` crazylht at gmail dot com
7 siblings, 0 replies; 9+ messages in thread
From: andysem at mail dot ru @ 2023-01-16 10:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
--- Comment #7 from andysem at mail dot ru ---
To be clear, I'm not asking the compiler to recognize the particular pattern of
alternating 0x00 and 0xFF bytes. Because hardcoding this particular pattern
won't improve generated code in other cases.
Rather, I'm asking to tune down code transformations for intrinsics. If the
developer wrote a sequence of intrinsics to generate a constant then he
probably wanted that sequence instead of a simple _mm_set1_epi32 or a load from
memory.
But, if you're going to improve constant generation, please make it so that it
can recognize not only the particular pattern described in this bug. More
importantly, it should recognize the all-ones case (as a single pcmpeq) as a
starting point. Then it can apply shifts to achieve the final result from the
all-ones vector - shifts of any width, length or direction, including
psrldq/pslldq. This would improve generated code in a wider range of cases.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/108401] gcc defeats vector constant generation with intrinsics
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
` (6 preceding siblings ...)
2023-01-16 10:42 ` andysem at mail dot ru
@ 2023-01-17 5:42 ` crazylht at gmail dot com
7 siblings, 0 replies; 9+ messages in thread
From: crazylht at gmail dot com @ 2023-01-17 5:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
> But, if you're going to improve constant generation, please make it so that
> it can recognize not only the particular pattern described in this bug. More
> importantly, it should recognize the all-ones case (as a single pcmpeq) as a
> starting point. Then it can apply shifts to achieve the final result from
> the all-ones vector - shifts of any width, length or direction, including
> psrldq/pslldq. This would improve generated code in a wider range of cases.
yes, we will try to do that. Generally fold intrinsic into compiler IR helps
performance, and for this case we need to optimize codegen for special
immediate broadcast(all-ones + shift)
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-01-17 5:42 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-14 0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
2023-01-16 5:06 ` [Bug target/108401] " pinskia at gcc dot gnu.org
2023-01-16 5:23 ` pinskia at gcc dot gnu.org
2023-01-16 5:28 ` crazylht at gmail dot com
2023-01-16 7:00 ` amonakov at gcc dot gnu.org
2023-01-16 7:47 ` rguenth at gcc dot gnu.org
2023-01-16 10:04 ` andysem at mail dot ru
2023-01-16 10:42 ` andysem at mail dot ru
2023-01-17 5:42 ` crazylht at gmail dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).