public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/111874] New: Missed mask_fold_left_plus with AVX512
@ 2023-10-19 8:49 rguenth at gcc dot gnu.org
2023-10-19 10:46 ` [Bug target/111874] " crazylht at gmail dot com
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-19 8:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874
Bug ID: 111874
Summary: Missed mask_fold_left_plus with AVX512
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rguenth at gcc dot gnu.org
Target Milestone: ---
Currently fold-left reductions are open-coded by the vectorizer, extracting
scalar elements and doing in-order adds. That's probably as good as it can
get.
For the case of conditional (or loop masked) fold-left reductions the scalar
fallback isn't implemented. But AVX512 has vpcompress that could be used
to implement a more efficient sequence for a masked fold-left, possibly
using a loop and population count of the mask.
It might be interesting to experiment with this, not so much for the
fully masked loop case but for conditional reduction. Maybe there's
some expert at Intel or AMD who can produce a good instruction sequence
here.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/111874] Missed mask_fold_left_plus with AVX512
2023-10-19 8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
@ 2023-10-19 10:46 ` crazylht at gmail dot com
2023-10-19 11:26 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2023-10-19 10:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874
--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
For integer, We have _mm512_mask_reduce_add_epi32 defined as
extern __inline int
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
{
__A = _mm512_maskz_mov_epi32 (__U, __A);
__MM512_REDUCE_OP (+);
}
#undef __MM512_REDUCE_OP
#define __MM512_REDUCE_OP(op) \
__v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1); \
__v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0); \
__m256i __T3 = (__m256i) (__T1 op __T2); \
__v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1); \
__v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0); \
__v4si __T6 = __T4 op __T5; \
__v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 }); \
__v4si __T8 = __T6 op __T7; \
return __T8[0] op __T8[1]
There's correponding floating point version, but it's not in-order adds.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/111874] Missed mask_fold_left_plus with AVX512
2023-10-19 8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
2023-10-19 10:46 ` [Bug target/111874] " crazylht at gmail dot com
@ 2023-10-19 11:26 ` rguenth at gcc dot gnu.org
2023-10-24 2:50 ` crazylht at gmail dot com
2023-11-12 23:28 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-19 11:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #1)
> For integer, We have _mm512_mask_reduce_add_epi32 defined as
>
> extern __inline int
> __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
> {
> __A = _mm512_maskz_mov_epi32 (__U, __A);
> __MM512_REDUCE_OP (+);
> }
>
> #undef __MM512_REDUCE_OP
> #define __MM512_REDUCE_OP(op) \
> __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1); \
> __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0); \
> __m256i __T3 = (__m256i) (__T1 op __T2); \
> __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1); \
> __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0); \
> __v4si __T6 = __T4 op __T5; \
> __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 }); \
> __v4si __T8 = __T6 op __T7; \
> return __T8[0] op __T8[1]
>
> There's correponding floating point version, but it's not in-order adds.
It also doesn't handle signed zeros correctly which would require
not using _mm512_maskz_mov_epi32 but merge masking with { -0.0, -0.0, ... }
for FP. Of course as it's not doing in-order processing not handling
signed zeros correctly might be a minor thing.
So yes, we're looking for -O3 without -ffast-math vectorization of
a conditional reduction that's currently not supported (correctly).
double a[1024];
double foo()
{
double res = 0.0;
for (int i = 0; i < 1024; ++i)
{
if (a[i] < 0.)
res += a[i];
}
return res;
}
should be vectorizable also with -frounding-math (where the trick using
-0.0 for masked elements doesn't work). Currently we are using 0.0 for
them (but there's a pending patch).
Maybe we don't care about -frounding-math and so -0.0 adds are OK. We
get something like the following with znver4, it could be that trying
to optimize the case of a sparse mask with vcompress isn't worth it
.L2:
vmovapd (%rax), %zmm1
addq $64, %rax
vminpd %zmm5, %zmm1, %zmm1
valignq $3, %ymm1, %ymm1, %ymm2
vunpckhpd %xmm1, %xmm1, %xmm3
vaddsd %xmm1, %xmm0, %xmm0
vaddsd %xmm3, %xmm0, %xmm0
vextractf64x2 $1, %ymm1, %xmm3
vextractf64x4 $0x1, %zmm1, %ymm1
vaddsd %xmm3, %xmm0, %xmm0
vaddsd %xmm2, %xmm0, %xmm0
vunpckhpd %xmm1, %xmm1, %xmm2
vaddsd %xmm1, %xmm0, %xmm0
vaddsd %xmm2, %xmm0, %xmm0
vextractf64x2 $1, %ymm1, %xmm2
valignq $3, %ymm1, %ymm1, %ymm1
vaddsd %xmm2, %xmm0, %xmm0
vaddsd %xmm1, %xmm0, %xmm0
cmpq $a+8192, %rax
jne .L2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/111874] Missed mask_fold_left_plus with AVX512
2023-10-19 8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
2023-10-19 10:46 ` [Bug target/111874] " crazylht at gmail dot com
2023-10-19 11:26 ` rguenth at gcc dot gnu.org
@ 2023-10-24 2:50 ` crazylht at gmail dot com
2023-11-12 23:28 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2023-10-24 2:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
> For the case of conditional (or loop masked) fold-left reductions the scalar
> fallback isn't implemented. But AVX512 has vpcompress that could be used
> to implement a more efficient sequence for a masked fold-left, possibly
> using a loop and population count of the mask.
There's extra kmov + vpcompress + popcnt, I'm afraid the performance could be
worse than the scalar version.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/111874] Missed mask_fold_left_plus with AVX512
2023-10-19 8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
` (2 preceding siblings ...)
2023-10-24 2:50 ` crazylht at gmail dot com
@ 2023-11-12 23:28 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-12 23:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2023-11-12
Severity|normal |enhancement
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-11-12 23:28 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-19 8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
2023-10-19 10:46 ` [Bug target/111874] " crazylht at gmail dot com
2023-10-19 11:26 ` rguenth at gcc dot gnu.org
2023-10-24 2:50 ` crazylht at gmail dot com
2023-11-12 23:28 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).