[Bug target/111874] New: Missed mask_fold_left

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/111874] New: Missed mask_fold_left_plus with AVX512
@ 2023-10-19  8:49 rguenth at gcc dot gnu.org
  2023-10-19 10:46 ` [Bug target/111874] " crazylht at gmail dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-19  8:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

            Bug ID: 111874
           Summary: Missed mask_fold_left_plus with AVX512
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

Currently fold-left reductions are open-coded by the vectorizer, extracting
scalar elements and doing in-order adds.  That's probably as good as it can
get.
For the case of conditional (or loop masked) fold-left reductions the scalar
fallback isn't implemented.  But AVX512 has vpcompress that could be used
to implement a more efficient sequence for a masked fold-left, possibly
using a loop and population count of the mask.

It might be interesting to experiment with this, not so much for the
fully masked loop case but for conditional reduction.  Maybe there's
some expert at Intel or AMD who can produce a good instruction sequence
here.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/111874] Missed mask_fold_left_plus with AVX512
  2023-10-19  8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
@ 2023-10-19 10:46 ` crazylht at gmail dot com
  2023-10-19 11:26 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2023-10-19 10:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
For integer, We have _mm512_mask_reduce_add_epi32 defined as

extern __inline int
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
{
  __A = _mm512_maskz_mov_epi32 (__U, __A);
  __MM512_REDUCE_OP (+);
}

#undef __MM512_REDUCE_OP
#define __MM512_REDUCE_OP(op) \
  __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1);            \
  __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0);            \
  __m256i __T3 = (__m256i) (__T1 op __T2);                              \
  __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1);            \
  __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0);            \
  __v4si __T6 = __T4 op __T5;                                           \
  __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 });      \
  __v4si __T8 = __T6 op __T7;                                           \
  return __T8[0] op __T8[1]

There's correponding floating point version, but it's not in-order adds.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/111874] Missed mask_fold_left_plus with AVX512
  2023-10-19  8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
  2023-10-19 10:46 ` [Bug target/111874] " crazylht at gmail dot com
@ 2023-10-19 11:26 ` rguenth at gcc dot gnu.org
  2023-10-24  2:50 ` crazylht at gmail dot com
  2023-11-12 23:28 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-19 11:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #1)
> For integer, We have _mm512_mask_reduce_add_epi32 defined as
> 
> extern __inline int
> __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
> {
>   __A = _mm512_maskz_mov_epi32 (__U, __A);
>   __MM512_REDUCE_OP (+);
> }
> 
> #undef __MM512_REDUCE_OP
> #define __MM512_REDUCE_OP(op) \
>   __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1);		\
>   __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0);		\
>   __m256i __T3 = (__m256i) (__T1 op __T2);				\
>   __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1);		\
>   __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0);		\
>   __v4si __T6 = __T4 op __T5;						\
>   __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 });	\
>   __v4si __T8 = __T6 op __T7;						\
>   return __T8[0] op __T8[1]
> 
> There's correponding floating point version, but it's not in-order adds.

It also doesn't handle signed zeros correctly which would require
not using _mm512_maskz_mov_epi32 but merge masking with { -0.0, -0.0, ... }
for FP.  Of course as it's not doing in-order processing not handling
signed zeros correctly might be a minor thing.

So yes, we're looking for -O3 without -ffast-math vectorization of
a conditional reduction that's currently not supported (correctly).

double a[1024];
double foo()
{
  double res = 0.0;
  for (int i = 0; i < 1024; ++i)
    {
      if (a[i] < 0.)
         res += a[i];
    }
  return res;
}

should be vectorizable also with -frounding-math (where the trick using
-0.0 for masked elements doesn't work).  Currently we are using 0.0 for
them (but there's a pending patch).

Maybe we don't care about -frounding-math and so -0.0 adds are OK.  We
get something like the following with znver4, it could be that trying
to optimize the case of a sparse mask with vcompress isn't worth it

.L2:
        vmovapd (%rax), %zmm1
        addq    $64, %rax
        vminpd  %zmm5, %zmm1, %zmm1
        valignq $3, %ymm1, %ymm1, %ymm2
        vunpckhpd       %xmm1, %xmm1, %xmm3
        vaddsd  %xmm1, %xmm0, %xmm0
        vaddsd  %xmm3, %xmm0, %xmm0
        vextractf64x2   $1, %ymm1, %xmm3
        vextractf64x4   $0x1, %zmm1, %ymm1
        vaddsd  %xmm3, %xmm0, %xmm0
        vaddsd  %xmm2, %xmm0, %xmm0
        vunpckhpd       %xmm1, %xmm1, %xmm2
        vaddsd  %xmm1, %xmm0, %xmm0
        vaddsd  %xmm2, %xmm0, %xmm0
        vextractf64x2   $1, %ymm1, %xmm2
        valignq $3, %ymm1, %ymm1, %ymm1
        vaddsd  %xmm2, %xmm0, %xmm0
        vaddsd  %xmm1, %xmm0, %xmm0
        cmpq    $a+8192, %rax
        jne     .L2

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/111874] Missed mask_fold_left_plus with AVX512
  2023-10-19  8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
  2023-10-19 10:46 ` [Bug target/111874] " crazylht at gmail dot com
  2023-10-19 11:26 ` rguenth at gcc dot gnu.org
@ 2023-10-24  2:50 ` crazylht at gmail dot com
  2023-11-12 23:28 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2023-10-24  2:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
> For the case of conditional (or loop masked) fold-left reductions the scalar
> fallback isn't implemented.  But AVX512 has vpcompress that could be used
> to implement a more efficient sequence for a masked fold-left, possibly
> using a loop and population count of the mask.
There's extra kmov + vpcompress + popcnt, I'm afraid the performance could be 
 worse than the scalar version.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/111874] Missed mask_fold_left_plus with AVX512
  2023-10-19  8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-10-24  2:50 ` crazylht at gmail dot com
@ 2023-11-12 23:28 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-12 23:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-11-12
           Severity|normal                      |enhancement
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-11-12 23:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-19  8:49 [Bug target/111874] New: Missed mask_fold_left_plus with AVX512 rguenth at gcc dot gnu.org
2023-10-19 10:46 ` [Bug target/111874] " crazylht at gmail dot com
2023-10-19 11:26 ` rguenth at gcc dot gnu.org
2023-10-24  2:50 ` crazylht at gmail dot com
2023-11-12 23:28 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).