[Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop
@ 2022-09-30  9:06 rguenth at gcc dot gnu.org
  2022-09-30  9:09 ` [Bug target/107093] " rguenth at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-09-30  9:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

            Bug ID: 107093
           Summary: AVX512 mask operations not simplified in fully masked
                    loop
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

Trying to implement WHILE_ULT for AVX512 I run into optimization issues. 
Consider

double a[1024], b[1024];

void foo (int n)
{
  for (int i = 0; i < n; ++i)
    a[i] = b[i] * 3.;
}

compiled with -O3 -march=cascadelake --param vect-partial-vector-usage=2

I get snippets like

        kxnorb  %k1, %k1, %k1
        kortestb        %k1, %k1
        je      .L11

or

        kxorb   %k1, %k1, %k1
        kxnorb  %k1, %k1, %k1

where we fail to simplify the operations.  Looking at the RTL it looks
like missed jump threading, but I do see the ops being

(insn 18 72 74 5 (parallel [
            (set (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])
                (not:QI (xor:QI (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])
                        (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86]))))
            (unspec [
                    (const_int 0 [0])
                ] UNSPEC_MASKOP)
        ]) 1912 {kxnorqi}
     (expr_list:REG_EQUAL (const_int -1 [0xffffffffffffffff])
        (nil)))

thus having an UNSPEC in them.  When emitting a SET from constm1 I end up
with mask<->GPR moves and if-converted code which isn't optimal either.
When doing -fno-if-conversion I get

.L7:
        vmovapd b(%rax), %ymm1{%k1}
        addl    $4, %ecx
        movl    %edi, %edx
        vmulpd  %ymm2, %ymm1, %ymm0
        subl    %ecx, %edx
        vmovapd %ymm0, a(%rax){%k1}
        kxnorb  %k1, %k1, %k1
        cmpl    $4, %edx
        jge     .L5
        vpbroadcastd    %edx, %xmm0
        vpcmpd  $1, %xmm0, %xmm3, %k1
.L5:
        addq    $32, %rax
        kortestb        %k1, %k1
        jne     .L7

which also doesn't have the desired short-cut from the cmpl $4, %edx.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
@ 2022-09-30  9:09 ` rguenth at gcc dot gnu.org
  2022-10-10  1:50 ` crazylht at gmail dot com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-09-30  9:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 53645
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53645&action=edit
prototype for WHILE_ULT

I'm playing with the attached.  Note it requires the third operand patch for
WHILE_ULT (but it's basically approved).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
  2022-09-30  9:09 ` [Bug target/107093] " rguenth at gcc dot gnu.org
@ 2022-10-10  1:50 ` crazylht at gmail dot com
  2022-10-11  9:23 ` cvs-commit at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2022-10-10  1:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
For UNSPEC part, we can create a new define_insn with genenral operation and
accept both gpr and mask alternatives just like other logic patterns.
For gpr version, we can split it to xor + not after reload.
For mask version, we can split it to kxnor with UNSPEC after reload.

It should help general simplication for xor and not.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
  2022-09-30  9:09 ` [Bug target/107093] " rguenth at gcc dot gnu.org
  2022-10-10  1:50 ` crazylht at gmail dot com
@ 2022-10-11  9:23 ` cvs-commit at gcc dot gnu.org
  2022-10-11 10:05 ` crazylht at gmail dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-10-11  9:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #3 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:498ad738690b3c464f901d63dcd4d0f49a50dd00

commit r13-3218-g498ad738690b3c464f901d63dcd4d0f49a50dd00
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Oct 10 11:31:48 2022 +0800

    Add define_insn_and_split to support general version of "kxnor".

    For genereal_reg_operand, it will be splitted into xor + not.
    For mask_reg_operand, it will be splitted with UNSPEC_MASK_OP just
    like what we did for other logic operations.

    The patch will optimize xor+not to kxnor when possible.

    gcc/ChangeLog:

            PR target/107093
            * config/i386/i386.md (*notxor<mode>_1): New post_reload
            define_insn_and_split.
            (*notxorqi_1): Ditto.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr107093.c: New test.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-10-11  9:23 ` cvs-commit at gcc dot gnu.org
@ 2022-10-11 10:05 ` crazylht at gmail dot com
  2022-10-11 10:13 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2022-10-11 10:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
change "*k, CBC" to "?k, CBC", in *mov{qi,hi,si,di}_internal.
then RA works good to choose kxnor for setting constm1_rtx to mask register,
and i got below with your attached patch(change #if 0 to #if 1), seems better
than orginal patch.

 6foo:
 7.LFB0:
 8        .cfi_startproc
 9        testl   %edi, %edi
10        jle     .L9
11        kxnorb  %k1, %k1, %k1
12        cmpl    $4, %edi
13        jl      .L11
14.L3:
15        vbroadcastsd    .LC2(%rip), %ymm3
16        vmovdqa .LC0(%rip), %xmm2
17        xorl    %eax, %eax
18        xorl    %ecx, %ecx
19        .p2align 4,,10
20        .p2align 3
21.L7:
22        vmovapd b(%rax), %ymm0{%k1}
23        addl    $4, %ecx
24        movl    %edi, %edx
25        vmulpd  %ymm3, %ymm0, %ymm1
26        subl    %ecx, %edx
27        cmpl    $4, %edx
28        vmovapd %ymm1, a(%rax){%k1}
29        vpbroadcastd    %edx, %xmm1
30        movl    $-1, %edx
31        vpcmpd  $1, %xmm1, %xmm2, %k1
32        kmovb   %k1, %esi
33        cmovge  %edx, %esi
34        addq    $32, %rax
35        kmovb   %esi, %k1
36        kortestb        %k1, %k1
37        jne     .L7
38        vzeroupper
39.L9:
40        ret
41        .p2align 4,,10
42        .p2align 3
43.L11:
44        vmovdqa .LC0(%rip), %xmm2
45        vpbroadcastd    %edi, %xmm1
46        vpcmpd  $1, %xmm1, %xmm2, %k1
47        jmp     .L3
48        .cfi_endproc

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-10-11 10:05 ` crazylht at gmail dot com
@ 2022-10-11 10:13 ` crazylht at gmail dot com
  2022-10-11 10:51 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2022-10-11 10:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
Also i think masked epilog(--param=vect-partial-vector-usage=1) should be good
for general cases under AVX512, espicially when main loop's vector width is
512, and the remain tripcount is not enough for 256-bit vectorization but ok
for 128-bit vectorization.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-10-11 10:13 ` crazylht at gmail dot com
@ 2022-10-11 10:51 ` rguenth at gcc dot gnu.org
  2022-10-11 10:59 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-11 10:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #4)
> change "*k, CBC" to "?k, CBC", in *mov{qi,hi,si,di}_internal.
> then RA works good to choose kxnor for setting constm1_rtx to mask register,
> and i got below with your attached patch(change #if 0 to #if 1), seems
> better than orginal patch.
> 
>  6foo:
>  7.LFB0:
>  8        .cfi_startproc
>  9        testl   %edi, %edi
> 10        jle     .L9
> 11        kxnorb  %k1, %k1, %k1
> 12        cmpl    $4, %edi
> 13        jl      .L11
> 14.L3:
> 15        vbroadcastsd    .LC2(%rip), %ymm3
> 16        vmovdqa .LC0(%rip), %xmm2
> 17        xorl    %eax, %eax
> 18        xorl    %ecx, %ecx
> 19        .p2align 4,,10
> 20        .p2align 3
> 21.L7:
> 22        vmovapd b(%rax), %ymm0{%k1}
> 23        addl    $4, %ecx
> 24        movl    %edi, %edx
> 25        vmulpd  %ymm3, %ymm0, %ymm1
> 26        subl    %ecx, %edx
> 27        cmpl    $4, %edx
> 28        vmovapd %ymm1, a(%rax){%k1}
> 29        vpbroadcastd    %edx, %xmm1
> 30        movl    $-1, %edx
> 31        vpcmpd  $1, %xmm1, %xmm2, %k1
> 32        kmovb   %k1, %esi
> 33        cmovge  %edx, %esi

not sure if the round-trip to GPRs for the sake
of a cmovge is worth, I guess a branch would be
better.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2022-10-11 10:51 ` rguenth at gcc dot gnu.org
@ 2022-10-11 10:59 ` rguenth at gcc dot gnu.org
  2022-10-11 11:08 ` crazylht at gmail dot com
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-11 10:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #5)
> Also i think masked epilog(--param=vect-partial-vector-usage=1) should be
> good for general cases under AVX512, espicially when main loop's vector
> width is 512, and the remain tripcount is not enough for 256-bit
> vectorization but ok for 128-bit vectorization.

Yes, for the fully masked variant I was mostly targeting -O2 with its
very-cheap (size wise) cost model.  Since we don't vectorize the
epilogue of a vectorized epilogue (yet) going fully masked there
should indeed help.  Also when we start to use the unroll hint the
vectorized epilogue might get full width iterations to handle as well.

One downside for a fully masked body is that we're using masked stores
which usually have higher latency due to the "merge" semantics which
means an extra memory input + merge operation.  Not sure if modern
uArchs can optimize the all-ones mask case, the vectorizer, for
.MASK_STORE, still has the code to change those to emit a mask
compare against all-zeros and only conditionally doing a .MASK_STORE.
That could be enhanced to single out the all-ones case, at least for
the .MASK_STOREs in a main fully masked loop when the mask is only
from the iteration (rather than conditional execution).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2022-10-11 10:59 ` rguenth at gcc dot gnu.org
@ 2022-10-11 11:08 ` crazylht at gmail dot com
  2022-10-11 11:14 ` rguenther at suse dot de
  2023-07-24  8:21 ` rguenth at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2022-10-11 11:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---

> 
> One downside for a fully masked body is that we're using masked stores
> which usually have higher latency due to the "merge" semantics which
> means an extra memory input + merge operation.  Not sure if modern
> uArchs can optimize the all-ones mask case, the vectorizer, for
Also I guess mask store won't be store forward even load is inside the mask
store.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-10-11 11:08 ` crazylht at gmail dot com
@ 2022-10-11 11:14 ` rguenther at suse dot de
  2023-07-24  8:21 ` rguenth at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: rguenther at suse dot de @ 2022-10-11 11:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #9 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 11 Oct 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
> 
> --- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > 
> > One downside for a fully masked body is that we're using masked stores
> > which usually have higher latency due to the "merge" semantics which
> > means an extra memory input + merge operation.  Not sure if modern
> > uArchs can optimize the all-ones mask case, the vectorizer, for
> Also I guess mask store won't be store forward even load is inside the mask
> store.

I guess the masking of the store is resolved in the load-store unit
and not by splitting the operation into a load, modify, store because
that cannot easily hide exceptions.  So yes, a masked store in the
store buffer likely cannot act as forwarding source (though the
actual mask should be fully resolved there) since the actual
merging will take place later.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
  2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2022-10-11 11:14 ` rguenther at suse dot de
@ 2023-07-24  8:21 ` rguenth at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-24  8:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
icelake is able to forward a masked store with a all-ones mask, Zen4 isn't able
to do that.  Other masked stores indeed do not forward.

There's a related problem also when an outer loop causes a low trip inner loop
to use masked load/store to then overlapping vectors:

outer iteration 1
   ... = .MASK_LOAD (p, {-1, -1, -1, -1, 0, 0, 0, 0});
   ...
   .MASK_STORE (p, val, {-1, -1, -1, -1, 0, 0, 0, 0});

outer iteration 2
   ... = .MASK_LOAD (p + delta, {-1, -1, -1, -1, 0, 0, 0, 0});
   ...
   .MASK_STORE (p + delta, val, {-1, -1, -1, -1, 0, 0, 0, 0});

with delta causing the next outer iteration to access the masked out values
from the previous iteration.  That gets a STLF failure (obviously) but
we now also need to wait for the masked store to retire before the masked
load of iteration 2 can be carried out.

We are hitting this case in SPEC CPU 2017 with masked epilogues (the
inner loop just iterates 4 times, vectorized with V8DFmode vectors).

Ideally the implementation (the CPU) would "shorten" loads/stores for
trailing sequences of zeros so this hazard doesn't occur.  Not sure if
that would be allowed by the x86 memory model though (I didn't find
anything specific there with respect to load/store masking).  ISTR store
buffer entries are usually assigned at instruction issue time, I'm not
sure if the mask is resolved there or whether the size of the store could
be adjusted later when it is.  The implementation could also somehow
ignore the "conflict".

Note I didn't yet fully benchmark masked epilogues with
-mpreferred-vector-width=512 on icelake or sapphire rapids, maybe Intel CPUs
are not affected
by this issue.

The original issue in the description seems solved we now generate the
following with the code generation variant that's now on trunk:

.L3:
        vmovapd b(%rax), %ymm0{%k1}
        movl    %edx, %ecx
        subl    $4, %edx
        kmovw   %edx, %k0
        vmulpd  %ymm3, %ymm0, %ymm1{%k1}{z}
        vmovapd %ymm1, a(%rax){%k1}
        vpbroadcastmw2d %k0, %xmm1
        addq    $32, %rax
        vpcmpud $6, %xmm2, %xmm1, %k1
        cmpw    $4, %cx
        ja      .L3

that avoids using the slow mask ops for loop control.  It oddly does

        subl    $4, %edx
        kmovw   %edx, %k0
        vpbroadcastmw2d %k0, %xmm1

with -march=cascadelake - with -march=znver4 I get the expected

        subl    $8, %edx
        vpbroadcastw    %edx, %xmm1

but I can reproduce the mask register "spill" with -mprefer-vector-width=256.

We expand to

(insn 14 13 15 (set (reg:V4SI 96)
        (vec_duplicate:V4SI (reg:SI 93 [ _27 ]))) 8167
{*avx512vl_vec_dup_gprv4si}
     (nil))

I'll file a separate bugreport for this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-07-24  8:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30  9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
2022-09-30  9:09 ` [Bug target/107093] " rguenth at gcc dot gnu.org
2022-10-10  1:50 ` crazylht at gmail dot com
2022-10-11  9:23 ` cvs-commit at gcc dot gnu.org
2022-10-11 10:05 ` crazylht at gmail dot com
2022-10-11 10:13 ` crazylht at gmail dot com
2022-10-11 10:51 ` rguenth at gcc dot gnu.org
2022-10-11 10:59 ` rguenth at gcc dot gnu.org
2022-10-11 11:08 ` crazylht at gmail dot com
2022-10-11 11:14 ` rguenther at suse dot de
2023-07-24  8:21 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).