public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop
@ 2022-09-30 9:06 rguenth at gcc dot gnu.org
2022-09-30 9:09 ` [Bug target/107093] " rguenth at gcc dot gnu.org
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-09-30 9:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
Bug ID: 107093
Summary: AVX512 mask operations not simplified in fully masked
loop
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rguenth at gcc dot gnu.org
Target Milestone: ---
Trying to implement WHILE_ULT for AVX512 I run into optimization issues.
Consider
double a[1024], b[1024];
void foo (int n)
{
for (int i = 0; i < n; ++i)
a[i] = b[i] * 3.;
}
compiled with -O3 -march=cascadelake --param vect-partial-vector-usage=2
I get snippets like
kxnorb %k1, %k1, %k1
kortestb %k1, %k1
je .L11
or
kxorb %k1, %k1, %k1
kxnorb %k1, %k1, %k1
where we fail to simplify the operations. Looking at the RTL it looks
like missed jump threading, but I do see the ops being
(insn 18 72 74 5 (parallel [
(set (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])
(not:QI (xor:QI (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])
(reg:QI 69 k1 [orig:86 loop_mask_15 ] [86]))))
(unspec [
(const_int 0 [0])
] UNSPEC_MASKOP)
]) 1912 {kxnorqi}
(expr_list:REG_EQUAL (const_int -1 [0xffffffffffffffff])
(nil)))
thus having an UNSPEC in them. When emitting a SET from constm1 I end up
with mask<->GPR moves and if-converted code which isn't optimal either.
When doing -fno-if-conversion I get
.L7:
vmovapd b(%rax), %ymm1{%k1}
addl $4, %ecx
movl %edi, %edx
vmulpd %ymm2, %ymm1, %ymm0
subl %ecx, %edx
vmovapd %ymm0, a(%rax){%k1}
kxnorb %k1, %k1, %k1
cmpl $4, %edx
jge .L5
vpbroadcastd %edx, %xmm0
vpcmpd $1, %xmm0, %xmm3, %k1
.L5:
addq $32, %rax
kortestb %k1, %k1
jne .L7
which also doesn't have the desired short-cut from the cmpl $4, %edx.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
@ 2022-09-30 9:09 ` rguenth at gcc dot gnu.org
2022-10-10 1:50 ` crazylht at gmail dot com
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-09-30 9:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 53645
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53645&action=edit
prototype for WHILE_ULT
I'm playing with the attached. Note it requires the third operand patch for
WHILE_ULT (but it's basically approved).
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
2022-09-30 9:09 ` [Bug target/107093] " rguenth at gcc dot gnu.org
@ 2022-10-10 1:50 ` crazylht at gmail dot com
2022-10-11 9:23 ` cvs-commit at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2022-10-10 1:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
For UNSPEC part, we can create a new define_insn with genenral operation and
accept both gpr and mask alternatives just like other logic patterns.
For gpr version, we can split it to xor + not after reload.
For mask version, we can split it to kxnor with UNSPEC after reload.
It should help general simplication for xor and not.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
2022-09-30 9:09 ` [Bug target/107093] " rguenth at gcc dot gnu.org
2022-10-10 1:50 ` crazylht at gmail dot com
@ 2022-10-11 9:23 ` cvs-commit at gcc dot gnu.org
2022-10-11 10:05 ` crazylht at gmail dot com
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-10-11 9:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #3 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:498ad738690b3c464f901d63dcd4d0f49a50dd00
commit r13-3218-g498ad738690b3c464f901d63dcd4d0f49a50dd00
Author: liuhongt <hongtao.liu@intel.com>
Date: Mon Oct 10 11:31:48 2022 +0800
Add define_insn_and_split to support general version of "kxnor".
For genereal_reg_operand, it will be splitted into xor + not.
For mask_reg_operand, it will be splitted with UNSPEC_MASK_OP just
like what we did for other logic operations.
The patch will optimize xor+not to kxnor when possible.
gcc/ChangeLog:
PR target/107093
* config/i386/i386.md (*notxor<mode>_1): New post_reload
define_insn_and_split.
(*notxorqi_1): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr107093.c: New test.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
` (2 preceding siblings ...)
2022-10-11 9:23 ` cvs-commit at gcc dot gnu.org
@ 2022-10-11 10:05 ` crazylht at gmail dot com
2022-10-11 10:13 ` crazylht at gmail dot com
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2022-10-11 10:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
change "*k, CBC" to "?k, CBC", in *mov{qi,hi,si,di}_internal.
then RA works good to choose kxnor for setting constm1_rtx to mask register,
and i got below with your attached patch(change #if 0 to #if 1), seems better
than orginal patch.
6foo:
7.LFB0:
8 .cfi_startproc
9 testl %edi, %edi
10 jle .L9
11 kxnorb %k1, %k1, %k1
12 cmpl $4, %edi
13 jl .L11
14.L3:
15 vbroadcastsd .LC2(%rip), %ymm3
16 vmovdqa .LC0(%rip), %xmm2
17 xorl %eax, %eax
18 xorl %ecx, %ecx
19 .p2align 4,,10
20 .p2align 3
21.L7:
22 vmovapd b(%rax), %ymm0{%k1}
23 addl $4, %ecx
24 movl %edi, %edx
25 vmulpd %ymm3, %ymm0, %ymm1
26 subl %ecx, %edx
27 cmpl $4, %edx
28 vmovapd %ymm1, a(%rax){%k1}
29 vpbroadcastd %edx, %xmm1
30 movl $-1, %edx
31 vpcmpd $1, %xmm1, %xmm2, %k1
32 kmovb %k1, %esi
33 cmovge %edx, %esi
34 addq $32, %rax
35 kmovb %esi, %k1
36 kortestb %k1, %k1
37 jne .L7
38 vzeroupper
39.L9:
40 ret
41 .p2align 4,,10
42 .p2align 3
43.L11:
44 vmovdqa .LC0(%rip), %xmm2
45 vpbroadcastd %edi, %xmm1
46 vpcmpd $1, %xmm1, %xmm2, %k1
47 jmp .L3
48 .cfi_endproc
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
` (3 preceding siblings ...)
2022-10-11 10:05 ` crazylht at gmail dot com
@ 2022-10-11 10:13 ` crazylht at gmail dot com
2022-10-11 10:51 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2022-10-11 10:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
Also i think masked epilog(--param=vect-partial-vector-usage=1) should be good
for general cases under AVX512, espicially when main loop's vector width is
512, and the remain tripcount is not enough for 256-bit vectorization but ok
for 128-bit vectorization.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
` (4 preceding siblings ...)
2022-10-11 10:13 ` crazylht at gmail dot com
@ 2022-10-11 10:51 ` rguenth at gcc dot gnu.org
2022-10-11 10:59 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-11 10:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #4)
> change "*k, CBC" to "?k, CBC", in *mov{qi,hi,si,di}_internal.
> then RA works good to choose kxnor for setting constm1_rtx to mask register,
> and i got below with your attached patch(change #if 0 to #if 1), seems
> better than orginal patch.
>
> 6foo:
> 7.LFB0:
> 8 .cfi_startproc
> 9 testl %edi, %edi
> 10 jle .L9
> 11 kxnorb %k1, %k1, %k1
> 12 cmpl $4, %edi
> 13 jl .L11
> 14.L3:
> 15 vbroadcastsd .LC2(%rip), %ymm3
> 16 vmovdqa .LC0(%rip), %xmm2
> 17 xorl %eax, %eax
> 18 xorl %ecx, %ecx
> 19 .p2align 4,,10
> 20 .p2align 3
> 21.L7:
> 22 vmovapd b(%rax), %ymm0{%k1}
> 23 addl $4, %ecx
> 24 movl %edi, %edx
> 25 vmulpd %ymm3, %ymm0, %ymm1
> 26 subl %ecx, %edx
> 27 cmpl $4, %edx
> 28 vmovapd %ymm1, a(%rax){%k1}
> 29 vpbroadcastd %edx, %xmm1
> 30 movl $-1, %edx
> 31 vpcmpd $1, %xmm1, %xmm2, %k1
> 32 kmovb %k1, %esi
> 33 cmovge %edx, %esi
not sure if the round-trip to GPRs for the sake
of a cmovge is worth, I guess a branch would be
better.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
` (5 preceding siblings ...)
2022-10-11 10:51 ` rguenth at gcc dot gnu.org
@ 2022-10-11 10:59 ` rguenth at gcc dot gnu.org
2022-10-11 11:08 ` crazylht at gmail dot com
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-11 10:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #5)
> Also i think masked epilog(--param=vect-partial-vector-usage=1) should be
> good for general cases under AVX512, espicially when main loop's vector
> width is 512, and the remain tripcount is not enough for 256-bit
> vectorization but ok for 128-bit vectorization.
Yes, for the fully masked variant I was mostly targeting -O2 with its
very-cheap (size wise) cost model. Since we don't vectorize the
epilogue of a vectorized epilogue (yet) going fully masked there
should indeed help. Also when we start to use the unroll hint the
vectorized epilogue might get full width iterations to handle as well.
One downside for a fully masked body is that we're using masked stores
which usually have higher latency due to the "merge" semantics which
means an extra memory input + merge operation. Not sure if modern
uArchs can optimize the all-ones mask case, the vectorizer, for
.MASK_STORE, still has the code to change those to emit a mask
compare against all-zeros and only conditionally doing a .MASK_STORE.
That could be enhanced to single out the all-ones case, at least for
the .MASK_STOREs in a main fully masked loop when the mask is only
from the iteration (rather than conditional execution).
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
` (6 preceding siblings ...)
2022-10-11 10:59 ` rguenth at gcc dot gnu.org
@ 2022-10-11 11:08 ` crazylht at gmail dot com
2022-10-11 11:14 ` rguenther at suse dot de
2023-07-24 8:21 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2022-10-11 11:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
>
> One downside for a fully masked body is that we're using masked stores
> which usually have higher latency due to the "merge" semantics which
> means an extra memory input + merge operation. Not sure if modern
> uArchs can optimize the all-ones mask case, the vectorizer, for
Also I guess mask store won't be store forward even load is inside the mask
store.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
` (7 preceding siblings ...)
2022-10-11 11:08 ` crazylht at gmail dot com
@ 2022-10-11 11:14 ` rguenther at suse dot de
2023-07-24 8:21 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenther at suse dot de @ 2022-10-11 11:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #9 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 11 Oct 2022, crazylht at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
>
> --- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
>
> >
> > One downside for a fully masked body is that we're using masked stores
> > which usually have higher latency due to the "merge" semantics which
> > means an extra memory input + merge operation. Not sure if modern
> > uArchs can optimize the all-ones mask case, the vectorizer, for
> Also I guess mask store won't be store forward even load is inside the mask
> store.
I guess the masking of the store is resolved in the load-store unit
and not by splitting the operation into a load, modify, store because
that cannot easily hide exceptions. So yes, a masked store in the
store buffer likely cannot act as forwarding source (though the
actual mask should be fully resolved there) since the actual
merging will take place later.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/107093] AVX512 mask operations not simplified in fully masked loop
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
` (8 preceding siblings ...)
2022-10-11 11:14 ` rguenther at suse dot de
@ 2023-07-24 8:21 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-24 8:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|UNCONFIRMED |RESOLVED
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
icelake is able to forward a masked store with a all-ones mask, Zen4 isn't able
to do that. Other masked stores indeed do not forward.
There's a related problem also when an outer loop causes a low trip inner loop
to use masked load/store to then overlapping vectors:
outer iteration 1
... = .MASK_LOAD (p, {-1, -1, -1, -1, 0, 0, 0, 0});
...
.MASK_STORE (p, val, {-1, -1, -1, -1, 0, 0, 0, 0});
outer iteration 2
... = .MASK_LOAD (p + delta, {-1, -1, -1, -1, 0, 0, 0, 0});
...
.MASK_STORE (p + delta, val, {-1, -1, -1, -1, 0, 0, 0, 0});
with delta causing the next outer iteration to access the masked out values
from the previous iteration. That gets a STLF failure (obviously) but
we now also need to wait for the masked store to retire before the masked
load of iteration 2 can be carried out.
We are hitting this case in SPEC CPU 2017 with masked epilogues (the
inner loop just iterates 4 times, vectorized with V8DFmode vectors).
Ideally the implementation (the CPU) would "shorten" loads/stores for
trailing sequences of zeros so this hazard doesn't occur. Not sure if
that would be allowed by the x86 memory model though (I didn't find
anything specific there with respect to load/store masking). ISTR store
buffer entries are usually assigned at instruction issue time, I'm not
sure if the mask is resolved there or whether the size of the store could
be adjusted later when it is. The implementation could also somehow
ignore the "conflict".
Note I didn't yet fully benchmark masked epilogues with
-mpreferred-vector-width=512 on icelake or sapphire rapids, maybe Intel CPUs
are not affected
by this issue.
The original issue in the description seems solved we now generate the
following with the code generation variant that's now on trunk:
.L3:
vmovapd b(%rax), %ymm0{%k1}
movl %edx, %ecx
subl $4, %edx
kmovw %edx, %k0
vmulpd %ymm3, %ymm0, %ymm1{%k1}{z}
vmovapd %ymm1, a(%rax){%k1}
vpbroadcastmw2d %k0, %xmm1
addq $32, %rax
vpcmpud $6, %xmm2, %xmm1, %k1
cmpw $4, %cx
ja .L3
that avoids using the slow mask ops for loop control. It oddly does
subl $4, %edx
kmovw %edx, %k0
vpbroadcastmw2d %k0, %xmm1
with -march=cascadelake - with -march=znver4 I get the expected
subl $8, %edx
vpbroadcastw %edx, %xmm1
but I can reproduce the mask register "spill" with -mprefer-vector-width=256.
We expand to
(insn 14 13 15 (set (reg:V4SI 96)
(vec_duplicate:V4SI (reg:SI 93 [ _27 ]))) 8167
{*avx512vl_vec_dup_gprv4si}
(nil))
I'll file a separate bugreport for this.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-07-24 8:21 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30 9:06 [Bug target/107093] New: AVX512 mask operations not simplified in fully masked loop rguenth at gcc dot gnu.org
2022-09-30 9:09 ` [Bug target/107093] " rguenth at gcc dot gnu.org
2022-10-10 1:50 ` crazylht at gmail dot com
2022-10-11 9:23 ` cvs-commit at gcc dot gnu.org
2022-10-11 10:05 ` crazylht at gmail dot com
2022-10-11 10:13 ` crazylht at gmail dot com
2022-10-11 10:51 ` rguenth at gcc dot gnu.org
2022-10-11 10:59 ` rguenth at gcc dot gnu.org
2022-10-11 11:08 ` crazylht at gmail dot com
2022-10-11 11:14 ` rguenther at suse dot de
2023-07-24 8:21 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).