public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops
@ 2024-06-18 3:14 tnfchris at gcc dot gnu.org
2024-06-18 3:18 ` [Bug tree-optimization/115531] " pinskia at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-18 3:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
Bug ID: 115531
Summary: vectorizer generates inefficient code for masked
conditional update loops
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
The following code:
void __attribute__((noipa))
foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
{
if (stride <= 1)
return;
for (int i = 0; i < n; i++)
{
int res = c[i];
int t = b[i+stride];
if (a[i] != 0)
res = t;
c[i] = res;
}
}
generates at -O3 -g0 -mcpu=generic+sve:
.L3:
ld1b z29.s, p7/z, [x0, x5]
ld1w z31.s, p7/z, [x2, x5, lsl 2]
ld1w z30.s, p7/z, [x1, x5, lsl 2]
cmpne p15.b, p6/z, z29.b, #0
sel z30.s, p15, z30.s, z31.s
st1w z30.s, p7, [x2, x5, lsl 2]
add x5, x5, x4
whilelo p7.s, w5, w3
b.any .L3
.L1:
and makes vectorization unprofitable until very high iterations of n.
This is because the vector code has more instructions than needed.
Since it's a masked store, whenever a value is being conditionally set we don't
need the intermediate VEC_COND_EXPR. This loop can be vectorized as:
.L3:
ld1b z29.s, p7/z, [x0, x5]
ld1w z31.s, p7/z, [x2, x5, lsl 2]
cmpne p4.b, p6/z, z29.b, #0
st1w z31.s, p4, [x2, x5, lsl 2]
add x5, x5, x4
whilelo p7.s, w5, w3
b.any .L3
.L1:
I currently prototyped a load-to-store forward optimization in forwprop but
looking to move it into the vectorizer to cost it properly, however I'm not
entirely sure what the best way to do so is.
I can certainly fix it up during codegen but to cost it I need to do so during
analysis. I could detect it during vectorizable_condition but then the dead
load is still costed. Or I could maybe use a pattern, but unsure how to
represent the mask into the load.
Is it valid to produce a pattern with .IFN_MASK_STORE?
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
2024-06-18 3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
@ 2024-06-18 3:18 ` pinskia at gcc dot gnu.org
2024-06-18 3:18 ` pinskia at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-06-18 3:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2024-06-18
See Also| |https://gcc.gnu.org/bugzill
| |a/show_bug.cgi?id=20999
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I suspect PR 20999 would fix this ...
but we have to be careful since without masked stores, you could still
vectorize this unlike the transformed version.
Maybe ifcvt can produce a masked store version if this pattern ...
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
2024-06-18 3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
2024-06-18 3:18 ` [Bug tree-optimization/115531] " pinskia at gcc dot gnu.org
@ 2024-06-18 3:18 ` pinskia at gcc dot gnu.org
2024-06-18 3:22 ` pinskia at gcc dot gnu.org
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-06-18 3:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
2024-06-18 3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
2024-06-18 3:18 ` [Bug tree-optimization/115531] " pinskia at gcc dot gnu.org
2024-06-18 3:18 ` pinskia at gcc dot gnu.org
@ 2024-06-18 3:22 ` pinskia at gcc dot gnu.org
2024-06-18 3:30 ` tnfchris at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-06-18 3:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> Maybe ifcvt can produce a masked store version if this pattern ...
Maybe add another argument to .MASK_STORE to say it was originally
unconditional store? Or something like that.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
2024-06-18 3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
` (2 preceding siblings ...)
2024-06-18 3:22 ` pinskia at gcc dot gnu.org
@ 2024-06-18 3:30 ` tnfchris at gcc dot gnu.org
2024-06-18 6:44 ` rguenth at gcc dot gnu.org
2024-06-25 5:29 ` tnfchris at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-18 3:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> I suspect PR 20999 would fix this ...
> but we have to be careful since without masked stores, you could still
> vectorize this unlike the transformed version.
>
> Maybe ifcvt can produce a masked store version if this pattern ...
doing so during ifcvt forces you to commit to a masked operation. So you loose
the ability to not vectorize for non-fully masked architectures.
So it's too early. A vector pattern doesn't have this problem. This question
was mostly to what degree the vectorizer has support for MASK_STORE as an
input. vect_get_vector_types_for_stmt seems to have support for it so it looks
like it may work.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
2024-06-18 3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
` (3 preceding siblings ...)
2024-06-18 3:30 ` tnfchris at gcc dot gnu.org
@ 2024-06-18 6:44 ` rguenth at gcc dot gnu.org
2024-06-25 5:29 ` tnfchris at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-06-18 6:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
AVX512 produces
.L3:
vmovdqu8 (%rsi), %zmm9{%k1}
kshiftrq $32, %k1, %k5
kshiftrq $48, %k1, %k4
movl %r9d, %eax
vmovdqu32 128(%rcx), %zmm7{%k5}
subl %esi, %eax
movl $64, %edi
vmovdqu32 128(%rdx), %zmm3{%k5}
kshiftrq $16, %k1, %k6
addl %r10d, %eax
vmovdqu32 192(%rcx), %zmm8{%k4}
cmpl %edi, %eax
vmovdqu32 192(%rdx), %zmm4{%k4}
cmova %edi, %eax
addq $64, %rsi
addq $256, %rcx
vmovdqu32 -256(%rcx), %zmm5{%k1}
vmovdqu32 (%rdx), %zmm1{%k1}
vmovdqu32 -192(%rcx), %zmm6{%k6}
vmovdqu32 64(%rdx), %zmm2{%k6}
vpcmpb $4, %zmm14, %zmm9, %k2
kshiftrq $32, %k2, %k3
vpblendmd %zmm7, %zmm3, %zmm10{%k3}
kshiftrd $16, %k3, %k3
vpblendmd %zmm8, %zmm4, %zmm0{%k3}
vpblendmd %zmm5, %zmm1, %zmm12{%k2}
vmovdqu32 %zmm10, 128(%rdx){%k5}
kshiftrd $16, %k2, %k2
vmovdqu32 %zmm0, 192(%rdx){%k4}
vpblendmd %zmm6, %zmm2, %zmm11{%k2}
vpbroadcastb %eax, %zmm0
movl %r9d, %eax
vmovdqu32 %zmm12, (%rdx){%k1}
subl %esi, %eax
addl %r8d, %eax
vmovdqu32 %zmm11, 64(%rdx){%k6}
addq $256, %rdx
vpcmpub $6, %zmm13, %zmm0, %k1
cmpl $64, %eax
ja .L3
The vectorizer sees
<bb 4> [local count: 955630225]:
# i_26 = PHI <i_23(9), 0(16)>
_1 = (long unsigned int) i_26;
_2 = _1 * 4;
_3 = c_17(D) + _2;
res_18 = *_3;
_4 = stride_14(D) + i_26;
_5 = (long unsigned int) _4;
_6 = _5 * 4;
_7 = b_19(D) + _6;
t_20 = *_7;
_8 = a_21(D) + _1;
_9 = *_8;
_34 = _9 != 0;
res_11 = _34 ? t_20 : res_18;
*_3 = res_11;
i_23 = i_26 + 1;
if (n_16(D) > i_23)
I believe that to get proper vectorizer costing we want to have an
optimization phase that can take into account whether we use a masked
loop or not. Note that your intended transform relies on identifying
the open-coded "conditional store"
int res = c[i];
if (a[i] != 0)
res = t;
c[i] = res;
As Andrew says when that's a .MASK_STORE it's going to be easier to
identify the opportunity. So yeah, if-conversion could recognize
this pattern and produce a .MASK_STORE from it as a first step.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops
2024-06-18 3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
` (4 preceding siblings ...)
2024-06-18 6:44 ` rguenth at gcc dot gnu.org
@ 2024-06-25 5:29 ` tnfchris at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-25 5:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
Tamar Christina <tnfchris at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org
Status|NEW |ASSIGNED
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-06-25 5:29 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-18 3:14 [Bug tree-optimization/115531] New: vectorizer generates inefficient code for masked conditional update loops tnfchris at gcc dot gnu.org
2024-06-18 3:18 ` [Bug tree-optimization/115531] " pinskia at gcc dot gnu.org
2024-06-18 3:18 ` pinskia at gcc dot gnu.org
2024-06-18 3:22 ` pinskia at gcc dot gnu.org
2024-06-18 3:30 ` tnfchris at gcc dot gnu.org
2024-06-18 6:44 ` rguenth at gcc dot gnu.org
2024-06-25 5:29 ` tnfchris at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).