public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
@ 2024-02-25 23:40 nathanael.schaeffer at gmail dot com
2024-02-25 23:46 ` [Bug target/114107] " pinskia at gcc dot gnu.org
` (15 more replies)
0 siblings, 16 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-25 23:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Bug ID: 114107
Summary: poor vectorization at -O3 when dealing with arrays of
different multiplicity, good with -O2
Product: gcc
Version: 13.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: nathanael.schaeffer at gmail dot com
Target Milestone: ---
A simple loop multiplying two arrays, with different multiplicity fails to
vectorize efficiently with -O3.
Target is AVX x86_64.
The loop is the following, where 4 consecutive values in data are multiplied by
the same factor :
for (int i=0; i<n; i++) {
for (int k=0; k<4; k++) data[4*i+k] *= factor[i];
}
See the very poor generated assembly with -O3 on godbolt, while
the correct solution of a simple vbroadcastsd is generated by gcc 12.1+ with
-O2
https://godbolt.org/z/fWj34bbhq
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
@ 2024-02-25 23:46 ` pinskia at gcc dot gnu.org
2024-02-25 23:56 ` pinskia at gcc dot gnu.org
` (14 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-25 23:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 57534
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57534&action=edit
Full testcase
`-O3 -march=skylake`
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
2024-02-25 23:46 ` [Bug target/114107] " pinskia at gcc dot gnu.org
@ 2024-02-25 23:56 ` pinskia at gcc dot gnu.org
2024-02-26 0:12 ` nathanael.schaeffer at gmail dot com
` (13 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-25 23:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |x86_64-linux
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am not 100% sure that is always better.
What is happening is GCC is vectorizing even the outer loop.
It is easier to understand via aarch64 asm too:
.L4:
ldr q27, [x3], 16
ld4 {v28.2d - v31.2d}, [x4]
fmul v24.2d, v27.2d, v28.2d
fmul v25.2d, v27.2d, v29.2d
fmul v26.2d, v27.2d, v30.2d
fmul v27.2d, v27.2d, v31.2d
st4 {v24.2d - v27.2d}, [x4], 64
cmp x3, x5
bne .L4
Have you benchmarked both?
If anything this is a cost model issue.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
2024-02-25 23:46 ` [Bug target/114107] " pinskia at gcc dot gnu.org
2024-02-25 23:56 ` pinskia at gcc dot gnu.org
@ 2024-02-26 0:12 ` nathanael.schaeffer at gmail dot com
2024-02-26 0:13 ` nathanael.schaeffer at gmail dot com
` (12 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26 0:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #3 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
I have not benchmarked.
For 4 vmulpd doing the actual work, there are more than 40 permute/mov
instructions, among which 24 vpermd instructions which have a 3 cycle latency.
That is 6 vpermd per vmulpd.
There is no way this can be faster than vbroadcastsd. I would bet it is 4 to 10
times slower than the vbroadcastsd loop.
If you want, I can benchmark it tomorrow.
If this is a cost model problem, it is a bad one. Even ignoring the decoding of
all these instructions, how can adding 6 vpermd to each vmulpd be faster?
I would rather think (hope?) the optimizer does not consider the vbroadcastsd
solution at all.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (2 preceding siblings ...)
2024-02-26 0:12 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26 0:13 ` nathanael.schaeffer at gmail dot com
2024-02-26 0:27 ` pinskia at gcc dot gnu.org
` (11 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26 0:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #4 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
... and thank you for your quick reply!
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (3 preceding siblings ...)
2024-02-26 0:13 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26 0:27 ` pinskia at gcc dot gnu.org
2024-02-26 0:34 ` nathanael.schaeffer at gmail dot com
` (10 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-26 0:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to N Schaeffer from comment #3)
> If this is a cost model problem, it is a bad one.
It is almost definitely a cost model in the x86_64 backend issue. Because I
tried on aarch64 with -march=armv9-a+sve and then we get only the vectorization
of the inner loop for both -O2 and -O3:
```
.L3:
ldp q29, q30, [x0]
ld1r {v31.2d}, [x1], 8
fmul v30.2d, v30.2d, v31.2d
fmul v29.2d, v29.2d, v31.2d
stp q29, q30, [x0], 32
cmp x2, x1
bne .L3
```
With the default generic armv8-a cost model we get the ld4 there and
vectorizing the outer loop.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (4 preceding siblings ...)
2024-02-26 0:27 ` pinskia at gcc dot gnu.org
@ 2024-02-26 0:34 ` nathanael.schaeffer at gmail dot com
2024-02-26 2:51 ` liuhongt at gcc dot gnu.org
` (9 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26 0:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #6 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
indeed, aarch64 assembly looks very good.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (5 preceding siblings ...)
2024-02-26 0:34 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26 2:51 ` liuhongt at gcc dot gnu.org
2024-02-26 3:28 ` liuhongt at gcc dot gnu.org
` (8 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-26 2:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |liuhongt at gcc dot gnu.org
--- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
pshufb/shufps are avaible for most cases.
But for 256/512-bit vectors, when the permuation is cross-lane, the cost could
be higher. One solution is increase perm_cost when vector size is more than 128
since vperm is most likely used instead of vblend/vpblend/vpshuf/vshuf.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (6 preceding siblings ...)
2024-02-26 2:51 ` liuhongt at gcc dot gnu.org
@ 2024-02-26 3:28 ` liuhongt at gcc dot gnu.org
2024-02-26 7:42 ` nathanael.schaeffer at gmail dot com
` (7 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-26 3:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #7)
> perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
> pshufb/shufps are avaible for most cases.
> But for 256/512-bit vectors, when the permuation is cross-lane, the cost
> could be higher. One solution is increase perm_cost when vector size is more
> than 128 since vperm is most likely used instead of
> vblend/vpblend/vpshuf/vshuf.
Furthermore, if we can get indices in the backend when calculating vec_perm
cost, we can check if the permutation is cross-lane or not, and set cost more
accurately for 256/512-bit vector permutation.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (7 preceding siblings ...)
2024-02-26 3:28 ` liuhongt at gcc dot gnu.org
@ 2024-02-26 7:42 ` nathanael.schaeffer at gmail dot com
2024-02-26 7:49 ` nathanael.schaeffer at gmail dot com
` (6 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26 7:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #9 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
In addition, optimizing for size with -Os leads to a non-vectorized double-loop
(51 bytes) while the vectorized loop with vbroadcastsd (produced by clang -Os)
leads to 40 bytes.
It is thus also a missed optimization for -Os.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (8 preceding siblings ...)
2024-02-26 7:42 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26 7:49 ` nathanael.schaeffer at gmail dot com
2024-02-26 7:54 ` liuhongt at gcc dot gnu.org
` (5 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26 7:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #10 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
intrestingly (and maybe surprisingly) I can get gcc to produce nearly optimal
code using vbroadcastsd with the following options:
-O2 -march=skylake -ftree-vectorize
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (9 preceding siblings ...)
2024-02-26 7:49 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26 7:54 ` liuhongt at gcc dot gnu.org
2024-02-26 8:13 ` nathanael.schaeffer at gmail dot com
` (4 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-26 7:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to N Schaeffer from comment #9)
> In addition, optimizing for size with -Os leads to a non-vectorized
> double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced
> by clang -Os) leads to 40 bytes.
> It is thus also a missed optimization for -Os.
vectorization is enabled with O2 but not Os.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (10 preceding siblings ...)
2024-02-26 7:54 ` liuhongt at gcc dot gnu.org
@ 2024-02-26 8:13 ` nathanael.schaeffer at gmail dot com
2024-02-26 9:10 ` [Bug tree-optimization/114107] " rguenth at gcc dot gnu.org
` (3 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26 8:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #12 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
I found the "offending" option, and it seems to be indeed a cost-model problem
as Andrew Pinski said:
good code is generated by:
gcc -O2 -ftree-vectorize -march=skylake (since gcc 6.1)
gcc -O1 -ftree-vectorize -march=skylake (since gcc 8.1)
gcc -O3 -fvect-cost-model=very-cheap -march=skylake (with gcc 13.1+)
bad code is generated otherwise, and in particular:
gcc -O2 -march=skylake (does not vectorize)
gcc -O3 -march=skylake (bad vectorization with so many permutations)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug tree-optimization/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (11 preceding siblings ...)
2024-02-26 8:13 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26 9:10 ` rguenth at gcc dot gnu.org
2024-02-26 14:40 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-26 9:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Blocks| |53947
Component|target |tree-optimization
Last reconfirmed| |2024-02-26
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that we fail to SLP vectorize this (at -O3 we unroll the inner loop):
t.c:4:20: note: ==> examining statement: _34 = *_33;
t.c:4:20: missed: peeling for gaps insufficient for access
t.c:5:51: missed: not vectorized: relevant stmt not supported: _34 = *_33;
t.c:4:20: note: removing SLP instance operations starting from: *_29 = _35;
t.c:4:20: missed: unsupported SLP instances
which is because 'factor[i]' is treated as vector load
t.c:4:20: note: node 0x687f730 (max_nunits=4, refcnt=2) const vector(4)
double
t.c:4:20: note: op template: _34 = *_33;
t.c:4:20: note: stmt 0 _34 = *_33;
t.c:4:20: note: stmt 1 _34 = *_33;
t.c:4:20: note: stmt 2 _34 = *_33;
t.c:4:20: note: stmt 3 _34 = *_33;
t.c:4:20: note: load permutation { 0 0 0 0 }
and we don't anticipate we can do this with a load-and-splat (I'm not sure
we'd eventually do that even).
I think we might have a duplicate bugreport for this issue.
Note with GCC 13 we refuse to SLP because
t.c:4:20: missed: Build SLP failed: not grouped load _35 = *_34;
You can help GCC by doign
void rescale_x4(double* __restrict data, const double * __restrict factor, int
n)
{
for (int i=0; i<n; i++) {
#pragma GCC unroll 0
for (int k=0; k<4; k++) data[4*i+k] *= factor[i];
}
}
which will get you
rescale_x4:
.LFB0:
.cfi_startproc
testl %edx, %edx
jle .L5
movslq %edx, %rdx
salq $5, %rdx
leaq (%rdi,%rdx), %rax
.p2align 4,,10
.p2align 3
.L3:
vbroadcastsd (%rsi), %ymm0
addq $32, %rdi
addq $8, %rsi
vmulpd -32(%rdi), %ymm0, %ymm0
vmovupd %ymm0, -32(%rdi)
cmpq %rdi, %rax
jne .L3
vzeroupper
.L5:
ret
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug tree-optimization/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (12 preceding siblings ...)
2024-02-26 9:10 ` [Bug tree-optimization/114107] " rguenth at gcc dot gnu.org
@ 2024-02-26 14:40 ` rguenth at gcc dot gnu.org
2024-06-13 6:22 ` cvs-commit at gcc dot gnu.org
2024-06-13 7:13 ` rguenth at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-26 14:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Status|NEW |ASSIGNED
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Mine.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug tree-optimization/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (13 preceding siblings ...)
2024-02-26 14:40 ` rguenth at gcc dot gnu.org
@ 2024-06-13 6:22 ` cvs-commit at gcc dot gnu.org
2024-06-13 7:13 ` rguenth at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-06-13 6:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #15 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:1fe55a1794863b5ad9eeca5062782834716016b2
commit r15-1238-g1fe55a1794863b5ad9eeca5062782834716016b2
Author: Richard Biener <rguenther@suse.de>
Date: Fri Jun 7 11:29:05 2024 +0200
tree-optimization/114107 - avoid peeling for gaps in more cases
The following refactors the code to detect necessary peeling for
gaps, in particular the PR103116 case when there is no gap but
the group size is smaller than the vector size. The testcase in
PR114107 shows we fail to SLP
for (int i=0; i<n; i++)
for (int k=0; k<4; k++)
data[4*i+k] *= factor[i];
because peeling one scalar iteration isn't enough to cover a gap
of 3 elements of factor[i]. But the code detecting this is placed
after the logic that detects cases we handle properly already as
we'd code generate { factor[i], 0., 0., 0. } for V4DFmode vectorization
already. In fact the check to detect when peeling a single iteration
isn't enough seems improperly guarded as it should apply to all cases.
I'm not sure we correctly handle VMAT_CONTIGUOUS_REVERSE but I
checked that VMAT_STRIDED_SLP and VMAT_ELEMENTWISE correctly avoid
touching excess elements.
With this change we can use SLP for the above testcase and the
PR103116 testcases no longer require an epilogue on x86-64. It
might be different on other targets so I made those testcases
runtime FAIL only instead of relying on dump scanning there's
currently no easy way to properly constrain.
PR tree-optimization/114107
PR tree-optimization/110445
* tree-vect-stmts.cc (get_group_load_store_type): Refactor
contiguous access case. Make sure peeling for gap constraints
are always tested and consistently relax when we know we can
avoid touching excess elements during code generation. But
rewrite the check poly-int aware.
* gcc.dg/vect/pr114107.c: New testcase.
* gcc.dg/vect/pr103116-1.c: Adjust.
* gcc.dg/vect/pr103116-2.c: Likewise.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug tree-optimization/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
` (14 preceding siblings ...)
2024-06-13 6:22 ` cvs-commit at gcc dot gnu.org
@ 2024-06-13 7:13 ` rguenth at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-06-13 7:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|ASSIGNED |RESOLVED
Target Milestone|--- |15.0
--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
This is fixed now, we produce
.L3:
vbroadcastsd (%rsi,%rax), %ymm0
vmulpd (%rdi,%rax,4), %ymm0, %ymm0
vmovupd %ymm0, (%rdi,%rax,4)
addq $8, %rax
cmpq %rdx, %rax
jne .L3
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-06-13 7:13 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
2024-02-25 23:46 ` [Bug target/114107] " pinskia at gcc dot gnu.org
2024-02-25 23:56 ` pinskia at gcc dot gnu.org
2024-02-26 0:12 ` nathanael.schaeffer at gmail dot com
2024-02-26 0:13 ` nathanael.schaeffer at gmail dot com
2024-02-26 0:27 ` pinskia at gcc dot gnu.org
2024-02-26 0:34 ` nathanael.schaeffer at gmail dot com
2024-02-26 2:51 ` liuhongt at gcc dot gnu.org
2024-02-26 3:28 ` liuhongt at gcc dot gnu.org
2024-02-26 7:42 ` nathanael.schaeffer at gmail dot com
2024-02-26 7:49 ` nathanael.schaeffer at gmail dot com
2024-02-26 7:54 ` liuhongt at gcc dot gnu.org
2024-02-26 8:13 ` nathanael.schaeffer at gmail dot com
2024-02-26 9:10 ` [Bug tree-optimization/114107] " rguenth at gcc dot gnu.org
2024-02-26 14:40 ` rguenth at gcc dot gnu.org
2024-06-13 6:22 ` cvs-commit at gcc dot gnu.org
2024-06-13 7:13 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).