public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
[not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
@ 2021-08-25 0:13 ` pinskia at gcc dot gnu.org
2021-08-25 7:13 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
4 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-25 0:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
Severity|normal |enhancement
Last reconfirmed| |2021-08-25
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Hmm, on aarch64 we do a decent job at vectorizing this (since GCC 11):
ldp d4, d0, [x1]
ldr d7, [x0, 16]
ldp d6, d5, [x0]
fmul v3.2s, v0.2s, v7.s[1]
ldr d1, [x1, 16]
fmul v2.2s, v0.2s, v6.s[1]
fmul v0.2s, v0.2s, v5.s[1]
fmla v3.2s, v4.2s, v7.s[0]
fmla v2.2s, v4.2s, v6.s[0]
fmla v0.2s, v4.2s, v5.s[0]
fadd v1.2s, v1.2s, v3.2s
stp d2, d0, [x8]
str d1, [x8, 16]
I suspect this is because V2SF does not exist on x86_64.
Using -Dfloat=double seems to get better for x86_64 (with -mavx2):
vmovupd (%rdx), %ymm0
vpermilpd $0, (%rsi), %ymm1
movq %rdi, %rax
vmovsd 32(%rsi), %xmm5
vmovsd 40(%rsi), %xmm4
vpermpd $68, %ymm0, %ymm2
vpermpd $238, %ymm0, %ymm3
vmulpd %ymm2, %ymm1, %ymm2
vpermilpd $15, (%rsi), %ymm1
vmulpd %ymm3, %ymm1, %ymm1
vaddpd %ymm1, %ymm2, %ymm1
vmulsd %xmm5, %xmm0, %xmm2
vmovupd %ymm1, (%rdi)
vmovapd %xmm0, %xmm1
vextractf128 $0x1, %ymm0, %xmm0
vmulsd %xmm4, %xmm0, %xmm3
vunpckhpd %xmm1, %xmm1, %xmm1
vunpckhpd %xmm0, %xmm0, %xmm0
vmulsd %xmm5, %xmm1, %xmm1
vmulsd %xmm4, %xmm0, %xmm0
vaddsd %xmm3, %xmm2, %xmm2
vaddsd 32(%rdx), %xmm2, %xmm2
vaddsd %xmm0, %xmm1, %xmm1
vaddsd 40(%rdx), %xmm1, %xmm1
vmovsd %xmm2, 32(%rdi)
vmovsd %xmm1, 40(%rdi)
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
[not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
2021-08-25 0:13 ` [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3 pinskia at gcc dot gnu.org
@ 2021-08-25 7:13 ` rguenth at gcc dot gnu.org
2021-09-20 11:11 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
4 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-25 7:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |x86_64-*-*
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
x86 actually does have V2SF, the issue is that there's an opportunity for V4SF
vectorization and one for V2SF arriving at the same load groups and that causes
a conflict (there's other PRs about this general issue), so we kill one part:
t.C:18:12: missed: desired vector type conflicts with earlier one for _2 =
b_35(D)->m11;
t.C:18:12: note: removing SLP instance operations starting from: <retval>.dx =
_27;
also we have a bunch of live lanes off the remaining vectorized piece which
makes code a bit awkward.
Unfortunately we have no way to force 64bit vectors here (V2SF) to see whether
splitting up the V4SFmode partition would help (I guess it would as can be
seen from using 'double').
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
[not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
2021-08-25 0:13 ` [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3 pinskia at gcc dot gnu.org
2021-08-25 7:13 ` rguenth at gcc dot gnu.org
@ 2021-09-20 11:11 ` rguenth at gcc dot gnu.org
2021-09-27 8:24 ` cvs-commit at gcc dot gnu.org
2021-09-27 8:26 ` rguenth at gcc dot gnu.org
4 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-20 11:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Status|NEW |ASSIGNED
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
I have a patch that produces
vect__1.5_42 = MEM <const vector(4) float> [(float *)a_34(D)];
vect__1.7_47 = VEC_PERM_EXPR <vect__1.5_42, vect__1.5_42, { 0, 0, 2, 2 }>;
vect__2.10_49 = MEM <const vector(4) float> [(float *)b_35(D)];
vect__2.12_53 = VEC_PERM_EXPR <vect__2.10_49, vect__2.10_49, { 0, 1, 0, 1 }>;
vect__3.13_54 = vect__1.7_47 * vect__2.12_53;
vect__2.30_73 = MEM <const vector(2) float> [(float *)b_35(D)];
vect__1.18_61 = VEC_PERM_EXPR <vect__1.5_42, vect__1.5_42, { 1, 1, 3, 3 }>;
vect__2.23_68 = VEC_PERM_EXPR <vect__2.10_49, vect__2.10_49, { 2, 3, 2, 3 }>;
vect__6.24_69 = vect__1.18_61 * vect__2.23_68;
vect__7.25_70 = vect__3.13_54 + vect__6.24_69;
vect__5.40_85 = MEM <const vector(2) float> [(float *)b_35(D) + 8B];
MEM <vector(4) float> [(float *)&<retval>] = vect__7.25_70;
vect__21.35_81 = MEM <const vector(2) float> [(float *)a_34(D) + 16B];
vect__1.36_82 = VEC_PERM_EXPR <vect__21.35_81, vect__21.35_81, { 0, 0 }>;
vect__22.37_83 = vect__2.30_73 * vect__1.36_82;
vect__1.46_94 = VEC_PERM_EXPR <vect__21.35_81, vect__21.35_81, { 1, 1 }>;
vect__24.47_95 = vect__5.40_85 * vect__1.46_94;
vect__25.48_96 = vect__22.37_83 + vect__24.47_95;
vect__26.51_98 = MEM <const vector(2) float> [(float *)b_35(D) + 16B];
vect__27.52_100 = vect__25.48_96 + vect__26.51_98;
MEM <vector(2) float> [(float *)&<retval> + 16B] = vect__27.52_100;
that means it ends up with some odd vector loads, but with SSE 4.2 it becomes
movups (%rsi), %xmm5
movups (%rdx), %xmm1
movq %rdi, %rax
movq (%rdx), %xmm4
movq 8(%rdx), %xmm3
movsldup %xmm5, %xmm0
movaps %xmm1, %xmm2
movlhps %xmm1, %xmm2
shufps $238, %xmm1, %xmm1
mulps %xmm0, %xmm2
movshdup %xmm5, %xmm0
mulps %xmm1, %xmm0
movq 16(%rsi), %xmm1
addps %xmm2, %xmm0
movups %xmm0, (%rdi)
movsldup %xmm1, %xmm0
movshdup %xmm1, %xmm1
mulps %xmm4, %xmm0
mulps %xmm3, %xmm1
addps %xmm1, %xmm0
movq 16(%rdx), %xmm1
addps %xmm1, %xmm0
movlps %xmm0, 16(%rdi)
alternatively -mavx can do some of the required perms with the loads and
with -mfma we can use an FMA as well:
vpermilps $238, (%rdx), %xmm1
vpermilps $245, (%rsi), %xmm0
movq %rdi, %rax
vpermilps $160, (%rsi), %xmm3
vpermilps $68, (%rdx), %xmm4
vmulps %xmm1, %xmm0, %xmm0
vmovq (%rdx), %xmm2
vfmadd231ps %xmm4, %xmm3, %xmm0
vmovq 8(%rdx), %xmm3
vmovups %xmm0, (%rdi)
vmovq 16(%rsi), %xmm0
vmovsldup %xmm0, %xmm1
vmovshdup %xmm0, %xmm0
vmulps %xmm3, %xmm0, %xmm0
vfmadd132ps %xmm1, %xmm0, %xmm2
vmovq 16(%rdx), %xmm0
vaddps %xmm2, %xmm0, %xmm0
vmovlps %xmm0, 16(%rdi)
I'm not sure whether the vmovups + vmovs{l,h}dup are any better than doing
two scalar loads + dups though - it might avoid some STLF conflict with
earlier smaller stores at least.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
[not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2021-09-20 11:11 ` rguenth at gcc dot gnu.org
@ 2021-09-27 8:24 ` cvs-commit at gcc dot gnu.org
2021-09-27 8:26 ` rguenth at gcc dot gnu.org
4 siblings, 0 replies; 5+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-09-27 8:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426
--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:6390c5047adb75960f86d56582e6322aaa4d9281
commit r12-3893-g6390c5047adb75960f86d56582e6322aaa4d9281
Author: Richard Biener <rguenther@suse.de>
Date: Wed Nov 18 09:36:57 2020 +0100
Allow different vector types for stmt groups
This allows vectorization (in practice non-loop vectorization) to
have a stmt participate in different vector type vectorizations.
It allows us to remove vect_update_shared_vectype and replace it
by pushing/popping STMT_VINFO_VECTYPE from SLP_TREE_VECTYPE around
vect_analyze_stmt and vect_transform_stmt.
For data-ref the situation is a bit more complicated since we
analyze alignment info with a specific vector type in mind which
doesn't play well when that changes.
So the bulk of the change is passing down the actual vector type
used for a vectorized access to the various accessors of alignment
info, first and foremost dr_misalignment but also aligned_access_p,
known_alignment_for_access_p, vect_known_alignment_in_bytes and
vect_supportable_dr_alignment. I took the liberty to replace
ALL_CAPS macro accessors with the lower-case function invocations.
The actual changes to the behavior are in dr_misalignment which now
is the place factoring in the negative step adjustment as well as
handling alignment queries for a vector type with bigger alignment
requirements than what we can (or have) analyze(d).
vect_slp_analyze_node_alignment makes use of this and upon receiving
a vector type with a bigger alingment desire re-analyzes the DR
with respect to it but keeps an older more precise result if possible.
In this context it might be possible to do the analysis just once
but instead of analyzing with respect to a specific desired alignment
look for the biggest alignment we can compute a not unknown alignment.
The ChangeLog includes the functional changes but not the bulk due
to the alignment accessor API changes - I hope that's something good.
2021-09-17 Richard Biener <rguenther@suse.de>
PR tree-optimization/97351
PR tree-optimization/97352
PR tree-optimization/82426
* tree-vectorizer.h (dr_misalignment): Add vector type
argument.
(aligned_access_p): Likewise.
(known_alignment_for_access_p): Likewise.
(vect_supportable_dr_alignment): Likewise.
(vect_known_alignment_in_bytes): Likewise. Refactor.
(DR_MISALIGNMENT): Remove.
(vect_update_shared_vectype): Likewise.
* tree-vect-data-refs.c (dr_misalignment): Refactor, handle
a vector type with larger alignment requirement and apply
the negative step adjustment here.
(vect_calculate_target_alignment): Remove.
(vect_compute_data_ref_alignment): Get explicit vector type
argument, do not apply a negative step alignment adjustment
here.
(vect_slp_analyze_node_alignment): Re-analyze alignment
when we re-visit the DR with a bigger desired alignment but
keep more precise results from smaller alignments.
* tree-vect-slp.c (vect_update_shared_vectype): Remove.
(vect_slp_analyze_node_operations_1): Do not update the
shared vector type on stmts.
* tree-vect-stmts.c (vect_analyze_stmt): Push/pop the
vector type of an SLP node to the representative stmt-info.
(vect_transform_stmt): Likewise.
* gcc.target/i386/vect-pr82426.c: New testcase.
* gcc.target/i386/vect-pr97352.c: Likewise.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
[not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
` (3 preceding siblings ...)
2021-09-27 8:24 ` cvs-commit at gcc dot gnu.org
@ 2021-09-27 8:26 ` rguenth at gcc dot gnu.org
4 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-27 8:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Target Milestone|--- |12.0
Status|ASSIGNED |RESOLVED
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 12.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-09-27 8:26 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
2021-08-25 0:13 ` [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3 pinskia at gcc dot gnu.org
2021-08-25 7:13 ` rguenth at gcc dot gnu.org
2021-09-20 11:11 ` rguenth at gcc dot gnu.org
2021-09-27 8:24 ` cvs-commit at gcc dot gnu.org
2021-09-27 8:26 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).