public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc
@ 2021-03-05 14:54 hubicka at gcc dot gnu.org
2021-03-08 8:32 ` [Bug tree-optimization/99412] " rguenth at gcc dot gnu.org
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-05 14:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Bug ID: 99412
Summary: s352 benchmark of TSVC is vectorized by clang and not
by gcc
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
typedef float real_t;
#define iterations 100000
#define LEN_1D 32000
#define LEN_2D 256
real_t a[LEN_1D],b[LEN_1D];
int main ()
{
// loop rerolling
// unrolled dot product
real_t dot;
for (int nl = 0; nl < 8*iterations; nl++) {
dot = (real_t)0.;
for (int i = 0; i < LEN_1D; i += 5) {
dot = dot + a[i] * b[i] + a[i + 1] * b[i + 1] + a[i + 2]
* b[i + 2] + a[i + 3] * b[i + 3] + a[i + 4] * b[i + 4];
}
}
return dot;
}
clang does:
main: # @main
.cfi_startproc
# %bb.0:
xorl %eax, %eax
.p2align 4, 0x90
.LBB0_1: # =>This Loop Header: Depth=1
# Child Loop BB0_2 Depth 2
vxorps %xmm0, %xmm0, %xmm0
movq $-5, %rcx
.p2align 4, 0x90
.LBB0_2: # Parent Loop BB0_1 Depth=1
# => This Inner Loop Header: Depth=2
vmovups b+20(,%rcx,4), %xmm1
vmovss b+36(,%rcx,4), %xmm2 # xmm2 = mem[0],zero,zero,zero
vmulps a+20(,%rcx,4), %xmm1, %xmm1
vpermilpd $1, %xmm1, %xmm3 # xmm3 = xmm1[1,0]
vaddps %xmm3, %xmm1, %xmm1
vmovshdup %xmm1, %xmm3 # xmm3 = xmm1[1,1,3,3]
vaddss %xmm3, %xmm1, %xmm1
vfmadd231ss a+36(,%rcx,4), %xmm2, %xmm1 # xmm1 = (xmm2 * mem) +
xmm1
addq $5, %rcx
vaddss %xmm0, %xmm1, %xmm0
cmpq $31995, %rcx # imm = 0x7CFB
jb .LBB0_2
# %bb.3: # in Loop: Header=BB0_1 Depth=1
incl %eax
cmpl $800000, %eax # imm = 0xC3500
jne .LBB0_1
# %bb.4:
vcvttss2si %xmm0, %eax
retq
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
@ 2021-03-08 8:32 ` rguenth at gcc dot gnu.org
2021-06-09 12:54 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08 8:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|middle-end |tree-optimization
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Blocks| |53947
Keywords| |missed-optimization
Ever confirmed|0 |1
Depends on| |97832
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2021-03-08
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
With -fno-tree-reassoc we detect the reduction chain and produce
.L3:
vmovaps b(%rax), %ymm5
vmovaps b+32(%rax), %ymm6
addq $160, %rax
vfmadd231ps a-160(%rax), %ymm5, %ymm1
vmovaps b-96(%rax), %ymm7
vfmadd231ps a-128(%rax), %ymm6, %ymm0
vmovaps b-64(%rax), %ymm5
vmovaps b-32(%rax), %ymm6
vfmadd231ps a-96(%rax), %ymm7, %ymm2
vfmadd231ps a-64(%rax), %ymm5, %ymm3
vfmadd231ps a-32(%rax), %ymm6, %ymm4
cmpq $128000, %rax
jne .L3
vaddps %ymm1, %ymm0, %ymm0
vaddps %ymm2, %ymm0, %ymm0
vaddps %ymm3, %ymm0, %ymm0
vaddps %ymm4, %ymm0, %ymm0
vextractf128 $0x1, %ymm0, %xmm1
vaddps %xmm0, %xmm1, %xmm1
vmovhlps %xmm1, %xmm1, %xmm0
vaddps %xmm1, %xmm0, %xmm0
vshufps $85, %xmm0, %xmm0, %xmm1
vaddps %xmm0, %xmm1, %xmm0
decl %edx
jne .L2
we're not re-rolling and thus are forced to use a VF of 4 here.
Note that LLVM doesn't seem to veectorize the loop but instead vectorizes
the basic-block which isn't what TSVC looks for (but that would work for
non-fast-math).
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
[Bug 97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than
-O3
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-08 8:32 ` [Bug tree-optimization/99412] " rguenth at gcc dot gnu.org
@ 2021-06-09 12:54 ` rguenth at gcc dot gnu.org
2023-01-11 19:14 ` hubicka at gcc dot gnu.org
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-06-09 12:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Bug 99412 depends on bug 97832, which changed state.
Bug 97832 Summary: AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-08 8:32 ` [Bug tree-optimization/99412] " rguenth at gcc dot gnu.org
2021-06-09 12:54 ` rguenth at gcc dot gnu.org
@ 2023-01-11 19:14 ` hubicka at gcc dot gnu.org
2023-01-12 9:12 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-01-11 19:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
This is also seen with zen4 comparing gcc and aocc. (about 2.3 times
differnece)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (2 preceding siblings ...)
2023-01-11 19:14 ` hubicka at gcc dot gnu.org
@ 2023-01-12 9:12 ` rguenth at gcc dot gnu.org
2023-01-12 9:30 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-12 9:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #2)
> This is also seen with zen4 comparing gcc and aocc. (about 2.3 times
> differnece)
Disabling
@@ -6877,7 +6887,7 @@ reassociate_bb (basic_block bb)
binary op are chosen wisely. */
int len = ops.length ();
if (len >= 3)
swap_ops_for_binary_stmt (ops, len - 3, stmt);
will naturally create the reduction chain (or leave it in place) given the
current rank computation. We do have (somewhat) robust fallback from
reduction chain to reduction (via reduction path support), so I think this
change would be OK.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (3 preceding siblings ...)
2023-01-12 9:12 ` rguenth at gcc dot gnu.org
@ 2023-01-12 9:30 ` rguenth at gcc dot gnu.org
2023-01-12 13:30 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-12 9:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> (In reply to Jan Hubicka from comment #2)
> > This is also seen with zen4 comparing gcc and aocc. (about 2.3 times
> > differnece)
>
> Disabling
>
> @@ -6877,7 +6887,7 @@ reassociate_bb (basic_block bb)
> binary op are chosen wisely. */
> int len = ops.length ();
> if (len >= 3)
> swap_ops_for_binary_stmt (ops, len - 3, stmt);
>
> will naturally create the reduction chain (or leave it in place) given the
> current rank computation. We do have (somewhat) robust fallback from
> reduction chain to reduction (via reduction path support), so I think this
> change would be OK.
The code originated from r0-111616-gdf7b0cc4aae062, the reassoc rewrite by
Jeff back in 2005 for GCC 4.3 (or 4.2, don't have that tree around anymore).
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (4 preceding siblings ...)
2023-01-12 9:30 ` rguenth at gcc dot gnu.org
@ 2023-01-12 13:30 ` cvs-commit at gcc dot gnu.org
2023-01-12 13:42 ` rguenth at gcc dot gnu.org
2023-01-12 13:42 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-01-12 13:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:b073f2b098ba7819450d6c14a0fb96cb1c09f242
commit r13-5122-gb073f2b098ba7819450d6c14a0fb96cb1c09f242
Author: Richard Biener <rguenther@suse.de>
Date: Thu Jan 12 11:18:22 2023 +0100
tree-optimization/99412 - reassoc and reduction chains
With -ffast-math we end up associating reduction chains and break
them - this is because of old code that tries to rectify reductions
into a shape likened by the vectorizer. Nowadays the rank compute
produces correct association for reduction chains and the vectorizer
has robust support to fall back to a regular reductions (via
reduction path) when it turns out to be not a proper reduction chain.
So this patch removes the special code in reassoc which makes
the TSVC s352 vectorized with -Ofast (it is already without
-ffast-math).
PR tree-optimization/99412
* tree-ssa-reassoc.cc (is_phi_for_stmt): Remove.
(swap_ops_for_binary_stmt): Remove reduction handling.
(rewrite_expr_tree_parallel): Adjust.
(reassociate_bb): Likewise.
* tree-parloops.cc (build_new_reduction): Handle MINUS_EXPR.
* gcc.dg/vect/pr99412.c: New testcase.
* gcc.dg/tree-ssa/reassoc-47.c: Adjust comment.
* gcc.dg/tree-ssa/reassoc-48.c: Remove.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (5 preceding siblings ...)
2023-01-12 13:30 ` cvs-commit at gcc dot gnu.org
@ 2023-01-12 13:42 ` rguenth at gcc dot gnu.org
2023-01-12 13:42 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-12 13:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC now does
.L2:
vmovaps b(%rax), %xmm6
vmulps a(%rax), %xmm6, %xmm0
addq $80, %rax
vmovaps b-64(%rax), %xmm7
vmovaps b-48(%rax), %xmm6
vaddps %xmm0, %xmm5, %xmm5
vmulps a-64(%rax), %xmm7, %xmm0
vmovaps b-32(%rax), %xmm7
vaddps %xmm0, %xmm1, %xmm1
vmulps a-48(%rax), %xmm6, %xmm0
vmovaps b-16(%rax), %xmm6
vaddps %xmm0, %xmm4, %xmm4
vmulps a-32(%rax), %xmm7, %xmm0
vaddps %xmm0, %xmm2, %xmm2
vmulps a-16(%rax), %xmm6, %xmm0
vaddps %xmm0, %xmm3, %xmm3
cmpq $128000, %rax
jne .L2
thus uses a VF of 4 with -Ofast. Your LLVM snippet uses 5 lanes
instead of our 20 with four lanes in a V4SF and one lane in a scalar.
That's interesting but not something we support.
Re-rolling would mean using a single v4sf 4 lane vector here. For
a pure SLP loop something like this should be possible without too
much hassle I think. We'd just need to try ... (and think of if it's
worth in real life)
For the purpose of the Summary this is fixed now.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
` (6 preceding siblings ...)
2023-01-12 13:42 ` rguenth at gcc dot gnu.org
@ 2023-01-12 13:42 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-12 13:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |13.0
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-01-12 13:42 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-08 8:32 ` [Bug tree-optimization/99412] " rguenth at gcc dot gnu.org
2021-06-09 12:54 ` rguenth at gcc dot gnu.org
2023-01-11 19:14 ` hubicka at gcc dot gnu.org
2023-01-12 9:12 ` rguenth at gcc dot gnu.org
2023-01-12 9:30 ` rguenth at gcc dot gnu.org
2023-01-12 13:30 ` cvs-commit at gcc dot gnu.org
2023-01-12 13:42 ` rguenth at gcc dot gnu.org
2023-01-12 13:42 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).