public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting
@ 2022-02-06 9:13 tnfchris at gcc dot gnu.org
2022-02-06 9:43 ` [Bug tree-optimization/104408] " tnfchris at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-02-06 9:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408
Bug ID: 104408
Summary: SLP discovery fails due to -Ofast rewriting
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
The following testcase:
typedef struct { float r, i; } cf;
void
f (cf *restrict a, cf *restrict b, cf *restrict c, cf *restrict d, cf e)
{
for (int i = 0; i < 100; ++i)
{
b[i].r = e.r * (c[i].r - d[i].r) - e.i * (c[i].i - d[i].i);
b[i].i = e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r);
}
}
when compiled at -O3 forms an SLP tree but fails at -Ofast because match.pd
rewrites the expression into
b[i].r = e.r * (c[i].r - d[i].r) + e.i * (d[i].i - c[i].i);
b[i].i = e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r);
and so introduces a different interleaving in the second multiply operation.
It's unclear to me what the gain of actually doing this is as it results in
worse vector and scalar code due to you losing the sharing of the computed
value of the nodes.
Without the rewriting the first code can re-use the load from the first vector
and just reverse the elements:
.L2:
ldr q1, [x3, x0]
ldr q0, [x2, x0]
fsub v0.4s, v0.4s, v1.4s
fmul v1.4s, v2.4s, v0.4s
fmul v0.4s, v3.4s, v0.4s
rev64 v1.4s, v1.4s
fneg v0.2d, v0.2d
fadd v0.4s, v0.4s, v1.4s
str q0, [x1, x0]
add x0, x0, 16
cmp x0, 800
bne .L2
While with the rewrite it forces an increase in VF to be able to handle the
interleaving
.L2:
ld2 {v0.4s - v1.4s}, [x3], 32
ld2 {v4.4s - v5.4s}, [x2], 32
fsub v2.4s, v1.4s, v5.4s
fsub v3.4s, v4.4s, v0.4s
fsub v5.4s, v5.4s, v1.4s
fmul v2.4s, v2.4s, v6.4s
fmul v4.4s, v6.4s, v3.4s
fmla v2.4s, v7.4s, v3.4s
fmla v4.4s, v5.4s, v7.4s
mov v0.16b, v2.16b
mov v1.16b, v4.16b
st2 {v0.4s - v1.4s}, [x1], 32
cmp x5, x1
bne .L2
in scalar you lose the ability to re-use the subtract so you get an extra sub.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting
2022-02-06 9:13 [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting tnfchris at gcc dot gnu.org
@ 2022-02-06 9:43 ` tnfchris at gcc dot gnu.org
2022-02-06 12:41 ` tnfchris at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-02-06 9:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408
--- Comment #1 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
In particular, the rewrite should probably be gated on the expression being
single use.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting
2022-02-06 9:13 [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting tnfchris at gcc dot gnu.org
2022-02-06 9:43 ` [Bug tree-optimization/104408] " tnfchris at gcc dot gnu.org
@ 2022-02-06 12:41 ` tnfchris at gcc dot gnu.org
2022-02-07 8:55 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-02-06 12:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408
Tamar Christina <tnfchris at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org
--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Mine for GCC 13.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting
2022-02-06 9:13 [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting tnfchris at gcc dot gnu.org
2022-02-06 9:43 ` [Bug tree-optimization/104408] " tnfchris at gcc dot gnu.org
2022-02-06 12:41 ` tnfchris at gcc dot gnu.org
@ 2022-02-07 8:55 ` rguenth at gcc dot gnu.org
2022-02-07 8:56 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-07 8:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Blocks| |53947
CC| |rguenth at gcc dot gnu.org
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
match.pd just does canonicalization here. SLP discovery could handle this
in the existing swap operands or reassoc support but I guess the desire here
is to pull out a Complex SLP pattern.
So what should really be done in the end is get rid of the restriction during
SLP build that a node has to be from a single interleaving chain. We can
handle {d[i].i, c[i].r} as permute of {d[i].r, d[i].i} and {c[i].r, c[i].i}
during discovery. Of course doing that interferes with the swap-operands
logic which when successful will produce a more optimal initial SLP graph
but that relies on the recursive SLP discovery to first fail.
Forming an optimal SLP graph is likely NP complete so we have to employ
heuristics to some extent.
So - no perfect idea yet how to reliably match a Complex pattern here but
trying to attack this from the match.pd side sounds wrong.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting
2022-02-06 9:13 [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting tnfchris at gcc dot gnu.org
` (2 preceding siblings ...)
2022-02-07 8:55 ` rguenth at gcc dot gnu.org
@ 2022-02-07 8:56 ` rguenth at gcc dot gnu.org
2022-02-07 9:58 ` tnfchris at gcc dot gnu.org
2022-02-07 10:55 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-07 8:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2022-02-07
Ever confirmed|0 |1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting
2022-02-06 9:13 [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting tnfchris at gcc dot gnu.org
` (3 preceding siblings ...)
2022-02-07 8:56 ` rguenth at gcc dot gnu.org
@ 2022-02-07 9:58 ` tnfchris at gcc dot gnu.org
2022-02-07 10:55 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-02-07 9:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> match.pd just does canonicalization here. SLP discovery could handle this
> in the existing swap operands or reassoc support but I guess the desire here
> is to pull out a Complex SLP pattern.
Yes, though also to optimize the case where you don't have the optab, currently
the generated code is much worse at -Ofast.
>
> So - no perfect idea yet how to reliably match a Complex pattern here but
> trying to attack this from the match.pd side sounds wrong.
Well the problem is that the scalar code is suboptimal too. even without
matching a complex pattern, so the epilogue here does an extra sub on each
unrolled step.
So I initially figured we'd want to not perform the canonization if it's coming
at the expense of sharing. However that looks harder than I though at first as
there are multiple points in const-fold.c that will try and force this form.
I can probably fix the epilogue post vectorization but that seemed like a worse
solution.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/104408] SLP discovery fails due to -Ofast rewriting
2022-02-06 9:13 [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting tnfchris at gcc dot gnu.org
` (4 preceding siblings ...)
2022-02-07 9:58 ` tnfchris at gcc dot gnu.org
@ 2022-02-07 10:55 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-07 10:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #4)
> (In reply to Richard Biener from comment #3)
> > match.pd just does canonicalization here. SLP discovery could handle this
> > in the existing swap operands or reassoc support but I guess the desire here
> > is to pull out a Complex SLP pattern.
>
> Yes, though also to optimize the case where you don't have the optab,
> currently the generated code is much worse at -Ofast.
>
> >
> > So - no perfect idea yet how to reliably match a Complex pattern here but
> > trying to attack this from the match.pd side sounds wrong.
>
> Well the problem is that the scalar code is suboptimal too. even without
> matching a complex pattern, so the epilogue here does an extra sub on each
> unrolled step.
Well, the issue is then why the scalar code is not optimized (yes, it's
not so easy).
> So I initially figured we'd want to not perform the canonization if it's
> coming at the expense of sharing. However that looks harder than I though at
> first as there are multiple points in const-fold.c that will try and force
> this form.
Yep. The canonicalization likely happens early before we do CSE.
> I can probably fix the epilogue post vectorization but that seemed like a
> worse solution.
Well, the CSE opportunity needs to be realized despite the canonialization.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-02-07 10:55 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-06 9:13 [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting tnfchris at gcc dot gnu.org
2022-02-06 9:43 ` [Bug tree-optimization/104408] " tnfchris at gcc dot gnu.org
2022-02-06 12:41 ` tnfchris at gcc dot gnu.org
2022-02-07 8:55 ` rguenth at gcc dot gnu.org
2022-02-07 8:56 ` rguenth at gcc dot gnu.org
2022-02-07 9:58 ` tnfchris at gcc dot gnu.org
2022-02-07 10:55 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).