public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/39821] 120% slowdown with vectorizer
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
@ 2021-07-26 6:19 ` pinskia at gcc dot gnu.org
2021-07-27 7:24 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-07-26 6:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|tree-optimization |target
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The code generation for aarch64 looks fine:
dotproduct_order4:
.LFB1:
.cfi_startproc
ldr q1, [x0]
ldr q2, [x1]
smull v0.2d, v2.2s, v1.2s
smlal2 v0.2d, v2.4s, v1.4s
addp d0, v0.2d
fmov x0, d0
ret
vect__6.41_18 = MEM <vector(4) int> [(int32_t *)v1_2(D)];
vect__10.44_13 = MEM <vector(4) int> [(int32_t *)v2_3(D)];
vect_patt_25.45_8 = WIDEN_MULT_LO_EXPR <vect__10.44_13, vect__6.41_18>;
vect_patt_25.45_4 = WIDEN_MULT_HI_EXPR <vect__10.44_13, vect__6.41_18>;
vect_accum_14.46_31 = vect_patt_25.45_4 + vect_patt_25.45_8;
_33 = .REDUC_PLUS (vect_accum_14.46_31); [tail call]
---- CUT ----
Even the gimple level for x86_64 looks ok:
vect__6.41_18 = MEM <vector(4) int> [(int32_t *)v1_2(D)];
vect__10.44_13 = MEM <vector(4) int> [(int32_t *)v2_3(D)];
vect_patt_25.45_8 = WIDEN_MULT_LO_EXPR <vect__10.44_13, vect__6.41_18>;
vect_patt_25.45_4 = WIDEN_MULT_HI_EXPR <vect__10.44_13, vect__6.41_18>;
vect_accum_14.46_31 = vect_patt_25.45_4 + vect_patt_25.45_8;
_33 = VEC_PERM_EXPR <vect_accum_14.46_31, { 0, 0 }, { 1, 2 }>;
_34 = vect_accum_14.46_31 + _33;
stmp_accum_14.47_35 = BIT_FIELD_REF <_34, 64, 0>;
But the expansion looks bad.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/39821] 120% slowdown with vectorizer
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
2021-07-26 6:19 ` [Bug target/39821] 120% slowdown with vectorizer pinskia at gcc dot gnu.org
@ 2021-07-27 7:24 ` rguenth at gcc dot gnu.org
2021-07-27 8:42 ` cvs-commit at gcc dot gnu.org
` (5 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-27 7:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
0x398f310 _2 * _4 1 times scalar_stmt costs 12 in body
...
0x392b3f0 _1 w* _3 2 times vec_promote_demote costs 8 in body
...
t4.c:4:12: note: Cost model analysis:
Vector inside of loop cost: 40
Vector prologue cost: 4
Vector epilogue cost: 108
Scalar iteration cost: 40
Scalar outside cost: 32
Vector outside cost: 112
prologue iterations: 0
epilogue iterations: 2
Calculated minimum iters for profitability: 3
so clearly the widening multiplication is not costed correctly. With SSE 4.2
we can do better:
.L4:
movdqu (%rcx,%rax), %xmm0
movdqu (%rsi,%rax), %xmm1
addq $16, %rax
movdqa %xmm0, %xmm3
movdqa %xmm1, %xmm4
punpckldq %xmm0, %xmm3
punpckldq %xmm1, %xmm4
punpckhdq %xmm0, %xmm0
pmuldq %xmm4, %xmm3
punpckhdq %xmm1, %xmm1
pmuldq %xmm1, %xmm0
paddq %xmm3, %xmm2
paddq %xmm0, %xmm2
cmpq %rdi, %rax
jne .L4
but even there the costing is imprecise. The vectorizer is unhelpful in
categorizing the widen mult as vec_promote_demote which then fails to
run into
case MULT_EXPR:
case WIDEN_MULT_EXPR:
case MULT_HIGHPART_EXPR:
stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
break;
fixing that yields
0x392b3f0 _1 w* _3 2 times vector_stmt costs 136 in body
for both SSE2 and SSE4.2 and AVX2 so that's over-estimating cost then via
/* V*DImode is emulated with 5-8 insns. */
else if (mode == V2DImode || mode == V4DImode)
{
if (TARGET_XOP && mode == V2DImode)
return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 3);
else
return ix86_vec_cost (mode, cost->mulss * 3 + cost->sse_op * 5);
}
with cost->mulss == 16. I suppose it is somehow failing to realize it's
doing a widening multiply.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/39821] 120% slowdown with vectorizer
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
2021-07-26 6:19 ` [Bug target/39821] 120% slowdown with vectorizer pinskia at gcc dot gnu.org
2021-07-27 7:24 ` rguenth at gcc dot gnu.org
@ 2021-07-27 8:42 ` cvs-commit at gcc dot gnu.org
2021-07-27 8:43 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-07-27 8:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:c8ce54c6e67295b70052d1b9f9a2f7ce9e2f8f0d
commit r12-2524-gc8ce54c6e67295b70052d1b9f9a2f7ce9e2f8f0d
Author: Richard Biener <rguenther@suse.de>
Date: Tue Jul 27 09:24:57 2021 +0200
tree-optimization/39821 - fix cost classification for widening arith
This adjusts the vectorizer to cost vector_stmt for widening
arithmetic instead of vec_promote_demote in the line of telling
the target that stmt_info->stmt is the meaningful piece we cost.
2021-07-27 Richard Biener <rguenther@suse.de>
PR tree-optimization/39821
* tree-vect-stmts.c (vect_model_promotion_demotion_cost): Use
vector_stmt for widening arithmetic.
(vectorizable_conversion): Adjust.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/39821] 120% slowdown with vectorizer
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2021-07-27 8:42 ` cvs-commit at gcc dot gnu.org
@ 2021-07-27 8:43 ` rguenth at gcc dot gnu.org
2021-07-28 5:36 ` crazylht at gmail dot com
` (3 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-27 8:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
I've pushed the change that makes us run into ix86_multiplication_cost but as
said that doesn't differentiate highpart or widening multiply yet and thus
we're now missing optimizations because of too conservative costing.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/39821] 120% slowdown with vectorizer
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
` (3 preceding siblings ...)
2021-07-27 8:43 ` rguenth at gcc dot gnu.org
@ 2021-07-28 5:36 ` crazylht at gmail dot com
2021-07-29 1:06 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2021-07-28 5:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #8)
> I've pushed the change that makes us run into ix86_multiplication_cost but
> as said that doesn't differentiate highpart or widening multiply yet and
> thus we're now missing optimizations because of too conservative costing.
For MULT_HIGHPART_EXPR, x86 only have pmulhw, it's probably ok to go into
ix86_multiplication_cost.
For WIDEN_MULT_EXPR, we need a separate cost function which should also accept
sign info since we have pmuludq under sse2 but pmuldq under sse4.1.
.i.e we should vectorize udotproduct under sse2, but sdotprodoct under sse4.1
#include<stdint.h>
uint64_t udotproduct(uint32_t *v1, uint32_t *v2, int order)
{
uint64_t accum = 0;
while (order--)
accum += (uint64_t) *v1++ * *v2++;
return accum;
}
#include<stdint.h>
int64_t sdotproduct(int32_t *v1, int32_t *v2, int order)
{
int64_t accum = 0;
while (order--)
accum += (int64_t) *v1++ * *v2++;
return accum;
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/39821] 120% slowdown with vectorizer
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
` (4 preceding siblings ...)
2021-07-28 5:36 ` crazylht at gmail dot com
@ 2021-07-29 1:06 ` cvs-commit at gcc dot gnu.org
2021-07-29 1:12 ` crazylht at gmail dot com
2021-08-26 12:03 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-07-29 1:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:231bcc77b953406b8381c7f55a3ec181da67d1e7
commit r12-2586-g231bcc77b953406b8381c7f55a3ec181da67d1e7
Author: liuhongt <hongtao.liu@intel.com>
Date: Wed Jul 28 16:24:52 2021 +0800
Add a separate function to calculate cost for WIDEN_MULT_EXPR.
gcc/ChangeLog:
PR target/39821
* config/i386/i386.c (ix86_widen_mult_cost): New function.
(ix86_add_stmt_cost): Use ix86_widen_mult_cost for
WIDEN_MULT_EXPR.
gcc/testsuite/ChangeLog:
PR target/39821
* gcc.target/i386/sse2-pr39821.c: New test.
* gcc.target/i386/sse4-pr39821.c: New test.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/39821] 120% slowdown with vectorizer
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
` (5 preceding siblings ...)
2021-07-29 1:06 ` cvs-commit at gcc dot gnu.org
@ 2021-07-29 1:12 ` crazylht at gmail dot com
2021-08-26 12:03 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2021-07-29 1:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
Fixed in GCC12.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/39821] 120% slowdown with vectorizer
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
` (6 preceding siblings ...)
2021-07-29 1:12 ` crazylht at gmail dot com
@ 2021-08-26 12:03 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-26 12:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Target Milestone|--- |12.0
Status|NEW |RESOLVED
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-08-26 12:03 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
2021-07-26 6:19 ` [Bug target/39821] 120% slowdown with vectorizer pinskia at gcc dot gnu.org
2021-07-27 7:24 ` rguenth at gcc dot gnu.org
2021-07-27 8:42 ` cvs-commit at gcc dot gnu.org
2021-07-27 8:43 ` rguenth at gcc dot gnu.org
2021-07-28 5:36 ` crazylht at gmail dot com
2021-07-29 1:06 ` cvs-commit at gcc dot gnu.org
2021-07-29 1:12 ` crazylht at gmail dot com
2021-08-26 12:03 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).