public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/111064] New: 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
@ 2023-08-18 11:48 hubicka at gcc dot gnu.org
2023-08-18 12:39 ` [Bug target/111064] " hubicka at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-08-18 11:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064
Bug ID: 111064
Summary: 5-10% regression of parest on icelake between
g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15
2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a
(Aug 16)
Product: gcc
Version: 13.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
-Ofast -march=native
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=801.457.0
-Ofast -march=native -flto + PGO
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=792.457.0
It does not seem to show on zen or altra
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
2023-08-18 11:48 [Bug target/111064] New: 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16) hubicka at gcc dot gnu.org
@ 2023-08-18 12:39 ` hubicka at gcc dot gnu.org
2023-08-18 13:24 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-08-18 12:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064
--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Maybe
commit 3064d1f5c48cb6ce1b4133570dd08ecca8abb52d
Author: liuhongt <hongtao.liu@intel.com>
Date: Thu Aug 10 11:41:39 2023 +0800
Software mitigation: Disable gather generation in vectorization for GDS
affected Intel Processors.
For more details of GDS (Gather Data Sampling), refer to
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html
After microcode update, there's performance regression. To avoid that,
the patch disables gather generation in autovectorization but uses
gather scalar emulation instead.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
2023-08-18 11:48 [Bug target/111064] New: 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16) hubicka at gcc dot gnu.org
2023-08-18 12:39 ` [Bug target/111064] " hubicka at gcc dot gnu.org
@ 2023-08-18 13:24 ` rguenth at gcc dot gnu.org
2023-08-22 2:54 ` crazylht at gmail dot com
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-18 13:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
might be - parest is the test that improved with emulated gather on Zen.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
2023-08-18 11:48 [Bug target/111064] New: 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16) hubicka at gcc dot gnu.org
2023-08-18 12:39 ` [Bug target/111064] " hubicka at gcc dot gnu.org
2023-08-18 13:24 ` rguenth at gcc dot gnu.org
@ 2023-08-22 2:54 ` crazylht at gmail dot com
2023-08-25 6:02 ` crazylht at gmail dot com
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2023-08-22 2:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
I didn't find the any regression when testing the patch.
Guess it's because my tester is full-copy run and the options are -march=native
-Ofast -flto -funroll-loop.
Let me verify it.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
2023-08-18 11:48 [Bug target/111064] New: 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16) hubicka at gcc dot gnu.org
` (2 preceding siblings ...)
2023-08-22 2:54 ` crazylht at gmail dot com
@ 2023-08-25 6:02 ` crazylht at gmail dot com
2023-08-25 8:16 ` fkastl at suse dot cz
2023-08-29 6:22 ` crazylht at gmail dot com
5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2023-08-25 6:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
The loop is like
doublefoo (double* a, unsigned* b, double* c, int n)
{
double sum = 0;
for (int i = 0; i != n; i++)
{
sum += a[i] * c[b[i]];
}
return sum;
}
After disabling gather, is use gather scalar emulation and the cost model is
only profitable for xmm not ymm, which cause the regression.
When manually add -fno-vect-cost-model, the regression is almost gone.
microbenchmark data
[liuhongt@intel gather_emulation]$ ./gather.out
;./nogather_xmm.out;./nogather_ymm.out
elapsed time: 1.75997 seconds for gather with 30000000 iterations
elapsed time: 2.42473 seconds for no_gather_xmm with 30000000 iterations
elapsed time: 1.86436 seconds for no_gather_ymm with 30000000 iterations
And I looked at the cost model
299_13 + sum_24 1 times scalar_to_vec costs 4 in prologue
300_13 + sum_24 1 times vector_stmt costs 16 in epilogue
301_13 + sum_24 1 times vec_to_scalar costs 4 in epilogue
302_13 + sum_24 2 times vector_stmt costs 32 in body
303*_3 1 times unaligned_load (misalign -1) costs 16 in body
304*_3 1 times unaligned_load (misalign -1) costs 16 in body
305*_7 1 times unaligned_load (misalign -1) costs 16 in body
306(long unsigned int) _8 2 times vec_promote_demote costs 8 in body
307*_11 4 times vec_to_scalar costs 80 in body
308*_11 4 times scalar_load costs 64 in body
309*_11 1 times vec_construct costs 120 in body
310*_11 4 times vec_to_scalar costs 80 in body
311*_11 4 times scalar_load costs 64 in body
312*_11 1 times vec_construct costs 120 in body
313_4 * _12 2 times vector_stmt costs 32 in body
314test.c:6:21: note: operating on full vectors.
315test.c:6:21: note: cost model: epilogue peel iters set to vf/2 because
loop iterations are unknown .
316*_3 4 times scalar_load costs 64 in epilogue
317*_7 4 times scalar_load costs 48 in epilogue
318(long unsigned int) _8 4 times scalar_stmt costs 16 in epilogue
319*_11 4 times scalar_load costs 64 in epilogue
320_4 * _12 4 times scalar_stmt costs 64 in epilogue
321_13 + sum_24 4 times scalar_stmt costs 64 in epilogue
322<unknown> 1 times cond_branch_taken costs 12 in epilogue
323test.c:6:21: note: Cost model analysis:
324 Vector inside of loop cost: 648
325 Vector prologue cost: 4
326 Vector epilogue cost: 352
327 Scalar iteration cost: 80
328 Scalar outside cost: 24
329 Vector outside cost: 356
330 prologue iterations: 0
331 epilogue iterations: 4
332test.c:6:21: missed: cost model: the vector iteration cost = 648 divided
by the scalar iteration cost = 80 is greater or equal to the vectorization
factor = 8.
For gather emulation part, it tries to generate below
2734 <bb 18> [local count: 83964060]:
2735 bnd.23_154 = niters.22_130 >> 2;
2736 _165 = (sizetype) _65;
2737 _166 = _165 * 8;
2738 vectp_a.28_164 = a_18(D) + _166;
2739 _174 = _165 * 4;
2740 vectp_b.32_172 = b_19(D) + _174;
2741 _180 = (sizetype) c_20(D);
2742 vect__33.29_169 = MEM <vector(2) double> [(double *)vectp_a.28_164];
2743 vectp_a.27_170 = vectp_a.28_164 + 16;
2744 vect__33.30_171 = MEM <vector(2) double> [(double *)vectp_a.27_170];
2745 vect__30.33_177 = MEM <vector(4) unsigned int> [(unsigned int
*)vectp_b.32_172];
2746 vect__29.34_178 = [vec_unpack_lo_expr] vect__30.33_177;
2747 vect__29.34_179 = [vec_unpack_hi_expr] vect__30.33_177;
2748 _181 = BIT_FIELD_REF <vect__29.34_178, 64, 0>;
2749 _182 = _181 * 8;
2750 _183 = _180 + _182;
2751 _184 = (void *) _183;
2752 _185 = MEM[(double *)_184];
2753 _186 = BIT_FIELD_REF <vect__29.34_178, 64, 64>;
2754 _187 = _186 * 8;
2755 _188 = _180 + _187;
2756 _189 = (void *) _188;
2757 _190 = MEM[(double *)_189];
2758 vect__23.35_191 = {_185, _190};
2759 _192 = BIT_FIELD_REF <vect__29.34_179, 64, 0>;
2760 _193 = _192 * 8;
2761 _194 = _180 + _193;
2762 _195 = (void *) _194;
2763 _196 = MEM[(double *)_195];
2764 _197 = BIT_FIELD_REF <vect__29.34_179, 64, 64>;
2765 _198 = _197 * 8;
2766 _199 = _180 + _198;
2767 _200 = (void *) _199;
2768 _201 = MEM[(double *)_200];
2769 vect__23.36_202 = {_196, _201};
2770 vect__15.37_203 = vect__33.29_169 * vect__23.35_191;
2771 vect__15.37_204 = vect__33.30_171 * vect__23.36_202;
2772 vect_sum_14.38_205 = _162 + vect__15.37_203;
2773 vect_sum_14.38_206 = vect__15.37_204 + vect_sum_14.38_205;
2774 _208 = .REDUC_PLUS (vect_sum_14.38_206);
2775 niters_vector_mult_vf.24_155 = bnd.23_154 << 2;
2776 _157 = (int) niters_vector_mult_vf.24_155;
2777 tmp.25_156 = i_60 + _157;
2778 if (niters.22_130 == niters_vector_mult_vf.24_155)
So there's 1 unaligned_load for index vector(cost 16), and 2 times
vec_promote_demote(cost 8), and 8 times vec_to_scalar(cost 160) to get each
index for the element.
But why do we need that, it's just 8 times scalar_load(cost 128) for index no
need to load it as vector and then vec_promote_demote + vec_to_scalar.
If we calculate cost model correctly total cost 595 < 640(scalar iterator cost
80 * VF 8), then it's still profitable for ymm gather emulation.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
2023-08-18 11:48 [Bug target/111064] New: 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16) hubicka at gcc dot gnu.org
` (3 preceding siblings ...)
2023-08-25 6:02 ` crazylht at gmail dot com
@ 2023-08-25 8:16 ` fkastl at suse dot cz
2023-08-29 6:22 ` crazylht at gmail dot com
5 siblings, 0 replies; 7+ messages in thread
From: fkastl at suse dot cz @ 2023-08-25 8:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064
Filip Kastl <fkastl at suse dot cz> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |fkastl at suse dot cz
--- Comment #5 from Filip Kastl <fkastl at suse dot cz> ---
*** Bug 111152 has been marked as a duplicate of this bug. ***
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
2023-08-18 11:48 [Bug target/111064] New: 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16) hubicka at gcc dot gnu.org
` (4 preceding siblings ...)
2023-08-25 8:16 ` fkastl at suse dot cz
@ 2023-08-29 6:22 ` crazylht at gmail dot com
5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2023-08-29 6:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064
--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
>
> [liuhongt@intel gather_emulation]$ ./gather.out
> ;./nogather_xmm.out;./nogather_ymm.out
> elapsed time: 1.75997 seconds for gather with 30000000 iterations
> elapsed time: 2.42473 seconds for no_gather_xmm with 30000000 iterations
> elapsed time: 1.86436 seconds for no_gather_ymm with 30000000 iterations
>
For 510.parest_r, enable gather emulation for ymm can bring back 3%
performance, still not as good as gather instruction due to thoughput bound.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-08-29 6:22 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-18 11:48 [Bug target/111064] New: 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16) hubicka at gcc dot gnu.org
2023-08-18 12:39 ` [Bug target/111064] " hubicka at gcc dot gnu.org
2023-08-18 13:24 ` rguenth at gcc dot gnu.org
2023-08-22 2:54 ` crazylht at gmail dot com
2023-08-25 6:02 ` crazylht at gmail dot com
2023-08-25 8:16 ` fkastl at suse dot cz
2023-08-29 6:22 ` crazylht at gmail dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).