From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id AF9F03858C2C; Fri, 25 Aug 2023 06:02:30 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org AF9F03858C2C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1692943350; bh=m7fn+pGmIOARsselvsJkISu31MMT2lAgv+fHj9/QLHw=; h=From:To:Subject:Date:In-Reply-To:References:From; b=nonwpQDle9TtTe9f/3vrHE4YjfZ06DaZ6Z0n2JfsrFtIn+NSsWQNoQXjAA+9Hbql/ obwjWvlEo65OkF7/t3pmixf3UGbE9/b4Iv0NcOWaRPvgHftj32waFzgxXM4v07pw8d njfSuJYRYSLN50UqxZDPgmQ+Q8Q+HzuUWBolBiYI= From: "crazylht at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16) Date: Fri, 25 Aug 2023 06:02:29 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 13.1.0 X-Bugzilla-Keywords: missed-optimization, needs-bisection X-Bugzilla-Severity: normal X-Bugzilla-Who: crazylht at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D111064 --- Comment #4 from Hongtao.liu --- The loop is like doublefoo (double* a, unsigned* b, double* c, int n) {=20=20=20=20 double sum =3D 0; for (int i =3D 0; i !=3D n; i++) { sum +=3D a[i] * c[b[i]]; } return sum; }=20=20=20=20=20=20 After disabling gather, is use gather scalar emulation and the cost model is only profitable for xmm not ymm, which cause the regression. When manually add -fno-vect-cost-model, the regression is almost gone. microbenchmark data [liuhongt@intel gather_emulation]$ ./gather.out ;./nogather_xmm.out;./nogather_ymm.out elapsed time: 1.75997 seconds for gather with 30000000 iterations elapsed time: 2.42473 seconds for no_gather_xmm with 30000000 iterations elapsed time: 1.86436 seconds for no_gather_ymm with 30000000 iterations And I looked at the cost model=20 299_13 + sum_24 1 times scalar_to_vec costs 4 in prologue 300_13 + sum_24 1 times vector_stmt costs 16 in epilogue 301_13 + sum_24 1 times vec_to_scalar costs 4 in epilogue 302_13 + sum_24 2 times vector_stmt costs 32 in body 303*_3 1 times unaligned_load (misalign -1) costs 16 in body 304*_3 1 times unaligned_load (misalign -1) costs 16 in body 305*_7 1 times unaligned_load (misalign -1) costs 16 in body 306(long unsigned int) _8 2 times vec_promote_demote costs 8 in body 307*_11 4 times vec_to_scalar costs 80 in body 308*_11 4 times scalar_load costs 64 in body 309*_11 1 times vec_construct costs 120 in body 310*_11 4 times vec_to_scalar costs 80 in body 311*_11 4 times scalar_load costs 64 in body 312*_11 1 times vec_construct costs 120 in body 313_4 * _12 2 times vector_stmt costs 32 in body 314test.c:6:21: note: operating on full vectors. 315test.c:6:21: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown . 316*_3 4 times scalar_load costs 64 in epilogue 317*_7 4 times scalar_load costs 48 in epilogue 318(long unsigned int) _8 4 times scalar_stmt costs 16 in epilogue 319*_11 4 times scalar_load costs 64 in epilogue 320_4 * _12 4 times scalar_stmt costs 64 in epilogue 321_13 + sum_24 4 times scalar_stmt costs 64 in epilogue 322 1 times cond_branch_taken costs 12 in epilogue 323test.c:6:21: note: Cost model analysis: 324 Vector inside of loop cost: 648 325 Vector prologue cost: 4 326 Vector epilogue cost: 352 327 Scalar iteration cost: 80 328 Scalar outside cost: 24 329 Vector outside cost: 356 330 prologue iterations: 0 331 epilogue iterations: 4 332test.c:6:21: missed: cost model: the vector iteration cost =3D 648 div= ided by the scalar iteration cost =3D 80 is greater or equal to the vectorization factor =3D 8. For gather emulation part, it tries to generate below 2734 [local count: 83964060]: 2735 bnd.23_154 =3D niters.22_130 >> 2; 2736 _165 =3D (sizetype) _65; 2737 _166 =3D _165 * 8; 2738 vectp_a.28_164 =3D a_18(D) + _166; 2739 _174 =3D _165 * 4; 2740 vectp_b.32_172 =3D b_19(D) + _174; 2741 _180 =3D (sizetype) c_20(D); 2742 vect__33.29_169 =3D MEM [(double *)vectp_a.28_164]; 2743 vectp_a.27_170 =3D vectp_a.28_164 + 16; 2744 vect__33.30_171 =3D MEM [(double *)vectp_a.27_170]; 2745 vect__30.33_177 =3D MEM [(unsigned int *)vectp_b.32_172]; 2746 vect__29.34_178 =3D [vec_unpack_lo_expr] vect__30.33_177; 2747 vect__29.34_179 =3D [vec_unpack_hi_expr] vect__30.33_177; 2748 _181 =3D BIT_FIELD_REF ; 2749 _182 =3D _181 * 8; 2750 _183 =3D _180 + _182; 2751 _184 =3D (void *) _183; 2752 _185 =3D MEM[(double *)_184]; 2753 _186 =3D BIT_FIELD_REF ; 2754 _187 =3D _186 * 8; 2755 _188 =3D _180 + _187; 2756 _189 =3D (void *) _188; 2757 _190 =3D MEM[(double *)_189]; 2758 vect__23.35_191 =3D {_185, _190}; 2759 _192 =3D BIT_FIELD_REF ; 2760 _193 =3D _192 * 8; 2761 _194 =3D _180 + _193; 2762 _195 =3D (void *) _194; 2763 _196 =3D MEM[(double *)_195]; 2764 _197 =3D BIT_FIELD_REF ; 2765 _198 =3D _197 * 8; 2766 _199 =3D _180 + _198; 2767 _200 =3D (void *) _199; 2768 _201 =3D MEM[(double *)_200]; 2769 vect__23.36_202 =3D {_196, _201}; 2770 vect__15.37_203 =3D vect__33.29_169 * vect__23.35_191; 2771 vect__15.37_204 =3D vect__33.30_171 * vect__23.36_202; 2772 vect_sum_14.38_205 =3D _162 + vect__15.37_203; 2773 vect_sum_14.38_206 =3D vect__15.37_204 + vect_sum_14.38_205; 2774 _208 =3D .REDUC_PLUS (vect_sum_14.38_206); 2775 niters_vector_mult_vf.24_155 =3D bnd.23_154 << 2; 2776 _157 =3D (int) niters_vector_mult_vf.24_155; 2777 tmp.25_156 =3D i_60 + _157; 2778 if (niters.22_130 =3D=3D niters_vector_mult_vf.24_155) So there's 1 unaligned_load for index vector(cost 16), and 2 times vec_promote_demote(cost 8), and 8 times vec_to_scalar(cost 160) to get each index for the element. But why do we need that, it's just 8 times scalar_load(cost 128) for index = no need to load it as vector and then vec_promote_demote + vec_to_scalar. If we calculate cost model correctly total cost 595 < 640(scalar iterator c= ost 80 * VF 8), then it's still profitable for ymm gather emulation.=