From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id AF9F03858C2C; Fri, 25 Aug 2023 06:02:30 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org AF9F03858C2C
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1692943350;
	bh=m7fn+pGmIOARsselvsJkISu31MMT2lAgv+fHj9/QLHw=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=nonwpQDle9TtTe9f/3vrHE4YjfZ06DaZ6Z0n2JfsrFtIn+NSsWQNoQXjAA+9Hbql/
	 obwjWvlEo65OkF7/t3pmixf3UGbE9/b4Iv0NcOWaRPvgHftj32waFzgxXM4v07pw8d
	 njfSuJYRYSLN50UqxZDPgmQ+Q8Q+HzuUWBolBiYI=
From: "crazylht at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/111064] 5-10% regression of parest on icelake between
 g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and
 g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
Date: Fri, 25 Aug 2023 06:02:29 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.1.0
X-Bugzilla-Keywords: missed-optimization, needs-bisection
X-Bugzilla-Severity: normal
X-Bugzilla-Who: crazylht at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-111064-4-Kxp8yKgXOU@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-111064-4@http.gcc.gnu.org/bugzilla/>
References: <bug-111064-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D111064
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
The loop is like

doublefoo (double* a, unsigned* b, double* c, int n)
{=20=20=20=20
    double sum =3D 0;
    for (int i =3D 0; i !=3D n; i++)
    {
        sum +=3D a[i] * c[b[i]];
    }
    return sum;
}=20=20=20=20=20=20

After disabling gather, is use gather scalar emulation and the cost model is
only profitable for xmm not ymm, which cause the regression.
When manually add -fno-vect-cost-model, the regression is almost gone.

microbenchmark data

[liuhongt@intel gather_emulation]$ ./gather.out
;./nogather_xmm.out;./nogather_ymm.out
elapsed time: 1.75997 seconds for gather with 30000000 iterations
elapsed time: 2.42473 seconds for no_gather_xmm with 30000000 iterations
elapsed time: 1.86436 seconds for no_gather_ymm with 30000000 iterations


And I looked at the cost model=20

 299_13 + sum_24 1 times scalar_to_vec costs 4 in prologue
 300_13 + sum_24 1 times vector_stmt costs 16 in epilogue
 301_13 + sum_24 1 times vec_to_scalar costs 4 in epilogue
 302_13 + sum_24 2 times vector_stmt costs 32 in body
 303*_3 1 times unaligned_load (misalign -1) costs 16 in body
 304*_3 1 times unaligned_load (misalign -1) costs 16 in body
 305*_7 1 times unaligned_load (misalign -1) costs 16 in body
 306(long unsigned int) _8 2 times vec_promote_demote costs 8 in body
 307*_11 4 times vec_to_scalar costs 80 in body
 308*_11 4 times scalar_load costs 64 in body
 309*_11 1 times vec_construct costs 120 in body
 310*_11 4 times vec_to_scalar costs 80 in body
 311*_11 4 times scalar_load costs 64 in body
 312*_11 1 times vec_construct costs 120 in body
 313_4 * _12 2 times vector_stmt costs 32 in body
 314test.c:6:21: note:  operating on full vectors.
 315test.c:6:21: note:  cost model: epilogue peel iters set to vf/2 because
loop iterations are unknown .
 316*_3 4 times scalar_load costs 64 in epilogue
 317*_7 4 times scalar_load costs 48 in epilogue
 318(long unsigned int) _8 4 times scalar_stmt costs 16 in epilogue
 319*_11 4 times scalar_load costs 64 in epilogue
 320_4 * _12 4 times scalar_stmt costs 64 in epilogue
 321_13 + sum_24 4 times scalar_stmt costs 64 in epilogue
 322<unknown> 1 times cond_branch_taken costs 12 in epilogue
 323test.c:6:21: note:  Cost model analysis:
 324  Vector inside of loop cost: 648
 325  Vector prologue cost: 4
 326  Vector epilogue cost: 352
 327  Scalar iteration cost: 80
 328  Scalar outside cost: 24
 329  Vector outside cost: 356
 330  prologue iterations: 0
 331  epilogue iterations: 4
 332test.c:6:21: missed:  cost model: the vector iteration cost =3D 648 div=
ided
by the scalar iteration cost =3D 80 is greater or equal to the vectorization
factor =3D 8.

For gather emulation part, it tries to generate below

2734  <bb 18> [local count: 83964060]:
2735  bnd.23_154 =3D niters.22_130 >> 2;
2736  _165 =3D (sizetype) _65;
2737  _166 =3D _165 * 8;
2738  vectp_a.28_164 =3D a_18(D) + _166;
2739  _174 =3D _165 * 4;
2740  vectp_b.32_172 =3D b_19(D) + _174;
2741  _180 =3D (sizetype) c_20(D);
2742  vect__33.29_169 =3D MEM <vector(2) double> [(double *)vectp_a.28_164];
2743  vectp_a.27_170 =3D vectp_a.28_164 + 16;
2744  vect__33.30_171 =3D MEM <vector(2) double> [(double *)vectp_a.27_170];
2745  vect__30.33_177 =3D MEM <vector(4) unsigned int> [(unsigned int
*)vectp_b.32_172];
2746  vect__29.34_178 =3D [vec_unpack_lo_expr] vect__30.33_177;
2747  vect__29.34_179 =3D [vec_unpack_hi_expr] vect__30.33_177;
2748  _181 =3D BIT_FIELD_REF <vect__29.34_178, 64, 0>;
2749  _182 =3D _181 * 8;
2750  _183 =3D _180 + _182;
2751  _184 =3D (void *) _183;
2752  _185 =3D MEM[(double *)_184];
2753  _186 =3D BIT_FIELD_REF <vect__29.34_178, 64, 64>;
2754  _187 =3D _186 * 8;
2755  _188 =3D _180 + _187;
2756  _189 =3D (void *) _188;
2757  _190 =3D MEM[(double *)_189];
2758  vect__23.35_191 =3D {_185, _190};
2759  _192 =3D BIT_FIELD_REF <vect__29.34_179, 64, 0>;
2760  _193 =3D _192 * 8;
2761  _194 =3D _180 + _193;
2762  _195 =3D (void *) _194;
2763  _196 =3D MEM[(double *)_195];
2764  _197 =3D BIT_FIELD_REF <vect__29.34_179, 64, 64>;
2765  _198 =3D _197 * 8;
2766  _199 =3D _180 + _198;
2767  _200 =3D (void *) _199;
2768  _201 =3D MEM[(double *)_200];
2769  vect__23.36_202 =3D {_196, _201};
2770  vect__15.37_203 =3D vect__33.29_169 * vect__23.35_191;
2771  vect__15.37_204 =3D vect__33.30_171 * vect__23.36_202;
2772  vect_sum_14.38_205 =3D _162 + vect__15.37_203;
2773  vect_sum_14.38_206 =3D vect__15.37_204 + vect_sum_14.38_205;
2774  _208 =3D .REDUC_PLUS (vect_sum_14.38_206);
2775  niters_vector_mult_vf.24_155 =3D bnd.23_154 << 2;
2776  _157 =3D (int) niters_vector_mult_vf.24_155;
2777  tmp.25_156 =3D i_60 + _157;
2778  if (niters.22_130 =3D=3D niters_vector_mult_vf.24_155)


So there's 1 unaligned_load for index vector(cost 16), and  2 times
vec_promote_demote(cost 8), and 8 times vec_to_scalar(cost 160) to get each
index for the element.

But why do we need that, it's just 8 times scalar_load(cost 128) for index =
no
need to load it as vector and then vec_promote_demote + vec_to_scalar.

If we calculate cost model correctly total cost 595 < 640(scalar iterator c=
ost
80 * VF 8), then it's still profitable for ymm gather emulation.=