From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 18D2A3858CDA; Sun, 16 Jul 2023 17:39:03 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 18D2A3858CDA DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1689529143; bh=qkSQHO+ZVwrbKYrh/Su+dj+ZG0wbpH3pgrMbOyjcPt0=; h=From:To:Subject:Date:In-Reply-To:References:From; b=RjUxyd3BXRWoCZPr07M7RRqdvWjujIdEQNK9I+ulqr8fsHI4Mt6MUIWj18DVKTYIC FFaGHUz//BPAFVtfH8+QwWL8ceNmfaYQT2/+lcYkUIoMPnHOina3Uvh4d/2uOoflW+ BE94YByXiSSoNmIjxp3hPtaeHEqIvsyTxBkSTmJs= From: "hubicka at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/110649] [14 Regression] 25% sphinx3 spec2006 regression on Ice Lake and zen between g:acaa441a98bebc52 (2023-07-06 11:36) and g:55900189ab517906 (2023-07-07 00:23) Date: Sun, 16 Jul 2023 17:39:02 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization, needs-bisection X-Bugzilla-Severity: normal X-Bugzilla-Who: hubicka at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 14.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110649 --- Comment #6 from Jan Hubicka --- I tried zen3 with -march=3Dnative -Ofast=20 Samples: 1M of event 'cycles:u', Event count (approx.): 2309002237334, DSO:= s Overhead Command Symbol=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20 42.51% sphinx_livepret [.] mgau_eval = =E2=97=86 24.36% sphinx_livepret [.] vector_gautbl_eval_logs3 = =E2=96=92 6.81% sphinx_livepret [.] subvq_mgau_shortlist = =E2=96=92 6.43% sphinx_livepret [.] logs3_add = =E2=96=92 4.91% sphinx_livepret [.] approx_cont_mgau_frame_eval = =E2=96=92 4.32% sphinx_livepret [.] mdef_sseq2sen_active = =E2=96=92 2.62% sphinx_livepret [.] dict2pid_comsenscr = =E2=96=92 1.50% sphinx_livepret [.] hmm_vit_eval_3st = =E2=96=92 0.84% sphinx_livepret [.] lextree_hmm_eval = =E2=96=92 0.67% sphinx_livepret [.] lextree_hmm_propagate = =E2=96=92 0.64% sphinx_livepret [.] lextree_enter = =E2=96=92 0.61% sphinx_livepret [.] fe_fft = =E2=96=92 0.45% sphinx_livepret [.] dict2pid_comsseq2sen_active = =E2=96=92 0.32% sphinx_livepret [.] lextree_ssid_active = =E2=96=92 0.18% sphinx_livepret [.] vithist_rescore = =E2=96=92 0.14% sphinx_livepret [.] utt_decode_block = =E2=96=92 0.12% sphinx_livepret [.] fe_mel_cep = =E2=96=92 Prior vectorizing there is no invalid profile in mgau_eval. Loop is for (c =3D 0; c < mgau->n_comp-1; c +=3D 2) { /* Interleave 2 components for speed */ m1 =3D mgau->mean[c]; m2 =3D mgau->mean[c+1]; v1 =3D mgau->var[c]; v2 =3D mgau->var[c+1]; dval1 =3D mgau->lrd[c]; dval2 =3D mgau->lrd[c+1]; for (i =3D 0; i < veclen; i++) { diff1 =3D x[i] - m1[i]; dval1 -=3D diff1 * diff1 * v1[i]; diff2 =3D x[i] - m2[i]; dval2 -=3D diff2 * diff2 * v2[i]; /* E_INFO("x %10f m1 %10f m2 %10f v1 %10f, v2 %10f\n",x[i],m1[i],m2[i],v1[i],v2[i]); E_INFO("diff1 %10f,dval1 %10f, diff2 %10f, dval2 %10f\n",diff1,dval1,diff2,dval2);*/ } if (dval1 < g->distfloor) /* Floor */ dval1 =3D g->distfloor; if (dval2 < g->distfloor) dval2 =3D g->distfloor; score =3D logs3_add (score, (int32)(f * dval1) + mgau->mixw[c]); score =3D logs3_add (score, (int32)(f * dval2) + mgau->mixw[c+1= ]); } and the inner loop iterates 47 times on average. Vectorizer has profitaibli= ty threshold 8 and vectorizes to 32bit vectors. Epilogue has threshold 4 and is vectorized with 16bit vector. There is second similar loop nest in the function: for (j =3D 0; active[j] >=3D 0; j++) { #ifdef SPEC_CPU considered++; #endif c =3D active[j]; m1 =3D mgau->mean[c]; v1 =3D mgau->var[c]; dval1 =3D mgau->lrd[c]; for (i =3D 0; i < veclen; i++) { diff1 =3D x[i] - m1[i]; dval1 -=3D diff1 * diff1 * v1[i]; } if (dval1 < g->distfloor) dval1 =3D g->distfloor; score =3D logs3_add (score, (int32)(f * dval1) + mgau->mixw[c]); } which is executed 10% of time and also vectorized twice. We then believe that the inner loop iterates 5 times (I would expect 47/4 times). In cunroll pass we then see: Loop 4 iterates at most 2147483647 times.=20 Loop 4 likely iterates at most 2147483647 times. Not unrolling loop 4 (--param max-completely-peel-times limit reached). This is the outer loop Loop 7 iterates at most 2 times. Loop 7 likely iterates at most 2 times. Loop size: 22 Estimated size after unrolling: 42 cont_mgau.c:604:20: optimized: loop with 2 iterations completely unrolled (header execution count 1065258) this is the scalar epilogue loop. Loop 6 iterates at most 0 times. Loop 6 likely iterates at most 0 times. cont_mgau.c:575:7: optimized: loop turned into non-loop; it never loops This is the vectorized epilogue loop (really non-loop). So this looks OK, but introduced one mismatch in profile. Before the pass we had: ;; basic block 14, loop depth 2, count 171249098 (guessed, freq 23.9461), maybe hot ;; prev block 51, next block 66, flags: (NEW, VISITED) ;; pred: 24 [always] count:142707582 (guessed, freq 19.9550) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 51 [always] count:28541516 (guessed, freq 3.9910) (FALLTHRU,EXECUTABLE) and now we get: ;; basic block 14, loop depth 2, count 13764235 (guessed, freq 1.9247), m= aybe hot ;; Invalid sum of incoming counts 25234431 (guessed, freq 3.5286), should= be 13764235 (guessed, freq 1.9247) ;; prev block 83, next block 66, flags: (NEW, VISITED) ;; pred: 24 [always] count:11470196 (guessed, freq 1.6039) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 83 [always] count:13764235 (guessed, freq 1.9247) (FALLTHRU,EXECUTABLE) this does look wrong, since the loop was not unroled yet it profile was red= uced significantl. I also noticed that in other (not hot) function we get following BB with nonsential exit edges: ;; basic block 74, loop depth 3, count 258660 (guessed, freq 258660.0000), maybe hot ;; Invalid sum of outgoing probabilities 120.0% ;; prev block 155, next block 175, flags: (NEW, REACHABLE, VISITED) ;; pred: 97 [always] count:215550 (guessed, freq 215550.0000) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 155 [always] count:43110 (guessed, freq 43110.0000) (FALLTHRU,EXECUTABLE) # i_212 =3D PHI # n_94 =3D PHI <_453(97), n_244(155)> # vect_n_94.158_583 =3D PHI # vectp.159_584 =3D PHI # vectp.165_594 =3D PHI # ivtmp_612 =3D PHI # DEBUG BEGIN_STMT _224 =3D (long unsigned int) i_212; _225 =3D _224 * 4; _226 =3D _222 + _225; vect__227.161_586 =3D MEM [(float32 *)vectp.159_584]; _227 =3D *_226; vect__228.162_587 =3D [vec_unpack_lo_expr] vect__227.161_586; vect__228.162_588 =3D [vec_unpack_hi_expr] vect__227.161_586; _228 =3D (double) _227; mask__470.163_590 =3D vect_cst__589 > vect__228.162_587; mask__470.163_591 =3D vect_cst__589 > vect__228.162_588; _470 =3D varfloor_23(D) > _228; # DEBUG BEGIN_STMT mask_patt_538.164_592 =3D VEC_PACK_TRUNC_EXPR ; if (mask_patt_538.164_592 =3D=3D { 0, 0, 0, 0, 0, 0, 0, 0 }) goto ; [100.00%] else goto ; [20.00%] Edge to 174 seems just worng: ;; basic block 174, loop depth 3, count 258660 (guessed, freq 258660.0000= ), maybe hot ;; Invalid sum of incoming counts 310392 (guessed, freq 310392.0000), sho= uld be 258660 (guessed, freq 258660.0000) ;; prev block 175, next block 97, flags: (NEW, VISITED) ;; pred: 74 [always] count:258660 (guessed, freq 258660.0000) (TRUE_VALUE,EXECUTABLE) ;; 175 [always] count:51732 (guessed, freq 51732.0000) (FALLTHRU,EXECUTABLE) # DEBUG BEGIN_STMT # So if the probability was 80% it would be almost right. This problem repeats twice.=