From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 82929385E02A; Thu, 2 Apr 2020 12:31:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 82929385E02A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1585830706; bh=TEMMrDLFkbBYJZXVTyjFe+KCs68MyvI/1Blw+amfTUY=; h=From:To:Subject:Date:In-Reply-To:References:From; b=OYRvDX33pspPa5awI/ECvNTZ4pZvUyH/k/rCIrXhpaYGNVtli7BIuXd4WqP62TOT8 9IRAzurATQPsHRbva2HCg6zuexRlSPVyALK/dMTaUciSaZNQerX4/lQod59NwlJo47 KKE1yURm0BJ57u95T2nxhF5rNTg8y7eLDG69v+VQ= From: "cvs-commit at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug rtl-optimization/92264] [10 Regression] Compile time hog in 521.wrf_r with -Ofast -march=znver2 -g since r276318 Date: Thu, 02 Apr 2020 12:31:46 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: rtl-optimization X-Bugzilla-Version: 10.0 X-Bugzilla-Keywords: compile-time-hog X-Bugzilla-Severity: normal X-Bugzilla-Who: cvs-commit at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P1 X-Bugzilla-Assigned-To: rsandifo at gcc dot gnu.org X-Bugzilla-Target-Milestone: 10.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Apr 2020 12:31:46 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D92264 --- Comment #39 from CVS Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:2c0fa3ecf70d199af18785702e9e0548fd3ab793 commit r10-7515-g2c0fa3ecf70d199af18785702e9e0548fd3ab793 Author: Jakub Jelinek Date: Thu Apr 2 14:28:14 2020 +0200 cselib: Reuse VALUEs on sp adjustments [PR92264] As discussed in the PR, if !ACCUMULATE_OUTGOING_ARGS on large functions= we can have hundreds of thousands of stack pointer adjustments and cselib creates a new VALUE after each sp adjustment, which form extremely deep VALUE chains, which is very harmful e.g. for find_base_term. E.g. if we have sp -=3D 4 sp -=3D 4 sp +=3D 4 sp +=3D 4 sp -=3D 4 sp +=3D 4 that means 7 VALUEs, one for the sp at beginning (val1), than val2 =3D = val1 - 4, then val3 =3D val2 - 4, then val4 =3D val3 + 4, then val5 =3D val4 += 4, then val6 =3D val5 - 4, then val7 =3D val6 + 4. This patch tweaks cselib, so that it is smarter about sp adjustments. When cselib_lookup (stack_pointer_rtx, Pmode, 1, VOIDmode) and we know nothing about sp yet (this happens at the start of the function, for non-var-tracking also after cselib_reset_table and for var-tracking aft= er processing fp_setter insn where we forget about former sp values because that is now hfp related while everything after it is sp related), we look it up normally, but in addition to what we have been doing before we mark the VALUE as SP_DERIVED_VALUE_P. Further lookups of sp + offset are then special cased, so that it is canonicalized to that SP_DERIVED_VALUE_P VALUE + CONST_INT (if possible). So, for the above, we get val1 with SP_DERIVED_VALUE_P set, then val2 =3D val1 - 4, val3 = =3D val1 - 8 (note, no longer val2 - 4!), then we get val2 again, val1 again, val2 again, val1 again. In the find_base_term visited_vals.length () > 100 find_base_term statistics during combined x86_64-linux and i686-linux bootstrap+regtest cycle, without the patch I see: find_base_term > 100 returning NULL returning non-NULL 32-bit compilations 4229178 407 64-bit compilations 217523 0 with largest visited_vals.length () when returning non-NULL being 206. With the patch the same numbers are: 32-bit compilations 1249588 135 64-bit compilations 3510 0 with largest visited_vals.length () when returning non-NULL being 173. This shows significant reduction of the deep VALUE chains. On powerpc64{,le}-linux, these stats didn't change at all, we have 1008 0 for all of -m32, -m64 and little-endian -m64, just the gcc.dg/pr85180.c and gcc.dg/pr87985.c testcases which are unrelated to = sp. My earlier version of the patch, which contained just the rtl.h and cselib.c changes, regressed some tests: gcc.dg/guality/{pr36728-{1,3},pr68860-{1,2}}.c gcc.target/i386/{pr88416,sse-{13,23,24,25,26}}.c The problem with the former tests was worse debug info, where with -m32 where arg7 was passed in a stack slot we though a push later on might h= ave invalidated it, when it couldn't. This is something I've solved with t= he var-tracking.c (vt_initialize) changes. In those problematic functions= , we create a cfa_base VALUE (argp) and want to record that at the start of the function the argp VALUE is sp + off and also record that current sp VALUE is argp's VALUE - off. The second permanent equivalence didn't m= ake it after the patch though, because cselib_add_permanent_equiv will cselib_lookup the value of the expression it wants to add as the equivalence and if it is the same VALUE as we are calling it on, it doesn't do anything; and due to the cselib changes for sp based accesses that is exactly what happened. By reversing the order of the cselib_add_permanent_equiv cal= ls we get both equivalences though and thus are able to canonicalize the sp b= ased accesses in var-tracking to the cfa_base value + offset. The i386 FAILs were all ICEs, where we had pushf instruction pushing fl= ags and then pop pseudo reading that value again. With the cselib changes, cselib during RTL DSE is able to see through the sp adjustment and want= ed to replace_read what was done pushf, by moving the flags register into a pseudo and replace the memory read in the pop with that pseudo. That is wrong for two reasons: one is that the backend doesn't have an instruct= ion to move the flags hard register into some other register, but replace_r= ead has been validating just the mem -> pseudo replacement and not the insns emitted by copy_to_mode_reg. And the second issue is that it is obviou= sly wrong to replace a stack pop which contains stack post-increment by a c= opy of pseudo into destination. dse.c has some code to handle RTX_AUTOINC,= but only uses it when actually removing stores and only when there is REG_I= NC note (stack RTX_AUTOINC does not have those), in check_for_inc_dec* whe= re it emits the reg adjustment(s) before the insn that is going to be dele= ted. replace_read doesn't remove the insn, so if it e.g. contained REG_INC n= ote, it would be kept there and we might have the RTX_AUTOINC not just in *l= oc, but other spots. So, the dse.c changes try to validate the added insns and punt on all RTX_AUTOINC in *loc. Furthermore, it seems that with the cselib.c chan= ges on the gfortran.dg/pr87360.f90 and gcc.target/i386/pr88416.c testcases check_for_inc_dec{,_1} happily throws stack pointer autoinc on the floo= r, which is also wrong. While we could perhaps do the for_each_inc_dec call regardless of whether we have REG_INC note or not, we aren't prepa= red to handle e.g. REG_ARGS_SIZE distribution and thus could end up with wr= ong unwind info or ICEs during dwarf2cfi.c. So the patch also punts on tho= se, after all, if we'd in theory managed to try to optimize such pushes bef= ore, we'd create wrong-code. On x86_64-linux and i686-linux, the patch has some minor debug info coverage differences, but it doesn't appear very significant to me. https://github.com/pmachata/dwlocstat tool gives (where before is vanil= la trunk + the rtl.h patch but not {cselib,var-tracking,dse}.c --enable-checking=3Dyes,rtl,extra bootstrapped, then {cselib,var-tracking,dse}.c hunks applied and make cc1plus, while after is trunk with the whole pat= ch applied). 64-bit cc1plus before cov% samples cumul 0..10 1232756/48% 1232756/48% 11..20 31089/1% 1263845/49% 21..30 39172/1% 1303017/51% 31..40 38853/1% 1341870/52% 41..50 47473/1% 1389343/54% 51..60 45171/1% 1434514/56% 61..70 69393/2% 1503907/59% 71..80 61988/2% 1565895/61% 81..90 104528/4% 1670423/65% 91..100 875402/34% 2545825/100% after cov% samples cumul 0..10 1233238/48% 1233238/48% 11..20 31086/1% 1264324/49% 21..30 39157/1% 1303481/51% 31..40 38819/1% 1342300/52% 41..50 47447/1% 1389747/54% 51..60 45151/1% 1434898/56% 61..70 69379/2% 1504277/59% 71..80 61946/2% 1566223/61% 81..90 104508/4% 1670731/65% 91..100 875094/34% 2545825/100% 32-bit cc1plus before cov% samples cumul 0..10 1231221/48% 1231221/48% 11..20 30992/1% 1262213/49% 21..30 36422/1% 1298635/51% 31..40 35793/1% 1334428/52% 41..50 47102/1% 1381530/54% 51..60 41201/1% 1422731/56% 61..70 65467/2% 1488198/58% 71..80 59560/2% 1547758/61% 81..90 104076/4% 1651834/65% 91..100 881879/34% 2533713/100% after cov% samples cumul 0..10 1230469/48% 1230469/48% 11..20 30390/1% 1260859/49% 21..30 36362/1% 1297221/51% 31..40 36042/1% 1333263/52% 41..50 47619/1% 1380882/54% 51..60 41674/1% 1422556/56% 61..70 65849/2% 1488405/58% 71..80 59857/2% 1548262/61% 81..90 104178/4% 1652440/65% 91..100 881273/34% 2533713/100% 2020-04-02 Jakub Jelinek PR rtl-optimization/92264 * rtl.h (struct rtx_def): Mention that call bit is used as SP_DERIVED_VALUE_P in cselib.c. * cselib.c (SP_DERIVED_VALUE_P): Define. (PRESERVED_VALUE_P, SP_BASED_VALUE_P): Move definitions earlier. (cselib_hasher::equal): Handle equality between SP_DERIVED_VALU= E_P val_rtx and sp based expression where offsets cancel each other. (preserve_constants_and_equivs): Formatting fix. (cselib_reset_table): Add reverse op loc to SP_DERIVED_VALUE_P locs list for cfa_base_preserved_val if needed. Formatting fix. (autoinc_split): If the to be returned value is a REG, MEM or VALUE which has SP_DERIVED_VALUE_P + CONST_INT as one of its locs, return the SP_DERIVED_VALUE_P VALUE and adjust *off. (rtx_equal_for_cselib_1): Call autoinc_split even if both expressions are PLUS in Pmode with CONST_INT second operands. Handle SP_DERIVED_VALUE_P cases. (cselib_hash_plus_const_int): New function. (cselib_hash_rtx): Use it for PLUS in Pmode with CONST_INT second operand, as well as for PRE_DEC etc. that ought to be hashed the same way. (cselib_subst_to_values): Substitute PLUS with Pmode and CONST_INT operand if the first operand is a VALUE which has SP_DERIVED_VALUE_P + CONST_INT as one of its locs for the SP_DERIVED_VALUE_P + adjusted offset. (cselib_lookup_1): When creating a new VALUE for stack_pointer_= rtx, set SP_DERIVED_VALUE_P on it. Set PRESERVED_VALUE_P when adding SP_DERIVED_VALUE_P PRESERVED_VALUE_P subseted VALUE location. * var-tracking.c (vt_initialize): Call cselib_add_permanent_equ= iv on the sp value before calling cselib_add_permanent_equiv on the cfa_base value. * dse.c (check_for_inc_dec_1, check_for_inc_dec): Punt on RTX_AUTOINC in the insn without REG_INC note. (replace_read): Punt on RTX_AUTOINC in the *loc being replaced. Punt on invalid insns added by copy_to_mode_reg. Formatting fi= xes.=