From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id DF0B23939C11; Mon, 1 Mar 2021 14:49:48 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DF0B23939C11 From: "rguenth at gcc dot gnu.org" To: glibc-bugs@sourceware.org Subject: [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts Date: Mon, 01 Mar 2021 14:49:48 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: string X-Bugzilla-Version: 2.31 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: hjl.tools at gmail dot com X-Bugzilla-Target-Milestone: 2.34 X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Mar 2021 14:49:49 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457 --- Comment #20 from Richard Biener --- (In reply to H.J. Lu from comment #18) > (In reply to rguenther from comment #16) > > On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote: > >=20 > > > https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457 > > >=20 > > > --- Comment #15 from H.J. Lu --- > > > (In reply to Richard Biener from comment #14) > > > >=20 > > > > Note according to Agner vzeroall, for example on Haswell, decodes to > > > > 20 uops while vzeroupper only requires 4. On Skylake it's even wor= se > > > > (34 uops). For short sizes (as in our benchmark which had 16-31 by= te > > > > strcmp) this might be a bigger difference than using the SSE2 varia= nt > > > > off an early xtest result. That said, why not, for HTM + AVX2 CPUs, > > > > have an intermediate dispatcher between the AVX2 and the SSE variant > > > > using xtest? That leaves the actual implementations unchanged and = thus > > > > with known performance characteristic? > > >=20 > > > It is implemented on users/hjl/pr27457/wrapper branch: > > >=20 > > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper > > >=20 > > > There are 2 problems: > > >=20 > > > 1. Many RTM tests failed for other reasons. > > > 2. Even with vzeroall overhead, AVX version may still be faster than > > > SSE version. > >=20 > > And the SSE version may still be faster than the AVX version with > > vzeroall. >=20 > Here is some data: >=20 > Function: strcmp > Variant: default > __strcmp_avx2 __strcmp_sse2_unalig= ned > length=3D14, align1=3D14, align2=3D14: 11.36 17.50=09 > length=3D14, align1=3D14, align2=3D14: 11.36 15.59=09 > length=3D14, align1=3D14, align2=3D14: 11.43 15.55=09 > length=3D15, align1=3D15, align2=3D15: 11.36 17.42=09 > length=3D15, align1=3D15, align2=3D15: 11.96 17.41=09 > length=3D15, align1=3D15, align2=3D15: 11.36 16.97=09 > length=3D16, align1=3D16, align2=3D16: 11.36 18.58=09 > length=3D16, align1=3D16, align2=3D16: 11.36 17.41=09 > length=3D16, align1=3D16, align2=3D16: 11.43 17.34=09 > length=3D17, align1=3D17, align2=3D17: 11.36 21.37=09 > length=3D17, align1=3D17, align2=3D17: 11.36 18.52=09 > length=3D17, align1=3D17, align2=3D17: 11.36 17.94=09 > length=3D18, align1=3D18, align2=3D18: 11.36 19.73=09 > length=3D18, align1=3D18, align2=3D18: 11.36 19.20=09 > length=3D18, align1=3D18, align2=3D18: 11.36 19.13=09 > length=3D19, align1=3D19, align2=3D19: 11.36 20.38=09 > length=3D19, align1=3D19, align2=3D19: 11.36 19.39=09 > length=3D19, align1=3D19, align2=3D19: 11.36 20.39=09 > length=3D20, align1=3D20, align2=3D20: 11.36 21.53=09 > length=3D20, align1=3D20, align2=3D20: 11.36 20.98=09 > length=3D20, align1=3D20, align2=3D20: 11.36 20.93=09 > length=3D21, align1=3D21, align2=3D21: 11.36 22.83=09 > length=3D21, align1=3D21, align2=3D21: 11.36 22.26=09 > length=3D21, align1=3D21, align2=3D21: 11.36 22.25=09 > length=3D22, align1=3D22, align2=3D22: 11.43 23.37=09 > length=3D22, align1=3D22, align2=3D22: 11.36 22.78=09 > length=3D22, align1=3D22, align2=3D22: 12.29 22.12=09 > length=3D23, align1=3D23, align2=3D23: 11.36 24.63=09 > length=3D23, align1=3D23, align2=3D23: 12.53 23.97=09 > length=3D23, align1=3D23, align2=3D23: 11.36 23.97=09 > length=3D24, align1=3D24, align2=3D24: 11.36 24.52=09 > length=3D24, align1=3D24, align2=3D24: 11.36 43.47=09 > length=3D24, align1=3D24, align2=3D24: 11.36 44.47=09 > length=3D25, align1=3D25, align2=3D25: 11.36 39.50=09 > length=3D25, align1=3D25, align2=3D25: 11.36 48.97=09 > length=3D25, align1=3D25, align2=3D25: 11.36 48.53=09 > length=3D26, align1=3D26, align2=3D26: 11.36 47.87=09 > length=3D26, align1=3D26, align2=3D26: 11.36 47.20=09 > length=3D26, align1=3D26, align2=3D26: 11.36 47.15=09 > length=3D27, align1=3D27, align2=3D27: 11.36 50.90=09 > length=3D27, align1=3D27, align2=3D27: 11.44 49.98=09 > length=3D27, align1=3D27, align2=3D27: 11.36 49.77=09 > length=3D28, align1=3D28, align2=3D28: 11.36 49.74=09 > length=3D28, align1=3D28, align2=3D28: 11.36 48.86=09 > length=3D28, align1=3D28, align2=3D28: 11.36 49.08=09 > length=3D29, align1=3D29, align2=3D29: 11.36 52.74=09 > length=3D29, align1=3D29, align2=3D29: 11.36 54.04=09 > length=3D29, align1=3D29, align2=3D29: 11.36 29.49=09 > length=3D30, align1=3D30, align2=3D30: 11.36 50.91=09 > length=3D30, align1=3D30, align2=3D30: 11.36 51.09=09 > length=3D30, align1=3D30, align2=3D30: 11.36 51.13=09 > length=3D31, align1=3D31, align2=3D31: 12.36 54.33=09 > length=3D31, align1=3D31, align2=3D31: 11.36 53.49=09 > length=3D31, align1=3D31, align2=3D31: 11.36 53.29=09 > length=3D16, align1=3D0, align2=3D0: 11.36 18.02=09 > length=3D16, align1=3D0, align2=3D0: 11.36 18.58=09 > length=3D16, align1=3D0, align2=3D0: 11.36 17.34=09 > length=3D16, align1=3D0, align2=3D0: 11.44 19.88=09 > length=3D16, align1=3D0, align2=3D0: 11.36 16.74=09 > length=3D16, align1=3D0, align2=3D0: 11.36 17.42=09 > length=3D16, align1=3D0, align2=3D3: 11.36 17.34=09 > length=3D16, align1=3D3, align2=3D4: 11.36 17.34=09 > length=3D32, align1=3D0, align2=3D0: 12.29 61.07=09 > length=3D32, align1=3D0, align2=3D0: 12.63 61.08=09 > length=3D32, align1=3D0, align2=3D0: 11.36 60.48=09 > length=3D32, align1=3D0, align2=3D0: 11.36 60.48=09 > length=3D32, align1=3D0, align2=3D0: 11.36 60.40=09 > length=3D32, align1=3D0, align2=3D0: 11.36 60.40=09 > length=3D32, align1=3D0, align2=3D4: 11.36 60.40=09 > length=3D32, align1=3D4, align2=3D5: 12.10 59.72=09 That's with or without the vzeroall actually executing? > > I guess we should mostly care about optimizing for "modern" CPUs > > which likely means HTM + AVX512 which should be already optimal > > on your branches by using %ymm16+. So we're talking about > > the "legacy" AVX2 + HTM path. > >=20 > > And there I think we should optimize the path that is _not_ in > > a transaction since that will be 99% of the cases. Which to > > me means using the proven tuned (on their respective ISA subsets) > > SSE2 and AVX2 variants and simply switch between them based on > > xtest. Yeah, so strcmp of a large string inside an transaction >=20 > I tried it and I got RTM abort for other reasons. >=20 > > might not run at optimal AVX2 speed. But it will be faster > > than before the xtest dispatch since before that it would have > > aborted the transaction. >=20 > Please give my current approach is a try. Well, I know that even unconditionally doing vzeroall will fix our observed regression since the time is dominated by all the other code inside the transaction that is then retried a few times (and always re-fails with vzeroupper), the strcmp part is just ~1%. I also only have AVX512 HW with HTM so can't easily test the AVX2 + HTM path. That said, I'm fine with the xtest/vzero{all,upper} epilogue. But I have also been reported libmicro regressions for strcpy with length 10 (32byte aligned), hot cache, when using AVX2 vs. SSE2 (SSE2 being faster by 20%). [note strcpy, not strcmp here] Yeah, stupid benchmark ... but it likely shows that for small lenghts every detail matters in case you want to shave off the last ns. --=20 You are receiving this mail because: You are on the CC list for the bug.=