From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 48180384B808; Mon, 1 Mar 2021 14:37:33 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 48180384B808 From: "hjl.tools at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts Date: Mon, 01 Mar 2021 14:37:33 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: string X-Bugzilla-Version: 2.31 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: hjl.tools at gmail dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: hjl.tools at gmail dot com X-Bugzilla-Target-Milestone: 2.34 X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Mar 2021 14:37:33 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457 --- Comment #18 from H.J. Lu --- (In reply to rguenther from comment #16) > On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote: >=20 > > https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457 > >=20 > > --- Comment #15 from H.J. Lu --- > > (In reply to Richard Biener from comment #14) > > >=20 > > > Note according to Agner vzeroall, for example on Haswell, decodes to > > > 20 uops while vzeroupper only requires 4. On Skylake it's even worse > > > (34 uops). For short sizes (as in our benchmark which had 16-31 byte > > > strcmp) this might be a bigger difference than using the SSE2 variant > > > off an early xtest result. That said, why not, for HTM + AVX2 CPUs, > > > have an intermediate dispatcher between the AVX2 and the SSE variant > > > using xtest? That leaves the actual implementations unchanged and th= us > > > with known performance characteristic? > >=20 > > It is implemented on users/hjl/pr27457/wrapper branch: > >=20 > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper > >=20 > > There are 2 problems: > >=20 > > 1. Many RTM tests failed for other reasons. > > 2. Even with vzeroall overhead, AVX version may still be faster than > > SSE version. >=20 > And the SSE version may still be faster than the AVX version with > vzeroall. Here is some data: Function: strcmp Variant: default __strcmp_avx2 __strcmp_sse2_unali= gned length=3D14, align1=3D14, align2=3D14: 11.36 17.50= =20=20=20=20 length=3D14, align1=3D14, align2=3D14: 11.36 15.59= =20=20=20=20 length=3D14, align1=3D14, align2=3D14: 11.43 15.55= =20=20=20=20 length=3D15, align1=3D15, align2=3D15: 11.36 17.42= =20=20=20=20 length=3D15, align1=3D15, align2=3D15: 11.96 17.41= =20=20=20=20 length=3D15, align1=3D15, align2=3D15: 11.36 16.97= =20=20=20=20 length=3D16, align1=3D16, align2=3D16: 11.36 18.58= =20=20=20=20 length=3D16, align1=3D16, align2=3D16: 11.36 17.41= =20=20=20=20 length=3D16, align1=3D16, align2=3D16: 11.43 17.34= =20=20=20=20 length=3D17, align1=3D17, align2=3D17: 11.36 21.37= =20=20=20=20 length=3D17, align1=3D17, align2=3D17: 11.36 18.52= =20=20=20=20 length=3D17, align1=3D17, align2=3D17: 11.36 17.94= =20=20=20=20 length=3D18, align1=3D18, align2=3D18: 11.36 19.73= =20=20=20=20 length=3D18, align1=3D18, align2=3D18: 11.36 19.20= =20=20=20=20 length=3D18, align1=3D18, align2=3D18: 11.36 19.13= =20=20=20=20 length=3D19, align1=3D19, align2=3D19: 11.36 20.38= =20=20=20=20 length=3D19, align1=3D19, align2=3D19: 11.36 19.39= =20=20=20=20 length=3D19, align1=3D19, align2=3D19: 11.36 20.39= =20=20=20=20 length=3D20, align1=3D20, align2=3D20: 11.36 21.53= =20=20=20=20 length=3D20, align1=3D20, align2=3D20: 11.36 20.98= =20=20=20=20 length=3D20, align1=3D20, align2=3D20: 11.36 20.93= =20=20=20=20 length=3D21, align1=3D21, align2=3D21: 11.36 22.83= =20=20=20=20 length=3D21, align1=3D21, align2=3D21: 11.36 22.26= =20=20=20=20 length=3D21, align1=3D21, align2=3D21: 11.36 22.25= =20=20=20=20 length=3D22, align1=3D22, align2=3D22: 11.43 23.37= =20=20=20=20 length=3D22, align1=3D22, align2=3D22: 11.36 22.78= =20=20=20=20 length=3D22, align1=3D22, align2=3D22: 12.29 22.12= =20=20=20=20 length=3D23, align1=3D23, align2=3D23: 11.36 24.63= =20=20=20=20 length=3D23, align1=3D23, align2=3D23: 12.53 23.97= =20=20=20=20 length=3D23, align1=3D23, align2=3D23: 11.36 23.97= =20=20=20=20 length=3D24, align1=3D24, align2=3D24: 11.36 24.52= =20=20=20=20 length=3D24, align1=3D24, align2=3D24: 11.36 43.47= =20=20=20=20 length=3D24, align1=3D24, align2=3D24: 11.36 44.47= =20=20=20=20 length=3D25, align1=3D25, align2=3D25: 11.36 39.50= =20=20=20=20 length=3D25, align1=3D25, align2=3D25: 11.36 48.97= =20=20=20=20 length=3D25, align1=3D25, align2=3D25: 11.36 48.53= =20=20=20=20 length=3D26, align1=3D26, align2=3D26: 11.36 47.87= =20=20=20=20 length=3D26, align1=3D26, align2=3D26: 11.36 47.20= =20=20=20=20 length=3D26, align1=3D26, align2=3D26: 11.36 47.15= =20=20=20=20 length=3D27, align1=3D27, align2=3D27: 11.36 50.90= =20=20=20=20 length=3D27, align1=3D27, align2=3D27: 11.44 49.98= =20=20=20=20 length=3D27, align1=3D27, align2=3D27: 11.36 49.77= =20=20=20=20 length=3D28, align1=3D28, align2=3D28: 11.36 49.74= =20=20=20=20 length=3D28, align1=3D28, align2=3D28: 11.36 48.86= =20=20=20=20 length=3D28, align1=3D28, align2=3D28: 11.36 49.08= =20=20=20=20 length=3D29, align1=3D29, align2=3D29: 11.36 52.74= =20=20=20=20 length=3D29, align1=3D29, align2=3D29: 11.36 54.04= =20=20=20=20 length=3D29, align1=3D29, align2=3D29: 11.36 29.49= =20=20=20=20 length=3D30, align1=3D30, align2=3D30: 11.36 50.91= =20=20=20=20 length=3D30, align1=3D30, align2=3D30: 11.36 51.09= =20=20=20=20 length=3D30, align1=3D30, align2=3D30: 11.36 51.13= =20=20=20=20 length=3D31, align1=3D31, align2=3D31: 12.36 54.33= =20=20=20=20 length=3D31, align1=3D31, align2=3D31: 11.36 53.49= =20=20=20=20 length=3D31, align1=3D31, align2=3D31: 11.36 53.29= =20=20=20=20 length=3D16, align1=3D0, align2=3D0: 11.36 18.02= =20=20=20=20 length=3D16, align1=3D0, align2=3D0: 11.36 18.58= =20=20=20=20 length=3D16, align1=3D0, align2=3D0: 11.36 17.34= =20=20=20=20 length=3D16, align1=3D0, align2=3D0: 11.44 19.88= =20=20=20=20 length=3D16, align1=3D0, align2=3D0: 11.36 16.74= =20=20=20=20 length=3D16, align1=3D0, align2=3D0: 11.36 17.42= =20=20=20=20 length=3D16, align1=3D0, align2=3D3: 11.36 17.34= =20=20=20=20 length=3D16, align1=3D3, align2=3D4: 11.36 17.34= =20=20=20=20 length=3D32, align1=3D0, align2=3D0: 12.29 61.07= =20=20=20=20 length=3D32, align1=3D0, align2=3D0: 12.63 61.08= =20=20=20=20 length=3D32, align1=3D0, align2=3D0: 11.36 60.48= =20=20=20=20 length=3D32, align1=3D0, align2=3D0: 11.36 60.48= =20=20=20=20 length=3D32, align1=3D0, align2=3D0: 11.36 60.40= =20=20=20=20 length=3D32, align1=3D0, align2=3D0: 11.36 60.40= =20=20=20=20 length=3D32, align1=3D0, align2=3D4: 11.36 60.40= =20=20=20=20 length=3D32, align1=3D4, align2=3D5: 12.10 59.72= =20=20=20=20 > I guess we should mostly care about optimizing for "modern" CPUs > which likely means HTM + AVX512 which should be already optimal > on your branches by using %ymm16+. So we're talking about > the "legacy" AVX2 + HTM path. >=20 > And there I think we should optimize the path that is _not_ in > a transaction since that will be 99% of the cases. Which to > me means using the proven tuned (on their respective ISA subsets) > SSE2 and AVX2 variants and simply switch between them based on > xtest. Yeah, so strcmp of a large string inside an transaction I tried it and I got RTM abort for other reasons. > might not run at optimal AVX2 speed. But it will be faster > than before the xtest dispatch since before that it would have > aborted the transaction. Please give my current approach is a try. --=20 You are receiving this mail because: You are on the CC list for the bug.=