From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D17143938C04; Mon, 1 Mar 2021 14:14:38 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D17143938C04 From: "rguenther at suse dot de" To: glibc-bugs@sourceware.org Subject: [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts Date: Mon, 01 Mar 2021 14:14:38 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: string X-Bugzilla-Version: 2.31 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenther at suse dot de X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: hjl.tools at gmail dot com X-Bugzilla-Target-Milestone: 2.34 X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Mar 2021 14:14:38 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457 --- Comment #16 from rguenther at suse dot de --- On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457 >=20 > --- Comment #15 from H.J. Lu --- > (In reply to Richard Biener from comment #14) > >=20 > > Note according to Agner vzeroall, for example on Haswell, decodes to > > 20 uops while vzeroupper only requires 4. On Skylake it's even worse > > (34 uops). For short sizes (as in our benchmark which had 16-31 byte > > strcmp) this might be a bigger difference than using the SSE2 variant > > off an early xtest result. That said, why not, for HTM + AVX2 CPUs, > > have an intermediate dispatcher between the AVX2 and the SSE variant > > using xtest? That leaves the actual implementations unchanged and thus > > with known performance characteristic? >=20 > It is implemented on users/hjl/pr27457/wrapper branch: >=20 > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper >=20 > There are 2 problems: >=20 > 1. Many RTM tests failed for other reasons. > 2. Even with vzeroall overhead, AVX version may still be faster than > SSE version. And the SSE version may still be faster than the AVX version with vzeroall. I guess we should mostly care about optimizing for "modern" CPUs which likely means HTM + AVX512 which should be already optimal on your branches by using %ymm16+. So we're talking about the "legacy" AVX2 + HTM path. And there I think we should optimize the path that is _not_ in a transaction since that will be 99% of the cases. Which to me means using the proven tuned (on their respective ISA subsets) SSE2 and AVX2 variants and simply switch between them based on xtest. Yeah, so strcmp of a large string inside an transaction might not run at optimal AVX2 speed. But it will be faster than before the xtest dispatch since before that it would have aborted the transaction. --=20 You are receiving this mail because: You are on the CC list for the bug.=