From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 48180384B808; Mon,  1 Mar 2021 14:37:33 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 48180384B808
From: "hjl.tools at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug string/27457] vzeroupper use in AVX2 multiarch string functions
 cause HTM aborts
Date: Mon, 01 Mar 2021 14:37:33 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: string
X-Bugzilla-Version: 2.31
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: hjl.tools at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: hjl.tools at gmail dot com
X-Bugzilla-Target-Milestone: 2.34
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-27457-131-GaRqP7vqgT@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-27457-131@http.sourceware.org/bugzilla/>
References: <bug-27457-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: glibc-bugs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Glibc-bugs mailing list <glibc-bugs.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/glibc-bugs/>
List-Help: <mailto:glibc-bugs-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Mar 2021 14:37:33 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457

--- Comment #18 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to rguenther from comment #16)
> On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote:
>=20
> > https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457
> >=20
> > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
> > (In reply to Richard Biener from comment #14)
> > >=20
> > > Note according to Agner vzeroall, for example on Haswell, decodes to
> > > 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> > > (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> > > strcmp) this might be a bigger difference than using the SSE2 variant
> > > off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> > > have an intermediate dispatcher between the AVX2 and the SSE variant
> > > using xtest?  That leaves the actual implementations unchanged and th=
us
> > > with known performance characteristic?
> >=20
> > It is implemented on users/hjl/pr27457/wrapper branch:
> >=20
> > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper
> >=20
> > There are 2 problems:
> >=20
> > 1. Many RTM tests failed for other reasons.
> > 2. Even with vzeroall overhead, AVX version may still be faster than
> > SSE version.
>=20
> And the SSE version may still be faster than the AVX version with
> vzeroall.

Here is some data:

Function: strcmp
Variant: default
                                       __strcmp_avx2    __strcmp_sse2_unali=
gned
     length=3D14, align1=3D14, align2=3D14:        11.36             17.50=
=20=20=20=20
     length=3D14, align1=3D14, align2=3D14:        11.36             15.59=
=20=20=20=20
     length=3D14, align1=3D14, align2=3D14:        11.43             15.55=
=20=20=20=20
     length=3D15, align1=3D15, align2=3D15:        11.36             17.42=
=20=20=20=20
     length=3D15, align1=3D15, align2=3D15:        11.96             17.41=
=20=20=20=20
     length=3D15, align1=3D15, align2=3D15:        11.36             16.97=
=20=20=20=20
     length=3D16, align1=3D16, align2=3D16:        11.36             18.58=
=20=20=20=20
     length=3D16, align1=3D16, align2=3D16:        11.36             17.41=
=20=20=20=20
     length=3D16, align1=3D16, align2=3D16:        11.43             17.34=
=20=20=20=20
     length=3D17, align1=3D17, align2=3D17:        11.36             21.37=
=20=20=20=20
     length=3D17, align1=3D17, align2=3D17:        11.36             18.52=
=20=20=20=20
     length=3D17, align1=3D17, align2=3D17:        11.36             17.94=
=20=20=20=20
     length=3D18, align1=3D18, align2=3D18:        11.36             19.73=
=20=20=20=20
     length=3D18, align1=3D18, align2=3D18:        11.36             19.20=
=20=20=20=20
     length=3D18, align1=3D18, align2=3D18:        11.36             19.13=
=20=20=20=20
     length=3D19, align1=3D19, align2=3D19:        11.36             20.38=
=20=20=20=20
     length=3D19, align1=3D19, align2=3D19:        11.36             19.39=
=20=20=20=20
     length=3D19, align1=3D19, align2=3D19:        11.36             20.39=
=20=20=20=20
     length=3D20, align1=3D20, align2=3D20:        11.36             21.53=
=20=20=20=20
     length=3D20, align1=3D20, align2=3D20:        11.36             20.98=
=20=20=20=20
     length=3D20, align1=3D20, align2=3D20:        11.36             20.93=
=20=20=20=20
     length=3D21, align1=3D21, align2=3D21:        11.36             22.83=
=20=20=20=20
     length=3D21, align1=3D21, align2=3D21:        11.36             22.26=
=20=20=20=20
     length=3D21, align1=3D21, align2=3D21:        11.36             22.25=
=20=20=20=20
     length=3D22, align1=3D22, align2=3D22:        11.43             23.37=
=20=20=20=20
     length=3D22, align1=3D22, align2=3D22:        11.36             22.78=
=20=20=20=20
     length=3D22, align1=3D22, align2=3D22:        12.29             22.12=
=20=20=20=20
     length=3D23, align1=3D23, align2=3D23:        11.36             24.63=
=20=20=20=20
     length=3D23, align1=3D23, align2=3D23:        12.53             23.97=
=20=20=20=20
     length=3D23, align1=3D23, align2=3D23:        11.36             23.97=
=20=20=20=20
     length=3D24, align1=3D24, align2=3D24:        11.36             24.52=
=20=20=20=20
     length=3D24, align1=3D24, align2=3D24:        11.36             43.47=
=20=20=20=20
     length=3D24, align1=3D24, align2=3D24:        11.36             44.47=
=20=20=20=20
     length=3D25, align1=3D25, align2=3D25:        11.36             39.50=
=20=20=20=20
     length=3D25, align1=3D25, align2=3D25:        11.36             48.97=
=20=20=20=20
     length=3D25, align1=3D25, align2=3D25:        11.36             48.53=
=20=20=20=20
     length=3D26, align1=3D26, align2=3D26:        11.36             47.87=
=20=20=20=20
     length=3D26, align1=3D26, align2=3D26:        11.36             47.20=
=20=20=20=20
     length=3D26, align1=3D26, align2=3D26:        11.36             47.15=
=20=20=20=20
     length=3D27, align1=3D27, align2=3D27:        11.36             50.90=
=20=20=20=20
     length=3D27, align1=3D27, align2=3D27:        11.44             49.98=
=20=20=20=20
     length=3D27, align1=3D27, align2=3D27:        11.36             49.77=
=20=20=20=20
     length=3D28, align1=3D28, align2=3D28:        11.36             49.74=
=20=20=20=20
     length=3D28, align1=3D28, align2=3D28:        11.36             48.86=
=20=20=20=20
     length=3D28, align1=3D28, align2=3D28:        11.36             49.08=
=20=20=20=20
     length=3D29, align1=3D29, align2=3D29:        11.36             52.74=
=20=20=20=20
     length=3D29, align1=3D29, align2=3D29:        11.36             54.04=
=20=20=20=20
     length=3D29, align1=3D29, align2=3D29:        11.36             29.49=
=20=20=20=20
     length=3D30, align1=3D30, align2=3D30:        11.36             50.91=
=20=20=20=20
     length=3D30, align1=3D30, align2=3D30:        11.36             51.09=
=20=20=20=20
     length=3D30, align1=3D30, align2=3D30:        11.36             51.13=
=20=20=20=20
     length=3D31, align1=3D31, align2=3D31:        12.36             54.33=
=20=20=20=20
     length=3D31, align1=3D31, align2=3D31:        11.36             53.49=
=20=20=20=20
     length=3D31, align1=3D31, align2=3D31:        11.36             53.29=
=20=20=20=20
       length=3D16, align1=3D0, align2=3D0:        11.36             18.02=
=20=20=20=20
       length=3D16, align1=3D0, align2=3D0:        11.36             18.58=
=20=20=20=20
       length=3D16, align1=3D0, align2=3D0:        11.36             17.34=
=20=20=20=20
       length=3D16, align1=3D0, align2=3D0:        11.44             19.88=
=20=20=20=20
       length=3D16, align1=3D0, align2=3D0:        11.36             16.74=
=20=20=20=20
       length=3D16, align1=3D0, align2=3D0:        11.36             17.42=
=20=20=20=20
       length=3D16, align1=3D0, align2=3D3:        11.36             17.34=
=20=20=20=20
       length=3D16, align1=3D3, align2=3D4:        11.36             17.34=
=20=20=20=20
       length=3D32, align1=3D0, align2=3D0:        12.29             61.07=
=20=20=20=20
       length=3D32, align1=3D0, align2=3D0:        12.63             61.08=
=20=20=20=20
       length=3D32, align1=3D0, align2=3D0:        11.36             60.48=
=20=20=20=20
       length=3D32, align1=3D0, align2=3D0:        11.36             60.48=
=20=20=20=20
       length=3D32, align1=3D0, align2=3D0:        11.36             60.40=
=20=20=20=20
       length=3D32, align1=3D0, align2=3D0:        11.36             60.40=
=20=20=20=20
       length=3D32, align1=3D0, align2=3D4:        11.36             60.40=
=20=20=20=20
       length=3D32, align1=3D4, align2=3D5:        12.10             59.72=
=20=20=20=20

> I guess we should mostly care about optimizing for "modern" CPUs
> which likely means HTM + AVX512 which should be already optimal
> on your branches by using %ymm16+.  So we're talking about
> the "legacy" AVX2 + HTM path.
>=20
> And there I think we should optimize the path that is _not_ in
> a transaction since that will be 99% of the cases.  Which to
> me means using the proven tuned (on their respective ISA subsets)
> SSE2 and AVX2 variants and simply switch between them based on
> xtest.  Yeah, so strcmp of a large string inside an transaction

I tried it and I got RTM abort for other reasons.

> might not run at optimal AVX2 speed.  But it will be faster
> than before the xtest dispatch since before that it would have
> aborted the transaction.

Please give my current approach is a try.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=