From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id DF0B23939C11; Mon,  1 Mar 2021 14:49:48 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DF0B23939C11
From: "rguenth at gcc dot gnu.org" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug string/27457] vzeroupper use in AVX2 multiarch string functions
 cause HTM aborts
Date: Mon, 01 Mar 2021 14:49:48 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: string
X-Bugzilla-Version: 2.31
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: hjl.tools at gmail dot com
X-Bugzilla-Target-Milestone: 2.34
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-27457-131-bLUDendt9j@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-27457-131@http.sourceware.org/bugzilla/>
References: <bug-27457-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: glibc-bugs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Glibc-bugs mailing list <glibc-bugs.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/glibc-bugs/>
List-Help: <mailto:glibc-bugs-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Mar 2021 14:49:49 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457
--- Comment #20 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #18)
> (In reply to rguenther from comment #16)
> > On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote:
> >=20
> > > https://sourceware.org/bugzilla/show_bug.cgi?id=3D27457
> > >=20
> > > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
> > > (In reply to Richard Biener from comment #14)
> > > >=20
> > > > Note according to Agner vzeroall, for example on Haswell, decodes to
> > > > 20 uops while vzeroupper only requires 4.  On Skylake it's even wor=
se
> > > > (34 uops).  For short sizes (as in our benchmark which had 16-31 by=
te
> > > > strcmp) this might be a bigger difference than using the SSE2 varia=
nt
> > > > off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> > > > have an intermediate dispatcher between the AVX2 and the SSE variant
> > > > using xtest?  That leaves the actual implementations unchanged and =
thus
> > > > with known performance characteristic?
> > >=20
> > > It is implemented on users/hjl/pr27457/wrapper branch:
> > >=20
> > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper
> > >=20
> > > There are 2 problems:
> > >=20
> > > 1. Many RTM tests failed for other reasons.
> > > 2. Even with vzeroall overhead, AVX version may still be faster than
> > > SSE version.
> >=20
> > And the SSE version may still be faster than the AVX version with
> > vzeroall.
>=20
> Here is some data:
>=20
> Function: strcmp
> Variant: default
>                                        __strcmp_avx2	__strcmp_sse2_unalig=
ned
>      length=3D14, align1=3D14, align2=3D14:        11.36	       17.50=09
>      length=3D14, align1=3D14, align2=3D14:        11.36	       15.59=09
>      length=3D14, align1=3D14, align2=3D14:        11.43	       15.55=09
>      length=3D15, align1=3D15, align2=3D15:        11.36	       17.42=09
>      length=3D15, align1=3D15, align2=3D15:        11.96	       17.41=09
>      length=3D15, align1=3D15, align2=3D15:        11.36	       16.97=09
>      length=3D16, align1=3D16, align2=3D16:        11.36	       18.58=09
>      length=3D16, align1=3D16, align2=3D16:        11.36	       17.41=09
>      length=3D16, align1=3D16, align2=3D16:        11.43	       17.34=09
>      length=3D17, align1=3D17, align2=3D17:        11.36	       21.37=09
>      length=3D17, align1=3D17, align2=3D17:        11.36	       18.52=09
>      length=3D17, align1=3D17, align2=3D17:        11.36	       17.94=09
>      length=3D18, align1=3D18, align2=3D18:        11.36	       19.73=09
>      length=3D18, align1=3D18, align2=3D18:        11.36	       19.20=09
>      length=3D18, align1=3D18, align2=3D18:        11.36	       19.13=09
>      length=3D19, align1=3D19, align2=3D19:        11.36	       20.38=09
>      length=3D19, align1=3D19, align2=3D19:        11.36	       19.39=09
>      length=3D19, align1=3D19, align2=3D19:        11.36	       20.39=09
>      length=3D20, align1=3D20, align2=3D20:        11.36	       21.53=09
>      length=3D20, align1=3D20, align2=3D20:        11.36	       20.98=09
>      length=3D20, align1=3D20, align2=3D20:        11.36	       20.93=09
>      length=3D21, align1=3D21, align2=3D21:        11.36	       22.83=09
>      length=3D21, align1=3D21, align2=3D21:        11.36	       22.26=09
>      length=3D21, align1=3D21, align2=3D21:        11.36	       22.25=09
>      length=3D22, align1=3D22, align2=3D22:        11.43	       23.37=09
>      length=3D22, align1=3D22, align2=3D22:        11.36	       22.78=09
>      length=3D22, align1=3D22, align2=3D22:        12.29	       22.12=09
>      length=3D23, align1=3D23, align2=3D23:        11.36	       24.63=09
>      length=3D23, align1=3D23, align2=3D23:        12.53	       23.97=09
>      length=3D23, align1=3D23, align2=3D23:        11.36	       23.97=09
>      length=3D24, align1=3D24, align2=3D24:        11.36	       24.52=09
>      length=3D24, align1=3D24, align2=3D24:        11.36	       43.47=09
>      length=3D24, align1=3D24, align2=3D24:        11.36	       44.47=09
>      length=3D25, align1=3D25, align2=3D25:        11.36	       39.50=09
>      length=3D25, align1=3D25, align2=3D25:        11.36	       48.97=09
>      length=3D25, align1=3D25, align2=3D25:        11.36	       48.53=09
>      length=3D26, align1=3D26, align2=3D26:        11.36	       47.87=09
>      length=3D26, align1=3D26, align2=3D26:        11.36	       47.20=09
>      length=3D26, align1=3D26, align2=3D26:        11.36	       47.15=09
>      length=3D27, align1=3D27, align2=3D27:        11.36	       50.90=09
>      length=3D27, align1=3D27, align2=3D27:        11.44	       49.98=09
>      length=3D27, align1=3D27, align2=3D27:        11.36	       49.77=09
>      length=3D28, align1=3D28, align2=3D28:        11.36	       49.74=09
>      length=3D28, align1=3D28, align2=3D28:        11.36	       48.86=09
>      length=3D28, align1=3D28, align2=3D28:        11.36	       49.08=09
>      length=3D29, align1=3D29, align2=3D29:        11.36	       52.74=09
>      length=3D29, align1=3D29, align2=3D29:        11.36	       54.04=09
>      length=3D29, align1=3D29, align2=3D29:        11.36	       29.49=09
>      length=3D30, align1=3D30, align2=3D30:        11.36	       50.91=09
>      length=3D30, align1=3D30, align2=3D30:        11.36	       51.09=09
>      length=3D30, align1=3D30, align2=3D30:        11.36	       51.13=09
>      length=3D31, align1=3D31, align2=3D31:        12.36	       54.33=09
>      length=3D31, align1=3D31, align2=3D31:        11.36	       53.49=09
>      length=3D31, align1=3D31, align2=3D31:        11.36	       53.29=09
>        length=3D16, align1=3D0, align2=3D0:        11.36	       18.02=09
>        length=3D16, align1=3D0, align2=3D0:        11.36	       18.58=09
>        length=3D16, align1=3D0, align2=3D0:        11.36	       17.34=09
>        length=3D16, align1=3D0, align2=3D0:        11.44	       19.88=09
>        length=3D16, align1=3D0, align2=3D0:        11.36	       16.74=09
>        length=3D16, align1=3D0, align2=3D0:        11.36	       17.42=09
>        length=3D16, align1=3D0, align2=3D3:        11.36	       17.34=09
>        length=3D16, align1=3D3, align2=3D4:        11.36	       17.34=09
>        length=3D32, align1=3D0, align2=3D0:        12.29	       61.07=09
>        length=3D32, align1=3D0, align2=3D0:        12.63	       61.08=09
>        length=3D32, align1=3D0, align2=3D0:        11.36	       60.48=09
>        length=3D32, align1=3D0, align2=3D0:        11.36	       60.48=09
>        length=3D32, align1=3D0, align2=3D0:        11.36	       60.40=09
>        length=3D32, align1=3D0, align2=3D0:        11.36	       60.40=09
>        length=3D32, align1=3D0, align2=3D4:        11.36	       60.40=09
>        length=3D32, align1=3D4, align2=3D5:        12.10	       59.72=09

That's with or without the vzeroall actually executing?

> > I guess we should mostly care about optimizing for "modern" CPUs
> > which likely means HTM + AVX512 which should be already optimal
> > on your branches by using %ymm16+.  So we're talking about
> > the "legacy" AVX2 + HTM path.
> >=20
> > And there I think we should optimize the path that is _not_ in
> > a transaction since that will be 99% of the cases.  Which to
> > me means using the proven tuned (on their respective ISA subsets)
> > SSE2 and AVX2 variants and simply switch between them based on
> > xtest.  Yeah, so strcmp of a large string inside an transaction
>=20
> I tried it and I got RTM abort for other reasons.
>=20
> > might not run at optimal AVX2 speed.  But it will be faster
> > than before the xtest dispatch since before that it would have
> > aborted the transaction.
>=20
> Please give my current approach is a try.

Well, I know that even unconditionally doing vzeroall will fix our observed
regression since the time is dominated by all the other code inside the
transaction that is then retried a few times (and always re-fails with
vzeroupper), the strcmp part is just ~1%.

I also only have AVX512 HW with HTM so can't easily test the AVX2 + HTM
path.  That said, I'm fine with the xtest/vzero{all,upper} epilogue.

But I have also been reported libmicro regressions for strcpy with length
10 (32byte aligned), hot cache, when using AVX2 vs. SSE2 (SSE2 being faster
by 20%).  [note strcpy, not strcmp here]  Yeah, stupid benchmark ... but
it likely shows that for small lenghts every detail matters in case you
want to shave off the last ns.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=