From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <darren@darrentristano.com>
Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2052.outbound.protection.outlook.com [40.107.93.52])
	by sourceware.org (Postfix) with ESMTPS id 088593858283;
	Wed, 28 Sep 2022 14:02:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 088593858283
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=darrentristano.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=darrentristano.com
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=kxRirfOcusPbZG6IdYkqpWOyywOzDAUUgqRpvqUlOYnSXqYZkyfvC6nkpwEf+KYU+asCpH7k9naxWEuOVngNuz1gey3lUGj1jDUKhrs6epPSVLr6c0uIJDvn2itMM27LyHQRnV4Ts2bmUHz/4bX9oPTQUrcMEueh4yMnTrfyoCwyW0t5QMbYYgK4ntV9I55/a+cOoZpBWrEElRFR45+AsxsjFgJR+mVUE+y3CBj3CEVznqhdiRBkpDwEIPpvbezVuqXn5bIarVCD0kz3j4TvswmdI6YOMMaTPxo8lbXyyvPq77dW8JCiX0p8ppNmxEOk7cBFCDT0MB6I0dAOuJLkMQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=qFozzJ9F6WupMuyq1RUbh2mtNvQ5Uxg9kT55wPGvXBQ=;
 b=n35iRUY1SfzqGDRbVK62nFj1LrfqQZdqqgiRwFPiV3EQDERJMBxiTaBnVGuytcMbTNNJ+SO/on75JHiHAbjUaZyYIBhFc1qXn95pRQzevaT9YIdJpIxJgPyfd9e0AICvbWADEPUW33cW4dYFgdSIPrWWcVbdCWrSsQLCN8Z25/Dg7piN3tiKMLlgGL+jLgXjPo8eRNtSuMaNYV7x7UelgZa7BdESYXW5iMykZzzz6D4d3DJwtTEtc9GsrpmVa1B2QshK4OFzn4sB7TvdaN5k0cTLv6mnxiF4CKA6lhLv0PDzZV7WdVSF8HF30uYbSlDyTJs5yHYz6UyVCu80hwDzbw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=darrentristano.com; dmarc=pass action=none
 header.from=darrentristano.com; dkim=pass header.d=darrentristano.com;
 arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=NETORG3144382.onmicrosoft.com; s=selector2-NETORG3144382-onmicrosoft-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=qFozzJ9F6WupMuyq1RUbh2mtNvQ5Uxg9kT55wPGvXBQ=;
 b=Tu3k+sjnyfSfZtnZdd7p/WIMEEbzlOE2D5QlA+OmvznXroVmablevY/zudIftdvjFhfyoSb9v2yLtT+N4qD706VcPw3TziwGsOT8XxYPCxWXTfz8+gQNKwCn+Ae5uNn3Lm99zq0v8ThUW3L36v4cEnXIxE3aOCg0vxe4lrnLh5c=
Received: from CH2PR17MB3541.namprd17.prod.outlook.com (2603:10b6:610:45::16)
 by MN2PR17MB3760.namprd17.prod.outlook.com (2603:10b6:208:204::9) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5654.25; Wed, 28 Sep
 2022 14:02:31 +0000
Received: from CH2PR17MB3541.namprd17.prod.outlook.com
 ([fe80::4702:617:871d:b3cb]) by CH2PR17MB3541.namprd17.prod.outlook.com
 ([fe80::4702:617:871d:b3cb%5]) with mapi id 15.20.5654.025; Wed, 28 Sep 2022
 14:02:30 +0000
From: Darren Tristano <darren@darrentristano.com>
To: Noah Goldstein <goldstein.w.n@gmail.com>, Libc-stable Mailing List
	<libc-stable@sourceware.org>, Hongjiu Lu <hjl.tools@gmail.com>, Sunil Pandey
	<skpgkp2@gmail.com>
CC: GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH v5 2/2] x86: Optimize strlen-avx2.S
Thread-Topic: [PATCH v5 2/2] x86: Optimize strlen-avx2.S
Thread-Index: AQHY00IBewNZlVT0eUCQHZjijvigiq30302b
Date: Wed, 28 Sep 2022 14:02:30 +0000
Message-ID:
 <CH2PR17MB3541AD0C315CE529B929CE98BD549@CH2PR17MB3541.namprd17.prod.outlook.com>
References: <20210419233607.916848-1-goldstein.w.n@gmail.com>
 <20210419233607.916848-2-goldstein.w.n@gmail.com>
 <YzAO/ictPso/Y5LS@aurel32.net>
 <CAFUsyf+U==grKgr9ZivEHDYsj=Xxi0MXBtQDfkU-JBHxb0pq+A@mail.gmail.com>
 <CAMAf5_eD8ABiFiKabV5xKkR-xLR7sK7Dx18z4VoawUUwkomf+g@mail.gmail.com>
In-Reply-To:
 <CAMAf5_eD8ABiFiKabV5xKkR-xLR7sK7Dx18z4VoawUUwkomf+g@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels:
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=darrentristano.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: CH2PR17MB3541:EE_|MN2PR17MB3760:EE_
x-ms-office365-filtering-correlation-id: 3b46ef4b-1ee3-43bb-70e0-08daa15a15b0
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info:
 OHHjIOLQKemriN+4bnl2lxDF0Vpfo5U24qhaqN5cD2wInn0iaEAmy+ZhSWnHkX6SpJn4/VMmJCTqnuzHfwiOhWvHKuY2kPneLEMA8thyeg3c36x5JfmA16oYbNQRQdvidEkjKTBAJNjfxKFZ7bZmi5QhGuxnpuIvi6iVrc1aDgNUTfr5AJP75q6aug6hWMCffXojDKdHVD2rWKNBuXhk/ObQE/QuJNV66Y9faByefa/JMrsrnNLg0M2j6tqZ6cUFdCcRhHwDsNVT9pS8S3262wLPOZPCzSmrHasQyV1pLbZkPgn/XYZCOn9OTnYQKm2Ja7L+jo6wjxnSNd2jwqhQmJwsHr64VvzukhCM+KLKaA0HEsgAn/G8XwJcUcdQs/7FnvF7pNLL89zr1ZssSbeaQtLaduOCbeESSijXtXtiXlH8m2HWfqlwR6O7FrH+Kkdj8CbGxFfYwde1SqoPPRSKvMKx5mHoBQOLricooTMS5OmS7XSZGcRHiEFUNcLQMIbSK/fwjx52WGKlrCv3r71vqBojqq0C690WVUMsMMuNnzfSNI/m75GPVM0Cwozi56Cs0pRuSkNynvthOWZ2zBswXR6pl7+Nu8X4Z+en71+EylR3UBA2V5g03dhyRJmO7ERRPu06jObVewebW9ZgTT58Tj4GSDQLjV9wx8JOGTVZJnW0OC4ESIL0AaC3xxDrj4ellxkv5B2V2P3V2K9h87oJkQ4I+DiVvVrjrrDRNkR2Gt6SLnQLSUqJjKPXzAP1sOhW848PjlLgVYHVpeGwm1I5uX+qnCoNM0epFQfLa/fIAlc=
x-forefront-antispam-report:
 CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CH2PR17MB3541.namprd17.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230022)(39830400003)(396003)(366004)(346002)(376002)(136003)(451199015)(478600001)(26005)(86362001)(83380400001)(5660300002)(6506007)(186003)(8936002)(7696005)(53546011)(33656002)(52536014)(55016003)(30864003)(9686003)(8676002)(66446008)(41300700001)(966005)(4326008)(66476007)(66556008)(76116006)(66946007)(64756008)(166002)(2906002)(122000001)(38100700002)(19627405001)(38070700005)(71200400001)(316002)(110136005)(559001)(579004);DIR:OUT;SFP:1101;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0:
 =?us-ascii?Q?Jcyzfg5SlJqUPB0VK+/Yx0venu1r12AECJCiAGFv5+Cu38//oDqobMLhcUSf?=
 =?us-ascii?Q?AjM5NuNbFmQg41XILQ5snheCPOfNx3UhYKIVo2JnONQ6QNnAITodFZiKXWFw?=
 =?us-ascii?Q?LKzRFOQKgu8aYIo0KYx+eJN2fIS3NGLtbQGoyo/ditgY0obMHOGOZUNogDNQ?=
 =?us-ascii?Q?UWM6rdoLQEwHDKkDN3DDXDr38uPTJov+L0HR2QoSpUaIeUo5vk7dF5MyKqCx?=
 =?us-ascii?Q?FhMVfJLoBW51LirfRB61y3MydHGFmZ7O3SxHGqosaiSsgeM+7qnOlDPtZci/?=
 =?us-ascii?Q?YPzIDJkdLrGrhvUogTVMedrRscX04nGvIMlWeCRgFyOCzNASc+tpnWwBPCOg?=
 =?us-ascii?Q?E056guqOTF1wmqMkDk0vyYAjzEhG2g7YiZd7FKNLgraNqRHc4CxGkRDEVY/T?=
 =?us-ascii?Q?Iv7sH85dUjCsA7IOqodkKRD7aStKnOO29yn+f0aj2+G7C+01CnOrMOYYucL/?=
 =?us-ascii?Q?0s91944pRXFcOvQNzZVRS+4NpBejgsrI+CvwLTYkEBD7FC4qelEdYtf6bSzY?=
 =?us-ascii?Q?BR9DEskpPxAy+8j9oKSI4ms3mVAw6t/5B4f5/MIH1o5NoS0AduUAjVNRDFlZ?=
 =?us-ascii?Q?lHEY/BDhhxQiCPe9W5Y7udrqAG9/+cpEv3+rRc+mudPRLVd7/rWfTL7Gq1Yn?=
 =?us-ascii?Q?EmVDVoYs+sEeJuMkS1OST516QB6pn/+VF1Cfc0bJ1fWNKW80bn8YUAZTG25k?=
 =?us-ascii?Q?T0Mpr9NiNPwtULGAEADRSzp9VT+E0v2YEecB/gh2DQwmJyFmhMJ2npuKq2oP?=
 =?us-ascii?Q?n86USv5EilQ9Fv9J/rW6lXbXVoumGOvdWwd4Jy/857xbK69ZYywSKyYmsEcP?=
 =?us-ascii?Q?kXaDscg1iDdjDTPG77kh0tnvAOolfbjjqd/Bx8SdToHCeUBo6q64DRghTscE?=
 =?us-ascii?Q?pferurpAM5gAsAH1UreZPYaG8ZmYIiDzCD7/Bs3ScDd+9h1UMuBVweOYpwTB?=
 =?us-ascii?Q?HeS+AtVUpcr5JR2r3WONiSMFy7V+kSgNHDJEYOULaAgPDpp4hmYyUeRdff11?=
 =?us-ascii?Q?q11yx2nGuulYNNpJCcYQfp6kXJgYOVyOHtf2oGCY276kc/AFKkxcmP+NU9y8?=
 =?us-ascii?Q?wRypSLWpdPEXod787iCgnBA5o0fgm4Z2vfb4dweBzLOzjhYtnz4Z9bIgJbAE?=
 =?us-ascii?Q?d7bei8nn/wrajbD8Ij+LuY0qcVNDTSYPRudL4PwjjimrRwE1q9WirWlA1NYt?=
 =?us-ascii?Q?FKUv+EVwpWJlpVD8bM+zdAjm3Ru0yqK4++CKidrddIpUSmsu5EBxazi7UuOB?=
 =?us-ascii?Q?k1R0hQBo95/kacMyQFwZE7BMDZRLWGxYx3cczhw01WcLkH7xNpjR81C8nUrF?=
 =?us-ascii?Q?4r4iMpWQboFeaU7Q5YvBGHFuwLjSfXBJdgLZY3pIXy7GktNmXcwpyMh6Tbi+?=
 =?us-ascii?Q?tTbeBbr1UJTPgD0NTKfgPwCroFC220gjTzYWV/Yi3VHd0DbRW+QZINpktRIh?=
 =?us-ascii?Q?EAY//v5L1yY+PJHgz2x62QrmCMDHw7zGTVtoA1xJwJ6yWcUJ00QTlqm0xvZg?=
 =?us-ascii?Q?01BS7IAsG/OwSZ3lwn6znm52n24Zu98SrSUo/JQHubjsDqSpesM5i6eQKB6t?=
 =?us-ascii?Q?KZzS99alHtRw2Zh58IA0iOsiD5iIIH0KXxl9DsL6?=
Content-Type: multipart/alternative;
	boundary="_000_CH2PR17MB3541AD0C315CE529B929CE98BD549CH2PR17MB3541namp_"
MIME-Version: 1.0
X-OriginatorOrg: darrentristano.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: CH2PR17MB3541.namprd17.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 3b46ef4b-1ee3-43bb-70e0-08daa15a15b0
X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Sep 2022 14:02:30.9047
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 595cff3a-a98e-4be2-a9e0-981f0e29e085
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: rGC8LjY/T146oVOOQ4GO6+jo5BUhAGhCSCE56mvN42IJB9URc7JUEu7aSmAp48OTQYpMbi/UwfmaZX1CC/NUAZSyeI4jJkueXkZnp2OnvYs=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR17MB3760
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,GIT_PATCH_0,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

--_000_CH2PR17MB3541AD0C315CE529B929CE98BD549CH2PR17MB3541namp_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Please Remove me from this string. I should not be on it.


________________________________
From: Libc-stable <libc-stable-bounces+darren=3Ddarrentristano.com@sourcewa=
re.org> on behalf of Sunil Pandey via Libc-stable <libc-stable@sourceware.o=
rg>
Sent: Wednesday, September 28, 2022 8:54 AM
To: Noah Goldstein <goldstein.w.n@gmail.com>; Libc-stable Mailing List <lib=
c-stable@sourceware.org>; Hongjiu Lu <hjl.tools@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH v5 2/2] x86: Optimize strlen-avx2.S

Attached patch fixes BZ# 29611.

I would like to backport it to 2.32,2.31,2.30,2.29 and 2.29. Let me know
if there is any objection.


On Sun, Sep 25, 2022 at 7:00 AM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Sun, Sep 25, 2022 at 1:19 AM Aurelien Jarno <aurelien@aurel32.net> wro=
te:
> >
> > On 2021-04-19 19:36, Noah Goldstein via Libc-alpha wrote:
> > > No bug. This commit optimizes strlen-avx2.S. The optimizations are
> > > mostly small things but they add up to roughly 10-30% performance
> > > improvement for strlen. The results for strnlen are bit more
> > > ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen
> > > are all passing.
> > >
> > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > ---
> > >  sysdeps/x86_64/multiarch/ifunc-impl-list.c |  16 +-
> > >  sysdeps/x86_64/multiarch/strlen-avx2.S     | 532 +++++++++++++------=
--
> > >  2 files changed, 334 insertions(+), 214 deletions(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86=
_64/multiarch/ifunc-impl-list.c
> > > index c377cab629..651b32908e 100644
> > > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > > @@ -293,10 +293,12 @@ __libc_ifunc_impl_list (const char *name, struc=
t libc_ifunc_impl *array,
> > >    /* Support sysdeps/x86_64/multiarch/strlen.c.  */
> > >    IFUNC_IMPL (i, name, strlen,
> > >             IFUNC_IMPL_ADD (array, i, strlen,
> > > -                           CPU_FEATURE_USABLE (AVX2),
> > > +                           (CPU_FEATURE_USABLE (AVX2)
> > > +                            && CPU_FEATURE_USABLE (BMI2)),
> > >                             __strlen_avx2)
> > >             IFUNC_IMPL_ADD (array, i, strlen,
> > >                             (CPU_FEATURE_USABLE (AVX2)
> > > +                            && CPU_FEATURE_USABLE (BMI2)
> > >                              && CPU_FEATURE_USABLE (RTM)),
> > >                             __strlen_avx2_rtm)
> > >             IFUNC_IMPL_ADD (array, i, strlen,
> > > @@ -309,10 +311,12 @@ __libc_ifunc_impl_list (const char *name, struc=
t libc_ifunc_impl *array,
> > >    /* Support sysdeps/x86_64/multiarch/strnlen.c.  */
> > >    IFUNC_IMPL (i, name, strnlen,
> > >             IFUNC_IMPL_ADD (array, i, strnlen,
> > > -                           CPU_FEATURE_USABLE (AVX2),
> > > +                           (CPU_FEATURE_USABLE (AVX2)
> > > +                            && CPU_FEATURE_USABLE (BMI2)),
> > >                             __strnlen_avx2)
> > >             IFUNC_IMPL_ADD (array, i, strnlen,
> > >                             (CPU_FEATURE_USABLE (AVX2)
> > > +                            && CPU_FEATURE_USABLE (BMI2)
> > >                              && CPU_FEATURE_USABLE (RTM)),
> > >                             __strnlen_avx2_rtm)
> > >             IFUNC_IMPL_ADD (array, i, strnlen,
> > > @@ -654,10 +658,12 @@ __libc_ifunc_impl_list (const char *name, struc=
t libc_ifunc_impl *array,
> > >    /* Support sysdeps/x86_64/multiarch/wcslen.c.  */
> > >    IFUNC_IMPL (i, name, wcslen,
> > >             IFUNC_IMPL_ADD (array, i, wcslen,
> > > -                           CPU_FEATURE_USABLE (AVX2),
> > > +                           (CPU_FEATURE_USABLE (AVX2)
> > > +                            && CPU_FEATURE_USABLE (BMI2)),
> > >                             __wcslen_avx2)
> > >             IFUNC_IMPL_ADD (array, i, wcslen,
> > >                             (CPU_FEATURE_USABLE (AVX2)
> > > +                            && CPU_FEATURE_USABLE (BMI2)
> > >                              && CPU_FEATURE_USABLE (RTM)),
> > >                             __wcslen_avx2_rtm)
> > >             IFUNC_IMPL_ADD (array, i, wcslen,
> > > @@ -670,10 +676,12 @@ __libc_ifunc_impl_list (const char *name, struc=
t libc_ifunc_impl *array,
> > >    /* Support sysdeps/x86_64/multiarch/wcsnlen.c.  */
> > >    IFUNC_IMPL (i, name, wcsnlen,
> > >             IFUNC_IMPL_ADD (array, i, wcsnlen,
> > > -                           CPU_FEATURE_USABLE (AVX2),
> > > +                           (CPU_FEATURE_USABLE (AVX2)
> > > +                            && CPU_FEATURE_USABLE (BMI2)),
> > >                             __wcsnlen_avx2)
> > >             IFUNC_IMPL_ADD (array, i, wcsnlen,
> > >                             (CPU_FEATURE_USABLE (AVX2)
> > > +                            && CPU_FEATURE_USABLE (BMI2)
> > >                              && CPU_FEATURE_USABLE (RTM)),
> > >                             __wcsnlen_avx2_rtm)
> > >             IFUNC_IMPL_ADD (array, i, wcsnlen,
> > > diff --git a/sysdeps/x86_64/multiarch/strlen-avx2.S b/sysdeps/x86_64/=
multiarch/strlen-avx2.S
> > > index 1caae9e6bc..bd2e6ee44a 100644
> > > --- a/sysdeps/x86_64/multiarch/strlen-avx2.S
> > > +++ b/sysdeps/x86_64/multiarch/strlen-avx2.S
> > > @@ -27,9 +27,11 @@
> > >  # ifdef USE_AS_WCSLEN
> > >  #  define VPCMPEQ    vpcmpeqd
> > >  #  define VPMINU     vpminud
> > > +#  define CHAR_SIZE  4
> > >  # else
> > >  #  define VPCMPEQ    vpcmpeqb
> > >  #  define VPMINU     vpminub
> > > +#  define CHAR_SIZE  1
> > >  # endif
> > >
> > >  # ifndef VZEROUPPER
> > > @@ -41,349 +43,459 @@
> > >  # endif
> > >
> > >  # define VEC_SIZE 32
> > > +# define PAGE_SIZE 4096
> > >
> > >       .section SECTION(.text),"ax",@progbits
> > >  ENTRY (STRLEN)
> > >  # ifdef USE_AS_STRNLEN
> > > -     /* Check for zero length.  */
> > > +     /* Check zero length.  */
> > >       test    %RSI_LP, %RSI_LP
> > >       jz      L(zero)
> > > +     /* Store max len in R8_LP before adjusting if using WCSLEN.  */
> > > +     mov     %RSI_LP, %R8_LP
> > >  #  ifdef USE_AS_WCSLEN
> > >       shl     $2, %RSI_LP
> > >  #  elif defined __ILP32__
> > >       /* Clear the upper 32 bits.  */
> > >       movl    %esi, %esi
> > >  #  endif
> > > -     mov     %RSI_LP, %R8_LP
> > >  # endif
> > > -     movl    %edi, %ecx
> > > +     movl    %edi, %eax
> > >       movq    %rdi, %rdx
> > >       vpxor   %xmm0, %xmm0, %xmm0
> > > -
> > > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > > +        cross check.  */
> > > +     andl    $(PAGE_SIZE - 1), %eax
> > >       /* Check if we may cross page boundary with one vector load.  */
> > > -     andl    $(2 * VEC_SIZE - 1), %ecx
> > > -     cmpl    $VEC_SIZE, %ecx
> > > -     ja      L(cros_page_boundary)
> > > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> > > +     ja      L(cross_page_boundary)
> > >
> > >       /* Check the first VEC_SIZE bytes.  */
> > > -     VPCMPEQ (%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > -     testl   %eax, %eax
> > > -
> > > +     VPCMPEQ (%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > >  # ifdef USE_AS_STRNLEN
> > > -     jnz     L(first_vec_x0_check)
> > > -     /* Adjust length and check the end of data.  */
> > > -     subq    $VEC_SIZE, %rsi
> > > -     jbe     L(max)
> > > -# else
> > > -     jnz     L(first_vec_x0)
> > > +     /* If length < VEC_SIZE handle special.  */
> > > +     cmpq    $VEC_SIZE, %rsi
> > > +     jbe     L(first_vec_x0)
> > >  # endif
> > > -
> > > -     /* Align data for aligned loads in the loop.  */
> > > -     addq    $VEC_SIZE, %rdi
> > > -     andl    $(VEC_SIZE - 1), %ecx
> > > -     andq    $-VEC_SIZE, %rdi
> > > +     /* If empty continue to aligned_more. Otherwise return bit
> > > +        position of first match.  */
> > > +     testl   %eax, %eax
> > > +     jz      L(aligned_more)
> > > +     tzcntl  %eax, %eax
> > > +# ifdef USE_AS_WCSLEN
> > > +     shrl    $2, %eax
> > > +# endif
> > > +     VZEROUPPER_RETURN
> > >
> > >  # ifdef USE_AS_STRNLEN
> > > -     /* Adjust length.  */
> > > -     addq    %rcx, %rsi
> > > +L(zero):
> > > +     xorl    %eax, %eax
> > > +     ret
> > >
> > > -     subq    $(VEC_SIZE * 4), %rsi
> > > -     jbe     L(last_4x_vec_or_less)
> > > +     .p2align 4
> > > +L(first_vec_x0):
> > > +     /* Set bit for max len so that tzcnt will return min of max len
> > > +        and position of first match.  */
> > > +     btsq    %rsi, %rax
> > > +     tzcntl  %eax, %eax
> > > +#  ifdef USE_AS_WCSLEN
> > > +     shrl    $2, %eax
> > > +#  endif
> > > +     VZEROUPPER_RETURN
> > >  # endif
> > > -     jmp     L(more_4x_vec)
> > >
> > >       .p2align 4
> > > -L(cros_page_boundary):
> > > -     andl    $(VEC_SIZE - 1), %ecx
> > > -     andq    $-VEC_SIZE, %rdi
> > > -     VPCMPEQ (%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > -     /* Remove the leading bytes.  */
> > > -     sarl    %cl, %eax
> > > -     testl   %eax, %eax
> > > -     jz      L(aligned_more)
> > > +L(first_vec_x1):
> > >       tzcntl  %eax, %eax
> > > +     /* Safe to use 32 bit instructions as these are only called for
> > > +        size =3D [1, 159].  */
> > >  # ifdef USE_AS_STRNLEN
> > > -     /* Check the end of data.  */
> > > -     cmpq    %rax, %rsi
> > > -     jbe     L(max)
> > > +     /* Use ecx which was computed earlier to compute correct value.
> > > +      */
> > > +     subl    $(VEC_SIZE * 4 + 1), %ecx
> > > +     addl    %ecx, %eax
> > > +# else
> > > +     subl    %edx, %edi
> > > +     incl    %edi
> > > +     addl    %edi, %eax
> > >  # endif
> > > -     addq    %rdi, %rax
> > > -     addq    %rcx, %rax
> > > -     subq    %rdx, %rax
> > >  # ifdef USE_AS_WCSLEN
> > > -     shrq    $2, %rax
> > > +     shrl    $2, %eax
> > >  # endif
> > > -L(return_vzeroupper):
> > > -     ZERO_UPPER_VEC_REGISTERS_RETURN
> > > +     VZEROUPPER_RETURN
> > >
> > >       .p2align 4
> > > -L(aligned_more):
> > > +L(first_vec_x2):
> > > +     tzcntl  %eax, %eax
> > > +     /* Safe to use 32 bit instructions as these are only called for
> > > +        size =3D [1, 159].  */
> > >  # ifdef USE_AS_STRNLEN
> > > -        /* "rcx" is less than VEC_SIZE.  Calculate "rdx + rcx - VEC_=
SIZE"
> > > -         with "rdx - (VEC_SIZE - rcx)" instead of "(rdx + rcx) - VEC=
_SIZE"
> > > -         to void possible addition overflow.  */
> > > -     negq    %rcx
> > > -     addq    $VEC_SIZE, %rcx
> > > -
> > > -     /* Check the end of data.  */
> > > -     subq    %rcx, %rsi
> > > -     jbe     L(max)
> > > +     /* Use ecx which was computed earlier to compute correct value.
> > > +      */
> > > +     subl    $(VEC_SIZE * 3 + 1), %ecx
> > > +     addl    %ecx, %eax
> > > +# else
> > > +     subl    %edx, %edi
> > > +     addl    $(VEC_SIZE + 1), %edi
> > > +     addl    %edi, %eax
> > >  # endif
> > > +# ifdef USE_AS_WCSLEN
> > > +     shrl    $2, %eax
> > > +# endif
> > > +     VZEROUPPER_RETURN
> > >
> > > -     addq    $VEC_SIZE, %rdi
> > > +     .p2align 4
> > > +L(first_vec_x3):
> > > +     tzcntl  %eax, %eax
> > > +     /* Safe to use 32 bit instructions as these are only called for
> > > +        size =3D [1, 159].  */
> > > +# ifdef USE_AS_STRNLEN
> > > +     /* Use ecx which was computed earlier to compute correct value.
> > > +      */
> > > +     subl    $(VEC_SIZE * 2 + 1), %ecx
> > > +     addl    %ecx, %eax
> > > +# else
> > > +     subl    %edx, %edi
> > > +     addl    $(VEC_SIZE * 2 + 1), %edi
> > > +     addl    %edi, %eax
> > > +# endif
> > > +# ifdef USE_AS_WCSLEN
> > > +     shrl    $2, %eax
> > > +# endif
> > > +     VZEROUPPER_RETURN
> > >
> > > +     .p2align 4
> > > +L(first_vec_x4):
> > > +     tzcntl  %eax, %eax
> > > +     /* Safe to use 32 bit instructions as these are only called for
> > > +        size =3D [1, 159].  */
> > >  # ifdef USE_AS_STRNLEN
> > > -     subq    $(VEC_SIZE * 4), %rsi
> > > -     jbe     L(last_4x_vec_or_less)
> > > +     /* Use ecx which was computed earlier to compute correct value.
> > > +      */
> > > +     subl    $(VEC_SIZE + 1), %ecx
> > > +     addl    %ecx, %eax
> > > +# else
> > > +     subl    %edx, %edi
> > > +     addl    $(VEC_SIZE * 3 + 1), %edi
> > > +     addl    %edi, %eax
> > >  # endif
> > > +# ifdef USE_AS_WCSLEN
> > > +     shrl    $2, %eax
> > > +# endif
> > > +     VZEROUPPER_RETURN
> > >
> > > -L(more_4x_vec):
> > > +     .p2align 5
> > > +L(aligned_more):
> > > +     /* Align data to VEC_SIZE - 1. This is the same number of
> > > +        instructions as using andq with -VEC_SIZE but saves 4 bytes =
of
> > > +        code on the x4 check.  */
> > > +     orq     $(VEC_SIZE - 1), %rdi
> > > +L(cross_page_continue):
> > >       /* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
> > >          since data is only aligned to VEC_SIZE.  */
> > > -     VPCMPEQ (%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > -     testl   %eax, %eax
> > > -     jnz     L(first_vec_x0)
> > > -
> > > -     VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > +# ifdef USE_AS_STRNLEN
> > > +     /* + 1 because rdi is aligned to VEC_SIZE - 1. + CHAR_SIZE beca=
use
> > > +        it simplies the logic in last_4x_vec_or_less.  */
> > > +     leaq    (VEC_SIZE * 4 + CHAR_SIZE + 1)(%rdi), %rcx
> > > +     subq    %rdx, %rcx
> > > +# endif
> > > +     /* Load first VEC regardless.  */
> > > +     VPCMPEQ 1(%rdi), %ymm0, %ymm1
> > > +# ifdef USE_AS_STRNLEN
> > > +     /* Adjust length. If near end handle specially.  */
> > > +     subq    %rcx, %rsi
> > > +     jb      L(last_4x_vec_or_less)
> > > +# endif
> > > +     vpmovmskb       %ymm1, %eax
> > >       testl   %eax, %eax
> > >       jnz     L(first_vec_x1)
> > >
> > > -     VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > +     VPCMPEQ (VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > >       testl   %eax, %eax
> > >       jnz     L(first_vec_x2)
> > >
> > > -     VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > +     VPCMPEQ (VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > >       testl   %eax, %eax
> > >       jnz     L(first_vec_x3)
> > >
> > > -     addq    $(VEC_SIZE * 4), %rdi
> > > -
> > > -# ifdef USE_AS_STRNLEN
> > > -     subq    $(VEC_SIZE * 4), %rsi
> > > -     jbe     L(last_4x_vec_or_less)
> > > -# endif
> > > -
> > > -     /* Align data to 4 * VEC_SIZE.  */
> > > -     movq    %rdi, %rcx
> > > -     andl    $(4 * VEC_SIZE - 1), %ecx
> > > -     andq    $-(4 * VEC_SIZE), %rdi
> > > +     VPCMPEQ (VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > > +     testl   %eax, %eax
> > > +     jnz     L(first_vec_x4)
> > >
> > > +     /* Align data to VEC_SIZE * 4 - 1.  */
> > >  # ifdef USE_AS_STRNLEN
> > > -     /* Adjust length.  */
> > > +     /* Before adjusting length check if at last VEC_SIZE * 4.  */
> > > +     cmpq    $(VEC_SIZE * 4 - 1), %rsi
> > > +     jbe     L(last_4x_vec_or_less_load)
> > > +     incq    %rdi
> > > +     movl    %edi, %ecx
> > > +     orq     $(VEC_SIZE * 4 - 1), %rdi
> > > +     andl    $(VEC_SIZE * 4 - 1), %ecx
> > > +     /* Readjust length.  */
> > >       addq    %rcx, %rsi
> > > +# else
> > > +     incq    %rdi
> > > +     orq     $(VEC_SIZE * 4 - 1), %rdi
> > >  # endif
> > > -
> > > +     /* Compare 4 * VEC at a time forward.  */
> > >       .p2align 4
> > >  L(loop_4x_vec):
> > > -     /* Compare 4 * VEC at a time forward.  */
> > > -     vmovdqa (%rdi), %ymm1
> > > -     vmovdqa VEC_SIZE(%rdi), %ymm2
> > > -     vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3
> > > -     vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4
> > > -     VPMINU  %ymm1, %ymm2, %ymm5
> > > -     VPMINU  %ymm3, %ymm4, %ymm6
> > > -     VPMINU  %ymm5, %ymm6, %ymm5
> > > -
> > > -     VPCMPEQ %ymm5, %ymm0, %ymm5
> > > -     vpmovmskb %ymm5, %eax
> > > -     testl   %eax, %eax
> > > -     jnz     L(4x_vec_end)
> > > -
> > > -     addq    $(VEC_SIZE * 4), %rdi
> > > -
> > > -# ifndef USE_AS_STRNLEN
> > > -     jmp     L(loop_4x_vec)
> > > -# else
> > > +# ifdef USE_AS_STRNLEN
> > > +     /* Break if at end of length.  */
> > >       subq    $(VEC_SIZE * 4), %rsi
> > > -     ja      L(loop_4x_vec)
> > > -
> > > -L(last_4x_vec_or_less):
> > > -     /* Less than 4 * VEC and aligned to VEC_SIZE.  */
> > > -     addl    $(VEC_SIZE * 2), %esi
> > > -     jle     L(last_2x_vec)
> > > +     jb      L(last_4x_vec_or_less_cmpeq)
> > > +# endif
> > > +     /* Save some code size by microfusing VPMINU with the load. Sin=
ce
> > > +        the matches in ymm2/ymm4 can only be returned if there where=
 no
> > > +        matches in ymm1/ymm3 respectively there is no issue with ove=
rlap.
> > > +      */
> > > +     vmovdqa 1(%rdi), %ymm1
> > > +     VPMINU  (VEC_SIZE + 1)(%rdi), %ymm1, %ymm2
> > > +     vmovdqa (VEC_SIZE * 2 + 1)(%rdi), %ymm3
> > > +     VPMINU  (VEC_SIZE * 3 + 1)(%rdi), %ymm3, %ymm4
> > > +
> > > +     VPMINU  %ymm2, %ymm4, %ymm5
> > > +     VPCMPEQ %ymm5, %ymm0, %ymm5
> > > +     vpmovmskb       %ymm5, %ecx
> > >
> > > -     VPCMPEQ (%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > -     testl   %eax, %eax
> > > -     jnz     L(first_vec_x0)
> > > +     subq    $-(VEC_SIZE * 4), %rdi
> > > +     testl   %ecx, %ecx
> > > +     jz      L(loop_4x_vec)
> > >
> > > -     VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > -     testl   %eax, %eax
> > > -     jnz     L(first_vec_x1)
> > >
> > > -     VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > +     VPCMPEQ %ymm1, %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > > +     subq    %rdx, %rdi
> > >       testl   %eax, %eax
> > > +     jnz     L(last_vec_return_x0)
> > >
> > > -     jnz     L(first_vec_x2_check)
> > > -     subl    $VEC_SIZE, %esi
> > > -     jle     L(max)
> > > -
> > > -     VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > +     VPCMPEQ %ymm2, %ymm0, %ymm2
> > > +     vpmovmskb       %ymm2, %eax
> > >       testl   %eax, %eax
> > > -
> > > -     jnz     L(first_vec_x3_check)
> > > -     movq    %r8, %rax
> > > -#  ifdef USE_AS_WCSLEN
> > > +     jnz     L(last_vec_return_x1)
> > > +
> > > +     /* Combine last 2 VEC.  */
> > > +     VPCMPEQ %ymm3, %ymm0, %ymm3
> > > +     vpmovmskb       %ymm3, %eax
> > > +     /* rcx has combined result from all 4 VEC. It will only be used=
 if
> > > +        the first 3 other VEC all did not contain a match.  */
> > > +     salq    $32, %rcx
> > > +     orq     %rcx, %rax
> > > +     tzcntq  %rax, %rax
> > > +     subq    $(VEC_SIZE * 2 - 1), %rdi
> > > +     addq    %rdi, %rax
> > > +# ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > > -#  endif
> > > +# endif
> > >       VZEROUPPER_RETURN
> > >
> > > +
> > > +# ifdef USE_AS_STRNLEN
> > >       .p2align 4
> > > -L(last_2x_vec):
> > > -     addl    $(VEC_SIZE * 2), %esi
> > > -     VPCMPEQ (%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > -     testl   %eax, %eax
> > > +L(last_4x_vec_or_less_load):
> > > +     /* Depending on entry adjust rdi / prepare first VEC in ymm1.  =
*/
> > > +     subq    $-(VEC_SIZE * 4), %rdi
> > > +L(last_4x_vec_or_less_cmpeq):
> > > +     VPCMPEQ 1(%rdi), %ymm0, %ymm1
> > > +L(last_4x_vec_or_less):
> > >
> > > -     jnz     L(first_vec_x0_check)
> > > -     subl    $VEC_SIZE, %esi
> > > -     jle     L(max)
> > > +     vpmovmskb       %ymm1, %eax
> > > +     /* If remaining length > VEC_SIZE * 2. This works if esi is off=
 by
> > > +        VEC_SIZE * 4.  */
> > > +     testl   $(VEC_SIZE * 2), %esi
> > > +     jnz     L(last_4x_vec)
> > >
> > > -     VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > +     /* length may have been negative or positive by an offset of
> > > +        VEC_SIZE * 4 depending on where this was called from. This f=
ixes
> > > +        that.  */
> > > +     andl    $(VEC_SIZE * 4 - 1), %esi
> > >       testl   %eax, %eax
> > > -     jnz     L(first_vec_x1_check)
> > > -     movq    %r8, %rax
> > > -#  ifdef USE_AS_WCSLEN
> > > -     shrq    $2, %rax
> > > -#  endif
> > > -     VZEROUPPER_RETURN
> > > +     jnz     L(last_vec_x1_check)
> > >
> > > -     .p2align 4
> > > -L(first_vec_x0_check):
> > > +     subl    $VEC_SIZE, %esi
> > > +     jb      L(max)
> > > +
> > > +     VPCMPEQ (VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > >       tzcntl  %eax, %eax
> > >       /* Check the end of data.  */
> > > -     cmpq    %rax, %rsi
> > > -     jbe     L(max)
> > > +     cmpl    %eax, %esi
> > > +     jb      L(max)
> > > +     subq    %rdx, %rdi
> > > +     addl    $(VEC_SIZE + 1), %eax
> > >       addq    %rdi, %rax
> > > -     subq    %rdx, %rax
> > >  #  ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > >  #  endif
> > >       VZEROUPPER_RETURN
> > > +# endif
> > >
> > >       .p2align 4
> > > -L(first_vec_x1_check):
> > > +L(last_vec_return_x0):
> > >       tzcntl  %eax, %eax
> > > -     /* Check the end of data.  */
> > > -     cmpq    %rax, %rsi
> > > -     jbe     L(max)
> > > -     addq    $VEC_SIZE, %rax
> > > +     subq    $(VEC_SIZE * 4 - 1), %rdi
> > >       addq    %rdi, %rax
> > > -     subq    %rdx, %rax
> > > -#  ifdef USE_AS_WCSLEN
> > > +# ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > > -#  endif
> > > +# endif
> > >       VZEROUPPER_RETURN
> > >
> > >       .p2align 4
> > > -L(first_vec_x2_check):
> > > +L(last_vec_return_x1):
> > >       tzcntl  %eax, %eax
> > > -     /* Check the end of data.  */
> > > -     cmpq    %rax, %rsi
> > > -     jbe     L(max)
> > > -     addq    $(VEC_SIZE * 2), %rax
> > > +     subq    $(VEC_SIZE * 3 - 1), %rdi
> > >       addq    %rdi, %rax
> > > -     subq    %rdx, %rax
> > > -#  ifdef USE_AS_WCSLEN
> > > +# ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > > -#  endif
> > > +# endif
> > >       VZEROUPPER_RETURN
> > >
> > > +# ifdef USE_AS_STRNLEN
> > >       .p2align 4
> > > -L(first_vec_x3_check):
> > > +L(last_vec_x1_check):
> > > +
> > >       tzcntl  %eax, %eax
> > >       /* Check the end of data.  */
> > > -     cmpq    %rax, %rsi
> > > -     jbe     L(max)
> > > -     addq    $(VEC_SIZE * 3), %rax
> > > +     cmpl    %eax, %esi
> > > +     jb      L(max)
> > > +     subq    %rdx, %rdi
> > > +     incl    %eax
> > >       addq    %rdi, %rax
> > > -     subq    %rdx, %rax
> > >  #  ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > >  #  endif
> > >       VZEROUPPER_RETURN
> > >
> > > -     .p2align 4
> > >  L(max):
> > >       movq    %r8, %rax
> > > +     VZEROUPPER_RETURN
> > > +
> > > +     .p2align 4
> > > +L(last_4x_vec):
> > > +     /* Test first 2x VEC normally.  */
> > > +     testl   %eax, %eax
> > > +     jnz     L(last_vec_x1)
> > > +
> > > +     VPCMPEQ (VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > > +     testl   %eax, %eax
> > > +     jnz     L(last_vec_x2)
> > > +
> > > +     /* Normalize length.  */
> > > +     andl    $(VEC_SIZE * 4 - 1), %esi
> > > +     VPCMPEQ (VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > > +     testl   %eax, %eax
> > > +     jnz     L(last_vec_x3)
> > > +
> > > +     subl    $(VEC_SIZE * 3), %esi
> > > +     jb      L(max)
> > > +
> > > +     VPCMPEQ (VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > > +     tzcntl  %eax, %eax
> > > +     /* Check the end of data.  */
> > > +     cmpl    %eax, %esi
> > > +     jb      L(max)
> > > +     subq    %rdx, %rdi
> > > +     addl    $(VEC_SIZE * 3 + 1), %eax
> > > +     addq    %rdi, %rax
> > >  #  ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > >  #  endif
> > >       VZEROUPPER_RETURN
> > >
> > > -     .p2align 4
> > > -L(zero):
> > > -     xorl    %eax, %eax
> > > -     ret
> > > -# endif
> > >
> > >       .p2align 4
> > > -L(first_vec_x0):
> > > +L(last_vec_x1):
> > > +     /* essentially duplicates of first_vec_x1 but use 64 bit
> > > +        instructions.  */
> > >       tzcntl  %eax, %eax
> > > +     subq    %rdx, %rdi
> > > +     incl    %eax
> > >       addq    %rdi, %rax
> > > -     subq    %rdx, %rax
> > > -# ifdef USE_AS_WCSLEN
> > > +#  ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > > -# endif
> > > +#  endif
> > >       VZEROUPPER_RETURN
> > >
> > >       .p2align 4
> > > -L(first_vec_x1):
> > > +L(last_vec_x2):
> > > +     /* essentially duplicates of first_vec_x1 but use 64 bit
> > > +        instructions.  */
> > >       tzcntl  %eax, %eax
> > > -     addq    $VEC_SIZE, %rax
> > > +     subq    %rdx, %rdi
> > > +     addl    $(VEC_SIZE + 1), %eax
> > >       addq    %rdi, %rax
> > > -     subq    %rdx, %rax
> > > -# ifdef USE_AS_WCSLEN
> > > +#  ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > > -# endif
> > > +#  endif
> > >       VZEROUPPER_RETURN
> > >
> > >       .p2align 4
> > > -L(first_vec_x2):
> > > +L(last_vec_x3):
> > >       tzcntl  %eax, %eax
> > > -     addq    $(VEC_SIZE * 2), %rax
> > > +     subl    $(VEC_SIZE * 2), %esi
> > > +     /* Check the end of data.  */
> > > +     cmpl    %eax, %esi
> > > +     jb      L(max_end)
> > > +     subq    %rdx, %rdi
> > > +     addl    $(VEC_SIZE * 2 + 1), %eax
> > >       addq    %rdi, %rax
> > > -     subq    %rdx, %rax
> > > -# ifdef USE_AS_WCSLEN
> > > +#  ifdef USE_AS_WCSLEN
> > >       shrq    $2, %rax
> > > -# endif
> > > +#  endif
> > > +     VZEROUPPER_RETURN
> > > +L(max_end):
> > > +     movq    %r8, %rax
> > >       VZEROUPPER_RETURN
> > > +# endif
> > >
> > > +     /* Cold case for crossing page with first load.  */
> > >       .p2align 4
> > > -L(4x_vec_end):
> > > -     VPCMPEQ %ymm1, %ymm0, %ymm1
> > > -     vpmovmskb %ymm1, %eax
> > > -     testl   %eax, %eax
> > > -     jnz     L(first_vec_x0)
> > > -     VPCMPEQ %ymm2, %ymm0, %ymm2
> > > -     vpmovmskb %ymm2, %eax
> > > +L(cross_page_boundary):
> > > +     /* Align data to VEC_SIZE - 1.  */
> > > +     orq     $(VEC_SIZE - 1), %rdi
> > > +     VPCMPEQ -(VEC_SIZE - 1)(%rdi), %ymm0, %ymm1
> > > +     vpmovmskb       %ymm1, %eax
> > > +     /* Remove the leading bytes. sarxl only uses bits [5:0] of COUNT
> > > +        so no need to manually mod rdx.  */
> > > +     sarxl   %edx, %eax, %eax
> >
> > This is a BMI2 instruction, which is not necessary available when AVX2
> > is available. This causes SIGILL on some CPU. I have reported that in
> > https://sourceware.org/bugzilla/show_bug.cgi?id=3D29611
>
> This is not a bug on master as:
>
> commit 83c5b368226c34a2f0a5287df40fc290b2b34359
> Author: H.J. Lu <hjl.tools@gmail.com>
> Date:   Mon Apr 19 10:45:07 2021 -0700
>
>     x86-64: Require BMI2 for strchr-avx2.S
>
> is already in tree. The issue is the avx2 changes where backported
> w.o H.J's changes.
> >
> > Regards
> > Aurelien
> >
> > --
> > Aurelien Jarno                          GPG: 4096R/1DDD8C9B
> > aurelien@aurel32.net                 http://www.aurel32.net

--_000_CH2PR17MB3541AD0C315CE529B929CE98BD549CH2PR17MB3541namp_--