From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2052.outbound.protection.outlook.com [40.107.93.52]) by sourceware.org (Postfix) with ESMTPS id 088593858283; Wed, 28 Sep 2022 14:02:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 088593858283 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=darrentristano.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=darrentristano.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=kxRirfOcusPbZG6IdYkqpWOyywOzDAUUgqRpvqUlOYnSXqYZkyfvC6nkpwEf+KYU+asCpH7k9naxWEuOVngNuz1gey3lUGj1jDUKhrs6epPSVLr6c0uIJDvn2itMM27LyHQRnV4Ts2bmUHz/4bX9oPTQUrcMEueh4yMnTrfyoCwyW0t5QMbYYgK4ntV9I55/a+cOoZpBWrEElRFR45+AsxsjFgJR+mVUE+y3CBj3CEVznqhdiRBkpDwEIPpvbezVuqXn5bIarVCD0kz3j4TvswmdI6YOMMaTPxo8lbXyyvPq77dW8JCiX0p8ppNmxEOk7cBFCDT0MB6I0dAOuJLkMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=qFozzJ9F6WupMuyq1RUbh2mtNvQ5Uxg9kT55wPGvXBQ=; b=n35iRUY1SfzqGDRbVK62nFj1LrfqQZdqqgiRwFPiV3EQDERJMBxiTaBnVGuytcMbTNNJ+SO/on75JHiHAbjUaZyYIBhFc1qXn95pRQzevaT9YIdJpIxJgPyfd9e0AICvbWADEPUW33cW4dYFgdSIPrWWcVbdCWrSsQLCN8Z25/Dg7piN3tiKMLlgGL+jLgXjPo8eRNtSuMaNYV7x7UelgZa7BdESYXW5iMykZzzz6D4d3DJwtTEtc9GsrpmVa1B2QshK4OFzn4sB7TvdaN5k0cTLv6mnxiF4CKA6lhLv0PDzZV7WdVSF8HF30uYbSlDyTJs5yHYz6UyVCu80hwDzbw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=darrentristano.com; dmarc=pass action=none header.from=darrentristano.com; dkim=pass header.d=darrentristano.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=NETORG3144382.onmicrosoft.com; s=selector2-NETORG3144382-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=qFozzJ9F6WupMuyq1RUbh2mtNvQ5Uxg9kT55wPGvXBQ=; b=Tu3k+sjnyfSfZtnZdd7p/WIMEEbzlOE2D5QlA+OmvznXroVmablevY/zudIftdvjFhfyoSb9v2yLtT+N4qD706VcPw3TziwGsOT8XxYPCxWXTfz8+gQNKwCn+Ae5uNn3Lm99zq0v8ThUW3L36v4cEnXIxE3aOCg0vxe4lrnLh5c= Received: from CH2PR17MB3541.namprd17.prod.outlook.com (2603:10b6:610:45::16) by MN2PR17MB3760.namprd17.prod.outlook.com (2603:10b6:208:204::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5654.25; Wed, 28 Sep 2022 14:02:31 +0000 Received: from CH2PR17MB3541.namprd17.prod.outlook.com ([fe80::4702:617:871d:b3cb]) by CH2PR17MB3541.namprd17.prod.outlook.com ([fe80::4702:617:871d:b3cb%5]) with mapi id 15.20.5654.025; Wed, 28 Sep 2022 14:02:30 +0000 From: Darren Tristano To: Noah Goldstein , Libc-stable Mailing List , Hongjiu Lu , Sunil Pandey CC: GNU C Library Subject: Re: [PATCH v5 2/2] x86: Optimize strlen-avx2.S Thread-Topic: [PATCH v5 2/2] x86: Optimize strlen-avx2.S Thread-Index: AQHY00IBewNZlVT0eUCQHZjijvigiq30302b Date: Wed, 28 Sep 2022 14:02:30 +0000 Message-ID: References: <20210419233607.916848-1-goldstein.w.n@gmail.com> <20210419233607.916848-2-goldstein.w.n@gmail.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=darrentristano.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: CH2PR17MB3541:EE_|MN2PR17MB3760:EE_ x-ms-office365-filtering-correlation-id: 3b46ef4b-1ee3-43bb-70e0-08daa15a15b0 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: OHHjIOLQKemriN+4bnl2lxDF0Vpfo5U24qhaqN5cD2wInn0iaEAmy+ZhSWnHkX6SpJn4/VMmJCTqnuzHfwiOhWvHKuY2kPneLEMA8thyeg3c36x5JfmA16oYbNQRQdvidEkjKTBAJNjfxKFZ7bZmi5QhGuxnpuIvi6iVrc1aDgNUTfr5AJP75q6aug6hWMCffXojDKdHVD2rWKNBuXhk/ObQE/QuJNV66Y9faByefa/JMrsrnNLg0M2j6tqZ6cUFdCcRhHwDsNVT9pS8S3262wLPOZPCzSmrHasQyV1pLbZkPgn/XYZCOn9OTnYQKm2Ja7L+jo6wjxnSNd2jwqhQmJwsHr64VvzukhCM+KLKaA0HEsgAn/G8XwJcUcdQs/7FnvF7pNLL89zr1ZssSbeaQtLaduOCbeESSijXtXtiXlH8m2HWfqlwR6O7FrH+Kkdj8CbGxFfYwde1SqoPPRSKvMKx5mHoBQOLricooTMS5OmS7XSZGcRHiEFUNcLQMIbSK/fwjx52WGKlrCv3r71vqBojqq0C690WVUMsMMuNnzfSNI/m75GPVM0Cwozi56Cs0pRuSkNynvthOWZ2zBswXR6pl7+Nu8X4Z+en71+EylR3UBA2V5g03dhyRJmO7ERRPu06jObVewebW9ZgTT58Tj4GSDQLjV9wx8JOGTVZJnW0OC4ESIL0AaC3xxDrj4ellxkv5B2V2P3V2K9h87oJkQ4I+DiVvVrjrrDRNkR2Gt6SLnQLSUqJjKPXzAP1sOhW848PjlLgVYHVpeGwm1I5uX+qnCoNM0epFQfLa/fIAlc= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CH2PR17MB3541.namprd17.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230022)(39830400003)(396003)(366004)(346002)(376002)(136003)(451199015)(478600001)(26005)(86362001)(83380400001)(5660300002)(6506007)(186003)(8936002)(7696005)(53546011)(33656002)(52536014)(55016003)(30864003)(9686003)(8676002)(66446008)(41300700001)(966005)(4326008)(66476007)(66556008)(76116006)(66946007)(64756008)(166002)(2906002)(122000001)(38100700002)(19627405001)(38070700005)(71200400001)(316002)(110136005)(559001)(579004);DIR:OUT;SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?Jcyzfg5SlJqUPB0VK+/Yx0venu1r12AECJCiAGFv5+Cu38//oDqobMLhcUSf?= =?us-ascii?Q?AjM5NuNbFmQg41XILQ5snheCPOfNx3UhYKIVo2JnONQ6QNnAITodFZiKXWFw?= =?us-ascii?Q?LKzRFOQKgu8aYIo0KYx+eJN2fIS3NGLtbQGoyo/ditgY0obMHOGOZUNogDNQ?= =?us-ascii?Q?UWM6rdoLQEwHDKkDN3DDXDr38uPTJov+L0HR2QoSpUaIeUo5vk7dF5MyKqCx?= =?us-ascii?Q?FhMVfJLoBW51LirfRB61y3MydHGFmZ7O3SxHGqosaiSsgeM+7qnOlDPtZci/?= =?us-ascii?Q?YPzIDJkdLrGrhvUogTVMedrRscX04nGvIMlWeCRgFyOCzNASc+tpnWwBPCOg?= =?us-ascii?Q?E056guqOTF1wmqMkDk0vyYAjzEhG2g7YiZd7FKNLgraNqRHc4CxGkRDEVY/T?= =?us-ascii?Q?Iv7sH85dUjCsA7IOqodkKRD7aStKnOO29yn+f0aj2+G7C+01CnOrMOYYucL/?= =?us-ascii?Q?0s91944pRXFcOvQNzZVRS+4NpBejgsrI+CvwLTYkEBD7FC4qelEdYtf6bSzY?= =?us-ascii?Q?BR9DEskpPxAy+8j9oKSI4ms3mVAw6t/5B4f5/MIH1o5NoS0AduUAjVNRDFlZ?= =?us-ascii?Q?lHEY/BDhhxQiCPe9W5Y7udrqAG9/+cpEv3+rRc+mudPRLVd7/rWfTL7Gq1Yn?= =?us-ascii?Q?EmVDVoYs+sEeJuMkS1OST516QB6pn/+VF1Cfc0bJ1fWNKW80bn8YUAZTG25k?= =?us-ascii?Q?T0Mpr9NiNPwtULGAEADRSzp9VT+E0v2YEecB/gh2DQwmJyFmhMJ2npuKq2oP?= =?us-ascii?Q?n86USv5EilQ9Fv9J/rW6lXbXVoumGOvdWwd4Jy/857xbK69ZYywSKyYmsEcP?= =?us-ascii?Q?kXaDscg1iDdjDTPG77kh0tnvAOolfbjjqd/Bx8SdToHCeUBo6q64DRghTscE?= =?us-ascii?Q?pferurpAM5gAsAH1UreZPYaG8ZmYIiDzCD7/Bs3ScDd+9h1UMuBVweOYpwTB?= =?us-ascii?Q?HeS+AtVUpcr5JR2r3WONiSMFy7V+kSgNHDJEYOULaAgPDpp4hmYyUeRdff11?= =?us-ascii?Q?q11yx2nGuulYNNpJCcYQfp6kXJgYOVyOHtf2oGCY276kc/AFKkxcmP+NU9y8?= =?us-ascii?Q?wRypSLWpdPEXod787iCgnBA5o0fgm4Z2vfb4dweBzLOzjhYtnz4Z9bIgJbAE?= =?us-ascii?Q?d7bei8nn/wrajbD8Ij+LuY0qcVNDTSYPRudL4PwjjimrRwE1q9WirWlA1NYt?= =?us-ascii?Q?FKUv+EVwpWJlpVD8bM+zdAjm3Ru0yqK4++CKidrddIpUSmsu5EBxazi7UuOB?= =?us-ascii?Q?k1R0hQBo95/kacMyQFwZE7BMDZRLWGxYx3cczhw01WcLkH7xNpjR81C8nUrF?= =?us-ascii?Q?4r4iMpWQboFeaU7Q5YvBGHFuwLjSfXBJdgLZY3pIXy7GktNmXcwpyMh6Tbi+?= =?us-ascii?Q?tTbeBbr1UJTPgD0NTKfgPwCroFC220gjTzYWV/Yi3VHd0DbRW+QZINpktRIh?= =?us-ascii?Q?EAY//v5L1yY+PJHgz2x62QrmCMDHw7zGTVtoA1xJwJ6yWcUJ00QTlqm0xvZg?= =?us-ascii?Q?01BS7IAsG/OwSZ3lwn6znm52n24Zu98SrSUo/JQHubjsDqSpesM5i6eQKB6t?= =?us-ascii?Q?KZzS99alHtRw2Zh58IA0iOsiD5iIIH0KXxl9DsL6?= Content-Type: multipart/alternative; boundary="_000_CH2PR17MB3541AD0C315CE529B929CE98BD549CH2PR17MB3541namp_" MIME-Version: 1.0 X-OriginatorOrg: darrentristano.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: CH2PR17MB3541.namprd17.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 3b46ef4b-1ee3-43bb-70e0-08daa15a15b0 X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Sep 2022 14:02:30.9047 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 595cff3a-a98e-4be2-a9e0-981f0e29e085 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: rGC8LjY/T146oVOOQ4GO6+jo5BUhAGhCSCE56mvN42IJB9URc7JUEu7aSmAp48OTQYpMbi/UwfmaZX1CC/NUAZSyeI4jJkueXkZnp2OnvYs= X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR17MB3760 X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,GIT_PATCH_0,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --_000_CH2PR17MB3541AD0C315CE529B929CE98BD549CH2PR17MB3541namp_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Please Remove me from this string. I should not be on it. ________________________________ From: Libc-stable on behalf of Sunil Pandey via Libc-stable Sent: Wednesday, September 28, 2022 8:54 AM To: Noah Goldstein ; Libc-stable Mailing List ; Hongjiu Lu Cc: GNU C Library Subject: Re: [PATCH v5 2/2] x86: Optimize strlen-avx2.S Attached patch fixes BZ# 29611. I would like to backport it to 2.32,2.31,2.30,2.29 and 2.29. Let me know if there is any objection. On Sun, Sep 25, 2022 at 7:00 AM Noah Goldstein via Libc-alpha wrote: > > On Sun, Sep 25, 2022 at 1:19 AM Aurelien Jarno wro= te: > > > > On 2021-04-19 19:36, Noah Goldstein via Libc-alpha wrote: > > > No bug. This commit optimizes strlen-avx2.S. The optimizations are > > > mostly small things but they add up to roughly 10-30% performance > > > improvement for strlen. The results for strnlen are bit more > > > ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen > > > are all passing. > > > > > > Signed-off-by: Noah Goldstein > > > --- > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 16 +- > > > sysdeps/x86_64/multiarch/strlen-avx2.S | 532 +++++++++++++------= -- > > > 2 files changed, 334 insertions(+), 214 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86= _64/multiarch/ifunc-impl-list.c > > > index c377cab629..651b32908e 100644 > > > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > > @@ -293,10 +293,12 @@ __libc_ifunc_impl_list (const char *name, struc= t libc_ifunc_impl *array, > > > /* Support sysdeps/x86_64/multiarch/strlen.c. */ > > > IFUNC_IMPL (i, name, strlen, > > > IFUNC_IMPL_ADD (array, i, strlen, > > > - CPU_FEATURE_USABLE (AVX2), > > > + (CPU_FEATURE_USABLE (AVX2) > > > + && CPU_FEATURE_USABLE (BMI2)), > > > __strlen_avx2) > > > IFUNC_IMPL_ADD (array, i, strlen, > > > (CPU_FEATURE_USABLE (AVX2) > > > + && CPU_FEATURE_USABLE (BMI2) > > > && CPU_FEATURE_USABLE (RTM)), > > > __strlen_avx2_rtm) > > > IFUNC_IMPL_ADD (array, i, strlen, > > > @@ -309,10 +311,12 @@ __libc_ifunc_impl_list (const char *name, struc= t libc_ifunc_impl *array, > > > /* Support sysdeps/x86_64/multiarch/strnlen.c. */ > > > IFUNC_IMPL (i, name, strnlen, > > > IFUNC_IMPL_ADD (array, i, strnlen, > > > - CPU_FEATURE_USABLE (AVX2), > > > + (CPU_FEATURE_USABLE (AVX2) > > > + && CPU_FEATURE_USABLE (BMI2)), > > > __strnlen_avx2) > > > IFUNC_IMPL_ADD (array, i, strnlen, > > > (CPU_FEATURE_USABLE (AVX2) > > > + && CPU_FEATURE_USABLE (BMI2) > > > && CPU_FEATURE_USABLE (RTM)), > > > __strnlen_avx2_rtm) > > > IFUNC_IMPL_ADD (array, i, strnlen, > > > @@ -654,10 +658,12 @@ __libc_ifunc_impl_list (const char *name, struc= t libc_ifunc_impl *array, > > > /* Support sysdeps/x86_64/multiarch/wcslen.c. */ > > > IFUNC_IMPL (i, name, wcslen, > > > IFUNC_IMPL_ADD (array, i, wcslen, > > > - CPU_FEATURE_USABLE (AVX2), > > > + (CPU_FEATURE_USABLE (AVX2) > > > + && CPU_FEATURE_USABLE (BMI2)), > > > __wcslen_avx2) > > > IFUNC_IMPL_ADD (array, i, wcslen, > > > (CPU_FEATURE_USABLE (AVX2) > > > + && CPU_FEATURE_USABLE (BMI2) > > > && CPU_FEATURE_USABLE (RTM)), > > > __wcslen_avx2_rtm) > > > IFUNC_IMPL_ADD (array, i, wcslen, > > > @@ -670,10 +676,12 @@ __libc_ifunc_impl_list (const char *name, struc= t libc_ifunc_impl *array, > > > /* Support sysdeps/x86_64/multiarch/wcsnlen.c. */ > > > IFUNC_IMPL (i, name, wcsnlen, > > > IFUNC_IMPL_ADD (array, i, wcsnlen, > > > - CPU_FEATURE_USABLE (AVX2), > > > + (CPU_FEATURE_USABLE (AVX2) > > > + && CPU_FEATURE_USABLE (BMI2)), > > > __wcsnlen_avx2) > > > IFUNC_IMPL_ADD (array, i, wcsnlen, > > > (CPU_FEATURE_USABLE (AVX2) > > > + && CPU_FEATURE_USABLE (BMI2) > > > && CPU_FEATURE_USABLE (RTM)), > > > __wcsnlen_avx2_rtm) > > > IFUNC_IMPL_ADD (array, i, wcsnlen, > > > diff --git a/sysdeps/x86_64/multiarch/strlen-avx2.S b/sysdeps/x86_64/= multiarch/strlen-avx2.S > > > index 1caae9e6bc..bd2e6ee44a 100644 > > > --- a/sysdeps/x86_64/multiarch/strlen-avx2.S > > > +++ b/sysdeps/x86_64/multiarch/strlen-avx2.S > > > @@ -27,9 +27,11 @@ > > > # ifdef USE_AS_WCSLEN > > > # define VPCMPEQ vpcmpeqd > > > # define VPMINU vpminud > > > +# define CHAR_SIZE 4 > > > # else > > > # define VPCMPEQ vpcmpeqb > > > # define VPMINU vpminub > > > +# define CHAR_SIZE 1 > > > # endif > > > > > > # ifndef VZEROUPPER > > > @@ -41,349 +43,459 @@ > > > # endif > > > > > > # define VEC_SIZE 32 > > > +# define PAGE_SIZE 4096 > > > > > > .section SECTION(.text),"ax",@progbits > > > ENTRY (STRLEN) > > > # ifdef USE_AS_STRNLEN > > > - /* Check for zero length. */ > > > + /* Check zero length. */ > > > test %RSI_LP, %RSI_LP > > > jz L(zero) > > > + /* Store max len in R8_LP before adjusting if using WCSLEN. */ > > > + mov %RSI_LP, %R8_LP > > > # ifdef USE_AS_WCSLEN > > > shl $2, %RSI_LP > > > # elif defined __ILP32__ > > > /* Clear the upper 32 bits. */ > > > movl %esi, %esi > > > # endif > > > - mov %RSI_LP, %R8_LP > > > # endif > > > - movl %edi, %ecx > > > + movl %edi, %eax > > > movq %rdi, %rdx > > > vpxor %xmm0, %xmm0, %xmm0 > > > - > > > + /* Clear high bits from edi. Only keeping bits relevant to page > > > + cross check. */ > > > + andl $(PAGE_SIZE - 1), %eax > > > /* Check if we may cross page boundary with one vector load. */ > > > - andl $(2 * VEC_SIZE - 1), %ecx > > > - cmpl $VEC_SIZE, %ecx > > > - ja L(cros_page_boundary) > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %eax > > > + ja L(cross_page_boundary) > > > > > > /* Check the first VEC_SIZE bytes. */ > > > - VPCMPEQ (%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - > > > + VPCMPEQ (%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > # ifdef USE_AS_STRNLEN > > > - jnz L(first_vec_x0_check) > > > - /* Adjust length and check the end of data. */ > > > - subq $VEC_SIZE, %rsi > > > - jbe L(max) > > > -# else > > > - jnz L(first_vec_x0) > > > + /* If length < VEC_SIZE handle special. */ > > > + cmpq $VEC_SIZE, %rsi > > > + jbe L(first_vec_x0) > > > # endif > > > - > > > - /* Align data for aligned loads in the loop. */ > > > - addq $VEC_SIZE, %rdi > > > - andl $(VEC_SIZE - 1), %ecx > > > - andq $-VEC_SIZE, %rdi > > > + /* If empty continue to aligned_more. Otherwise return bit > > > + position of first match. */ > > > + testl %eax, %eax > > > + jz L(aligned_more) > > > + tzcntl %eax, %eax > > > +# ifdef USE_AS_WCSLEN > > > + shrl $2, %eax > > > +# endif > > > + VZEROUPPER_RETURN > > > > > > # ifdef USE_AS_STRNLEN > > > - /* Adjust length. */ > > > - addq %rcx, %rsi > > > +L(zero): > > > + xorl %eax, %eax > > > + ret > > > > > > - subq $(VEC_SIZE * 4), %rsi > > > - jbe L(last_4x_vec_or_less) > > > + .p2align 4 > > > +L(first_vec_x0): > > > + /* Set bit for max len so that tzcnt will return min of max len > > > + and position of first match. */ > > > + btsq %rsi, %rax > > > + tzcntl %eax, %eax > > > +# ifdef USE_AS_WCSLEN > > > + shrl $2, %eax > > > +# endif > > > + VZEROUPPER_RETURN > > > # endif > > > - jmp L(more_4x_vec) > > > > > > .p2align 4 > > > -L(cros_page_boundary): > > > - andl $(VEC_SIZE - 1), %ecx > > > - andq $-VEC_SIZE, %rdi > > > - VPCMPEQ (%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - /* Remove the leading bytes. */ > > > - sarl %cl, %eax > > > - testl %eax, %eax > > > - jz L(aligned_more) > > > +L(first_vec_x1): > > > tzcntl %eax, %eax > > > + /* Safe to use 32 bit instructions as these are only called for > > > + size =3D [1, 159]. */ > > > # ifdef USE_AS_STRNLEN > > > - /* Check the end of data. */ > > > - cmpq %rax, %rsi > > > - jbe L(max) > > > + /* Use ecx which was computed earlier to compute correct value. > > > + */ > > > + subl $(VEC_SIZE * 4 + 1), %ecx > > > + addl %ecx, %eax > > > +# else > > > + subl %edx, %edi > > > + incl %edi > > > + addl %edi, %eax > > > # endif > > > - addq %rdi, %rax > > > - addq %rcx, %rax > > > - subq %rdx, %rax > > > # ifdef USE_AS_WCSLEN > > > - shrq $2, %rax > > > + shrl $2, %eax > > > # endif > > > -L(return_vzeroupper): > > > - ZERO_UPPER_VEC_REGISTERS_RETURN > > > + VZEROUPPER_RETURN > > > > > > .p2align 4 > > > -L(aligned_more): > > > +L(first_vec_x2): > > > + tzcntl %eax, %eax > > > + /* Safe to use 32 bit instructions as these are only called for > > > + size =3D [1, 159]. */ > > > # ifdef USE_AS_STRNLEN > > > - /* "rcx" is less than VEC_SIZE. Calculate "rdx + rcx - VEC_= SIZE" > > > - with "rdx - (VEC_SIZE - rcx)" instead of "(rdx + rcx) - VEC= _SIZE" > > > - to void possible addition overflow. */ > > > - negq %rcx > > > - addq $VEC_SIZE, %rcx > > > - > > > - /* Check the end of data. */ > > > - subq %rcx, %rsi > > > - jbe L(max) > > > + /* Use ecx which was computed earlier to compute correct value. > > > + */ > > > + subl $(VEC_SIZE * 3 + 1), %ecx > > > + addl %ecx, %eax > > > +# else > > > + subl %edx, %edi > > > + addl $(VEC_SIZE + 1), %edi > > > + addl %edi, %eax > > > # endif > > > +# ifdef USE_AS_WCSLEN > > > + shrl $2, %eax > > > +# endif > > > + VZEROUPPER_RETURN > > > > > > - addq $VEC_SIZE, %rdi > > > + .p2align 4 > > > +L(first_vec_x3): > > > + tzcntl %eax, %eax > > > + /* Safe to use 32 bit instructions as these are only called for > > > + size =3D [1, 159]. */ > > > +# ifdef USE_AS_STRNLEN > > > + /* Use ecx which was computed earlier to compute correct value. > > > + */ > > > + subl $(VEC_SIZE * 2 + 1), %ecx > > > + addl %ecx, %eax > > > +# else > > > + subl %edx, %edi > > > + addl $(VEC_SIZE * 2 + 1), %edi > > > + addl %edi, %eax > > > +# endif > > > +# ifdef USE_AS_WCSLEN > > > + shrl $2, %eax > > > +# endif > > > + VZEROUPPER_RETURN > > > > > > + .p2align 4 > > > +L(first_vec_x4): > > > + tzcntl %eax, %eax > > > + /* Safe to use 32 bit instructions as these are only called for > > > + size =3D [1, 159]. */ > > > # ifdef USE_AS_STRNLEN > > > - subq $(VEC_SIZE * 4), %rsi > > > - jbe L(last_4x_vec_or_less) > > > + /* Use ecx which was computed earlier to compute correct value. > > > + */ > > > + subl $(VEC_SIZE + 1), %ecx > > > + addl %ecx, %eax > > > +# else > > > + subl %edx, %edi > > > + addl $(VEC_SIZE * 3 + 1), %edi > > > + addl %edi, %eax > > > # endif > > > +# ifdef USE_AS_WCSLEN > > > + shrl $2, %eax > > > +# endif > > > + VZEROUPPER_RETURN > > > > > > -L(more_4x_vec): > > > + .p2align 5 > > > +L(aligned_more): > > > + /* Align data to VEC_SIZE - 1. This is the same number of > > > + instructions as using andq with -VEC_SIZE but saves 4 bytes = of > > > + code on the x4 check. */ > > > + orq $(VEC_SIZE - 1), %rdi > > > +L(cross_page_continue): > > > /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > since data is only aligned to VEC_SIZE. */ > > > - VPCMPEQ (%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jnz L(first_vec_x0) > > > - > > > - VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > +# ifdef USE_AS_STRNLEN > > > + /* + 1 because rdi is aligned to VEC_SIZE - 1. + CHAR_SIZE beca= use > > > + it simplies the logic in last_4x_vec_or_less. */ > > > + leaq (VEC_SIZE * 4 + CHAR_SIZE + 1)(%rdi), %rcx > > > + subq %rdx, %rcx > > > +# endif > > > + /* Load first VEC regardless. */ > > > + VPCMPEQ 1(%rdi), %ymm0, %ymm1 > > > +# ifdef USE_AS_STRNLEN > > > + /* Adjust length. If near end handle specially. */ > > > + subq %rcx, %rsi > > > + jb L(last_4x_vec_or_less) > > > +# endif > > > + vpmovmskb %ymm1, %eax > > > testl %eax, %eax > > > jnz L(first_vec_x1) > > > > > > - VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > + VPCMPEQ (VEC_SIZE + 1)(%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > testl %eax, %eax > > > jnz L(first_vec_x2) > > > > > > - VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > + VPCMPEQ (VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > testl %eax, %eax > > > jnz L(first_vec_x3) > > > > > > - addq $(VEC_SIZE * 4), %rdi > > > - > > > -# ifdef USE_AS_STRNLEN > > > - subq $(VEC_SIZE * 4), %rsi > > > - jbe L(last_4x_vec_or_less) > > > -# endif > > > - > > > - /* Align data to 4 * VEC_SIZE. */ > > > - movq %rdi, %rcx > > > - andl $(4 * VEC_SIZE - 1), %ecx > > > - andq $-(4 * VEC_SIZE), %rdi > > > + VPCMPEQ (VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > + testl %eax, %eax > > > + jnz L(first_vec_x4) > > > > > > + /* Align data to VEC_SIZE * 4 - 1. */ > > > # ifdef USE_AS_STRNLEN > > > - /* Adjust length. */ > > > + /* Before adjusting length check if at last VEC_SIZE * 4. */ > > > + cmpq $(VEC_SIZE * 4 - 1), %rsi > > > + jbe L(last_4x_vec_or_less_load) > > > + incq %rdi > > > + movl %edi, %ecx > > > + orq $(VEC_SIZE * 4 - 1), %rdi > > > + andl $(VEC_SIZE * 4 - 1), %ecx > > > + /* Readjust length. */ > > > addq %rcx, %rsi > > > +# else > > > + incq %rdi > > > + orq $(VEC_SIZE * 4 - 1), %rdi > > > # endif > > > - > > > + /* Compare 4 * VEC at a time forward. */ > > > .p2align 4 > > > L(loop_4x_vec): > > > - /* Compare 4 * VEC at a time forward. */ > > > - vmovdqa (%rdi), %ymm1 > > > - vmovdqa VEC_SIZE(%rdi), %ymm2 > > > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm3 > > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm4 > > > - VPMINU %ymm1, %ymm2, %ymm5 > > > - VPMINU %ymm3, %ymm4, %ymm6 > > > - VPMINU %ymm5, %ymm6, %ymm5 > > > - > > > - VPCMPEQ %ymm5, %ymm0, %ymm5 > > > - vpmovmskb %ymm5, %eax > > > - testl %eax, %eax > > > - jnz L(4x_vec_end) > > > - > > > - addq $(VEC_SIZE * 4), %rdi > > > - > > > -# ifndef USE_AS_STRNLEN > > > - jmp L(loop_4x_vec) > > > -# else > > > +# ifdef USE_AS_STRNLEN > > > + /* Break if at end of length. */ > > > subq $(VEC_SIZE * 4), %rsi > > > - ja L(loop_4x_vec) > > > - > > > -L(last_4x_vec_or_less): > > > - /* Less than 4 * VEC and aligned to VEC_SIZE. */ > > > - addl $(VEC_SIZE * 2), %esi > > > - jle L(last_2x_vec) > > > + jb L(last_4x_vec_or_less_cmpeq) > > > +# endif > > > + /* Save some code size by microfusing VPMINU with the load. Sin= ce > > > + the matches in ymm2/ymm4 can only be returned if there where= no > > > + matches in ymm1/ymm3 respectively there is no issue with ove= rlap. > > > + */ > > > + vmovdqa 1(%rdi), %ymm1 > > > + VPMINU (VEC_SIZE + 1)(%rdi), %ymm1, %ymm2 > > > + vmovdqa (VEC_SIZE * 2 + 1)(%rdi), %ymm3 > > > + VPMINU (VEC_SIZE * 3 + 1)(%rdi), %ymm3, %ymm4 > > > + > > > + VPMINU %ymm2, %ymm4, %ymm5 > > > + VPCMPEQ %ymm5, %ymm0, %ymm5 > > > + vpmovmskb %ymm5, %ecx > > > > > > - VPCMPEQ (%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jnz L(first_vec_x0) > > > + subq $-(VEC_SIZE * 4), %rdi > > > + testl %ecx, %ecx > > > + jz L(loop_4x_vec) > > > > > > - VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jnz L(first_vec_x1) > > > > > > - VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > + VPCMPEQ %ymm1, %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > + subq %rdx, %rdi > > > testl %eax, %eax > > > + jnz L(last_vec_return_x0) > > > > > > - jnz L(first_vec_x2_check) > > > - subl $VEC_SIZE, %esi > > > - jle L(max) > > > - > > > - VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > + VPCMPEQ %ymm2, %ymm0, %ymm2 > > > + vpmovmskb %ymm2, %eax > > > testl %eax, %eax > > > - > > > - jnz L(first_vec_x3_check) > > > - movq %r8, %rax > > > -# ifdef USE_AS_WCSLEN > > > + jnz L(last_vec_return_x1) > > > + > > > + /* Combine last 2 VEC. */ > > > + VPCMPEQ %ymm3, %ymm0, %ymm3 > > > + vpmovmskb %ymm3, %eax > > > + /* rcx has combined result from all 4 VEC. It will only be used= if > > > + the first 3 other VEC all did not contain a match. */ > > > + salq $32, %rcx > > > + orq %rcx, %rax > > > + tzcntq %rax, %rax > > > + subq $(VEC_SIZE * 2 - 1), %rdi > > > + addq %rdi, %rax > > > +# ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > -# endif > > > +# endif > > > VZEROUPPER_RETURN > > > > > > + > > > +# ifdef USE_AS_STRNLEN > > > .p2align 4 > > > -L(last_2x_vec): > > > - addl $(VEC_SIZE * 2), %esi > > > - VPCMPEQ (%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > +L(last_4x_vec_or_less_load): > > > + /* Depending on entry adjust rdi / prepare first VEC in ymm1. = */ > > > + subq $-(VEC_SIZE * 4), %rdi > > > +L(last_4x_vec_or_less_cmpeq): > > > + VPCMPEQ 1(%rdi), %ymm0, %ymm1 > > > +L(last_4x_vec_or_less): > > > > > > - jnz L(first_vec_x0_check) > > > - subl $VEC_SIZE, %esi > > > - jle L(max) > > > + vpmovmskb %ymm1, %eax > > > + /* If remaining length > VEC_SIZE * 2. This works if esi is off= by > > > + VEC_SIZE * 4. */ > > > + testl $(VEC_SIZE * 2), %esi > > > + jnz L(last_4x_vec) > > > > > > - VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > + /* length may have been negative or positive by an offset of > > > + VEC_SIZE * 4 depending on where this was called from. This f= ixes > > > + that. */ > > > + andl $(VEC_SIZE * 4 - 1), %esi > > > testl %eax, %eax > > > - jnz L(first_vec_x1_check) > > > - movq %r8, %rax > > > -# ifdef USE_AS_WCSLEN > > > - shrq $2, %rax > > > -# endif > > > - VZEROUPPER_RETURN > > > + jnz L(last_vec_x1_check) > > > > > > - .p2align 4 > > > -L(first_vec_x0_check): > > > + subl $VEC_SIZE, %esi > > > + jb L(max) > > > + > > > + VPCMPEQ (VEC_SIZE + 1)(%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > tzcntl %eax, %eax > > > /* Check the end of data. */ > > > - cmpq %rax, %rsi > > > - jbe L(max) > > > + cmpl %eax, %esi > > > + jb L(max) > > > + subq %rdx, %rdi > > > + addl $(VEC_SIZE + 1), %eax > > > addq %rdi, %rax > > > - subq %rdx, %rax > > > # ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > # endif > > > VZEROUPPER_RETURN > > > +# endif > > > > > > .p2align 4 > > > -L(first_vec_x1_check): > > > +L(last_vec_return_x0): > > > tzcntl %eax, %eax > > > - /* Check the end of data. */ > > > - cmpq %rax, %rsi > > > - jbe L(max) > > > - addq $VEC_SIZE, %rax > > > + subq $(VEC_SIZE * 4 - 1), %rdi > > > addq %rdi, %rax > > > - subq %rdx, %rax > > > -# ifdef USE_AS_WCSLEN > > > +# ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > -# endif > > > +# endif > > > VZEROUPPER_RETURN > > > > > > .p2align 4 > > > -L(first_vec_x2_check): > > > +L(last_vec_return_x1): > > > tzcntl %eax, %eax > > > - /* Check the end of data. */ > > > - cmpq %rax, %rsi > > > - jbe L(max) > > > - addq $(VEC_SIZE * 2), %rax > > > + subq $(VEC_SIZE * 3 - 1), %rdi > > > addq %rdi, %rax > > > - subq %rdx, %rax > > > -# ifdef USE_AS_WCSLEN > > > +# ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > -# endif > > > +# endif > > > VZEROUPPER_RETURN > > > > > > +# ifdef USE_AS_STRNLEN > > > .p2align 4 > > > -L(first_vec_x3_check): > > > +L(last_vec_x1_check): > > > + > > > tzcntl %eax, %eax > > > /* Check the end of data. */ > > > - cmpq %rax, %rsi > > > - jbe L(max) > > > - addq $(VEC_SIZE * 3), %rax > > > + cmpl %eax, %esi > > > + jb L(max) > > > + subq %rdx, %rdi > > > + incl %eax > > > addq %rdi, %rax > > > - subq %rdx, %rax > > > # ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > # endif > > > VZEROUPPER_RETURN > > > > > > - .p2align 4 > > > L(max): > > > movq %r8, %rax > > > + VZEROUPPER_RETURN > > > + > > > + .p2align 4 > > > +L(last_4x_vec): > > > + /* Test first 2x VEC normally. */ > > > + testl %eax, %eax > > > + jnz L(last_vec_x1) > > > + > > > + VPCMPEQ (VEC_SIZE + 1)(%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > + testl %eax, %eax > > > + jnz L(last_vec_x2) > > > + > > > + /* Normalize length. */ > > > + andl $(VEC_SIZE * 4 - 1), %esi > > > + VPCMPEQ (VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > + testl %eax, %eax > > > + jnz L(last_vec_x3) > > > + > > > + subl $(VEC_SIZE * 3), %esi > > > + jb L(max) > > > + > > > + VPCMPEQ (VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > + tzcntl %eax, %eax > > > + /* Check the end of data. */ > > > + cmpl %eax, %esi > > > + jb L(max) > > > + subq %rdx, %rdi > > > + addl $(VEC_SIZE * 3 + 1), %eax > > > + addq %rdi, %rax > > > # ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > # endif > > > VZEROUPPER_RETURN > > > > > > - .p2align 4 > > > -L(zero): > > > - xorl %eax, %eax > > > - ret > > > -# endif > > > > > > .p2align 4 > > > -L(first_vec_x0): > > > +L(last_vec_x1): > > > + /* essentially duplicates of first_vec_x1 but use 64 bit > > > + instructions. */ > > > tzcntl %eax, %eax > > > + subq %rdx, %rdi > > > + incl %eax > > > addq %rdi, %rax > > > - subq %rdx, %rax > > > -# ifdef USE_AS_WCSLEN > > > +# ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > -# endif > > > +# endif > > > VZEROUPPER_RETURN > > > > > > .p2align 4 > > > -L(first_vec_x1): > > > +L(last_vec_x2): > > > + /* essentially duplicates of first_vec_x1 but use 64 bit > > > + instructions. */ > > > tzcntl %eax, %eax > > > - addq $VEC_SIZE, %rax > > > + subq %rdx, %rdi > > > + addl $(VEC_SIZE + 1), %eax > > > addq %rdi, %rax > > > - subq %rdx, %rax > > > -# ifdef USE_AS_WCSLEN > > > +# ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > -# endif > > > +# endif > > > VZEROUPPER_RETURN > > > > > > .p2align 4 > > > -L(first_vec_x2): > > > +L(last_vec_x3): > > > tzcntl %eax, %eax > > > - addq $(VEC_SIZE * 2), %rax > > > + subl $(VEC_SIZE * 2), %esi > > > + /* Check the end of data. */ > > > + cmpl %eax, %esi > > > + jb L(max_end) > > > + subq %rdx, %rdi > > > + addl $(VEC_SIZE * 2 + 1), %eax > > > addq %rdi, %rax > > > - subq %rdx, %rax > > > -# ifdef USE_AS_WCSLEN > > > +# ifdef USE_AS_WCSLEN > > > shrq $2, %rax > > > -# endif > > > +# endif > > > + VZEROUPPER_RETURN > > > +L(max_end): > > > + movq %r8, %rax > > > VZEROUPPER_RETURN > > > +# endif > > > > > > + /* Cold case for crossing page with first load. */ > > > .p2align 4 > > > -L(4x_vec_end): > > > - VPCMPEQ %ymm1, %ymm0, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - testl %eax, %eax > > > - jnz L(first_vec_x0) > > > - VPCMPEQ %ymm2, %ymm0, %ymm2 > > > - vpmovmskb %ymm2, %eax > > > +L(cross_page_boundary): > > > + /* Align data to VEC_SIZE - 1. */ > > > + orq $(VEC_SIZE - 1), %rdi > > > + VPCMPEQ -(VEC_SIZE - 1)(%rdi), %ymm0, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > + /* Remove the leading bytes. sarxl only uses bits [5:0] of COUNT > > > + so no need to manually mod rdx. */ > > > + sarxl %edx, %eax, %eax > > > > This is a BMI2 instruction, which is not necessary available when AVX2 > > is available. This causes SIGILL on some CPU. I have reported that in > > https://sourceware.org/bugzilla/show_bug.cgi?id=3D29611 > > This is not a bug on master as: > > commit 83c5b368226c34a2f0a5287df40fc290b2b34359 > Author: H.J. Lu > Date: Mon Apr 19 10:45:07 2021 -0700 > > x86-64: Require BMI2 for strchr-avx2.S > > is already in tree. The issue is the avx2 changes where backported > w.o H.J's changes. > > > > Regards > > Aurelien > > > > -- > > Aurelien Jarno GPG: 4096R/1DDD8C9B > > aurelien@aurel32.net http://www.aurel32.net --_000_CH2PR17MB3541AD0C315CE529B929CE98BD549CH2PR17MB3541namp_--