From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi1-x235.google.com (mail-oi1-x235.google.com [IPv6:2607:f8b0:4864:20::235]) by sourceware.org (Postfix) with ESMTPS id A17F83856254 for ; Tue, 12 Jul 2022 20:26:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A17F83856254 Received: by mail-oi1-x235.google.com with SMTP id n206so3022592oia.6 for ; Tue, 12 Jul 2022 13:26:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KRxuEjF2qz2XaoGZioNrSbWh5eJCuNe0kXFdVZpNb28=; b=AISYA78Yx0l6IB4xUh3Cf68IPihh5CIDcT4JwCv3o8eM66BctvMO1o7YZq4W4asx8x jzeTSEuVjODVDxojPAvY60XT6prvQ9zjRzIPnoPvSQa4QSbmY6JGcDkxu1k3Mpf9NkRX /8k3XTc0aFRZS0nRi1UnPIsSd+9xJqw1qy7ghF+22UanT6m5nO9FnoIDEFuId/IKXTUD GWFyU6uDF7ApJ0X09PPBB47agHLJeUl2KI79KkbZxmCURPZrE+v9/UtwegljHi3aqi2Y ol6GepGSNHag18/vXKJXXnSiuajpWTPrChoRrHSAf7aWEbDNPiBDX1KPvUsc2kb8YpDu NBRg== X-Gm-Message-State: AJIora8+Rk0Ul6b3s1aDdWmgxZ7f0k3+q2TjTG6vMWH8uTMRkys8RM+3 l8ohqC6wqMp4wSgFtjx3JmQJ8G9ye1oxSpWQsoI= X-Google-Smtp-Source: AGRyM1ujff1bedlGG6V6IAf2nVRrA69VsNr6F6L3i2W9rtTPhTO+DpykvH3BzuLUw5v0xyUnPN7ntxndbST/Psjd0xs= X-Received: by 2002:a05:6808:14c3:b0:337:a1dc:89d5 with SMTP id f3-20020a05680814c300b00337a1dc89d5mr2887061oiw.201.1657657613876; Tue, 12 Jul 2022 13:26:53 -0700 (PDT) MIME-Version: 1.0 References: <20220712192910.351121-1-goldstein.w.n@gmail.com> <20220712192910.351121-8-goldstein.w.n@gmail.com> In-Reply-To: <20220712192910.351121-8-goldstein.w.n@gmail.com> From: "H.J. Lu" Date: Tue, 12 Jul 2022 13:26:17 -0700 Message-ID: Subject: Re: [PATCH v1] x86: Move wcslen SSE2 implementation to multiarch/wcslen-sse2.S To: Noah Goldstein Cc: GNU C Library , "Carlos O'Donell" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3024.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Jul 2022 20:26:56 -0000 On Tue, Jul 12, 2022 at 12:29 PM Noah Goldstein wrote: > > This commit doesn't affect libc.so.6, its just housekeeping to prepare > for adding explicit ISA level support. > > Tested build on x86_64 and x86_32 with/without multiarch. > --- > sysdeps/x86_64/multiarch/wcslen-sse2.S | 221 ++++++++++++++++++++++++- > sysdeps/x86_64/wcslen.S | 216 +----------------------- > 2 files changed, 218 insertions(+), 219 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/wcslen-sse2.S b/sysdeps/x86_64/multiarch/wcslen-sse2.S > index 2b3a9efd64..944c3bd9c6 100644 > --- a/sysdeps/x86_64/multiarch/wcslen-sse2.S > +++ b/sysdeps/x86_64/multiarch/wcslen-sse2.S > @@ -17,10 +17,221 @@ > . */ > > #if IS_IN (libc) > -# define __wcslen __wcslen_sse2 > - > -# undef weak_alias > -# define weak_alias(__wcslen, wcslen) > +# ifndef WCSLEN > +# define WCSLEN __wcslen_sse2 > +# endif > #endif > > -#include "../wcslen.S" > +#include > + > + .text > +ENTRY (WCSLEN) > + cmpl $0, (%rdi) > + jz L(exit_tail0) > + cmpl $0, 4(%rdi) > + jz L(exit_tail1) > + cmpl $0, 8(%rdi) > + jz L(exit_tail2) > + cmpl $0, 12(%rdi) > + jz L(exit_tail3) > + cmpl $0, 16(%rdi) > + jz L(exit_tail4) > + cmpl $0, 20(%rdi) > + jz L(exit_tail5) > + cmpl $0, 24(%rdi) > + jz L(exit_tail6) > + cmpl $0, 28(%rdi) > + jz L(exit_tail7) > + > + pxor %xmm0, %xmm0 > + > + lea 32(%rdi), %rax > + addq $16, %rdi > + and $-16, %rax > + > + pcmpeqd (%rax), %xmm0 > + pmovmskb %xmm0, %edx > + pxor %xmm1, %xmm1 > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm1 > + pmovmskb %xmm1, %edx > + pxor %xmm2, %xmm2 > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm2 > + pmovmskb %xmm2, %edx > + pxor %xmm3, %xmm3 > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm3 > + pmovmskb %xmm3, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm0 > + pmovmskb %xmm0, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm1 > + pmovmskb %xmm1, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm2 > + pmovmskb %xmm2, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm3 > + pmovmskb %xmm3, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm0 > + pmovmskb %xmm0, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm1 > + pmovmskb %xmm1, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm2 > + pmovmskb %xmm2, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd (%rax), %xmm3 > + pmovmskb %xmm3, %edx > + addq $16, %rax > + test %edx, %edx > + jnz L(exit) > + > + and $-0x40, %rax > + > + .p2align 4 > +L(aligned_64_loop): > + movaps (%rax), %xmm0 > + movaps 16(%rax), %xmm1 > + movaps 32(%rax), %xmm2 > + movaps 48(%rax), %xmm6 > + > + pminub %xmm1, %xmm0 > + pminub %xmm6, %xmm2 > + pminub %xmm0, %xmm2 > + pcmpeqd %xmm3, %xmm2 > + pmovmskb %xmm2, %edx > + addq $64, %rax > + test %edx, %edx > + jz L(aligned_64_loop) > + > + pcmpeqd -64(%rax), %xmm3 > + pmovmskb %xmm3, %edx > + addq $48, %rdi > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd %xmm1, %xmm3 > + pmovmskb %xmm3, %edx > + addq $-16, %rdi > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd -32(%rax), %xmm3 > + pmovmskb %xmm3, %edx > + addq $-16, %rdi > + test %edx, %edx > + jnz L(exit) > + > + pcmpeqd %xmm6, %xmm3 > + pmovmskb %xmm3, %edx > + addq $-16, %rdi > + test %edx, %edx > + jz L(aligned_64_loop) > + > + .p2align 4 > +L(exit): > + sub %rdi, %rax > + shr $2, %rax > + test %dl, %dl > + jz L(exit_high) > + > + andl $15, %edx > + jz L(exit_1) > + ret > + > + /* No align here. Naturally aligned % 16 == 1. */ > +L(exit_high): > + andl $(15 << 8), %edx > + jz L(exit_3) > + add $2, %rax > + ret > + > + .p2align 3 > +L(exit_1): > + add $1, %rax > + ret > + > + .p2align 3 > +L(exit_3): > + add $3, %rax > + ret > + > + .p2align 3 > +L(exit_tail0): > + xorl %eax, %eax > + ret > + > + .p2align 3 > +L(exit_tail1): > + movl $1, %eax > + ret > + > + .p2align 3 > +L(exit_tail2): > + movl $2, %eax > + ret > + > + .p2align 3 > +L(exit_tail3): > + movl $3, %eax > + ret > + > + .p2align 3 > +L(exit_tail4): > + movl $4, %eax > + ret > + > + .p2align 3 > +L(exit_tail5): > + movl $5, %eax > + ret > + > + .p2align 3 > +L(exit_tail6): > + movl $6, %eax > + ret > + > + .p2align 3 > +L(exit_tail7): > + movl $7, %eax > + ret > + > +END (WCSLEN) > diff --git a/sysdeps/x86_64/wcslen.S b/sysdeps/x86_64/wcslen.S > index d641141d75..588a0fbe01 100644 > --- a/sysdeps/x86_64/wcslen.S > +++ b/sysdeps/x86_64/wcslen.S > @@ -16,218 +16,6 @@ > License along with the GNU C Library; if not, see > . */ > > -#include > - > - .text > -ENTRY (__wcslen) > - cmpl $0, (%rdi) > - jz L(exit_tail0) > - cmpl $0, 4(%rdi) > - jz L(exit_tail1) > - cmpl $0, 8(%rdi) > - jz L(exit_tail2) > - cmpl $0, 12(%rdi) > - jz L(exit_tail3) > - cmpl $0, 16(%rdi) > - jz L(exit_tail4) > - cmpl $0, 20(%rdi) > - jz L(exit_tail5) > - cmpl $0, 24(%rdi) > - jz L(exit_tail6) > - cmpl $0, 28(%rdi) > - jz L(exit_tail7) > - > - pxor %xmm0, %xmm0 > - > - lea 32(%rdi), %rax > - addq $16, %rdi > - and $-16, %rax > - > - pcmpeqd (%rax), %xmm0 > - pmovmskb %xmm0, %edx > - pxor %xmm1, %xmm1 > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm1 > - pmovmskb %xmm1, %edx > - pxor %xmm2, %xmm2 > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm2 > - pmovmskb %xmm2, %edx > - pxor %xmm3, %xmm3 > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm3 > - pmovmskb %xmm3, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm0 > - pmovmskb %xmm0, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm1 > - pmovmskb %xmm1, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm2 > - pmovmskb %xmm2, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm3 > - pmovmskb %xmm3, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm0 > - pmovmskb %xmm0, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm1 > - pmovmskb %xmm1, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm2 > - pmovmskb %xmm2, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd (%rax), %xmm3 > - pmovmskb %xmm3, %edx > - addq $16, %rax > - test %edx, %edx > - jnz L(exit) > - > - and $-0x40, %rax > - > - .p2align 4 > -L(aligned_64_loop): > - movaps (%rax), %xmm0 > - movaps 16(%rax), %xmm1 > - movaps 32(%rax), %xmm2 > - movaps 48(%rax), %xmm6 > - > - pminub %xmm1, %xmm0 > - pminub %xmm6, %xmm2 > - pminub %xmm0, %xmm2 > - pcmpeqd %xmm3, %xmm2 > - pmovmskb %xmm2, %edx > - addq $64, %rax > - test %edx, %edx > - jz L(aligned_64_loop) > - > - pcmpeqd -64(%rax), %xmm3 > - pmovmskb %xmm3, %edx > - addq $48, %rdi > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd %xmm1, %xmm3 > - pmovmskb %xmm3, %edx > - addq $-16, %rdi > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd -32(%rax), %xmm3 > - pmovmskb %xmm3, %edx > - addq $-16, %rdi > - test %edx, %edx > - jnz L(exit) > - > - pcmpeqd %xmm6, %xmm3 > - pmovmskb %xmm3, %edx > - addq $-16, %rdi > - test %edx, %edx > - jz L(aligned_64_loop) > - > - .p2align 4 > -L(exit): > - sub %rdi, %rax > - shr $2, %rax > - test %dl, %dl > - jz L(exit_high) > - > - andl $15, %edx > - jz L(exit_1) > - ret > - > - /* No align here. Naturally aligned % 16 == 1. */ > -L(exit_high): > - andl $(15 << 8), %edx > - jz L(exit_3) > - add $2, %rax > - ret > - > - .p2align 3 > -L(exit_1): > - add $1, %rax > - ret > - > - .p2align 3 > -L(exit_3): > - add $3, %rax > - ret > - > - .p2align 3 > -L(exit_tail0): > - xorl %eax, %eax > - ret > - > - .p2align 3 > -L(exit_tail1): > - movl $1, %eax > - ret > - > - .p2align 3 > -L(exit_tail2): > - movl $2, %eax > - ret > - > - .p2align 3 > -L(exit_tail3): > - movl $3, %eax > - ret > - > - .p2align 3 > -L(exit_tail4): > - movl $4, %eax > - ret > - > - .p2align 3 > -L(exit_tail5): > - movl $5, %eax > - ret > - > - .p2align 3 > -L(exit_tail6): > - movl $6, %eax > - ret > - > - .p2align 3 > -L(exit_tail7): > - movl $7, %eax > - ret > - > -END (__wcslen) > - > +#define WCSLEN __wcslen > +#include "multiarch/wcslen-sse2.S" > weak_alias(__wcslen, wcslen) > -- > 2.34.1 > LGTM. Thanks. -- H.J.