From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <danilak@google.com>
Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com
 [IPv6:2a00:1450:4864:20::62d])
 by sourceware.org (Postfix) with ESMTPS id 7DD163857B91
 for <libc-alpha@sourceware.org>; Tue, 21 Jun 2022 09:28:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7DD163857B91
Received: by mail-ej1-x62d.google.com with SMTP id g26so4801488ejb.5
 for <libc-alpha@sourceware.org>; Tue, 21 Jun 2022 02:28:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=KVIwrYm/4O8G/tGg4ddBOs0ozGpNxxzIdj7JTHF8SDo=;
 b=wtMW7ppaNvnIzXJX80fzjBL+tKd/UBVLJ/ac5EMF9RrTUpoAl9SYUih8J26s5wthWb
 TzUf2zO1SoV1fmuSzdKra1PyLvhdeBalpmscuw89Fe6VuRGOctH6PnVCeQyZ59AQ8Qea
 c9+m3BfDMCST6j0jNpCELv1zsLp3qc4tmt/PGZdda+Dd4TCOxJCj3w16lC9Cupd0UP0i
 U6KBH48aqTT0JNKwQ6+TaEPdVsKTSWjWzhLPGAI4vT5+mvc2TARIvN4wEUpzqBs0ADd9
 Zi5DAh9VmoKwCjEG6DH9I7xzj8qcMMg015ryjmZEap6q4gNGds3S8jI/kBcy7ez1No+r
 zPRA==
X-Gm-Message-State: AJIora/sdcfLENijyJYmpmhlZbkxoRiJCmcb5dV0qgUH/HqZbn1jCTGu
 2Ed3Dq2AAwjEAM9BlJmpzjH5+USNbD35D0hndtugNA==
X-Google-Smtp-Source: AGRyM1tT3zxfcEMGLmmVFZuj0zIvuXnEXG6hsNvl7WGB6QdbzH1eLUPH7YM5eXK1DsUHohJf8FaFVDsUZhSN4idhDzQ=
X-Received: by 2002:a17:906:3bd2:b0:711:ce53:5d5f with SMTP id
 v18-20020a1709063bd200b00711ce535d5fmr24757083ejf.500.1655803733973; Tue, 21
 Jun 2022 02:28:53 -0700 (PDT)
MIME-Version: 1.0
References: <20220620174628.2820531-1-danilak@google.com>
 <YrGKYrHR/rsh0dNo@arm.com>
In-Reply-To: <YrGKYrHR/rsh0dNo@arm.com>
From: Danila Kutenin <danilak@google.com>
Date: Tue, 21 Jun 2022 10:28:42 +0100
Message-ID: <CA+EkT6uJ1LtTmDd4pkVdQ8Hp=aqqO3qUsA8=0FGt6G13zQEhGw@mail.gmail.com>
Subject: Re: [PATCH] aarch64: Optimize string functions with shrn instruction
To: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Cc: libc-alpha@sourceware.org, Danila Kutenin <kutdanila@yandex.ru>
X-Spam-Status: No, score=-27.1 required=5.0 tests=BAYES_00, DKIMWL_WL_MED,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,
 GIT_PATCH_0, HTML_MESSAGE, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL,
 USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 21 Jun 2022 09:29:00 -0000

It's a contribution from Google. Google has a copyright assignment with
fsf, I think this should cover it.

Sorry for the email confusion, I realized the mess quite late: my git
client was configured with yandex and the commit was set up with the
google's account. Yandex has nothing to do with this work. If needed, I can
recreate the patch

On Tue, Jun 21, 2022, 10:08 Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:

> The 06/20/2022 17:46, Danila Kutenin wrote:
> > From: Danila Kutenin <kutdanila@yandex.ru>
> >
> > We found that string functions were using AND+ADDP
> > to find the nibble/syndrome mask but there is an easier
> > opportunity through `SHRN dst, src, 4` and has same
> > latency on all SIMD ARMv8 targets as ADDP. There are also
> > gaps for memcmp but that's probably for another patch
> >
> > We see 10-20% savings for small-mid size cases which are
> > primary cases for general workloads https://pastebin.com/hA5Fd8eM
> >
> > I don't have commit rights, asking maintainers to do that
> >
> > Signed-off-by: Danila Kutenin <danilak@google.com>
>
> is this a contribution from google or yandex or personal?
>
> (e.g. if your company has copyright assignment with fsf then
> you dont need signed-off-by, otherwise it's better to have
> the email address consistent with the author address)
>
> > ---
> >  sysdeps/aarch64/memchr.S    | 19 +++++++------------
> >  sysdeps/aarch64/memrchr.S   | 25 +++++++++----------------
> >  sysdeps/aarch64/strchrnul.S | 29 +++++++++++------------------
> >  sysdeps/aarch64/strcpy.S    | 32 ++++++++++++--------------------
> >  sysdeps/aarch64/strlen.S    | 25 +++++++++----------------
> >  sysdeps/aarch64/strnlen.S   | 25 +++++++++----------------
> >  6 files changed, 57 insertions(+), 98 deletions(-)
> >
> > diff --git a/sysdeps/aarch64/memchr.S b/sysdeps/aarch64/memchr.S
> > index b060eee97d..b983489491 100644
> > --- a/sysdeps/aarch64/memchr.S
> > +++ b/sysdeps/aarch64/memchr.S
> > @@ -53,12 +53,11 @@
> >
> >  /*
> >     Core algorithm:
> > -   For each 16-byte chunk we calculate a 64-bit syndrome value with
> four bits
> > -   per byte. For even bytes, bits 0-3 are set if the relevant byte
> matched the
> > -   requested character or the byte is NUL. Bits 4-7 must be zero. Bits
> 4-7 are
> > -   set likewise for odd bytes so that adjacent bytes can be merged.
> Since the
> > -   bits in the syndrome reflect the order in which things occur in the
> original
> > -   string, counting trailing zeros identifies exactly which byte
> matched.  */
> > +   For each 16-byte chunk we calculate a 64-bit nibble mask value with
> four bits
> > +   per byte. We take 4 bits of every comparison byte with shift right
> and narrow
> > +   by 4 instruction. Since the bits in the nibble mask reflect the
> order in
> > +   which things occur in the original string, counting leading zeros
> identifies
> > +   exactly which byte matched.  */
> >
> >  ENTRY (MEMCHR)
> >       PTR_ARG (0)
> > @@ -67,12 +66,9 @@ ENTRY (MEMCHR)
> >       cbz     cntin, L(nomatch)
> >       ld1     {vdata.16b}, [src]
> >       dup     vrepchr.16b, chrin
> > -     mov     wtmp, 0xf00f
> > -     dup     vrepmask.8h, wtmp
> >       cmeq    vhas_chr.16b, vdata.16b, vrepchr.16b
> >       lsl     shift, srcin, 2
> > -     and     vhas_chr.16b, vhas_chr.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_chr.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >       lsr     synd, synd, shift
> >       cbz     synd, L(start_loop)
> > @@ -111,8 +107,7 @@ L(loop32_2):
> >       fmov    synd, dend
> >       cbz     synd, L(loop32)
> >  L(end):
> > -     and     vhas_chr.16b, vhas_chr.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_chr.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >       add     tmp, srcin, cntin
> >       sub     cntrem, tmp, src
> > diff --git a/sysdeps/aarch64/memrchr.S b/sysdeps/aarch64/memrchr.S
> > index e0efbad91c..5179320720 <(517)%20932-0720> 100644
> > --- a/sysdeps/aarch64/memrchr.S
> > +++ b/sysdeps/aarch64/memrchr.S
> > @@ -37,7 +37,6 @@
> >  #define synd         x5
> >  #define shift                x6
> >  #define      tmp             x7
> > -#define wtmp         w7
> >  #define end          x8
> >  #define endm1                x9
> >
> > @@ -45,18 +44,16 @@
> >  #define qdata                q1
> >  #define vdata                v1
> >  #define vhas_chr     v2
> > -#define vrepmask     v3
> > -#define vend         v4
> > -#define dend         d4
> > +#define vend         v3
> > +#define dend         d3
> >
> >  /*
> >     Core algorithm:
> > -   For each 16-byte chunk we calculate a 64-bit syndrome value with
> four bits
> > -   per byte. For even bytes, bits 0-3 are set if the relevant byte
> matched the
> > -   requested character or the byte is NUL. Bits 4-7 must be zero. Bits
> 4-7 are
> > -   set likewise for odd bytes so that adjacent bytes can be merged.
> Since the
> > -   bits in the syndrome reflect the order in which things occur in the
> original
> > -   string, counting trailing zeros identifies exactly which byte
> matched.  */
> > +   For each 16-byte chunk we calculate a 64-bit nibble mask value with
> four bits
> > +   per byte. We take 4 bits of every comparison byte with shift right
> and narrow
> > +   by 4 instruction. Since the bits in the nibble mask reflect the
> order in
> > +   which things occur in the original string, counting leading zeros
> identifies
> > +   exactly which byte matched.  */
> >
> >  ENTRY (__memrchr)
> >       PTR_ARG (0)
> > @@ -67,12 +64,9 @@ ENTRY (__memrchr)
> >       cbz     cntin, L(nomatch)
> >       ld1     {vdata.16b}, [src]
> >       dup     vrepchr.16b, chrin
> > -     mov     wtmp, 0xf00f
> > -     dup     vrepmask.8h, wtmp
> >       cmeq    vhas_chr.16b, vdata.16b, vrepchr.16b
> >       neg     shift, end, lsl 2
> > -     and     vhas_chr.16b, vhas_chr.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_chr.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >       lsl     synd, synd, shift
> >       cbz     synd, L(start_loop)
> > @@ -109,8 +103,7 @@ L(loop32_2):
> >       fmov    synd, dend
> >       cbz     synd, L(loop32)
> >  L(end):
> > -     and     vhas_chr.16b, vhas_chr.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_chr.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >
> >       add     tmp, src, 15
> > diff --git a/sysdeps/aarch64/strchrnul.S b/sysdeps/aarch64/strchrnul.S
> > index 442726fd49..ee154ab74b 100644
> > --- a/sysdeps/aarch64/strchrnul.S
> > +++ b/sysdeps/aarch64/strchrnul.S
> > @@ -33,38 +33,32 @@
> >  #define src          x2
> >  #define tmp1         x1
> >  #define tmp2         x3
> > -#define tmp2w                w3
> >
> >  #define vrepchr              v0
> >  #define vdata                v1
> >  #define qdata                q1
> >  #define vhas_nul     v2
> >  #define vhas_chr     v3
> > -#define vrepmask     v4
> > -#define vend         v5
> > -#define dend         d5
> > +#define vend         v4
> > +#define dend         d4
> >
> > -/* Core algorithm:
> > -
> > -   For each 16-byte chunk we calculate a 64-bit syndrome value with
> four bits
> > -   per byte. For even bytes, bits 0-3 are set if the relevant byte
> matched the
> > -   requested character or the byte is NUL. Bits 4-7 must be zero. Bits
> 4-7 are
> > -   set likewise for odd bytes so that adjacent bytes can be merged.
> Since the
> > -   bits in the syndrome reflect the order in which things occur in the
> original
> > -   string, counting trailing zeros identifies exactly which byte
> matched.  */
> > +/*
> > +   Core algorithm:
> > +   For each 16-byte chunk we calculate a 64-bit nibble mask value with
> four bits
> > +   per byte. We take 4 bits of every comparison byte with shift right
> and narrow
> > +   by 4 instruction. Since the bits in the nibble mask reflect the
> order in
> > +   which things occur in the original string, counting leading zeros
> identifies
> > +   exactly which byte matched.  */
> >
> >  ENTRY (__strchrnul)
> >       PTR_ARG (0)
> >       bic     src, srcin, 15
> >       dup     vrepchr.16b, chrin
> >       ld1     {vdata.16b}, [src]
> > -     mov     tmp2w, 0xf00f
> > -     dup     vrepmask.8h, tmp2w
> >       cmeq    vhas_chr.16b, vdata.16b, vrepchr.16b
> >       cmhs    vhas_chr.16b, vhas_chr.16b, vdata.16b
> >       lsl     tmp2, srcin, 2
> > -     and     vhas_chr.16b, vhas_chr.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_chr.8h, 4         /* 128->64 */
> >       fmov    tmp1, dend
> >       lsr     tmp1, tmp1, tmp2        /* Mask padding bits.  */
> >       cbz     tmp1, L(loop)
> > @@ -83,8 +77,7 @@ L(loop):
> >       fmov    tmp1, dend
> >       cbz     tmp1, L(loop)
> >
> > -     and     vhas_chr.16b, vhas_chr.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_chr.8h, 4         /* 128->64 */
> >       fmov    tmp1, dend
> >  #ifndef __AARCH64EB__
> >       rbit    tmp1, tmp1
> > diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
> > index da53170ece..78d27b4aa6 100644
> > --- a/sysdeps/aarch64/strcpy.S
> > +++ b/sysdeps/aarch64/strcpy.S
> > @@ -40,7 +40,6 @@
> >  #define len          x4
> >  #define synd         x4
> >  #define      tmp             x5
> > -#define wtmp         w5
> >  #define shift                x5
> >  #define data1                x6
> >  #define dataw1               w6
> > @@ -50,9 +49,8 @@
> >  #define dataq                q0
> >  #define vdata                v0
> >  #define vhas_nul     v1
> > -#define vrepmask     v2
> > -#define vend         v3
> > -#define dend         d3
> > +#define vend         v2
> > +#define dend         d2
> >  #define dataq2               q1
> >
> >  #ifdef BUILD_STPCPY
> > @@ -63,34 +61,29 @@
> >  # define IFSTPCPY(X,...)
> >  #endif
> >
> > -/* Core algorithm:
> > -
> > -   For each 16-byte chunk we calculate a 64-bit syndrome value with
> four bits
> > -   per byte. For even bytes, bits 0-3 are set if the relevant byte
> matched the
> > -   requested character or the byte is NUL. Bits 4-7 must be zero. Bits
> 4-7 are
> > -   set likewise for odd bytes so that adjacent bytes can be merged.
> Since the
> > -   bits in the syndrome reflect the order in which things occur in the
> original
> > -   string, counting trailing zeros identifies exactly which byte
> matched.  */
> > +/*
> > +   Core algorithm:
> > +   For each 16-byte chunk we calculate a 64-bit nibble mask value with
> four bits
> > +   per byte. We take 4 bits of every comparison byte with shift right
> and narrow
> > +   by 4 instruction. Since the bits in the nibble mask reflect the
> order in
> > +   which things occur in the original string, counting leading zeros
> identifies
> > +   exactly which byte matched.  */
> >
> >  ENTRY (STRCPY)
> >       PTR_ARG (0)
> >       PTR_ARG (1)
> >       bic     src, srcin, 15
> > -     mov     wtmp, 0xf00f
> >       ld1     {vdata.16b}, [src]
> > -     dup     vrepmask.8h, wtmp
> >       cmeq    vhas_nul.16b, vdata.16b, 0
> >       lsl     shift, srcin, 2
> > -     and     vhas_nul.16b, vhas_nul.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_nul.16b, vhas_nul.16b
> > +     shrn    vend.8b, vhas_nul.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >       lsr     synd, synd, shift
> >       cbnz    synd, L(tail)
> >
> >       ldr     dataq, [src, 16]!
> >       cmeq    vhas_nul.16b, vdata.16b, 0
> > -     and     vhas_nul.16b, vhas_nul.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_nul.16b, vhas_nul.16b
> > +     shrn    vend.8b, vhas_nul.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >       cbz     synd, L(start_loop)
> >
> > @@ -162,8 +155,7 @@ L(loop):
> >       fmov    synd, dend
> >       cbz     synd, L(loop)
> >
> > -     and     vhas_nul.16b, vhas_nul.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_nul.16b, vhas_nul.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_nul.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >  #ifndef __AARCH64EB__
> >       rbit    synd, synd
> > diff --git a/sysdeps/aarch64/strlen.S b/sysdeps/aarch64/strlen.S
> > index a2310871c2..3a5d088407 100644
> > --- a/sysdeps/aarch64/strlen.S
> > +++ b/sysdeps/aarch64/strlen.S
> > @@ -34,35 +34,29 @@
> >  #define src          x1
> >  #define      synd            x2
> >  #define tmp          x3
> > -#define wtmp         w3
> >  #define shift                x4
> >
> >  #define data         q0
> >  #define vdata                v0
> >  #define vhas_nul     v1
> > -#define vrepmask     v2
> > -#define vend         v3
> > -#define dend         d3
> > +#define vend         v2
> > +#define dend         d2
> >
> >  /* Core algorithm:
> >
> > -   For each 16-byte chunk we calculate a 64-bit syndrome value with
> four bits
> > -   per byte. For even bytes, bits 0-3 are set if the relevant byte
> matched the
> > -   requested character or the byte is NUL. Bits 4-7 must be zero. Bits
> 4-7 are
> > -   set likewise for odd bytes so that adjacent bytes can be merged.
> Since the
> > -   bits in the syndrome reflect the order in which things occur in the
> original
> > -   string, counting trailing zeros identifies exactly which byte
> matched.  */
> > +   For each 16-byte chunk we calculate a 64-bit nibble mask value with
> four bits
> > +   per byte. We take 4 bits of every comparison byte with shift right
> and narrow
> > +   by 4 instruction. Since the bits in the nibble mask reflect the
> order in
> > +   which things occur in the original string, counting trailing zeros
> identifies
> > +   exactly which byte matched.  */
> >
> >  ENTRY (STRLEN)
> >       PTR_ARG (0)
> >       bic     src, srcin, 15
> > -     mov     wtmp, 0xf00f
> >       ld1     {vdata.16b}, [src]
> > -     dup     vrepmask.8h, wtmp
> >       cmeq    vhas_nul.16b, vdata.16b, 0
> >       lsl     shift, srcin, 2
> > -     and     vhas_nul.16b, vhas_nul.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_nul.16b, vhas_nul.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_nul.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >       lsr     synd, synd, shift
> >       cbz     synd, L(loop)
> > @@ -80,8 +74,7 @@ L(loop):
> >       fmov    synd, dend
> >       cbz     synd, L(loop)
> >
> > -     and     vhas_nul.16b, vhas_nul.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_nul.16b, vhas_nul.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_nul.8h, 4         /* 128->64 */
> >       sub     result, src, srcin
> >       fmov    synd, dend
> >  #ifndef __AARCH64EB__
> > diff --git a/sysdeps/aarch64/strnlen.S b/sysdeps/aarch64/strnlen.S
> > index 0dbecb0ce9..282bddc9aa 100644
> > --- a/sysdeps/aarch64/strnlen.S
> > +++ b/sysdeps/aarch64/strnlen.S
> > @@ -33,39 +33,33 @@
> >  #define src          x2
> >  #define synd         x3
> >  #define      shift           x4
> > -#define wtmp         w4
> >  #define tmp          x4
> >  #define cntrem               x5
> >
> >  #define qdata                q0
> >  #define vdata                v0
> >  #define vhas_chr     v1
> > -#define vrepmask     v2
> > -#define vend         v3
> > -#define dend         d3
> > +#define vend         v2
> > +#define dend         d2
> >
> >  /*
> >     Core algorithm:
> >
> > -   For each 16-byte chunk we calculate a 64-bit syndrome value with
> four bits
> > -   per byte. For even bytes, bits 0-3 are set if the relevant byte
> matched the
> > -   requested character or the byte is NUL. Bits 4-7 must be zero. Bits
> 4-7 are
> > -   set likewise for odd bytes so that adjacent bytes can be merged.
> Since the
> > -   bits in the syndrome reflect the order in which things occur in the
> original
> > -   string, counting trailing zeros identifies exactly which byte
> matched.  */
> > +   For each 16-byte chunk we calculate a 64-bit nibble mask value with
> four bits
> > +   per byte. We take 4 bits of every comparison byte with shift right
> and narrow
> > +   by 4 instruction. Since the bits in the nibble mask reflect the
> order in
> > +   which things occur in the original string, counting trailing zeros
> identifies
> > +   exactly which byte matched.  */
> >
> >  ENTRY (__strnlen)
> >       PTR_ARG (0)
> >       SIZE_ARG (1)
> >       bic     src, srcin, 15
> > -     mov     wtmp, 0xf00f
> >       cbz     cntin, L(nomatch)
> >       ld1     {vdata.16b}, [src], 16
> > -     dup     vrepmask.8h, wtmp
> >       cmeq    vhas_chr.16b, vdata.16b, 0
> >       lsl     shift, srcin, 2
> > -     and     vhas_chr.16b, vhas_chr.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_chr.8h, 4         /* 128->64 */
> >       fmov    synd, dend
> >       lsr     synd, synd, shift
> >       cbz     synd, L(start_loop)
> > @@ -103,8 +97,7 @@ L(loop32_2):
> >       cbz     synd, L(loop32)
> >
> >  L(end):
> > -     and     vhas_chr.16b, vhas_chr.16b, vrepmask.16b
> > -     addp    vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64
> */
> > +     shrn    vend.8b, vhas_chr.8h, 4         /* 128->64 */
> >       sub     src, src, 16
> >       mov     synd, vend.d[0]
> >       sub     result, src, srcin
> > --
> > 2.37.0.rc0.104.g0611611a94-goog
> >
>