From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <hjl.tools@gmail.com>
Received: from mail-oi1-x230.google.com (mail-oi1-x230.google.com
 [IPv6:2607:f8b0:4864:20::230])
 by sourceware.org (Postfix) with ESMTPS id 72438385780C
 for <libc-alpha@sourceware.org>; Wed, 20 Jan 2021 12:48:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 72438385780C
Received: by mail-oi1-x230.google.com with SMTP id w124so24902928oia.6
 for <libc-alpha@sourceware.org>; Wed, 20 Jan 2021 04:48:13 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=zStdIaTCNBqi3kmz+5wWEg0jzd05Q+fzAK6LTW4wHuc=;
 b=YhFXtMyGEVVM45SvpsD8Uu/a026AgcCNe8Cu4wTlX24wAK18Jcq3/xxUIJimzbwkP8
 gz8lCeT0RNTXQuRsVoybgoEIinUP0EsyuEzn1x/RrodyQr+DNmY/UBAnGgehTIgy+iZ/
 nupRztLrMmM1o2z580QsbCRpCIHefk18YKo/j40dKp3tgx0sBeaJUeBDDsUG3FV/7w03
 303NxailUmS8SNfgOQa7qJ1216ZcSkrozEu3aBrrkSGGXQuUZz7zfwTegR0HPGHeGlYU
 Eww5xu5a/9GSno9DL6fsIye8f5A8+cPJfLD8Xl2Ky7OLSkyGvrttVL4vj64MPykYJpmE
 0NhQ==
X-Gm-Message-State: AOAM53304Q0V3DaKMh1eLPksw2F6TkRdi8j/N2RDA7UdOruY25qOL83R
 h9KxMdZU69qc0hKcOMplhMyTaIqwsHUO+tuPnq2Zhu+EwY8=
X-Google-Smtp-Source: ABdhPJznWfxJOWhlOq+YgOsvOaQ5LvrwMptYW0Hn5BZoZQDZXwbIlaES4suzQcxixytEOKsP23WgxMwm1nAOEStzjp8=
X-Received: by 2002:aca:34c2:: with SMTP id b185mr2739490oia.25.1611146892751; 
 Wed, 20 Jan 2021 04:48:12 -0800 (PST)
MIME-Version: 1.0
References: <20210120092914.256388-1-goldstein.w.n@gmail.com>
In-Reply-To: <20210120092914.256388-1-goldstein.w.n@gmail.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 20 Jan 2021 04:47:36 -0800
Message-ID: <CAMe9rOoBYBxqDj_6-kPAaWo9a6wY3Eykdw9LtEiNNobqbxtdUw@mail.gmail.com>
Subject: Re: [PATCH] x86: Refactor and improve performance of strchr-avx2.S
To: noah <goldstein.w.n@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3036.3 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Jan 2021 12:48:15 -0000

On Wed, Jan 20, 2021 at 1:34 AM noah via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> No bug. Just seemed the performance could be improved a bit. Observed
> and expected behavior are unchanged. Optimized body of main
> loop. Updated page cross logic and optimized accordingly. Made a few
> minor instruction selection modifications. No regressions in test
> suite. Both test-strchrnul and test-strchr passed.

Thank you very much.

> Author: noah <goldstein.w.n@gmail.com>
> ---
> Possible improvements fall into 5 categories roughly ordered by
> importance:
>
> 1 - The main loop that handles 4 vectors as a time has 4 uops removed
>     from it. This is at the cost of additional port pressure on ports
>     0/1 as vpminub and vpminud can only run on those ports whereas
>     vpor can run on ports 0/1/5. But the 4 saved uops should more than
>     make up for that. Analysis either by latency, throughput, or
>     benchmarks shows its a performance improvement.
>
> 2 - As far as I can tell the cros_page_boundary logic was broken (or
>     the jump label was especially confusing). The origional code
>     tested for this by checking if the load would split cache lines
>     not pages. I don't think there are any x86_64 architectures that

The original code has

        /* Check if we may cross page boundary with one vector load.  */
        andl    $(2 * VEC_SIZE - 1), %ecx
        cmpl    $VEC_SIZE, %ecx
        ja      L(cros_page_boundary)

It is just very conservative.   It never fails to cross page boundary.

>     support both AVX2 (Haskwell and newer) and don't have a minimum
>     page size of 4kb. Given that the cost of splitting a cacheline
>     appears to be low on CPUs that support AVX2
>     [https://www.agner.org/optimize/blog/read.php?i=415#423] and this
>     is one-off I don't see it as being critical to avoid. If it
>     critical to avoid a cacheline split load I think a branchless
>     approach might be better. Thoughts? Note: Given that the check was
>     changed to only be for page crosses, I think it is significantly
>     less likely than before so I moved the branch target from such
>     prime real estate. I am also unsure if there is a PAGE_SIZE define
>     in glibc somewhere I can use instead of defining it here.

Define PAGE_SIZE to 4096 is fine.

> 3 - What was origionally the more_4x_vec label was removed and the
>     code only does 3 vector blocks now. The reasoning is as follows;
>     there are two entry points to that code section, from a page cross
>     or fall through (no page cross). The fall through case is more
>     important to optimize for assuming the point above. In this case
>     the incoming alignments (i.e alignment of ptr in rdi) mod 128 can
>     be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector
>     blocks optimizes for the [96 - 127] so that when the main 4x loop
>     is hit, a new 128 byte aligned segment can be started. Doing 3
>     vector blocks optimizes for the [0 - 32] case. I generally think
>     the string is more likely to for aligned to cache line size (or L2
>     prefetcher cache line pairs) than at [96 - 127] bytes. An
>     alternative would be to make that code do 5x vector blocks. This
>     would mean that at most 2 vector blocks where wasted when
>     realigning to 128 bytes (as opposed to 3x or 4x which can allow
>     for 3 vector blocks to be wasted). Thoughts?

I picked 4x because it was how I unrolled in other string functions.
3x is fine if it faster than 4x.

> 4 - Replace salq using the %cl partial register to sarx. This assumes
>     BMI2 which is Haskwell and newer but AVX2 implies Hashwell and
>     newer so I think it is safe.

Need to add a BMI2 check in strchr.c:

&& CPU_FEATURE_USABLE_P (cpu_features, BMI2)

>
> 5 - in the first_vec_xN return blocks change the addq from using rax
>     as a destination to rdi as a destination. This just allows for 1
>     uop shorter latency.

Sounds reasonable.

> Benchamrks:
> I can post my benchmarking code in the email thread if that is
> appreciated. I benchmarked a variety of cases with different
> alignments, sizes, and data hotness (in L1, L2, etc...) so I can just
> give a simple number of x percentage improvement. They where also run
> on my personal computer (icelake-client). Of my 2732 test cases 1985
> saw an improvement with these changes, 747 performed better with the
> origional code. I should note, however, that my test cases had a
> disproportionate number of cases with 4kb page crosses, which as
> discussed I moved to a colder path.

Please submit a separate patch to add your workload to
benchtests/bench-strstr.c.

> In general the affects of this change are:
>
> large/medium sized strings (from any part of memory really) 10-30%
> performance boost.
> Small strings that are not page crosses by 10-20% performance boost.
> Small strings are cache line splits by 20-30%  performance boost.
> Small strings that cross a page boundary (4kb page that is) see a 10%
> performance regression.
>
> No regressions in test suite. Both test-strchrnul and test-strchr
> passed.

It is also used for wcschr and wcschrnul.  Do they work correctly with
your change?

> I would love to here you feedback and all of these points (or any that
> I missed).
>
> FSF Documentation has been signed and returned (via pub key and email
> respectively)
>
>  sysdeps/x86_64/multiarch/strchr-avx2.S | 173 +++++++++++++------------
>  1 file changed, 87 insertions(+), 86 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
> index d416558d04..09c2df86d1 100644
> --- a/sysdeps/x86_64/multiarch/strchr-avx2.S
> +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
> @@ -27,10 +27,12 @@
>  # ifdef USE_AS_WCSCHR
>  #  define VPBROADCAST  vpbroadcastd
>  #  define VPCMPEQ      vpcmpeqd
> +#  define VPMINU       vpminud
>  #  define CHAR_REG     esi
>  # else
>  #  define VPBROADCAST  vpbroadcastb
>  #  define VPCMPEQ      vpcmpeqb
> +#  define VPMINU       vpminub
>  #  define CHAR_REG     sil
>  # endif
>
> @@ -39,7 +41,8 @@
>  # endif
>
>  # define VEC_SIZE 32
> -
> +# define PAGE_SIZE 4096
> +
>         .section .text.avx,"ax",@progbits
>  ENTRY (STRCHR)
>         movl    %edi, %ecx
> @@ -48,8 +51,8 @@ ENTRY (STRCHR)
>         vpxor   %xmm9, %xmm9, %xmm9
>         VPBROADCAST %xmm0, %ymm0
>         /* Check if we may cross page boundary with one vector load.  */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  Please update comments.
> -       andl    $(2 * VEC_SIZE - 1), %ecx
> -       cmpl    $VEC_SIZE, %ecx
> +       andl    $(PAGE_SIZE - 1), %ecx
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %ecx
>         ja      L(cros_page_boundary)

Can you also replace cros_page_boundary with cross_page_boundary?

>         /* Check the first VEC_SIZE bytes.  Search for both CHAR and the
> @@ -63,45 +66,11 @@ ENTRY (STRCHR)
>         jnz     L(first_vec_x0)
>
>         /* Align data for aligned loads in the loop.  */
> -       addq    $VEC_SIZE, %rdi
> -       andl    $(VEC_SIZE - 1), %ecx
> -       andq    $-VEC_SIZE, %rdi
> -
> -       jmp     L(more_4x_vec)
> -
> -       .p2align 4
> -L(cros_page_boundary):
> -       andl    $(VEC_SIZE - 1), %ecx
> -       andq    $-VEC_SIZE, %rdi
> -       vmovdqu (%rdi), %ymm8
> -       VPCMPEQ %ymm8, %ymm0, %ymm1
> -       VPCMPEQ %ymm8, %ymm9, %ymm2
> -       vpor    %ymm1, %ymm2, %ymm1
> -       vpmovmskb %ymm1, %eax
> -       /* Remove the leading bytes.  */
> -       sarl    %cl, %eax
> -       testl   %eax, %eax
> -       jz      L(aligned_more)
> -       /* Found CHAR or the null byte.  */
> -       tzcntl  %eax, %eax
> -       addq    %rcx, %rax
> -# ifdef USE_AS_STRCHRNUL
> -       addq    %rdi, %rax
> -# else
> -       xorl    %edx, %edx
> -       leaq    (%rdi, %rax), %rax
> -       cmp     (%rax), %CHAR_REG
> -       cmovne  %rdx, %rax
> -# endif
> -       VZEROUPPER
> -       ret
> -
> -       .p2align 4
> +    andq       $-VEC_SIZE, %rdi
>  L(aligned_more):
>         addq    $VEC_SIZE, %rdi
>
> -L(more_4x_vec):
> -       /* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
> +       /* Check the next 3 * VEC_SIZE.  Only one VEC_SIZE at a time
>            since data is only aligned to VEC_SIZE.  */
>         vmovdqa (%rdi), %ymm8
>         VPCMPEQ %ymm8, %ymm0, %ymm1
> @@ -127,19 +96,9 @@ L(more_4x_vec):
>         testl   %eax, %eax
>         jnz     L(first_vec_x2)
>
> -       vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8
> -       VPCMPEQ %ymm8, %ymm0, %ymm1
> -       VPCMPEQ %ymm8, %ymm9, %ymm2
> -       vpor    %ymm1, %ymm2, %ymm1
> -       vpmovmskb %ymm1, %eax
> -       testl   %eax, %eax
> -       jnz     L(first_vec_x3)
> -
> -       addq    $(VEC_SIZE * 4), %rdi
> -
> +
>         /* Align data to 4 * VEC_SIZE.  */
> -       movq    %rdi, %rcx
> -       andl    $(4 * VEC_SIZE - 1), %ecx
> +       addq    $(VEC_SIZE * 3), %rdi
>         andq    $-(4 * VEC_SIZE), %rdi
>
>         .p2align 4
> @@ -150,34 +109,61 @@ L(loop_4x_vec):
>         vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7
>         vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8
>
> -       VPCMPEQ %ymm5, %ymm0, %ymm1
> -       VPCMPEQ %ymm6, %ymm0, %ymm2
> -       VPCMPEQ %ymm7, %ymm0, %ymm3
> -       VPCMPEQ %ymm8, %ymm0, %ymm4
> -
> -       VPCMPEQ %ymm5, %ymm9, %ymm5
> -       VPCMPEQ %ymm6, %ymm9, %ymm6
> -       VPCMPEQ %ymm7, %ymm9, %ymm7
> -       VPCMPEQ %ymm8, %ymm9, %ymm8
> +    /* Leaves only CHARS matching esi as 0.  */
> +    vpxor %ymm5, %ymm0, %ymm1
> +    vpxor %ymm6, %ymm0, %ymm2
> +    vpxor %ymm7, %ymm0, %ymm3
> +    vpxor %ymm8, %ymm0, %ymm4
>
> -       vpor    %ymm1, %ymm5, %ymm1
> -       vpor    %ymm2, %ymm6, %ymm2
> -       vpor    %ymm3, %ymm7, %ymm3
> -       vpor    %ymm4, %ymm8, %ymm4
> +       VPMINU  %ymm1, %ymm5, %ymm1
> +       VPMINU  %ymm2, %ymm6, %ymm2
> +       VPMINU  %ymm3, %ymm7, %ymm3
> +       VPMINU  %ymm4, %ymm8, %ymm4
>
> -       vpor    %ymm1, %ymm2, %ymm5
> -       vpor    %ymm3, %ymm4, %ymm6
> +       VPMINU  %ymm1, %ymm2, %ymm5
> +       VPMINU  %ymm3, %ymm4, %ymm6
>
> -       vpor    %ymm5, %ymm6, %ymm5
> +       VPMINU  %ymm5, %ymm6, %ymm5
>
> +    VPCMPEQ %ymm5, %ymm9, %ymm5
>         vpmovmskb %ymm5, %eax
> +    addq       $(VEC_SIZE * 4), %rdi
> +
>         testl   %eax, %eax
> -       jnz     L(4x_vec_end)
> +       jz      L(loop_4x_vec)
>
> -       addq    $(VEC_SIZE * 4), %rdi
> +    subq       $(VEC_SIZE * 4), %rdi
> +
> +L(4x_vec_end):
> +    VPCMPEQ %ymm1, %ymm9, %ymm1
> +       vpmovmskb %ymm1, %eax
> +       testl   %eax, %eax
> +       jnz     L(first_vec_x0)
> +    VPCMPEQ %ymm2, %ymm9, %ymm2
> +       vpmovmskb %ymm2, %eax
> +       testl   %eax, %eax
> +       jnz     L(first_vec_x1)
> +    VPCMPEQ %ymm3, %ymm9, %ymm3
> +       vpmovmskb %ymm3, %eax
> +       testl   %eax, %eax
> +       jnz     L(first_vec_x2)
> +    VPCMPEQ %ymm4, %ymm9, %ymm4
> +       vpmovmskb %ymm4, %eax
>
> -       jmp     L(loop_4x_vec)
> +       tzcntl  %eax, %eax
> +# ifdef USE_AS_STRCHRNUL
> +       addq    $(VEC_SIZE * 3), %rdi
> +       addq    %rdi, %rax
> +# else
> +       xorl    %edx, %edx
> +       leaq    (VEC_SIZE * 3)(%rdi, %rax), %rax
> +       cmp     (%rax), %CHAR_REG
> +       cmovne  %rdx, %rax
> +# endif
> +       VZEROUPPER
> +       ret
>
> +
>         .p2align 4
>  L(first_vec_x0):
>         /* Found CHAR or the null byte.  */
> @@ -197,7 +183,7 @@ L(first_vec_x0):
>  L(first_vec_x1):
>         tzcntl  %eax, %eax
>  # ifdef USE_AS_STRCHRNUL
> -       addq    $VEC_SIZE, %rax
> +       addq    $VEC_SIZE, %rdi
>         addq    %rdi, %rax
>  # else
>         xorl    %edx, %edx
> @@ -212,7 +198,7 @@ L(first_vec_x1):
>  L(first_vec_x2):
>         tzcntl  %eax, %eax
>  # ifdef USE_AS_STRCHRNUL
> -       addq    $(VEC_SIZE * 2), %rax
> +       addq    $(VEC_SIZE * 2), %rdi
>         addq    %rdi, %rax
>  # else
>         xorl    %edx, %edx
> @@ -223,32 +209,47 @@ L(first_vec_x2):
>         VZEROUPPER
>         ret
>
> +    /* Cold case for crossing page with first load.  */
>         .p2align 4
> -L(4x_vec_end):
> +L(cros_page_boundary):
> +       andl    $(VEC_SIZE - 1), %ecx
> +       andq    $-VEC_SIZE, %rdi
> +       vmovdqu (%rdi), %ymm8
> +       VPCMPEQ %ymm8, %ymm0, %ymm1
> +       VPCMPEQ %ymm8, %ymm9, %ymm2
> +       vpor    %ymm1, %ymm2, %ymm1
>         vpmovmskb %ymm1, %eax
> +       /* Remove the leading bits.  */
> +       sarxl   %ecx, %eax, %eax
>         testl   %eax, %eax
> -       jnz     L(first_vec_x0)
> -       vpmovmskb %ymm2, %eax
> -       testl   %eax, %eax
> -       jnz     L(first_vec_x1)
> -       vpmovmskb %ymm3, %eax
> -       testl   %eax, %eax
> -       jnz     L(first_vec_x2)
> -       vpmovmskb %ymm4, %eax
> +       jnz     L(cros_page_return)
> +
> +    /* Second block so that the 3 other blocks from L(aligned_more)
> +       will get to next 4 * VEC_SIZE alignment.  */
> +    andq       $-VEC_SIZE, %rdi
> +    addq       $VEC_SIZE, %rdi
> +    xorl    %ecx, %ecx
> +    vmovdqa    (%rdi), %ymm8
> +       VPCMPEQ %ymm8, %ymm0, %ymm1
> +       VPCMPEQ %ymm8, %ymm9, %ymm2
> +       vpor    %ymm1, %ymm2, %ymm1
> +       vpmovmskb %ymm1, %eax
>         testl   %eax, %eax
> -L(first_vec_x3):
> +       jz      L(aligned_more)
> +
> +L(cros_page_return):
>         tzcntl  %eax, %eax
> +       addq    %rcx, %rax
>  # ifdef USE_AS_STRCHRNUL
> -       addq    $(VEC_SIZE * 3), %rax
>         addq    %rdi, %rax
>  # else
>         xorl    %edx, %edx
> -       leaq    (VEC_SIZE * 3)(%rdi, %rax), %rax
> +       leaq    (%rdi, %rax), %rax
>         cmp     (%rax), %CHAR_REG
>         cmovne  %rdx, %rax
>  # endif
>         VZEROUPPER
>         ret
> -
> +
>  END (STRCHR)
> -#endif
> +# endif
> --
> 2.29.2
>


-- 
H.J.