Re: [PATCH v1 1/2] x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.S

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Noah Goldstein <goldstein.w.n@gmail.com>
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>,
	"Carlos O'Donell" <carlos@systemhalted.org>
Subject: Re: [PATCH v1 1/2] x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.S
Date: Mon, 19 Apr 2021 12:27:16 -0400	[thread overview]
Message-ID: <CAFUsyf+oiwQ=502PAZv2bMsLjE+D3ERxz93RcrbJyTGNmYvJQg@mail.gmail.com> (raw)
In-Reply-To: <YH2YpTNfJ1Xvx/ZK@gmail.com>

On Mon, Apr 19, 2021 at 10:50 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Apr 18, 2021 at 06:09:21PM -0400, Noah Goldstein wrote:
> > No bug. This commit adds optimized cased for less_vec memset case that
> > uses the avx512vl/avx512bw mask store avoiding the excessive
> > branches. test-memset and test-wmemset are passing.
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > ---
> > Tests where run on the following CPUs:
> >
> > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html
> >
> > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
> >
> >
> > All times are the geometric mean of N=20. The unit of time is
> > seconds.
> >
> > "Cur" refers to the current implementation
> > "New" refers to this patches implementation
> >
> > There are 3 cases that matter for performance.
> >
> > 1) ptr is not within VEC_SIZE of a page
> > 2) ptr is within VEC_SIZE of a page but length is small enough so that
> >    there is not page cross
> > 3) page cross.
> >
> > Case 1 (which should be the most common) the new implementation has a
> > near universal improvement. The only exception is the avx512 case for
> > size = [0, 15] where I believe the downclocking from avx512 is causing
> > slowdown. Its worth noting that because bench-memset.c repeats the
> > same size the branch heavy case should be favored as the branches will
> > all be predicted correctly. In a setting with unpredictable length
> > this version should perform significant better. For example I
> > implemented something similiar to this change for memmove/memcpy and
> > saw ~40% speedup in bench-memcpy-random (but for other reasons this
> > change isnt good there).
> >
> > Cases 2 has a slowdown with this patch (roughly equivilent to the
> > performance improvement for case 1). Though I think this is probably
> > less important than the improvements for case 1 as page cross are
> > probably rarer than non-page cross.
> >
> > Case 3 has a very slight slowdown with this patch. But for the same
> > reason as above I think this patch is still an improvement.
> >
> > Its worth noting that the page cross check could be removed and
> > the mask store implementation would still be correct, but I'm finding
> > the the fault suppression is incredibly expensive from a performance
> > perspective and without the branch I see a 2 orders of magnitude
> > performance regression on the Case 2 benchmarks.
> >
> >
> ...
> >
> >  sysdeps/x86_64/multiarch/ifunc-memset.h       |  6 ++-
> >  .../multiarch/memset-avx512-unaligned-erms.S  |  2 +-
> >  .../multiarch/memset-evex-unaligned-erms.S    |  2 +-
> >  .../multiarch/memset-vec-unaligned-erms.S     | 52 +++++++++++++++----
> >  4 files changed, 47 insertions(+), 15 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-memset.h b/sysdeps/x86_64/multiarch/ifunc-memset.h
> > index 502f946a84..eda5640541 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-memset.h
> > +++ b/sysdeps/x86_64/multiarch/ifunc-memset.h
> > @@ -54,7 +54,8 @@ IFUNC_SELECTOR (void)
> >        && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
> >      {
> >        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
> > -       && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
> > +          && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> > +          && CPU_FEATURE_USABLE_P (cpu_features, BMI2))
> >       {
> >         if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> >           return OPTIMIZE (avx512_unaligned_erms);
> > @@ -68,7 +69,8 @@ IFUNC_SELECTOR (void)
> >    if (CPU_FEATURE_USABLE_P (cpu_features, AVX2))
> >      {
> >        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
> > -       && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
> > +          && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> > +          && CPU_FEATURE_USABLE_P (cpu_features, BMI2))
> >       {
> >         if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> >           return OPTIMIZE (evex_unaligned_erms);
>
> Please also update ifunc-impl-list.c.
>
> > diff --git a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > index 22e7b187c8..d03460be93 100644
> > --- a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > @@ -19,6 +19,6 @@
> >  # define SECTION(p)          p##.evex512
> >  # define MEMSET_SYMBOL(p,s)  p##_avx512_##s
> >  # define WMEMSET_SYMBOL(p,s) p##_avx512_##s
> > -
> > +# define USE_LESS_VEC_MASKMOV        1
> >  # include "memset-vec-unaligned-erms.S"
> >  #endif
> > diff --git a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > index ae0a4d6e46..eb3541ef60 100644
> > --- a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > @@ -19,6 +19,6 @@
> >  # define SECTION(p)          p##.evex
> >  # define MEMSET_SYMBOL(p,s)  p##_evex_##s
> >  # define WMEMSET_SYMBOL(p,s) p##_evex_##s
> > -
> > +# define USE_LESS_VEC_MASKMOV        1
> >  # include "memset-vec-unaligned-erms.S"
> >  #endif
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index 584747f1a1..6b02e87f48 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -63,6 +63,9 @@
> >  # endif
> >  #endif
> >
> > +#define PAGE_SIZE 4096
> > +#define LOG_PAGE_SIZE 12
> > +
> >  #ifndef SECTION
> >  # error SECTION is not defined!
> >  #endif
> > @@ -213,11 +216,38 @@ L(loop):
> >       cmpq    %rcx, %rdx
> >       jne     L(loop)
> >       VZEROUPPER_SHORT_RETURN
> > +
> > +     .p2align 4
> >  L(less_vec):
> >       /* Less than 1 VEC.  */
> >  # if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> >  #  error Unsupported VEC_SIZE!
> >  # endif
> > +# ifdef USE_LESS_VEC_MASKMOV
> > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > +        cross check. Using sall instead of andl saves 3 bytes. Note
> > +        that we are using rax which is set in
> > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > +     sall    $(32 - LOG_PAGE_SIZE), %edi
>
> Why is SAL needed here?

To clear the upper bits from edi before checking for a page cross.
We only want to check edi[11:0] to see if one vec load will cross a page.
AFAIK this is the most efficient way to do that.

>
> > +     /* Check if VEC_SIZE load cross page. Mask loads suffer serious
> > +        performance degradation when it has to fault supress.  */
> > +     cmpl    $((PAGE_SIZE - VEC_SIZE) << (32 - LOG_PAGE_SIZE)), %edi
>
> Can you use only CMP, not SAL, for masked store?

What do you mean? This comparison is to check if there will be a page cross
on the store. Its not necessary for correctness but is for performance as
fault suppression appears to be really expensive. The code would still be
correct with no SAL and no CMP.

>
> > +     ja      L(cross_page)
> > +# if VEC_SIZE > 32
> > +     movq    $-1, %rcx
> > +     bzhiq   %rdx, %rcx, %rcx
> > +     kmovq   %rcx, %k1
> > +# else
> > +     movl    $-1, %ecx
> > +     bzhil   %edx, %ecx, %ecx
> > +     kmovd   %ecx, %k1
> > +# endif
> > +     vmovdqu8        %VEC(0), (%rax) {%k1}
> > +     VZEROUPPER_RETURN
> > +
> > +     .p2align 4
> > +L(cross_page):
> > +# endif
> >  # if VEC_SIZE > 32
> >       cmpb    $32, %dl
> >       jae     L(between_32_63)
> > @@ -234,36 +264,36 @@ L(less_vec):
> >       cmpb    $1, %dl
> >       ja      L(between_2_3)
> >       jb      1f
> > -     movb    %cl, (%rdi)
> > +     movb    %cl, (%rax)
> >  1:
> >       VZEROUPPER_RETURN
> >  # if VEC_SIZE > 32
> >       /* From 32 to 63.  No branch when size == 32.  */
> >  L(between_32_63):
> > -     VMOVU   %YMM0, -32(%rdi,%rdx)
> > -     VMOVU   %YMM0, (%rdi)
> > +     VMOVU   %YMM0, -32(%rax,%rdx)
> > +     VMOVU   %YMM0, (%rax)
> >       VZEROUPPER_RETURN
> >  # endif
> >  # if VEC_SIZE > 16
> >       /* From 16 to 31.  No branch when size == 16.  */
> >  L(between_16_31):
> > -     VMOVU   %XMM0, -16(%rdi,%rdx)
> > -     VMOVU   %XMM0, (%rdi)
> > +     VMOVU   %XMM0, -16(%rax,%rdx)
> > +     VMOVU   %XMM0, (%rax)
> >       VZEROUPPER_RETURN
> >  # endif
> >       /* From 8 to 15.  No branch when size == 8.  */
> >  L(between_8_15):
> > -     movq    %rcx, -8(%rdi,%rdx)
> > -     movq    %rcx, (%rdi)
> > +     movq    %rcx, -8(%rax,%rdx)
> > +     movq    %rcx, (%rax)
> >       VZEROUPPER_RETURN
> >  L(between_4_7):
> >       /* From 4 to 7.  No branch when size == 4.  */
> > -     movl    %ecx, -4(%rdi,%rdx)
> > -     movl    %ecx, (%rdi)
> > +     movl    %ecx, -4(%rax,%rdx)
> > +     movl    %ecx, (%rax)
> >       VZEROUPPER_RETURN
> >  L(between_2_3):
> >       /* From 2 to 3.  No branch when size == 2.  */
> > -     movw    %cx, -2(%rdi,%rdx)
> > -     movw    %cx, (%rdi)
> > +     movw    %cx, -2(%rax,%rdx)
> > +     movw    %cx, (%rax)
> >       VZEROUPPER_RETURN
> >  END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > --
> > 2.29.2
> >
>
> H.J.

     prev parent reply	other threads:[~2021-04-19 16:27 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-18 22:09 Noah Goldstein
2021-04-18 22:09 ` [PATCH v1 2/2] x86: Expand test-memset.c and bench-memset.c Noah Goldstein
2021-04-19  5:25   ` Paul Zimmermann
2021-04-19 16:30     ` Noah Goldstein
2021-04-19 14:50 ` [PATCH v1 1/2] x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.S H.J. Lu
2021-04-19 16:27   ` Noah Goldstein [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFUsyf+oiwQ=502PAZv2bMsLjE+D3ERxz93RcrbJyTGNmYvJQg@mail.gmail.com' \
    --to=goldstein.w.n@gmail.com \
    --cc=carlos@systemhalted.org \
    --cc=hjl.tools@gmail.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).