public inbox for libc-stable@sourceware.org
 help / color / mirror / Atom feed
* Re: [PATCH v1 2/3] x86: Modify ENTRY in sysdep.h so that p2align can be specified
       [not found]     ` <CAFUsyfKidMCHU9MJLGpzXL-CXmXhR43B4dfnEESDfaMruW2zkg@mail.gmail.com>
@ 2022-04-23  1:13       ` Sunil Pandey
  0 siblings, 0 replies; 2+ messages in thread
From: Sunil Pandey @ 2022-04-23  1:13 UTC (permalink / raw)
  To: Noah Goldstein, libc-stable; +Cc: H.J. Lu, GNU C Library

On Fri, Oct 8, 2021 at 10:20 AM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Fri, Oct 8, 2021 at 9:27 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> > On Sun, Sep 26, 2021 at 1:53 PM Noah Goldstein <goldstein.w.n@gmail.com>
> > wrote:
> > >
> > > No bug.
> > >
> > > This change adds a new macro ENTRY_P2ALIGN which takes a second
> > > argument, log2 of the desired function alignment.
> > >
> > > The old ENTRY(name) macro is just ENTRY_P2ALIGN(name, 4) so this
> > > doesn't affect any existing functionality.
> > >
> > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > ---
> > > Note this is a duplicate of:
> > > [v2,1/2] x86: Modify ENTRY in sysdep.h so that p2align can be specified
> > >
> > https://patchwork.sourceware.org/project/glibc/patch/20210922051657.1655745-1-goldstein.w.n@gmail.com/
> > >
> > >  sysdeps/x86/sysdep.h | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/sysdeps/x86/sysdep.h b/sysdeps/x86/sysdep.h
> > > index cac1d762fb..937180c1bd 100644
> > > --- a/sysdeps/x86/sysdep.h
> > > +++ b/sysdeps/x86/sysdep.h
> > > @@ -78,15 +78,18 @@ enum cf_protection_level
> > >  #define ASM_SIZE_DIRECTIVE(name) .size name,.-name;
> > >
> > >  /* Define an entry point visible from C.  */
> > > -#define        ENTRY(name)
> >              \
> > > +#define        ENTRY_P2ALIGN(name, alignment)
> >               \
> > >    .globl C_SYMBOL_NAME(name);
> >       \
> > >    .type C_SYMBOL_NAME(name),@function;
> >      \
> > > -  .align ALIGNARG(4);
> >       \
> > > +  .align ALIGNARG(alignment);
> >       \
> > >    C_LABEL(name)
> >               \
> > >    cfi_startproc;
> >      \
> > >    _CET_ENDBR;
> >       \
> > >    CALL_MCOUNT
> > >
> > > +/* Common entry 16 byte aligns.  */
> > > +#define ENTRY(name) ENTRY_P2ALIGN (name, 4)
> > > +
> > >  #undef END
> > >  #define END(name)
> >       \
> > >    cfi_endproc;
> >      \
> > > --
> > > 2.25.1
> > >
> >
> > LGTM.
> >
> > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> >
> > Thanks.
> >
>
> Thanks. Pushed.
>
>
> >
> > --
> > H.J.
> >

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH v1 3/3] x86: Optimize memset-vec-unaligned-erms.S
       [not found]     ` <CAFUsyfKhtVfqbpbs-S2b2XjfOh9rDh58wpUuy4+oYgPx1Ecbhw@mail.gmail.com>
@ 2022-04-23  1:19       ` Sunil Pandey
  0 siblings, 0 replies; 2+ messages in thread
From: Sunil Pandey @ 2022-04-23  1:19 UTC (permalink / raw)
  To: Noah Goldstein, libc-stable; +Cc: H.J. Lu, GNU C Library

On Tue, Oct 12, 2021 at 12:02 PM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Tue, Oct 12, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Sun, Sep 26, 2021 at 1:54 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > No bug.
> > >
> > > Optimization are
> > >
> > > 1. change control flow for L(more_2x_vec) to fall through to loop and
> > >    jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
> > >    size and saves jumps for length > 4x VEC_SIZE.
> > >
> > > 2. For EVEX/AVX512 move L(less_vec) closer to entry.
> > >
> > > 3. Avoid complex address mode for length > 2x VEC_SIZE
> > >
> > > 4. Slightly better aligning code for the loop from the perspective of
> > >    code size and uops.
> > >
> > > 5. Align targets so they make full use of their fetch block and if
> > >    possible cache line.
> > >
> > > 6. Try and reduce total number of icache lines that will need to be
> > >    pulled in for a given length.
> > >
> > > 7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
> > >    jumping to the stosb target in the sse2 code section will almost
> > >    certainly be to a new page. The new version does increase code size
> > >    marginally by duplicating the target but should get better iTLB
> > >    behavior as a result.
> > >
> > > test-memset, test-wmemset, and test-bzero are all passing.
> > >
> > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > > ---
> > > Performance numbers attached in reply.
> > >
> > > Notes on testing.
> > >
> > > I page aligned the start and end of MEMSET so that the bench-memset.c
> > > code would not be affected by code size changes.
> > >
> > > I replaced `_x86_rep_stosb_threshold` with `$2048`. Note this takes
> > > the exact same code size. I did this because the load of
> > > `_x86_rep_stosb_threshold` can stall due to 4k aliasing from stores
> > > during the benchmark. I don't think this is what we are intending to
> > > benchmark and I believe it adds more confusion/noise to the
> > > results. (Not to say there is no reason to benchmark that case, just I
> > > don't think it should be accidental).
> > >
> > > Notes on numbers:
> > >
> > > There are degregations in the (2x, 4x] VEC_SIZE range. This is
> > > expected and because the current version gives a fallthrough patch to
> > > that size range whereas the new version has a jump. Generally this is
> > > also conditional on the previous ENTRY_ALIGNMENT % 64, with
> > > ENTRY_ALIGNMENT % 64 == 16 being the best for that size range in the
> > > current implementation. The new implementation hopefully makes up for
> > > this with the higher magnitude gains in the (4x, stosb_threshold)
> > > VEC_SIZE range. Is this a blocker?
> > >
> > > For Skylake there are two other degregations.
> > >
> > > One where length = 416 with alignment = 416. I have not been able to
> > > find this root cause of this. All other size/alignment pairs in the
> > > medium sized range seem to perform better. Does anyone have an
> > > explination for this or think its a blocker?
> > >
> > > The other is conditionally on length in [1x, 2x] VEC_SIZE. I believe
> > > this is related to reordering the two overlapping stores. The previous
> > > version had complex address mode first. Its possible that the simple
> > > address mode is taking p23 AGU stalling the complex address
> > > calculation for the second store. Having the complex address mode
> > > first prevents this. This only appears to be n issue for the first set
> > > of tests (c=-65) and only appears to sometimes matter. My general
> > > feeling is that if possible ordering for sequantial strided memory
> > > access makes the most sense hence the reordering. Does anyone think it
> > > should keep the original order?
> > >
> > > Other than those 1/3 cases the new version wins out in all other cases
> > > against all prior alignments.
> > >
> > >  sysdeps/x86_64/memset.S                       |  10 +-
> > >  .../multiarch/memset-avx2-unaligned-erms.S    |  10 +-
> > >  .../multiarch/memset-avx512-unaligned-erms.S  |  11 +-
> > >  .../multiarch/memset-evex-unaligned-erms.S    |  11 +-
> > >  .../multiarch/memset-vec-unaligned-erms.S     | 285 ++++++++++++------
> > >  5 files changed, 232 insertions(+), 95 deletions(-)
> > >
> > > diff --git a/sysdeps/x86_64/memset.S b/sysdeps/x86_64/memset.S
> > > index 7d4a327eba..0137eba4cd 100644
> > > --- a/sysdeps/x86_64/memset.S
> > > +++ b/sysdeps/x86_64/memset.S
> > > @@ -18,13 +18,15 @@
> > >     <https://www.gnu.org/licenses/>.  */
> > >
> > >  #include <sysdep.h>
> > > +#define USE_WITH_SSE2  1
> > >
> > >  #define VEC_SIZE       16
> > > +#define MOV_SIZE       3
> > > +#define RET_SIZE       1
> > > +
> > >  #define VEC(i)         xmm##i
> > > -/* Don't use movups and movaps since it will get larger nop paddings for
> > > -   alignment.  */
> > > -#define VMOVU          movdqu
> > > -#define VMOVA          movdqa
> > > +#define VMOVU     movups
> > > +#define VMOVA     movaps
> > >
> > >  #define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
> > >    movd d, %xmm0; \
> > > diff --git a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
> > > index ae0860f36a..1af668af0a 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
> > > @@ -1,8 +1,14 @@
> > >  #if IS_IN (libc)
> > > +# define USE_WITH_AVX2 1
> > > +
> > >  # define VEC_SIZE      32
> > > +# define MOV_SIZE      4
> > > +# define RET_SIZE      4
> > > +
> > >  # define VEC(i)                ymm##i
> > > -# define VMOVU         vmovdqu
> > > -# define VMOVA         vmovdqa
> > > +
> > > +# define VMOVU     vmovdqu
> > > +# define VMOVA     vmovdqa
> > >
> > >  # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
> > >    vmovd d, %xmm0; \
> > > diff --git a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > > index 8ad842fc2f..5937a974bf 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > > @@ -1,11 +1,18 @@
> > >  #if IS_IN (libc)
> > > +# define USE_WITH_AVX512       1
> > > +
> > >  # define VEC_SIZE      64
> > > +# define MOV_SIZE      6
> > > +# define RET_SIZE      1
> > > +
> > >  # define XMM0          xmm16
> > >  # define YMM0          ymm16
> > >  # define VEC0          zmm16
> > >  # define VEC(i)                VEC##i
> > > -# define VMOVU         vmovdqu64
> > > -# define VMOVA         vmovdqa64
> > > +
> > > +# define VMOVU     vmovdqu64
> > > +# define VMOVA     vmovdqa64
> > > +
> > >  # define VZEROUPPER
> > >
> > >  # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
> > > diff --git a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > > index 640f092903..64b09e77cc 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > > @@ -1,11 +1,18 @@
> > >  #if IS_IN (libc)
> > > +# define USE_WITH_EVEX 1
> > > +
> > >  # define VEC_SIZE      32
> > > +# define MOV_SIZE      6
> > > +# define RET_SIZE      1
> > > +
> > >  # define XMM0          xmm16
> > >  # define YMM0          ymm16
> > >  # define VEC0          ymm16
> > >  # define VEC(i)                VEC##i
> > > -# define VMOVU         vmovdqu64
> > > -# define VMOVA         vmovdqa64
> > > +
> > > +# define VMOVU     vmovdqu64
> > > +# define VMOVA     vmovdqa64
> > > +
> > >  # define VZEROUPPER
> > >
> > >  # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > index ff196844a0..e723413a66 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > @@ -63,8 +63,27 @@
> > >  # endif
> > >  #endif
> > >
> > > +#if VEC_SIZE == 64
> > > +# define LOOP_4X_OFFSET        (VEC_SIZE * 4)
> > > +#else
> > > +# define LOOP_4X_OFFSET        (0)
> > > +#endif
> > > +
> > > +#if defined USE_WITH_EVEX || defined USE_WITH_AVX512
> > > +# define END_REG       rcx
> > > +# define LOOP_REG      rdi
> > > +#else
> > > +# define END_REG       rdi
> > > +# define LOOP_REG      rdx
> > > +#endif
> > > +
> > >  #define PAGE_SIZE 4096
> > >
> > > +/* Macro to calculate size of small memset block for aligning
> > > +   purposes.  */
> > > +#define SMALL_MEMSET_ALIGN(mov_sz,     ret_sz) (2 * (mov_sz) + (ret_sz) + 1)
> > > +
> > > +
> > >  #ifndef SECTION
> > >  # error SECTION is not defined!
> > >  #endif
> > > @@ -74,6 +93,7 @@
> > >  ENTRY (__bzero)
> > >         mov     %RDI_LP, %RAX_LP /* Set return value.  */
> > >         mov     %RSI_LP, %RDX_LP /* Set n.  */
> > > +       xorl    %esi, %esi
> > >         pxor    %XMM0, %XMM0
> > >         jmp     L(entry_from_bzero)
> > >  END (__bzero)
> > > @@ -158,7 +178,7 @@ ENTRY_CHK (MEMSET_CHK_SYMBOL (__memset_chk, unaligned_erms))
> > >  END_CHK (MEMSET_CHK_SYMBOL (__memset_chk, unaligned_erms))
> > >  # endif
> > >
> > > -ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > +ENTRY_P2ALIGN (MEMSET_SYMBOL (__memset, unaligned_erms), 6)
> > >         MEMSET_VDUP_TO_VEC0_AND_SET_RETURN (%esi, %rdi)
> > >  # ifdef __ILP32__
> > >         /* Clear the upper 32 bits.  */
> > > @@ -168,75 +188,43 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
> > >         jb      L(less_vec)
> > >         cmp     $(VEC_SIZE * 2), %RDX_LP
> > >         ja      L(stosb_more_2x_vec)
> > > -       /* From VEC and to 2 * VEC.  No branch when size == VEC_SIZE.  */
> > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)
> > > -       VMOVU   %VEC(0), (%rdi)
> > > +       /* From VEC and to 2 * VEC.  No branch when size == VEC_SIZE.
> > > +        */
> > > +       VMOVU   %VEC(0), (%rax)
> > > +       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)
> > >         VZEROUPPER_RETURN
> > > -
> > > -       .p2align 4
> > > -L(stosb_more_2x_vec):
> > > -       cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > -       ja      L(stosb)
> > > -#else
> > > -       .p2align 4
> > >  #endif
> > > -L(more_2x_vec):
> > > -       /* Stores to first 2x VEC before cmp as any path forward will
> > > -          require it.  */
> > > -       VMOVU   %VEC(0), (%rdi)
> > > -       VMOVU   %VEC(0), VEC_SIZE(%rdi)
> > > -       cmpq    $(VEC_SIZE * 4), %rdx
> > > -       ja      L(loop_start)
> > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
> > > -       VMOVU   %VEC(0), -VEC_SIZE(%rdi,%rdx)
> > > -L(return):
> > > -#if VEC_SIZE > 16
> > > -       ZERO_UPPER_VEC_REGISTERS_RETURN
> > > +
> > > +       .p2align 4,, 10
> > > +L(last_2x_vec):
> > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > +       VMOVU   %VEC(0), (VEC_SIZE * 2 + LOOP_4X_OFFSET)(%rcx)
> > > +       VMOVU   %VEC(0), (VEC_SIZE * 3 + LOOP_4X_OFFSET)(%rcx)
> > >  #else
> > > -       ret
> > > +       VMOVU   %VEC(0), (VEC_SIZE * -2)(%rdi)
> > > +       VMOVU   %VEC(0), (VEC_SIZE * -1)(%rdi)
> > >  #endif
> > > +       VZEROUPPER_RETURN
> > >
> > > -L(loop_start):
> > > -       VMOVU   %VEC(0), (VEC_SIZE * 2)(%rdi)
> > > -       VMOVU   %VEC(0), (VEC_SIZE * 3)(%rdi)
> > > -       cmpq    $(VEC_SIZE * 8), %rdx
> > > -       jbe     L(loop_end)
> > > -       andq    $-(VEC_SIZE * 2), %rdi
> > > -       subq    $-(VEC_SIZE * 4), %rdi
> > > -       leaq    -(VEC_SIZE * 4)(%rax, %rdx), %rcx
> > > -       .p2align 4
> > > -L(loop):
> > > -       VMOVA   %VEC(0), (%rdi)
> > > -       VMOVA   %VEC(0), VEC_SIZE(%rdi)
> > > -       VMOVA   %VEC(0), (VEC_SIZE * 2)(%rdi)
> > > -       VMOVA   %VEC(0), (VEC_SIZE * 3)(%rdi)
> > > -       subq    $-(VEC_SIZE * 4), %rdi
> > > -       cmpq    %rcx, %rdi
> > > -       jb      L(loop)
> > > -L(loop_end):
> > > -       /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.
> > > -              rdx as length is also unchanged.  */
> > > -       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)
> > > -       VMOVU   %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)
> > > -       VMOVU   %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)
> > > -       VMOVU   %VEC(0), -VEC_SIZE(%rax, %rdx)
> > > -       VZEROUPPER_SHORT_RETURN
> > > -
> > > -       .p2align 4
> > > +       /* If have AVX512 mask instructions put L(less_vec) close to
> > > +          entry as it doesn't take much space and is likely a hot target.
> > > +        */
> > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > +       .p2align 4,, 10
> > >  L(less_vec):
> > >         /* Less than 1 VEC.  */
> > >  # if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > >  #  error Unsupported VEC_SIZE!
> > >  # endif
> > > -# ifdef USE_LESS_VEC_MASK_STORE
> > >         /* Clear high bits from edi. Only keeping bits relevant to page
> > >            cross check. Note that we are using rax which is set in
> > > -          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.
> > > -        */
> > > +          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > >         andl    $(PAGE_SIZE - 1), %edi
> > > -       /* Check if VEC_SIZE store cross page. Mask stores suffer serious
> > > -          performance degradation when it has to fault supress.  */
> > > +       /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > +          serious performance degradation when it has to fault supress.
> > > +        */
> > >         cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > +       /* This is generally considered a cold target.  */
> > >         ja      L(cross_page)
> > >  # if VEC_SIZE > 32
> > >         movq    $-1, %rcx
> > > @@ -247,58 +235,185 @@ L(less_vec):
> > >         bzhil   %edx, %ecx, %ecx
> > >         kmovd   %ecx, %k1
> > >  # endif
> > > -       vmovdqu8        %VEC(0), (%rax) {%k1}
> > > +       vmovdqu8 %VEC(0), (%rax){%k1}
> > >         VZEROUPPER_RETURN
> > >
> > > +# if defined USE_MULTIARCH && IS_IN (libc)
> > > +       /* Include L(stosb_local) here if including L(less_vec) between
> > > +          L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > +          L(stosb_more_2x_vec) target.  */
> > > +       .p2align 4,, 10
> > > +L(stosb_local):
> > > +       movzbl  %sil, %eax
> > > +       mov     %RDX_LP, %RCX_LP
> > > +       mov     %RDI_LP, %RDX_LP
> > > +       rep     stosb
> > > +       mov     %RDX_LP, %RAX_LP
> > > +       VZEROUPPER_RETURN
> > > +# endif
> > > +#endif
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > >         .p2align 4
> > > -L(cross_page):
> > > +L(stosb_more_2x_vec):
> > > +       cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > +       ja      L(stosb_local)
> > > +#endif
> > > +       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > +          and (4x, 8x] jump to target.  */
> > > +L(more_2x_vec):
> > > +
> > > +       /* Two different methods of setting up pointers / compare. The
> > > +          two methods are based on the fact that EVEX/AVX512 mov
> > > +          instructions take more bytes then AVX2/SSE2 mov instructions. As
> > > +          well that EVEX/AVX512 machines also have fast LEA_BID. Both
> > > +          setup and END_REG to avoid complex address mode. For EVEX/AVX512
> > > +          this saves code size and keeps a few targets in one fetch block.
> > > +          For AVX2/SSE2 this helps prevent AGU bottlenecks.  */
> > > +#if defined USE_WITH_EVEX || defined USE_WITH_AVX512
> > > +       /* If EVEX/AVX512 compute END_REG - (VEC_SIZE * 4 +
> > > +          LOOP_4X_OFFSET) with LEA_BID.  */
> > > +
> > > +       /* END_REG is rcx for EVEX/AVX512.  */
> > > +       leaq    -(VEC_SIZE * 4 + LOOP_4X_OFFSET)(%rdi, %rdx), %END_REG
> > > +#endif
> > > +
> > > +       /* Stores to first 2x VEC before cmp as any path forward will
> > > +          require it.  */
> > > +       VMOVU   %VEC(0), (%rax)
> > > +       VMOVU   %VEC(0), VEC_SIZE(%rax)
> > > +
> > > +
> > > +#if !(defined USE_WITH_EVEX || defined USE_WITH_AVX512)
> > > +       /* If AVX2/SSE2 compute END_REG (rdi) with ALU.  */
> > > +       addq    %rdx, %END_REG
> > > +#endif
> > > +
> > > +       cmpq    $(VEC_SIZE * 4), %rdx
> > > +       jbe     L(last_2x_vec)
> > > +
> > > +       /* Store next 2x vec regardless.  */
> > > +       VMOVU   %VEC(0), (VEC_SIZE * 2)(%rax)
> > > +       VMOVU   %VEC(0), (VEC_SIZE * 3)(%rax)
> > > +
> > > +
> > > +#if defined USE_WITH_EVEX || defined USE_WITH_AVX512
> > > +       /* If LOOP_4X_OFFSET don't readjust LOOP_REG (rdi), just add
> > > +          extra offset to addresses in loop. Used for AVX512 to save space
> > > +          as no way to get (VEC_SIZE * 4) in imm8.  */
> > > +# if LOOP_4X_OFFSET == 0
> > > +       subq    $-(VEC_SIZE * 4), %LOOP_REG
> > >  # endif
> > > -# if VEC_SIZE > 32
> > > -       cmpb    $32, %dl
> > > -       jae     L(between_32_63)
> > > +       /* Avoid imm32 compare here to save code size.  */
> > > +       cmpq    %rdi, %rcx
> > > +#else
> > > +       addq    $-(VEC_SIZE * 4), %END_REG
> > > +       cmpq    $(VEC_SIZE * 8), %rdx
> > > +#endif
> > > +       jbe     L(last_4x_vec)
> > > +#if !(defined USE_WITH_EVEX || defined USE_WITH_AVX512)
> > > +       /* Set LOOP_REG (rdx).  */
> > > +       leaq    (VEC_SIZE * 4)(%rax), %LOOP_REG
> > > +#endif
> > > +       /* Align dst for loop.  */
> > > +       andq    $(VEC_SIZE * -2), %LOOP_REG
> > > +       .p2align 4
> > > +L(loop):
> > > +       VMOVA   %VEC(0), LOOP_4X_OFFSET(%LOOP_REG)
> > > +       VMOVA   %VEC(0), (VEC_SIZE + LOOP_4X_OFFSET)(%LOOP_REG)
> > > +       VMOVA   %VEC(0), (VEC_SIZE * 2 + LOOP_4X_OFFSET)(%LOOP_REG)
> > > +       VMOVA   %VEC(0), (VEC_SIZE * 3 + LOOP_4X_OFFSET)(%LOOP_REG)
> > > +       subq    $-(VEC_SIZE * 4), %LOOP_REG
> > > +       cmpq    %END_REG, %LOOP_REG
> > > +       jb      L(loop)
> > > +       .p2align 4,, MOV_SIZE
> > > +L(last_4x_vec):
> > > +       VMOVU   %VEC(0), LOOP_4X_OFFSET(%END_REG)
> > > +       VMOVU   %VEC(0), (VEC_SIZE + LOOP_4X_OFFSET)(%END_REG)
> > > +       VMOVU   %VEC(0), (VEC_SIZE * 2 + LOOP_4X_OFFSET)(%END_REG)
> > > +       VMOVU   %VEC(0), (VEC_SIZE * 3 + LOOP_4X_OFFSET)(%END_REG)
> > > +L(return):
> > > +#if VEC_SIZE > 16
> > > +       ZERO_UPPER_VEC_REGISTERS_RETURN
> > > +#else
> > > +       ret
> > > +#endif
> > > +
> > > +       .p2align 4,, 10
> > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > +# if defined USE_MULTIARCH && IS_IN (libc)
> > > +       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > +          range for 2-byte jump encoding.  */
> > > +L(stosb_local):
> > > +       movzbl  %sil, %eax
> > > +       mov     %RDX_LP, %RCX_LP
> > > +       mov     %RDI_LP, %RDX_LP
> > > +       rep     stosb
> > > +       mov     %RDX_LP, %RAX_LP
> > > +       VZEROUPPER_RETURN
> > >  # endif
> > > -# if VEC_SIZE > 16
> > > -       cmpb    $16, %dl
> > > +       /* Define L(less_vec) only if not otherwise defined.  */
> > > +       .p2align 4
> > > +L(less_vec):
> > > +#endif
> > > +L(cross_page):
> > > +#if VEC_SIZE > 32
> > > +       cmpl    $32, %edx
> > > +       jae     L(between_32_63)
> > > +#endif
> > > +#if VEC_SIZE > 16
> > > +       cmpl    $16, %edx
> > >         jae     L(between_16_31)
> > > -# endif
> > > -       MOVQ    %XMM0, %rcx
> > > -       cmpb    $8, %dl
> > > +#endif
> > > +       MOVQ    %XMM0, %rdi
> > > +       cmpl    $8, %edx
> > >         jae     L(between_8_15)
> > > -       cmpb    $4, %dl
> > > +       cmpl    $4, %edx
> > >         jae     L(between_4_7)
> > > -       cmpb    $1, %dl
> > > +       cmpl    $1, %edx
> > >         ja      L(between_2_3)
> > > -       jb      1f
> > > -       movb    %cl, (%rax)
> > > -1:
> > > +       jb      L(return)
> > > +       movb    %sil, (%rax)
> > >         VZEROUPPER_RETURN
> > > -# if VEC_SIZE > 32
> > > +
> > > +       /* Align small targets only if not doing so would cross a fetch
> > > +          line.  */
> > > +#if VEC_SIZE > 32
> > > +       .p2align 4,, SMALL_MEMSET_ALIGN(MOV_SIZE, RET_SIZE)
> > >         /* From 32 to 63.  No branch when size == 32.  */
> > >  L(between_32_63):
> > > -       VMOVU   %YMM0, -32(%rax,%rdx)
> > >         VMOVU   %YMM0, (%rax)
> > > +       VMOVU   %YMM0, -32(%rax, %rdx)
> > >         VZEROUPPER_RETURN
> > > -# endif
> > > -# if VEC_SIZE > 16
> > > -       /* From 16 to 31.  No branch when size == 16.  */
> > > +#endif
> > > +
> > > +#if VEC_SIZE >= 32
> > > +       .p2align 4,, SMALL_MEMSET_ALIGN(MOV_SIZE, RET_SIZE)
> > >  L(between_16_31):
> > > -       VMOVU   %XMM0, -16(%rax,%rdx)
> > > +       /* From 16 to 31.  No branch when size == 16.  */
> > >         VMOVU   %XMM0, (%rax)
> > > +       VMOVU   %XMM0, -16(%rax, %rdx)
> > >         VZEROUPPER_RETURN
> > > -# endif
> > > -       /* From 8 to 15.  No branch when size == 8.  */
> > > +#endif
> > > +
> > > +       .p2align 4,, SMALL_MEMSET_ALIGN(3, RET_SIZE)
> > >  L(between_8_15):
> > > -       movq    %rcx, -8(%rax,%rdx)
> > > -       movq    %rcx, (%rax)
> > > +       /* From 8 to 15.  No branch when size == 8.  */
> > > +       movq    %rdi, (%rax)
> > > +       movq    %rdi, -8(%rax, %rdx)
> > >         VZEROUPPER_RETURN
> > > +
> > > +       .p2align 4,, SMALL_MEMSET_ALIGN(2, RET_SIZE)
> > >  L(between_4_7):
> > >         /* From 4 to 7.  No branch when size == 4.  */
> > > -       movl    %ecx, -4(%rax,%rdx)
> > > -       movl    %ecx, (%rax)
> > > +       movl    %edi, (%rax)
> > > +       movl    %edi, -4(%rax, %rdx)
> > >         VZEROUPPER_RETURN
> > > +
> > > +       .p2align 4,, SMALL_MEMSET_ALIGN(3, RET_SIZE)
> > >  L(between_2_3):
> > >         /* From 2 to 3.  No branch when size == 2.  */
> > > -       movw    %cx, -2(%rax,%rdx)
> > > -       movw    %cx, (%rax)
> > > +       movw    %di, (%rax)
> > > +       movb    %dil, -1(%rax, %rdx)
> > >         VZEROUPPER_RETURN
> > >  END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > --
> > > 2.25.1
> > >
> >
> > LGTM.
> >
> > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> >
> > Thanks.
> >
> > --
> > H.J.
>
> Thanks! Pushed.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-04-23  1:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210926205306.900081-1-goldstein.w.n@gmail.com>
     [not found] ` <20210926205306.900081-2-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOqaYwc=TujPmZ+KEicKBEUachMdXg5ixKjKxcgBYN=h0w@mail.gmail.com>
     [not found]     ` <CAFUsyfKidMCHU9MJLGpzXL-CXmXhR43B4dfnEESDfaMruW2zkg@mail.gmail.com>
2022-04-23  1:13       ` [PATCH v1 2/3] x86: Modify ENTRY in sysdep.h so that p2align can be specified Sunil Pandey
     [not found] ` <20210926205306.900081-3-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOo-qJRkqYjwuO_pqQySfDFHtg6wSTcM1W=APwCCFpPiWg@mail.gmail.com>
     [not found]     ` <CAFUsyfKhtVfqbpbs-S2b2XjfOh9rDh58wpUuy4+oYgPx1Ecbhw@mail.gmail.com>
2022-04-23  1:19       ` [PATCH v1 3/3] x86: Optimize memset-vec-unaligned-erms.S Sunil Pandey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).