From: Noah Goldstein <goldstein.w.n@gmail.com>
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>,
"Carlos O'Donell" <carlos@systemhalted.org>
Subject: Re: [PATCH v1 3/3] x86: Optimize memset-vec-unaligned-erms.S
Date: Tue, 12 Oct 2021 14:01:42 -0500 [thread overview]
Message-ID: <CAFUsyfKhtVfqbpbs-S2b2XjfOh9rDh58wpUuy4+oYgPx1Ecbhw@mail.gmail.com> (raw)
In-Reply-To: <CAMe9rOo-qJRkqYjwuO_pqQySfDFHtg6wSTcM1W=APwCCFpPiWg@mail.gmail.com>
On Tue, Oct 12, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Sep 26, 2021 at 1:54 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > No bug.
> >
> > Optimization are
> >
> > 1. change control flow for L(more_2x_vec) to fall through to loop and
> > jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
> > size and saves jumps for length > 4x VEC_SIZE.
> >
> > 2. For EVEX/AVX512 move L(less_vec) closer to entry.
> >
> > 3. Avoid complex address mode for length > 2x VEC_SIZE
> >
> > 4. Slightly better aligning code for the loop from the perspective of
> > code size and uops.
> >
> > 5. Align targets so they make full use of their fetch block and if
> > possible cache line.
> >
> > 6. Try and reduce total number of icache lines that will need to be
> > pulled in for a given length.
> >
> > 7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
> > jumping to the stosb target in the sse2 code section will almost
> > certainly be to a new page. The new version does increase code size
> > marginally by duplicating the target but should get better iTLB
> > behavior as a result.
> >
> > test-memset, test-wmemset, and test-bzero are all passing.
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > ---
> > Performance numbers attached in reply.
> >
> > Notes on testing.
> >
> > I page aligned the start and end of MEMSET so that the bench-memset.c
> > code would not be affected by code size changes.
> >
> > I replaced `_x86_rep_stosb_threshold` with `$2048`. Note this takes
> > the exact same code size. I did this because the load of
> > `_x86_rep_stosb_threshold` can stall due to 4k aliasing from stores
> > during the benchmark. I don't think this is what we are intending to
> > benchmark and I believe it adds more confusion/noise to the
> > results. (Not to say there is no reason to benchmark that case, just I
> > don't think it should be accidental).
> >
> > Notes on numbers:
> >
> > There are degregations in the (2x, 4x] VEC_SIZE range. This is
> > expected and because the current version gives a fallthrough patch to
> > that size range whereas the new version has a jump. Generally this is
> > also conditional on the previous ENTRY_ALIGNMENT % 64, with
> > ENTRY_ALIGNMENT % 64 == 16 being the best for that size range in the
> > current implementation. The new implementation hopefully makes up for
> > this with the higher magnitude gains in the (4x, stosb_threshold)
> > VEC_SIZE range. Is this a blocker?
> >
> > For Skylake there are two other degregations.
> >
> > One where length = 416 with alignment = 416. I have not been able to
> > find this root cause of this. All other size/alignment pairs in the
> > medium sized range seem to perform better. Does anyone have an
> > explination for this or think its a blocker?
> >
> > The other is conditionally on length in [1x, 2x] VEC_SIZE. I believe
> > this is related to reordering the two overlapping stores. The previous
> > version had complex address mode first. Its possible that the simple
> > address mode is taking p23 AGU stalling the complex address
> > calculation for the second store. Having the complex address mode
> > first prevents this. This only appears to be n issue for the first set
> > of tests (c=-65) and only appears to sometimes matter. My general
> > feeling is that if possible ordering for sequantial strided memory
> > access makes the most sense hence the reordering. Does anyone think it
> > should keep the original order?
> >
> > Other than those 1/3 cases the new version wins out in all other cases
> > against all prior alignments.
> >
> > sysdeps/x86_64/memset.S | 10 +-
> > .../multiarch/memset-avx2-unaligned-erms.S | 10 +-
> > .../multiarch/memset-avx512-unaligned-erms.S | 11 +-
> > .../multiarch/memset-evex-unaligned-erms.S | 11 +-
> > .../multiarch/memset-vec-unaligned-erms.S | 285 ++++++++++++------
> > 5 files changed, 232 insertions(+), 95 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/memset.S b/sysdeps/x86_64/memset.S
> > index 7d4a327eba..0137eba4cd 100644
> > --- a/sysdeps/x86_64/memset.S
> > +++ b/sysdeps/x86_64/memset.S
> > @@ -18,13 +18,15 @@
> > <https://www.gnu.org/licenses/>. */
> >
> > #include <sysdep.h>
> > +#define USE_WITH_SSE2 1
> >
> > #define VEC_SIZE 16
> > +#define MOV_SIZE 3
> > +#define RET_SIZE 1
> > +
> > #define VEC(i) xmm##i
> > -/* Don't use movups and movaps since it will get larger nop paddings for
> > - alignment. */
> > -#define VMOVU movdqu
> > -#define VMOVA movdqa
> > +#define VMOVU movups
> > +#define VMOVA movaps
> >
> > #define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
> > movd d, %xmm0; \
> > diff --git a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
> > index ae0860f36a..1af668af0a 100644
> > --- a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
> > @@ -1,8 +1,14 @@
> > #if IS_IN (libc)
> > +# define USE_WITH_AVX2 1
> > +
> > # define VEC_SIZE 32
> > +# define MOV_SIZE 4
> > +# define RET_SIZE 4
> > +
> > # define VEC(i) ymm##i
> > -# define VMOVU vmovdqu
> > -# define VMOVA vmovdqa
> > +
> > +# define VMOVU vmovdqu
> > +# define VMOVA vmovdqa
> >
> > # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
> > vmovd d, %xmm0; \
> > diff --git a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > index 8ad842fc2f..5937a974bf 100644
> > --- a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
> > @@ -1,11 +1,18 @@
> > #if IS_IN (libc)
> > +# define USE_WITH_AVX512 1
> > +
> > # define VEC_SIZE 64
> > +# define MOV_SIZE 6
> > +# define RET_SIZE 1
> > +
> > # define XMM0 xmm16
> > # define YMM0 ymm16
> > # define VEC0 zmm16
> > # define VEC(i) VEC##i
> > -# define VMOVU vmovdqu64
> > -# define VMOVA vmovdqa64
> > +
> > +# define VMOVU vmovdqu64
> > +# define VMOVA vmovdqa64
> > +
> > # define VZEROUPPER
> >
> > # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
> > diff --git a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > index 640f092903..64b09e77cc 100644
> > --- a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
> > @@ -1,11 +1,18 @@
> > #if IS_IN (libc)
> > +# define USE_WITH_EVEX 1
> > +
> > # define VEC_SIZE 32
> > +# define MOV_SIZE 6
> > +# define RET_SIZE 1
> > +
> > # define XMM0 xmm16
> > # define YMM0 ymm16
> > # define VEC0 ymm16
> > # define VEC(i) VEC##i
> > -# define VMOVU vmovdqu64
> > -# define VMOVA vmovdqa64
> > +
> > +# define VMOVU vmovdqu64
> > +# define VMOVA vmovdqa64
> > +
> > # define VZEROUPPER
> >
> > # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index ff196844a0..e723413a66 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -63,8 +63,27 @@
> > # endif
> > #endif
> >
> > +#if VEC_SIZE == 64
> > +# define LOOP_4X_OFFSET (VEC_SIZE * 4)
> > +#else
> > +# define LOOP_4X_OFFSET (0)
> > +#endif
> > +
> > +#if defined USE_WITH_EVEX || defined USE_WITH_AVX512
> > +# define END_REG rcx
> > +# define LOOP_REG rdi
> > +#else
> > +# define END_REG rdi
> > +# define LOOP_REG rdx
> > +#endif
> > +
> > #define PAGE_SIZE 4096
> >
> > +/* Macro to calculate size of small memset block for aligning
> > + purposes. */
> > +#define SMALL_MEMSET_ALIGN(mov_sz, ret_sz) (2 * (mov_sz) + (ret_sz) + 1)
> > +
> > +
> > #ifndef SECTION
> > # error SECTION is not defined!
> > #endif
> > @@ -74,6 +93,7 @@
> > ENTRY (__bzero)
> > mov %RDI_LP, %RAX_LP /* Set return value. */
> > mov %RSI_LP, %RDX_LP /* Set n. */
> > + xorl %esi, %esi
> > pxor %XMM0, %XMM0
> > jmp L(entry_from_bzero)
> > END (__bzero)
> > @@ -158,7 +178,7 @@ ENTRY_CHK (MEMSET_CHK_SYMBOL (__memset_chk, unaligned_erms))
> > END_CHK (MEMSET_CHK_SYMBOL (__memset_chk, unaligned_erms))
> > # endif
> >
> > -ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
> > +ENTRY_P2ALIGN (MEMSET_SYMBOL (__memset, unaligned_erms), 6)
> > MEMSET_VDUP_TO_VEC0_AND_SET_RETURN (%esi, %rdi)
> > # ifdef __ILP32__
> > /* Clear the upper 32 bits. */
> > @@ -168,75 +188,43 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
> > jb L(less_vec)
> > cmp $(VEC_SIZE * 2), %RDX_LP
> > ja L(stosb_more_2x_vec)
> > - /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */
> > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx)
> > - VMOVU %VEC(0), (%rdi)
> > + /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE.
> > + */
> > + VMOVU %VEC(0), (%rax)
> > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx)
> > VZEROUPPER_RETURN
> > -
> > - .p2align 4
> > -L(stosb_more_2x_vec):
> > - cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > - ja L(stosb)
> > -#else
> > - .p2align 4
> > #endif
> > -L(more_2x_vec):
> > - /* Stores to first 2x VEC before cmp as any path forward will
> > - require it. */
> > - VMOVU %VEC(0), (%rdi)
> > - VMOVU %VEC(0), VEC_SIZE(%rdi)
> > - cmpq $(VEC_SIZE * 4), %rdx
> > - ja L(loop_start)
> > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
> > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx)
> > -L(return):
> > -#if VEC_SIZE > 16
> > - ZERO_UPPER_VEC_REGISTERS_RETURN
> > +
> > + .p2align 4,, 10
> > +L(last_2x_vec):
> > +#ifdef USE_LESS_VEC_MASK_STORE
> > + VMOVU %VEC(0), (VEC_SIZE * 2 + LOOP_4X_OFFSET)(%rcx)
> > + VMOVU %VEC(0), (VEC_SIZE * 3 + LOOP_4X_OFFSET)(%rcx)
> > #else
> > - ret
> > + VMOVU %VEC(0), (VEC_SIZE * -2)(%rdi)
> > + VMOVU %VEC(0), (VEC_SIZE * -1)(%rdi)
> > #endif
> > + VZEROUPPER_RETURN
> >
> > -L(loop_start):
> > - VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi)
> > - VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi)
> > - cmpq $(VEC_SIZE * 8), %rdx
> > - jbe L(loop_end)
> > - andq $-(VEC_SIZE * 2), %rdi
> > - subq $-(VEC_SIZE * 4), %rdi
> > - leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx
> > - .p2align 4
> > -L(loop):
> > - VMOVA %VEC(0), (%rdi)
> > - VMOVA %VEC(0), VEC_SIZE(%rdi)
> > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi)
> > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi)
> > - subq $-(VEC_SIZE * 4), %rdi
> > - cmpq %rcx, %rdi
> > - jb L(loop)
> > -L(loop_end):
> > - /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.
> > - rdx as length is also unchanged. */
> > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)
> > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)
> > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)
> > - VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx)
> > - VZEROUPPER_SHORT_RETURN
> > -
> > - .p2align 4
> > + /* If have AVX512 mask instructions put L(less_vec) close to
> > + entry as it doesn't take much space and is likely a hot target.
> > + */
> > +#ifdef USE_LESS_VEC_MASK_STORE
> > + .p2align 4,, 10
> > L(less_vec):
> > /* Less than 1 VEC. */
> > # if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > # error Unsupported VEC_SIZE!
> > # endif
> > -# ifdef USE_LESS_VEC_MASK_STORE
> > /* Clear high bits from edi. Only keeping bits relevant to page
> > cross check. Note that we are using rax which is set in
> > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.
> > - */
> > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > andl $(PAGE_SIZE - 1), %edi
> > - /* Check if VEC_SIZE store cross page. Mask stores suffer serious
> > - performance degradation when it has to fault supress. */
> > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > + serious performance degradation when it has to fault supress.
> > + */
> > cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > + /* This is generally considered a cold target. */
> > ja L(cross_page)
> > # if VEC_SIZE > 32
> > movq $-1, %rcx
> > @@ -247,58 +235,185 @@ L(less_vec):
> > bzhil %edx, %ecx, %ecx
> > kmovd %ecx, %k1
> > # endif
> > - vmovdqu8 %VEC(0), (%rax) {%k1}
> > + vmovdqu8 %VEC(0), (%rax){%k1}
> > VZEROUPPER_RETURN
> >
> > +# if defined USE_MULTIARCH && IS_IN (libc)
> > + /* Include L(stosb_local) here if including L(less_vec) between
> > + L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > + L(stosb_more_2x_vec) target. */
> > + .p2align 4,, 10
> > +L(stosb_local):
> > + movzbl %sil, %eax
> > + mov %RDX_LP, %RCX_LP
> > + mov %RDI_LP, %RDX_LP
> > + rep stosb
> > + mov %RDX_LP, %RAX_LP
> > + VZEROUPPER_RETURN
> > +# endif
> > +#endif
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > .p2align 4
> > -L(cross_page):
> > +L(stosb_more_2x_vec):
> > + cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > + ja L(stosb_local)
> > +#endif
> > + /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > + and (4x, 8x] jump to target. */
> > +L(more_2x_vec):
> > +
> > + /* Two different methods of setting up pointers / compare. The
> > + two methods are based on the fact that EVEX/AVX512 mov
> > + instructions take more bytes then AVX2/SSE2 mov instructions. As
> > + well that EVEX/AVX512 machines also have fast LEA_BID. Both
> > + setup and END_REG to avoid complex address mode. For EVEX/AVX512
> > + this saves code size and keeps a few targets in one fetch block.
> > + For AVX2/SSE2 this helps prevent AGU bottlenecks. */
> > +#if defined USE_WITH_EVEX || defined USE_WITH_AVX512
> > + /* If EVEX/AVX512 compute END_REG - (VEC_SIZE * 4 +
> > + LOOP_4X_OFFSET) with LEA_BID. */
> > +
> > + /* END_REG is rcx for EVEX/AVX512. */
> > + leaq -(VEC_SIZE * 4 + LOOP_4X_OFFSET)(%rdi, %rdx), %END_REG
> > +#endif
> > +
> > + /* Stores to first 2x VEC before cmp as any path forward will
> > + require it. */
> > + VMOVU %VEC(0), (%rax)
> > + VMOVU %VEC(0), VEC_SIZE(%rax)
> > +
> > +
> > +#if !(defined USE_WITH_EVEX || defined USE_WITH_AVX512)
> > + /* If AVX2/SSE2 compute END_REG (rdi) with ALU. */
> > + addq %rdx, %END_REG
> > +#endif
> > +
> > + cmpq $(VEC_SIZE * 4), %rdx
> > + jbe L(last_2x_vec)
> > +
> > + /* Store next 2x vec regardless. */
> > + VMOVU %VEC(0), (VEC_SIZE * 2)(%rax)
> > + VMOVU %VEC(0), (VEC_SIZE * 3)(%rax)
> > +
> > +
> > +#if defined USE_WITH_EVEX || defined USE_WITH_AVX512
> > + /* If LOOP_4X_OFFSET don't readjust LOOP_REG (rdi), just add
> > + extra offset to addresses in loop. Used for AVX512 to save space
> > + as no way to get (VEC_SIZE * 4) in imm8. */
> > +# if LOOP_4X_OFFSET == 0
> > + subq $-(VEC_SIZE * 4), %LOOP_REG
> > # endif
> > -# if VEC_SIZE > 32
> > - cmpb $32, %dl
> > - jae L(between_32_63)
> > + /* Avoid imm32 compare here to save code size. */
> > + cmpq %rdi, %rcx
> > +#else
> > + addq $-(VEC_SIZE * 4), %END_REG
> > + cmpq $(VEC_SIZE * 8), %rdx
> > +#endif
> > + jbe L(last_4x_vec)
> > +#if !(defined USE_WITH_EVEX || defined USE_WITH_AVX512)
> > + /* Set LOOP_REG (rdx). */
> > + leaq (VEC_SIZE * 4)(%rax), %LOOP_REG
> > +#endif
> > + /* Align dst for loop. */
> > + andq $(VEC_SIZE * -2), %LOOP_REG
> > + .p2align 4
> > +L(loop):
> > + VMOVA %VEC(0), LOOP_4X_OFFSET(%LOOP_REG)
> > + VMOVA %VEC(0), (VEC_SIZE + LOOP_4X_OFFSET)(%LOOP_REG)
> > + VMOVA %VEC(0), (VEC_SIZE * 2 + LOOP_4X_OFFSET)(%LOOP_REG)
> > + VMOVA %VEC(0), (VEC_SIZE * 3 + LOOP_4X_OFFSET)(%LOOP_REG)
> > + subq $-(VEC_SIZE * 4), %LOOP_REG
> > + cmpq %END_REG, %LOOP_REG
> > + jb L(loop)
> > + .p2align 4,, MOV_SIZE
> > +L(last_4x_vec):
> > + VMOVU %VEC(0), LOOP_4X_OFFSET(%END_REG)
> > + VMOVU %VEC(0), (VEC_SIZE + LOOP_4X_OFFSET)(%END_REG)
> > + VMOVU %VEC(0), (VEC_SIZE * 2 + LOOP_4X_OFFSET)(%END_REG)
> > + VMOVU %VEC(0), (VEC_SIZE * 3 + LOOP_4X_OFFSET)(%END_REG)
> > +L(return):
> > +#if VEC_SIZE > 16
> > + ZERO_UPPER_VEC_REGISTERS_RETURN
> > +#else
> > + ret
> > +#endif
> > +
> > + .p2align 4,, 10
> > +#ifndef USE_LESS_VEC_MASK_STORE
> > +# if defined USE_MULTIARCH && IS_IN (libc)
> > + /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > + range for 2-byte jump encoding. */
> > +L(stosb_local):
> > + movzbl %sil, %eax
> > + mov %RDX_LP, %RCX_LP
> > + mov %RDI_LP, %RDX_LP
> > + rep stosb
> > + mov %RDX_LP, %RAX_LP
> > + VZEROUPPER_RETURN
> > # endif
> > -# if VEC_SIZE > 16
> > - cmpb $16, %dl
> > + /* Define L(less_vec) only if not otherwise defined. */
> > + .p2align 4
> > +L(less_vec):
> > +#endif
> > +L(cross_page):
> > +#if VEC_SIZE > 32
> > + cmpl $32, %edx
> > + jae L(between_32_63)
> > +#endif
> > +#if VEC_SIZE > 16
> > + cmpl $16, %edx
> > jae L(between_16_31)
> > -# endif
> > - MOVQ %XMM0, %rcx
> > - cmpb $8, %dl
> > +#endif
> > + MOVQ %XMM0, %rdi
> > + cmpl $8, %edx
> > jae L(between_8_15)
> > - cmpb $4, %dl
> > + cmpl $4, %edx
> > jae L(between_4_7)
> > - cmpb $1, %dl
> > + cmpl $1, %edx
> > ja L(between_2_3)
> > - jb 1f
> > - movb %cl, (%rax)
> > -1:
> > + jb L(return)
> > + movb %sil, (%rax)
> > VZEROUPPER_RETURN
> > -# if VEC_SIZE > 32
> > +
> > + /* Align small targets only if not doing so would cross a fetch
> > + line. */
> > +#if VEC_SIZE > 32
> > + .p2align 4,, SMALL_MEMSET_ALIGN(MOV_SIZE, RET_SIZE)
> > /* From 32 to 63. No branch when size == 32. */
> > L(between_32_63):
> > - VMOVU %YMM0, -32(%rax,%rdx)
> > VMOVU %YMM0, (%rax)
> > + VMOVU %YMM0, -32(%rax, %rdx)
> > VZEROUPPER_RETURN
> > -# endif
> > -# if VEC_SIZE > 16
> > - /* From 16 to 31. No branch when size == 16. */
> > +#endif
> > +
> > +#if VEC_SIZE >= 32
> > + .p2align 4,, SMALL_MEMSET_ALIGN(MOV_SIZE, RET_SIZE)
> > L(between_16_31):
> > - VMOVU %XMM0, -16(%rax,%rdx)
> > + /* From 16 to 31. No branch when size == 16. */
> > VMOVU %XMM0, (%rax)
> > + VMOVU %XMM0, -16(%rax, %rdx)
> > VZEROUPPER_RETURN
> > -# endif
> > - /* From 8 to 15. No branch when size == 8. */
> > +#endif
> > +
> > + .p2align 4,, SMALL_MEMSET_ALIGN(3, RET_SIZE)
> > L(between_8_15):
> > - movq %rcx, -8(%rax,%rdx)
> > - movq %rcx, (%rax)
> > + /* From 8 to 15. No branch when size == 8. */
> > + movq %rdi, (%rax)
> > + movq %rdi, -8(%rax, %rdx)
> > VZEROUPPER_RETURN
> > +
> > + .p2align 4,, SMALL_MEMSET_ALIGN(2, RET_SIZE)
> > L(between_4_7):
> > /* From 4 to 7. No branch when size == 4. */
> > - movl %ecx, -4(%rax,%rdx)
> > - movl %ecx, (%rax)
> > + movl %edi, (%rax)
> > + movl %edi, -4(%rax, %rdx)
> > VZEROUPPER_RETURN
> > +
> > + .p2align 4,, SMALL_MEMSET_ALIGN(3, RET_SIZE)
> > L(between_2_3):
> > /* From 2 to 3. No branch when size == 2. */
> > - movw %cx, -2(%rax,%rdx)
> > - movw %cx, (%rax)
> > + movw %di, (%rax)
> > + movb %dil, -1(%rax, %rdx)
> > VZEROUPPER_RETURN
> > END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
Thanks! Pushed.
next prev parent reply other threads:[~2021-10-12 19:01 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-26 20:53 [PATCH v1 1/3] benchtests: Add medium cases and increase iters in bench-memset.c Noah Goldstein
2021-09-26 20:53 ` [PATCH v1 2/3] x86: Modify ENTRY in sysdep.h so that p2align can be specified Noah Goldstein
2021-09-27 16:07 ` Noah Goldstein
2021-10-08 14:27 ` H.J. Lu
2021-10-08 17:20 ` Noah Goldstein
2022-04-23 1:13 ` Sunil Pandey
2021-09-26 20:53 ` [PATCH v1 3/3] x86: Optimize memset-vec-unaligned-erms.S Noah Goldstein
2021-09-26 20:55 ` Noah Goldstein
2021-10-12 15:23 ` H.J. Lu
2021-10-12 19:01 ` Noah Goldstein [this message]
2022-04-23 1:19 ` Sunil Pandey
2021-10-05 16:33 ` [PATCH v1 1/3] benchtests: Add medium cases and increase iters in bench-memset.c Noah Goldstein
2021-10-08 19:39 ` H.J. Lu
2021-10-08 21:04 ` Noah Goldstein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFUsyfKhtVfqbpbs-S2b2XjfOh9rDh58wpUuy4+oYgPx1Ecbhw@mail.gmail.com \
--to=goldstein.w.n@gmail.com \
--cc=carlos@systemhalted.org \
--cc=hjl.tools@gmail.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).