From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <hjl.tools@gmail.com>
Received: from mail-oo1-xc30.google.com (mail-oo1-xc30.google.com
 [IPv6:2607:f8b0:4864:20::c30])
 by sourceware.org (Postfix) with ESMTPS id 27A1D3857811
 for <libc-alpha@sourceware.org>; Fri,  2 Apr 2021 01:54:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 27A1D3857811
Received: by mail-oo1-xc30.google.com with SMTP id
 w1-20020a4adec10000b02901bc77feac3eso1000985oou.3
 for <libc-alpha@sourceware.org>; Thu, 01 Apr 2021 18:54:34 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=a4yVdwCo6/cBj8SVHo5gt/NQubtu93DY5sC6k/Cdcig=;
 b=uWVNaTrhl8fAdFlAGFEh5PnrysPbc3AYz5b+99eaY310RTzhD93rTJUqnDnPpkabK2
 AwriiXHEZzuWlrcxjc4J6lLBPFaOG7I07bM7fdpxytNgsvI7f0N5scatXn/uXhowjI2a
 +0NRzqYavaGWNJcTaZ0wKcjL+xs22qlp+w+OLdXvwYCrTiTu6QJpF33iAcywwZH3YcIb
 6/kXUl7//AdChLyJZo0loz5U1tFr9qbJdWiTuq/v4zqH1ci4lLGyGDTxpSG4TBxasD8q
 hNVQYrwIwyFRj417Z7pzcV3N00cbAbbjHPpMgQJbT1ekBwm//4uUcPwqeYrLqtouB5mz
 KJrw==
X-Gm-Message-State: AOAM533K2hwPkzGGLQWFxADGDnBvfgSw0QvwGAu7cd19qfW+ZLAlO/GX
 M7QvTLy4NT/X71l9wbuT0BHd0vrXEgl/OjJ02jo=
X-Google-Smtp-Source: ABdhPJyRTWd5nELqWUD+HtXCdePRvXH20P4YtEAqPjkM3Eu6ABzlbVC4jtEuEpyZcPRPsRCIaez0wodRnTm2n2OVagk=
X-Received: by 2002:a4a:9823:: with SMTP id y32mr9562687ooi.35.1617328473339; 
 Thu, 01 Apr 2021 18:54:33 -0700 (PDT)
MIME-Version: 1.0
References: <20210330213807.492074-1-goldstein.w.n@gmail.com>
In-Reply-To: <20210330213807.492074-1-goldstein.w.n@gmail.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Thu, 1 Apr 2021 18:53:57 -0700
Message-ID: <CAMe9rOqurc=vcnK8XhaOmBBT_OREvUKbihDL_Xq=TO9Jj_1Y4A@mail.gmail.com>
Subject: Re: [PATCH v4 1/2] x86: Update large memcpy case in
 memmove-vec-unaligned-erms.S
To: noah <goldstein.w.n@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>,
 "Carlos O'Donell" <carlos@systemhalted.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3035.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Apr 2021 01:54:37 -0000

On Tue, Mar 30, 2021 at 2:38 PM noah <goldstein.w.n@gmail.com> wrote:
>
> No Bug. This commit updates the large memcpy case (no overlap). The
> update is to perform memcpy on either 2 or 4 contiguous pages at
> once. This 1) helps to alleviate the affects of false memory aliasing
> when destination and source have a close 4k alignment and 2) In most
> cases and for most DRAM units is a modestly more efficient access
> pattern. These changes are a clear performance improvement for
> VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy,
> test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all
> pass.
>
> Signed-off-by: noah <goldstein.w.n@gmail.com>
> ---
>  .../multiarch/memmove-vec-unaligned-erms.S    | 326 ++++++++++++++----
>  1 file changed, 258 insertions(+), 68 deletions(-)

1. There are many trailing whitespaces.
2. Replace "jccSPACElabel" with "jccTABlabel".
3. Replace "insnSPACEoperand" with "insnTABoperand" if needed.

> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index 897a3d9762..dae3e2bac5 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -35,7 +35,16 @@
>        __x86_rep_movsb_stop_threshold, then REP MOVSB will be used.
>     7. If size >= __x86_shared_non_temporal_threshold and there is no
>        overlap between destination and source, use non-temporal store
> -      instead of aligned store.  */
> +      instead of aligned store copying from either 2 or 4 pages at
> +      once.
> +   8. For point 7) if size < 16 * __x86_shared_non_temporal_threshold
> +      and source and destination do not page alias, copy from 2 pages
> +      at once using non-temporal stores. Page aliasing in this case is
> +      considered true if destination's page alignment - sources' page
> +      alignment is less than 8 * VEC_SIZE.
> +   9. If size >= 16 * __x86_shared_non_temporal_threshold or source
> +      and destination do page alias copy from 4 pages at once using
> +      non-temporal stores.  */
>
>  #include <sysdep.h>
>
> @@ -67,6 +76,35 @@
>  # endif
>  #endif
>
> +#ifndef PAGE_SIZE
> +# define PAGE_SIZE 4096
> +#endif
> +
> +#if PAGE_SIZE != 4096
> +# error Unsupported PAGE_SIZE
> +#endif
> +
> +#ifndef LOG_PAGE_SIZE
> +# define LOG_PAGE_SIZE 12
> +#endif
> +
> +#if PAGE_SIZE != (1 << LOG_PAGE_SIZE)
> +# error Invalid LOG_PAGE_SIZE
> +#endif
> +
> +/* Byte per page for large_memcpy inner loop.  */
> +#if VEC_SIZE == 64
> +# define LARGE_LOAD_SIZE (VEC_SIZE * 2)
> +#else
> +# define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> +#endif
> +
> +
> +/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> +#ifndef LOG_4X_MEMCPY_THRESH
> +# define LOG_4X_MEMCPY_THRESH 4
> +#endif
> +
>  /* Avoid short distance rep movsb only with non-SSE vector.  */
>  #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB
>  # define AVOID_SHORT_DISTANCE_REP_MOVSB (VEC_SIZE > 16)
> @@ -106,6 +144,28 @@
>  # error Unsupported PREFETCH_SIZE!
>  #endif
>
> +#if LARGE_LOAD_SIZE == (VEC_SIZE * 2)
> +# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \
> +    VMOVU   (offset)base, vec0; \
> +    VMOVU   ((offset) + VEC_SIZE)base, vec1;
> +# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \
> +    VMOVNT  vec0, (offset)base; \
> +    VMOVNT  vec1, ((offset) + VEC_SIZE)base;
> +#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4)
> +# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \
> +    VMOVU   (offset)base, vec0; \
> +    VMOVU   ((offset) + VEC_SIZE)base, vec1; \
> +    VMOVU   ((offset) + VEC_SIZE * 2)base, vec2; \
> +    VMOVU   ((offset) + VEC_SIZE * 3)base, vec3;
> +# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \
> +    VMOVNT  vec0, (offset)base; \
> +    VMOVNT  vec1, ((offset) + VEC_SIZE)base; \
> +    VMOVNT  vec2, ((offset) + VEC_SIZE * 2)base; \
> +    VMOVNT  vec3, ((offset) + VEC_SIZE * 3)base;
> +#else
> +# error Invalid LARGE_LOAD_SIZE
> +#endif
> +
>  #ifndef SECTION
>  # error SECTION is not defined!
>  #endif
> @@ -393,6 +453,15 @@ L(last_4x_vec):
>         VZEROUPPER_RETURN
>
>  L(more_8x_vec):
> +    /* Check if non-temporal move candidate.  */
> +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> +       /* Check non-temporal store threshold.  */
> +       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +       ja      L(large_memcpy_2x)
> +#endif
> +    /* Entry if rdx is greater than non-temporal threshold but there
> +       is overlap.  */
> +L(more_8x_vec_check):
>         cmpq    %rsi, %rdi
>         ja      L(more_8x_vec_backward)
>         /* Source == destination is less common.  */
> @@ -419,11 +488,6 @@ L(more_8x_vec):
>         subq    %r8, %rdi
>         /* Adjust length.  */
>         addq    %r8, %rdx
> -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> -       /* Check non-temporal store threshold.  */
> -       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> -       ja      L(large_forward)
> -#endif
>  L(loop_4x_vec_forward):
>         /* Copy 4 * VEC a time forward.  */
>         VMOVU   (%rsi), %VEC(0)
> @@ -470,11 +534,6 @@ L(more_8x_vec_backward):
>         subq    %r8, %r9
>         /* Adjust length.  */
>         subq    %r8, %rdx
> -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> -       /* Check non-temporal store threshold.  */
> -       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> -       ja      L(large_backward)
> -#endif
>  L(loop_4x_vec_backward):
>         /* Copy 4 * VEC a time backward.  */
>         VMOVU   (%rcx), %VEC(0)
> @@ -500,72 +559,203 @@ L(loop_4x_vec_backward):
>         VZEROUPPER_RETURN
>
>  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> -L(large_forward):
> +L(large_memcpy_2x):
> +       /* Compute absolute value of difference between source and
> +          destination.  */
> +       movq    %rdi, %r9
> +       subq    %rsi, %r9
> +       movq    %r9, %r8
> +       leaq    -1(%r9), %rcx
> +       sarq    $63, %r8
> +       xorq    %r8, %r9
> +       subq    %r8, %r9
>         /* Don't use non-temporal store if there is overlap between
> -          destination and source since destination may be in cache
> -          when source is loaded.  */
> -       leaq    (%rdi, %rdx), %r10
> -       cmpq    %r10, %rsi
> -       jb      L(loop_4x_vec_forward)
> -L(loop_large_forward):
> -       /* Copy 4 * VEC a time forward with non-temporal stores.  */
> -       PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2)
> -       PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3)
> +          destination and source since destination may be in cache when
> +          source is loaded.  */
> +       cmpq    %r9, %rdx
> +       ja L(more_8x_vec_check)
> +
> +       /* Cache align destination. First store the first 64 bytes then
> +          adjust alignments.  */
> +       VMOVU   (%rsi), %VEC(8)
> +#if VEC_SIZE < 64
> +       VMOVU   VEC_SIZE(%rsi), %VEC(9)
> +#if VEC_SIZE < 32
> +       VMOVU   (VEC_SIZE * 2)(%rsi), %VEC(10)
> +       VMOVU   (VEC_SIZE * 3)(%rsi), %VEC(11)
> +#endif
> +#endif
> +       VMOVU   %VEC(8), (%rdi)
> +#if VEC_SIZE < 64
> +       VMOVU   %VEC(9), VEC_SIZE(%rdi)
> +#if VEC_SIZE < 32
> +       VMOVU   %VEC(10), (VEC_SIZE * 2)(%rdi)
> +       VMOVU   %VEC(11), (VEC_SIZE * 3)(%rdi)
> +#endif
> +#endif
> +       /* Adjust source, destination, and size.  */
> +       MOVQ    %rdi, %r8

movq

> +       andq    $63, %r8
> +       /* Get the negative of offset for alignment.  */
> +       subq    $64, %r8
> +       /* Adjust source.  */
> +       subq    %r8, %rsi
> +       /* Adjust destination which should be aligned now.  */
> +       subq    %r8, %rdi
> +       /* Adjust length.  */
> +       addq    %r8, %rdx
> +
> +       /* Test if source and destination addresses will alias. If they do
> +          the larger pipeline in large_memcpy_4x alleviated the
> +          performance drop.  */
> +       testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> +       jz  L(large_memcpy_4x)
> +
> +       movq    %rdx, %r10
> +       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> +       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> +       jae L(large_memcpy_4x)
> +
> +       /* edx will store remainder size for copying tail.  */
> +       andl    $(PAGE_SIZE * 2 - 1), %edx
> +       /* r10 stores outer loop counter.  */
> +       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> +       /* Copy 4x VEC at a time from 2 pages.  */
> +       .p2align 4

If you drop .p2align, will it show up on the glibc benchtest?

> +L(loop_large_memcpy_2x_outer):
> +       /* ecx stores inner loop counter.  */
> +       movl    $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx
> +L(loop_large_memcpy_2x_inner):
> +       PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE)
> +       PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2)
> +       PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE)
> +       PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2)
> +       /* Load vectors from rsi.  */
> +       LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3))
> +       LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7))
> +       addq    $LARGE_LOAD_SIZE, %rsi
> +       /* Non-temporal store vectors to rdi.  */
> +       STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3))
> +       STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7))
> +       addq    $LARGE_LOAD_SIZE, %rdi
> +       decl    %ecx
> +       jnz L(loop_large_memcpy_2x_inner)
> +       addq    $PAGE_SIZE, %rdi
> +       addq    $PAGE_SIZE, %rsi
> +       decq    %r10
> +       jne L(loop_large_memcpy_2x_outer)
> +       sfence
> +
> +       /* Check if only last 4 loads are needed.  */
> +       cmpl    $(VEC_SIZE * 4), %edx
> +       jbe L(large_memcpy_2x_end)
> +
> +       /* Handle the last 2 * PAGE_SIZE bytes. Use temporal stores
> +          here. The region will fit in cache and it should fit user
> +          expectations for the tail of the memcpy region to be hot.  */
> +L(loop_large_memcpy_2x_tail):
> +       /* Copy 4 * VEC a time forward with temporal stores.  */
> +       PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
> +       PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE)
>         VMOVU   (%rsi), %VEC(0)
>         VMOVU   VEC_SIZE(%rsi), %VEC(1)
>         VMOVU   (VEC_SIZE * 2)(%rsi), %VEC(2)
>         VMOVU   (VEC_SIZE * 3)(%rsi), %VEC(3)
> -       addq    $PREFETCHED_LOAD_SIZE, %rsi
> -       subq    $PREFETCHED_LOAD_SIZE, %rdx
> -       VMOVNT  %VEC(0), (%rdi)
> -       VMOVNT  %VEC(1), VEC_SIZE(%rdi)
> -       VMOVNT  %VEC(2), (VEC_SIZE * 2)(%rdi)
> -       VMOVNT  %VEC(3), (VEC_SIZE * 3)(%rdi)
> -       addq    $PREFETCHED_LOAD_SIZE, %rdi
> -       cmpq    $PREFETCHED_LOAD_SIZE, %rdx
> -       ja      L(loop_large_forward)
> -       sfence
> +       addq    $(VEC_SIZE * 4), %rsi
> +       subl    $(VEC_SIZE * 4), %edx
> +       VMOVA   %VEC(0), (%rdi)
> +       VMOVA   %VEC(1), VEC_SIZE(%rdi)
> +       VMOVA   %VEC(2), (VEC_SIZE * 2)(%rdi)
> +       VMOVA   %VEC(3), (VEC_SIZE * 3)(%rdi)
> +       addq    $(VEC_SIZE * 4), %rdi
> +       cmpl    $(VEC_SIZE * 4), %edx
> +       ja      L(loop_large_memcpy_2x_tail)
> +
> +L(large_memcpy_2x_end):
>         /* Store the last 4 * VEC.  */
> -       VMOVU   %VEC(5), (%rcx)
> -       VMOVU   %VEC(6), -VEC_SIZE(%rcx)
> -       VMOVU   %VEC(7), -(VEC_SIZE * 2)(%rcx)
> -       VMOVU   %VEC(8), -(VEC_SIZE * 3)(%rcx)
> -       /* Store the first VEC.  */
> -       VMOVU   %VEC(4), (%r11)
> +       VMOVU   -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0)
> +       VMOVU   -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1)
> +       VMOVU   -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2)
> +       VMOVU   -VEC_SIZE(%rsi, %rdx), %VEC(3)
> +
> +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx)
> +       VMOVU   %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx)
> +       VMOVU   %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx)
> +       VMOVU   %VEC(3), -VEC_SIZE(%rdi, %rdx)
>         VZEROUPPER_RETURN
>
> -L(large_backward):
> -       /* Don't use non-temporal store if there is overlap between
> -          destination and source since destination may be in cache
> -          when source is loaded.  */
> -       leaq    (%rcx, %rdx), %r10
> -       cmpq    %r10, %r9
> -       jb      L(loop_4x_vec_backward)
> -L(loop_large_backward):
> -       /* Copy 4 * VEC a time backward with non-temporal stores.  */
> -       PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 2)
> -       PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 3)
> -       VMOVU   (%rcx), %VEC(0)
> -       VMOVU   -VEC_SIZE(%rcx), %VEC(1)
> -       VMOVU   -(VEC_SIZE * 2)(%rcx), %VEC(2)
> -       VMOVU   -(VEC_SIZE * 3)(%rcx), %VEC(3)
> -       subq    $PREFETCHED_LOAD_SIZE, %rcx
> -       subq    $PREFETCHED_LOAD_SIZE, %rdx
> -       VMOVNT  %VEC(0), (%r9)
> -       VMOVNT  %VEC(1), -VEC_SIZE(%r9)
> -       VMOVNT  %VEC(2), -(VEC_SIZE * 2)(%r9)
> -       VMOVNT  %VEC(3), -(VEC_SIZE * 3)(%r9)
> -       subq    $PREFETCHED_LOAD_SIZE, %r9
> -       cmpq    $PREFETCHED_LOAD_SIZE, %rdx
> -       ja      L(loop_large_backward)
> +L(large_memcpy_4x):
> +       movq    %rdx, %r10
> +       /* edx will store remainder size for copying tail.  */
> +       andl    $(PAGE_SIZE * 4 - 1), %edx
> +       /* r10 stores outer loop counter.  */
> +       shrq    $(LOG_PAGE_SIZE + 2), %r10
> +       /* Copy 4x VEC at a time from 4 pages.  */
> +       .p2align 4

If you drop .p2align, will it show up on the glibc benchtest?

> +L(loop_large_memcpy_4x_outer):
> +       /* ecx stores inner loop counter.  */
> +       movl    $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx
> +L(loop_large_memcpy_4x_inner):
> +       /* Only one prefetch set per page as doing 4 pages give more time
> +          for prefetcher to keep up.  */
> +       PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE)
> +       PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE)
> +       PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE)
> +       PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE)
> +       /* Load vectors from rsi.  */
> +       LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3))
> +       LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7))
> +       LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11))
> +       LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15))
> +       addq    $LARGE_LOAD_SIZE, %rsi
> +       /* Non-temporal store vectors to rdi.  */
> +       STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3))
> +       STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7))
> +       STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11))
> +       STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15))
> +       addq    $LARGE_LOAD_SIZE, %rdi
> +       decl    %ecx
> +       jnz L(loop_large_memcpy_4x_inner)
> +       addq    $(PAGE_SIZE * 3), %rdi
> +       addq    $(PAGE_SIZE * 3), %rsi
> +       decq    %r10
> +       jne L(loop_large_memcpy_4x_outer)
>         sfence
> -       /* Store the first 4 * VEC.  */
> -       VMOVU   %VEC(4), (%rdi)
> -       VMOVU   %VEC(5), VEC_SIZE(%rdi)
> -       VMOVU   %VEC(6), (VEC_SIZE * 2)(%rdi)
> -       VMOVU   %VEC(7), (VEC_SIZE * 3)(%rdi)
> -       /* Store the last VEC.  */
> -       VMOVU   %VEC(8), (%r11)
> +
> +       /* Check if only last 4 loads are needed.  */
> +       cmpl    $(VEC_SIZE * 4), %edx
> +       jbe L(large_memcpy_4x_end)
> +
> +       /* Handle the last 4  * PAGE_SIZE bytes.  */
> +L(loop_large_memcpy_4x_tail):
> +       /* Copy 4 * VEC a time forward with temporal stores.  */
> +       PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
> +       PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE)
> +       VMOVU   (%rsi), %VEC(0)
> +       VMOVU   VEC_SIZE(%rsi), %VEC(1)
> +       VMOVU   (VEC_SIZE * 2)(%rsi), %VEC(2)
> +       VMOVU   (VEC_SIZE * 3)(%rsi), %VEC(3)
> +       addq    $(VEC_SIZE * 4), %rsi
> +       subl    $(VEC_SIZE * 4), %edx
> +       VMOVA   %VEC(0), (%rdi)
> +       VMOVA   %VEC(1), VEC_SIZE(%rdi)
> +       VMOVA   %VEC(2), (VEC_SIZE * 2)(%rdi)
> +       VMOVA   %VEC(3), (VEC_SIZE * 3)(%rdi)
> +       addq    $(VEC_SIZE * 4), %rdi
> +       cmpl    $(VEC_SIZE * 4), %edx
> +       ja      L(loop_large_memcpy_4x_tail)
> +
> +L(large_memcpy_4x_end):
> +       /* Store the last 4 * VEC.  */
> +       VMOVU   -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0)
> +       VMOVU   -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1)
> +       VMOVU   -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2)
> +       VMOVU   -VEC_SIZE(%rsi, %rdx), %VEC(3)
> +
> +       VMOVU   %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx)
> +       VMOVU   %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx)
> +       VMOVU   %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx)
> +       VMOVU   %VEC(3), -VEC_SIZE(%rdi, %rdx)
>         VZEROUPPER_RETURN
>  #endif
>  END (MEMMOVE_SYMBOL (__memmove, unaligned_erms))
> --
> 2.29.2
>


-- 
H.J.