From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=F4yF=NA=gmail.com=hjl.tools@sourceware.org>
Received: from mail-yb1-xb32.google.com (mail-yb1-xb32.google.com [IPv6:2607:f8b0:4864:20::b32])
	by sourceware.org (Postfix) with ESMTPS id 0AAEA3858D20
	for <libc-alpha@sourceware.org>; Wed, 29 May 2024 22:53:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0AAEA3858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0AAEA3858D20
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::b32
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717023212; cv=none;
	b=alLb49pjaBcSnzO5ohgs0iCeEZoR731TC903BjvtmrJoub2LAKuwZ5kYulFFXFdLtCAKe7zy28Q0WpdarXK0BRSyly2vWaS8F5tUz2V423kiBrtgzto7KIbTl+vNg6yjlxbAKE+tF6Ew0z3OSb08N8+0e5L/yVA7n3ZWJsqIAZQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1717023212; c=relaxed/simple;
	bh=TMEo82kigmkjVA3Nlfumqxskw4suu0piyzmZEf9C0jc=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=kJNI1H0nD9JEx0NODy3QTCN56fXERHzWyON2fNA/iw3x8Z51H+oeoWEdCtCJlzFzpdt3O94XHTvzTO/h0GfvcnqjKMabkUcwg3a6YMoed11el/BkumPqzTem3dxg1sPqTEkehIBlJkw0WenHbi8vabl5iollDC8H6iXNyYMh+us=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-yb1-xb32.google.com with SMTP id 3f1490d57ef6-df4ea041bd7so204468276.2
        for <libc-alpha@sourceware.org>; Wed, 29 May 2024 15:53:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1717023209; x=1717628009; darn=sourceware.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=pVg1P8HFHu6eBh8DREPKcmXuFdSHGlPT5bidM2eY4Fs=;
        b=RCEG9v+s8Hqe8s1P4+vE5UxFBucDc27dfJFVMdsAw+Nd3TS6gwNooNVDdFBfshSyI1
         TLV1OF6Xs2c6Uu5Wm2ZDwAe04YBVMxCzMnJ60AHgb5/8E1UlEptbNd0zJfVog+rxeZY8
         YI8pY64lH+zRKTc4oGGiufXxVVQXk0vIe4hzLhDvTnuyFlo+SoGDWQYd5+4OL8QRtxzP
         iFXPscoufHKdnF/rCXl8a6yugZ5LTMzF7UVdYwAEQFTiNJo8bh1H7uDK3GsBDqjxAPFv
         Xz22YF15lekz7EImIbH1ZkCI/EaWwDtW5Foj9B3E4nlgLB3aKfo+xkOrkrzSMRx6ExCu
         csLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1717023209; x=1717628009;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=pVg1P8HFHu6eBh8DREPKcmXuFdSHGlPT5bidM2eY4Fs=;
        b=cbSn4n8GNh+EzWiz/jqlJ/shiQkd4KWSntOUH7Xa+fhih1KXYDjNun3dL2NSGT+k6K
         Z2enukthLCyPmk1/xZLHAql4MI7HDDEq45nlSrM3heL2iytkDjCA0dDUxCvyZeNXfnWH
         8wjKFF1RcmB/gMsd/cQrrqCbIzMgh2rRGE4ZPmVKlokqgHFMN/PVMccQ6+YCzDAbkZam
         JRq2MUdbGUlcaIhTYOGA1glP/0rL5xrO3R69cWbL3552GaPnhrZxcCl3nXNpbgTsmTYK
         VNO5jO4Lpnbbeo/C13xhv7OqeoJarORGj1i0SUB/oXS6+eiPj26mRI9BlSRL9DVX6z8T
         i5TQ==
X-Gm-Message-State: AOJu0Yxl1inirpLMMCSwNsnKMm+nwr8e0hJq79Aw0dNf3slZCF6Z/m43
	oc2DGHvWhHx7AAyJdI2FIfURESfY6WMGcTMUyhVWvKN2GAP4ADj9n9juEC/CdyM3hxlQHsMn2Do
	ERU/L+vubwbhNiHZ3gQ2VBkdHR9eS2mIhzGtEPQ==
X-Google-Smtp-Source: AGHT+IEYVWPOY/fp4Vmdj48y0YqHAcdeZ3eEaG2SU/cTBPoEDkehMna9D89Jw82jU9lhFV1rTRn5jSkJQdFnnu6+vbA=
X-Received: by 2002:a25:840b:0:b0:df7:8dca:1ef4 with SMTP id
 3f1490d57ef6-dfa5a5db421mr758851276.21.1717023209028; Wed, 29 May 2024
 15:53:29 -0700 (PDT)
MIME-Version: 1.0
References: <20240519004347.2759850-1-goldstein.w.n@gmail.com> <20240524173851.2483952-1-goldstein.w.n@gmail.com>
In-Reply-To: <20240524173851.2483952-1-goldstein.w.n@gmail.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 29 May 2024 15:52:53 -0700
Message-ID: <CAMe9rOrKgQwBcGjR3RkbVW+p=a2Uc8bwDAS1syKt8P79bvs8Ow@mail.gmail.com>
Subject: Re: [PATCH v2 1/2] x86: Improve large memset perf with non-temporal
 stores [RHEL-29312]
To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>, "Carlos O'Donell" <carlos@systemhalted.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-3018.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

On Fri, May 24, 2024 at 10:38=E2=80=AFAM Noah Goldstein <goldstein.w.n@gmai=
l.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2u=
MjjQGLbLcU/edit?usp=3Dsharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=3D20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2   : 0.986
> Geometric Mean across the suite New / Old SSE2   : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=3D20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2   : 0.928
> Geometric Mean across the suite New / Old SSE2   : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
>  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
>  1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysde=
ps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
>     2. If size is less than VEC, use integer register stores.
>     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
>     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> -   5. On machines ERMS feature, if size is greater or equal than
> -      __x86_rep_stosb_threshold then REP STOSB will be used.
> -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> -      4 VEC stores and store 4 * VEC at a time until done.  */
> +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> +      4 VEC stores and store 4 * VEC at a time until done.
> +   6. On machines ERMS feature, if size is range
> +         [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold=
)
> +         then REP STOSB will be used.
> +   7. If size >=3D __x86_shared_non_temporal_threshold, use a
> +         non-temporal stores.  */
>
>  #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
>         VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
>         VMOVU   %VMM(0), (%rdi)
>         VZEROUPPER_RETURN
> +
> +       /* If have AVX512 mask instructions put L(less_vec) close to
> +          entry as it doesn't take much space and is likely a hot target=
.  */
> +#ifdef USE_LESS_VEC_MASK_STORE
> +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.=
  */
> +       .p2align 6,, 47
> +       .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> +       /* Less than 1 VEC.  */
> +# if VEC_SIZE !=3D 16 && VEC_SIZE !=3D 32 && VEC_SIZE !=3D 64
> +#  error Unsupported VEC_SIZE!
> +# endif
> +       /* Clear high bits from edi. Only keeping bits relevant to page
> +          cross check. Note that we are using rax which is set in
> +          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  *=
/
> +       andl    $(PAGE_SIZE - 1), %edi
> +       /* Check if VEC_SIZE store cross page. Mask stores suffer
> +          serious performance degradation when it has to fault suppress.=
  */
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> +       /* This is generally considered a cold target.  */
> +       ja      L(cross_page)
> +# if VEC_SIZE > 32
> +       movq    $-1, %rcx
> +       bzhiq   %rdx, %rcx, %rcx
> +       kmovq   %rcx, %k1
> +# else
> +       movl    $-1, %ecx
> +       bzhil   %edx, %ecx, %ecx
> +       kmovd   %ecx, %k1
> +# endif
> +       vmovdqu8 %VMM(0), (%rax){%k1}
> +       VZEROUPPER_RETURN
> +#endif
> +
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
>  #endif
>         VZEROUPPER_RETURN
>
> -       /* If have AVX512 mask instructions put L(less_vec) close to
> -          entry as it doesn't take much space and is likely a hot target=
.
> -        */
> -#ifdef USE_LESS_VEC_MASK_STORE
> -       .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> -       /* Less than 1 VEC.  */
> -# if VEC_SIZE !=3D 16 && VEC_SIZE !=3D 32 && VEC_SIZE !=3D 64
> -#  error Unsupported VEC_SIZE!
> -# endif
> -       /* Clear high bits from edi. Only keeping bits relevant to page
> -          cross check. Note that we are using rax which is set in
> -          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  *=
/
> -       andl    $(PAGE_SIZE - 1), %edi
> -       /* Check if VEC_SIZE store cross page. Mask stores suffer
> -          serious performance degradation when it has to fault suppress.
> -        */
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> -       /* This is generally considered a cold target.  */
> -       ja      L(cross_page)
> -# if VEC_SIZE > 32
> -       movq    $-1, %rcx
> -       bzhiq   %rdx, %rcx, %rcx
> -       kmovq   %rcx, %k1
> -# else
> -       movl    $-1, %ecx
> -       bzhil   %edx, %ecx, %ecx
> -       kmovd   %ecx, %k1
> -# endif
> -       vmovdqu8 %VMM(0), (%rax){%k1}
> -       VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> -       /* Include L(stosb_local) here if including L(less_vec) between
> -          L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> -          L(stosb_more_2x_vec) target.  */
> -       .p2align 4,, 10
> -L(stosb_local):
> -       movzbl  %sil, %eax
> -       mov     %RDX_LP, %RCX_LP
> -       mov     %RDI_LP, %RDX_LP
> -       rep     stosb
> -       mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
> -# endif
> -#endif
> -
>  #if defined USE_MULTIARCH && IS_IN (libc)
>         .p2align 4
>  L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
>         ret
>  #endif
>
> -       .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> +       .p2align 4
> +#else
> +       .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
>         /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>            range for 2-byte jump encoding.  */
>  L(stosb_local):
> +       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +       jae     L(nt_memset)
>         movzbl  %sil, %eax
>         mov     %RDX_LP, %RCX_LP
>         mov     %RDI_LP, %RDX_LP
>         rep     stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> +       /* Use xchg to save 1-byte (this helps align targets below).  */
> +       xchg    %RDX_LP, %RAX_LP
> +# else
>         mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
>  # endif
> +       VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
>         /* Define L(less_vec) only if not otherwise defined.  */
> -       .p2align 4
> +       .p2align 4,, 12
>  L(less_vec):
>         /* Broadcast esi to partial register (i.e VEC_SIZE =3D=3D 32 broa=
dcast to
>            xmm). This is only does anything for AVX2.  */
> @@ -423,4 +425,35 @@ L(between_2_3):
>         movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
>  #endif
>         ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> +       /* Force align so the loop doesn't cross a cache-line.  */
> +       .p2align 4
> +# endif
> +       .p2align 4,, 7
> +    /* Memset using non-temporal stores.  */
> +L(nt_memset):
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> +    /* Align DST.  */
> +       orq     $(VEC_SIZE * 1 - 1), %rdi
> +       incq    %rdi
> +       .p2align 4,, 7
> +L(nt_loop):
> +       VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> +       subq    $(VEC_SIZE * -4), %rdi
> +       cmpq    %rdx, %rdi
> +       jb      L(nt_loop)
> +       sfence
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> +       VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

--=20
H.J.