From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=2Ove=MX=gmail.com=goldstein.w.n@sourceware.org>
Received: from mail-oo1-xc30.google.com (mail-oo1-xc30.google.com [IPv6:2607:f8b0:4864:20::c30])
	by sourceware.org (Postfix) with ESMTPS id 18CC03858D1E
	for <libc-alpha@sourceware.org>; Mon, 20 May 2024 22:50:27 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 18CC03858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 18CC03858D1E
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::c30
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716245429; cv=none;
	b=J7TfmyWKqhwJGdkUVLAZOmYnGbaXEfpLvUJYuF1TkUSOXASlES1kNF4s+eWfIjB67v+DCeOLoM7ZnQVM5gPXnSMuO8nahCJgwiQ7POfn2+s/HbWP4Gx6t9qm5qaPTUNHxN7pGr0NnjTu5leiGUHd22ZQxqMkKo2t4d6WDayi2WI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1716245429; c=relaxed/simple;
	bh=gOrzjD80xDsKg8xCZ4DcSXHvNPjy5glQPdd4or2aWVw=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=EOxLT+GHdy+TdRzNp2jXYEww70YD2azva8WE3yPGNETaI0S45I43TbiR5qwItT4ICAXeZmI2p7GBRcgLMk+wnSCsf+mTtPtQZS/HfO0pXE8DGlHRQt2Xq426sAURvb5XVX6xZd4MDP55buNmRKVbKnEU4L+xnC2gYak1wLZEVXQ=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-oo1-xc30.google.com with SMTP id 006d021491bc7-5b2a66dce8fso2900664eaf.1
        for <libc-alpha@sourceware.org>; Mon, 20 May 2024 15:50:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1716245426; x=1716850226; darn=sourceware.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=KORF8o2Vz9m7bmDBddJPMjtvlQju/m/acDavdo1D5ik=;
        b=bVFQFK8bNFhWK0cCBEWPu1vqyFUQOobvSHIFv6wSuSzHzHurNcT2qguyrLDlb6N6XC
         YvqiFoedENWTeuJ9RrCfPGMq/uxkI5n9VG0T9xvjJXVSusk1MDrEn/fIn+WvfiroWozc
         1TuP5d48X+Cnu8l3by+kjpnyDHGmNwMrd1sPnKUeToElUykMfL/rS/PkK4CSK1PzBwLk
         ejOmWET5GNRfnYVbBp4vQJuU7BonzaEqBjZicG8OEEo7L6IegPsh6RP5QdBBIZrPX4q0
         TZQ+JbFQqpXz42yB12CdliEWdagdHF3Pvh0dcX1XJUb1Xjv8OmLhQC24LTbNjp1YUym2
         VHvg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1716245426; x=1716850226;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=KORF8o2Vz9m7bmDBddJPMjtvlQju/m/acDavdo1D5ik=;
        b=uzszMt4x/CyrEiU1OEGjGSFGLrtOCORBfKmLUTXVWHxNPDLO+A5x3sWJNePsZ5lyia
         fLtr4JmdB4fcAB7J9sCSlskwXeqAXCQvyeChbv2o1B5Z4sjoSlQ1nKwY2OMBdOw8Un+h
         fHjIhEwRXKx6S2Qxy8U3FZyUzZGSx4uT+IljU2v/tS4OUI9rnUa1pS5edX1iGOoxHXRE
         0+M0sODSv63hsVoLmxUW0FIgIrGg1/++ksl6o4Hm5dBZ79bm3AiZt0xyuXSIOPfKozNe
         4scjzAXaBKpXEN6WwThoGD4TnRKWzi8ejg3YdK3sF4gzT+ZJCgN3af5YLHGGASFiHZ/Y
         iy9A==
X-Gm-Message-State: AOJu0Yz2CQtzWkqLI4tIicggBfaGCfG4P8wF3l/CK9wGnhGhevGMicgb
	sOEqdrOmL0PAYbP3f31tAgBIoM0jn5szLEtA1wtD1zaM2eocMAjfxor0sc7I/k3gyHH065CoIAs
	5ITjisq1uf+BDvEZh30OlusrFxagpXQ==
X-Google-Smtp-Source: AGHT+IGTEdJ4UgCYeJm7IjsiXS0HIW1oClsr056syX11mUHdRVvcEj3KnpuzS2Rz4AJtjGm7+2hM7GfTfGb9znCeoGQ=
X-Received: by 2002:a05:6870:860c:b0:245:49e3:9651 with SMTP id
 586e51a60fabf-24549e399b1mr14739728fac.17.1716245425873; Mon, 20 May 2024
 15:50:25 -0700 (PDT)
MIME-Version: 1.0
References: <20240519004347.2759850-1-goldstein.w.n@gmail.com>
 <18e510fd-6d49-40fe-91f0-aa59605fe3c0@linaro.org> <CAFUsyfLpoJek1aGribHw3KWXOYmH5fG6w+9jh90yPLE9no6Zqg@mail.gmail.com>
In-Reply-To: <CAFUsyfLpoJek1aGribHw3KWXOYmH5fG6w+9jh90yPLE9no6Zqg@mail.gmail.com>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon, 20 May 2024 17:50:14 -0500
Message-ID: <CAFUsyf+dZZuWVyscNavtxMTzTPqEPCvimX8ThvyHFxtTQLrKSA@mail.gmail.com>
Subject: Re: [PATCH v1] x86: Improve large memset perf with non-temporal
 stores [RHEL-29312]
To: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: libc-alpha@sourceware.org, hjl.tools@gmail.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-9.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

On Mon, May 20, 2024 at 5:47=E2=80=AFPM Noah Goldstein <goldstein.w.n@gmail=
.com> wrote:
>
> On Mon, May 20, 2024 at 2:16=E2=80=AFPM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 18/05/24 21:43, Noah Goldstein wrote:
> > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > notably worse than non-temporal stores for large (above a
> > > few MBs) memsets.
> > > See:
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil=
4P2uMjjQGLbLcU/edit?usp=3Dsharing
> > > For data using different stategies for large memset on ICX and SKX.
> >
> > Btw this datasheet is not accessible without extra steps.
> >
>
> Bh sorry, put the benchmark data in a repo. See data here:
> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>
> The ICX.html and SKX.html have the data from the above link.
> > >
> > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > on SKX. Historically, these numbers would not have been so good
> > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > is able to do. But, the zero-over-zero writeback optimization has bee=
n
> > > removed as a potential side-channel attack, so there is no longer any
> > > good reason to only rely on `rep stosb` for large memsets. On the fli=
p
> > > size, non-temporal writes can avoid data in their RFO requests saving
> > > memory bandwidth.
> >
> > Any chance on how this play in newer AMD chips? I am trying to avoid
> > another regression like BZ#30994 and BZ#30995 (this one I would like
> > to fix, however I don't have access to a Zen4 machine to check for
> > results).
>
> I didn't and don't have access to any of the newer AMD chips.
>
> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> has a README w/ steps if anyone wants to test it.

What we could do is use a seperate tunable for memset NT threshold and just
make it SIZE_MAX for AMD.
> >
> > >
> > > All of the other changes to the file are to re-organize the
> > > code-blocks to maintain "good" alignment given the new code added in
> > > the `L(stosb_local)` case.
> > >
> > > The results from running the GLIBC memset benchmarks on TGL-client fo=
r
> > > N=3D20 runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > Geometric Mean across the suite New / Old AVX2   : 0.986
> > > Geometric Mean across the suite New / Old SSE2   : 0.979
> > >
> > > Most of the cases are essentially unchanged, this is mostly to show
> > > that adding the non-temporal case didn't add any regressions to the
> > > other cases.
> > >
> > > The results on the memset-large benchmark suite on TGL-client for N=
=3D20
> > > runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > Geometric Mean across the suite New / Old AVX2   : 0.928
> > > Geometric Mean across the suite New / Old SSE2   : 0.924
> > >
> > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > (likely because clients typically have faster single-core bandwidth s=
o
> > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > >
> > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > ---
> > >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-----=
--
> > >  1 file changed, 91 insertions(+), 58 deletions(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/s=
ysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > index 97839a2248..637caadb40 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > @@ -21,10 +21,13 @@
> > >     2. If size is less than VEC, use integer register stores.
> > >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores=
.
> > > -   5. On machines ERMS feature, if size is greater or equal than
> > > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > +      4 VEC stores and store 4 * VEC at a time until done.
> > > +   6. On machines ERMS feature, if size is range
> > > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_thresho=
ld)
> > > +       then REP STOSB will be used.
> > > +   7. If size >=3D __x86_shared_non_temporal_threshold, use a
> > > +       non-temporal stores.  */
> > >
> > >  #include <sysdep.h>
> > >
> > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > >       VMOVU   %VMM(0), (%rdi)
> > >       VZEROUPPER_RETURN
> > > +
> > > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > > +        entry as it doesn't take much space and is likely a hot targ=
et.  */
> > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache li=
nes.  */
> > > +     .p2align 6,, 47
> > > +     .p2align 4
> > > +L(less_vec):
> > > +L(less_vec_from_wmemset):
> > > +     /* Less than 1 VEC.  */
> > > +# if VEC_SIZE !=3D 16 && VEC_SIZE !=3D 32 && VEC_SIZE !=3D 64
> > > +#  error Unsupported VEC_SIZE!
> > > +# endif
> > > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > > +        cross check. Note that we are using rax which is set in
> > > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. =
 */
> > > +     andl    $(PAGE_SIZE - 1), %edi
> > > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > +        serious performance degradation when it has to fault suppres=
s.  */
> > > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > +     /* This is generally considered a cold target.  */
> > > +     ja      L(cross_page)
> > > +# if VEC_SIZE > 32
> > > +     movq    $-1, %rcx
> > > +     bzhiq   %rdx, %rcx, %rcx
> > > +     kmovq   %rcx, %k1
> > > +# else
> > > +     movl    $-1, %ecx
> > > +     bzhil   %edx, %ecx, %ecx
> > > +     kmovd   %ecx, %k1
> > > +# endif
> > > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +
> > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > >  END (MEMSET_SYMBOL (__memset, unaligned))
> > >
> > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > >  #endif
> > >       VZEROUPPER_RETURN
> > >
> > > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > > -        entry as it doesn't take much space and is likely a hot targ=
et.
> > > -      */
> > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > -     .p2align 4,, 10
> > > -L(less_vec):
> > > -L(less_vec_from_wmemset):
> > > -     /* Less than 1 VEC.  */
> > > -# if VEC_SIZE !=3D 16 && VEC_SIZE !=3D 32 && VEC_SIZE !=3D 64
> > > -#  error Unsupported VEC_SIZE!
> > > -# endif
> > > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > > -        cross check. Note that we are using rax which is set in
> > > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. =
 */
> > > -     andl    $(PAGE_SIZE - 1), %edi
> > > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > -        serious performance degradation when it has to fault suppres=
s.
> > > -      */
> > > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > -     /* This is generally considered a cold target.  */
> > > -     ja      L(cross_page)
> > > -# if VEC_SIZE > 32
> > > -     movq    $-1, %rcx
> > > -     bzhiq   %rdx, %rcx, %rcx
> > > -     kmovq   %rcx, %k1
> > > -# else
> > > -     movl    $-1, %ecx
> > > -     bzhil   %edx, %ecx, %ecx
> > > -     kmovd   %ecx, %k1
> > > -# endif
> > > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > > -     VZEROUPPER_RETURN
> > > -
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > -     /* Include L(stosb_local) here if including L(less_vec) between
> > > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > -        L(stosb_more_2x_vec) target.  */
> > > -     .p2align 4,, 10
> > > -L(stosb_local):
> > > -     movzbl  %sil, %eax
> > > -     mov     %RDX_LP, %RCX_LP
> > > -     mov     %RDI_LP, %RDX_LP
> > > -     rep     stosb
> > > -     mov     %RDX_LP, %RAX_LP
> > > -     VZEROUPPER_RETURN
> > > -# endif
> > > -#endif
> > > -
> > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > >       .p2align 4
> > >  L(stosb_more_2x_vec):
> > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > >       ret
> > >  #endif
> > >
> > > -     .p2align 4,, 10
> > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > +#ifdef USE_WITH_AVX2
> > > +     .p2align 4
> > > +#else
> > > +     .p2align 4,, 4
> > > +#endif
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > >          range for 2-byte jump encoding.  */
> > >  L(stosb_local):
> > > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > +     jae     L(nt_memset)
> > >       movzbl  %sil, %eax
> > >       mov     %RDX_LP, %RCX_LP
> > >       mov     %RDI_LP, %RDX_LP
> > >       rep     stosb
> > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > +     /* Use xchg to save 1-byte (this helps align targets below).  *=
/
> > > +     xchg    %RDX_LP, %RAX_LP
> > > +# else
> > >       mov     %RDX_LP, %RAX_LP
> > > -     VZEROUPPER_RETURN
> > >  # endif
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +#ifndef USE_LESS_VEC_MASK_STORE
> > >       /* Define L(less_vec) only if not otherwise defined.  */
> > > -     .p2align 4
> > > +     .p2align 4,, 12
> > >  L(less_vec):
> > >       /* Broadcast esi to partial register (i.e VEC_SIZE =3D=3D 32 br=
oadcast to
> > >          xmm). This is only does anything for AVX2.  */
> > > @@ -423,4 +425,35 @@ L(between_2_3):
> > >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > >  #endif
> > >       ret
> > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > +# ifdef USE_WITH_AVX512
> > > +     /* Force align so the loop doesn't cross a cache-line.  */
> > > +     .p2align 4
> > > +# endif
> > > +     .p2align 4,, 7
> > > +    /* Memset using non-temporal stores.  */
> > > +L(nt_memset):
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > +    /* Align DST.  */
> > > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > > +     incq    %rdi
> > > +     .p2align 4,, 7
> > > +L(nt_loop):
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > +     subq    $(VEC_SIZE * -4), %rdi
> > > +     cmpq    %rdx, %rdi
> > > +     jb      L(nt_loop)
> > > +     sfence
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +
> > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))