From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oo1-xc30.google.com (mail-oo1-xc30.google.com [IPv6:2607:f8b0:4864:20::c30]) by sourceware.org (Postfix) with ESMTPS id 18CC03858D1E for ; Mon, 20 May 2024 22:50:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 18CC03858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 18CC03858D1E Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::c30 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716245429; cv=none; b=J7TfmyWKqhwJGdkUVLAZOmYnGbaXEfpLvUJYuF1TkUSOXASlES1kNF4s+eWfIjB67v+DCeOLoM7ZnQVM5gPXnSMuO8nahCJgwiQ7POfn2+s/HbWP4Gx6t9qm5qaPTUNHxN7pGr0NnjTu5leiGUHd22ZQxqMkKo2t4d6WDayi2WI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716245429; c=relaxed/simple; bh=gOrzjD80xDsKg8xCZ4DcSXHvNPjy5glQPdd4or2aWVw=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=EOxLT+GHdy+TdRzNp2jXYEww70YD2azva8WE3yPGNETaI0S45I43TbiR5qwItT4ICAXeZmI2p7GBRcgLMk+wnSCsf+mTtPtQZS/HfO0pXE8DGlHRQt2Xq426sAURvb5XVX6xZd4MDP55buNmRKVbKnEU4L+xnC2gYak1wLZEVXQ= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-oo1-xc30.google.com with SMTP id 006d021491bc7-5b2a66dce8fso2900664eaf.1 for ; Mon, 20 May 2024 15:50:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716245426; x=1716850226; darn=sourceware.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KORF8o2Vz9m7bmDBddJPMjtvlQju/m/acDavdo1D5ik=; b=bVFQFK8bNFhWK0cCBEWPu1vqyFUQOobvSHIFv6wSuSzHzHurNcT2qguyrLDlb6N6XC YvqiFoedENWTeuJ9RrCfPGMq/uxkI5n9VG0T9xvjJXVSusk1MDrEn/fIn+WvfiroWozc 1TuP5d48X+Cnu8l3by+kjpnyDHGmNwMrd1sPnKUeToElUykMfL/rS/PkK4CSK1PzBwLk ejOmWET5GNRfnYVbBp4vQJuU7BonzaEqBjZicG8OEEo7L6IegPsh6RP5QdBBIZrPX4q0 TZQ+JbFQqpXz42yB12CdliEWdagdHF3Pvh0dcX1XJUb1Xjv8OmLhQC24LTbNjp1YUym2 VHvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716245426; x=1716850226; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KORF8o2Vz9m7bmDBddJPMjtvlQju/m/acDavdo1D5ik=; b=uzszMt4x/CyrEiU1OEGjGSFGLrtOCORBfKmLUTXVWHxNPDLO+A5x3sWJNePsZ5lyia fLtr4JmdB4fcAB7J9sCSlskwXeqAXCQvyeChbv2o1B5Z4sjoSlQ1nKwY2OMBdOw8Un+h fHjIhEwRXKx6S2Qxy8U3FZyUzZGSx4uT+IljU2v/tS4OUI9rnUa1pS5edX1iGOoxHXRE 0+M0sODSv63hsVoLmxUW0FIgIrGg1/++ksl6o4Hm5dBZ79bm3AiZt0xyuXSIOPfKozNe 4scjzAXaBKpXEN6WwThoGD4TnRKWzi8ejg3YdK3sF4gzT+ZJCgN3af5YLHGGASFiHZ/Y iy9A== X-Gm-Message-State: AOJu0Yz2CQtzWkqLI4tIicggBfaGCfG4P8wF3l/CK9wGnhGhevGMicgb sOEqdrOmL0PAYbP3f31tAgBIoM0jn5szLEtA1wtD1zaM2eocMAjfxor0sc7I/k3gyHH065CoIAs 5ITjisq1uf+BDvEZh30OlusrFxagpXQ== X-Google-Smtp-Source: AGHT+IGTEdJ4UgCYeJm7IjsiXS0HIW1oClsr056syX11mUHdRVvcEj3KnpuzS2Rz4AJtjGm7+2hM7GfTfGb9znCeoGQ= X-Received: by 2002:a05:6870:860c:b0:245:49e3:9651 with SMTP id 586e51a60fabf-24549e399b1mr14739728fac.17.1716245425873; Mon, 20 May 2024 15:50:25 -0700 (PDT) MIME-Version: 1.0 References: <20240519004347.2759850-1-goldstein.w.n@gmail.com> <18e510fd-6d49-40fe-91f0-aa59605fe3c0@linaro.org> In-Reply-To: From: Noah Goldstein Date: Mon, 20 May 2024 17:50:14 -0500 Message-ID: Subject: Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] To: Adhemerval Zanella Netto Cc: libc-alpha@sourceware.org, hjl.tools@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, May 20, 2024 at 5:47=E2=80=AFPM Noah Goldstein wrote: > > On Mon, May 20, 2024 at 2:16=E2=80=AFPM Adhemerval Zanella Netto > wrote: > > > > > > > > On 18/05/24 21:43, Noah Goldstein wrote: > > > Previously we use `rep stosb` for all medium/large memsets. This is > > > notably worse than non-temporal stores for large (above a > > > few MBs) memsets. > > > See: > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil= 4P2uMjjQGLbLcU/edit?usp=3Dsharing > > > For data using different stategies for large memset on ICX and SKX. > > > > Btw this datasheet is not accessible without extra steps. > > > > Bh sorry, put the benchmark data in a repo. See data here: > https://github.com/goldsteinn/memset-benchmarks/tree/master/results > > The ICX.html and SKX.html have the data from the above link. > > > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > > > on SKX. Historically, these numbers would not have been so good > > > because of the zero-over-zero writeback optimization that `rep stosb` > > > is able to do. But, the zero-over-zero writeback optimization has bee= n > > > removed as a potential side-channel attack, so there is no longer any > > > good reason to only rely on `rep stosb` for large memsets. On the fli= p > > > size, non-temporal writes can avoid data in their RFO requests saving > > > memory bandwidth. > > > > Any chance on how this play in newer AMD chips? I am trying to avoid > > another regression like BZ#30994 and BZ#30995 (this one I would like > > to fix, however I don't have access to a Zen4 machine to check for > > results). > > I didn't and don't have access to any of the newer AMD chips. > > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/ > has a README w/ steps if anyone wants to test it. What we could do is use a seperate tunable for memset NT threshold and just make it SIZE_MAX for AMD. > > > > > > > > All of the other changes to the file are to re-organize the > > > code-blocks to maintain "good" alignment given the new code added in > > > the `L(stosb_local)` case. > > > > > > The results from running the GLIBC memset benchmarks on TGL-client fo= r > > > N=3D20 runs: > > > > > > Geometric Mean across the suite New / Old EXEX256: 0.979 > > > Geometric Mean across the suite New / Old EXEX512: 0.979 > > > Geometric Mean across the suite New / Old AVX2 : 0.986 > > > Geometric Mean across the suite New / Old SSE2 : 0.979 > > > > > > Most of the cases are essentially unchanged, this is mostly to show > > > that adding the non-temporal case didn't add any regressions to the > > > other cases. > > > > > > The results on the memset-large benchmark suite on TGL-client for N= =3D20 > > > runs: > > > > > > Geometric Mean across the suite New / Old EXEX256: 0.926 > > > Geometric Mean across the suite New / Old EXEX512: 0.925 > > > Geometric Mean across the suite New / Old AVX2 : 0.928 > > > Geometric Mean across the suite New / Old SSE2 : 0.924 > > > > > > So roughly a 7.5% speedup. This is lower than what we see on servers > > > (likely because clients typically have faster single-core bandwidth s= o > > > saving bandwidth on RFOs is less impactful), but still advantageous. > > > > > > Full test-suite passes on x86_64 w/ and w/o multiarch. > > > --- > > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-----= -- > > > 1 file changed, 91 insertions(+), 58 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/s= ysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > index 97839a2248..637caadb40 100644 > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > @@ -21,10 +21,13 @@ > > > 2. If size is less than VEC, use integer register stores. > > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores= . > > > - 5. On machines ERMS feature, if size is greater or equal than > > > - __x86_rep_stosb_threshold then REP STOSB will be used. > > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > > > - 4 VEC stores and store 4 * VEC at a time until done. */ > > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > > > + 4 VEC stores and store 4 * VEC at a time until done. > > > + 6. On machines ERMS feature, if size is range > > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_thresho= ld) > > > + then REP STOSB will be used. > > > + 7. If size >=3D __x86_shared_non_temporal_threshold, use a > > > + non-temporal stores. */ > > > > > > #include > > > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > > > VMOVU %VMM(0), (%rdi) > > > VZEROUPPER_RETURN > > > + > > > + /* If have AVX512 mask instructions put L(less_vec) close to > > > + entry as it doesn't take much space and is likely a hot targ= et. */ > > > +#ifdef USE_LESS_VEC_MASK_STORE > > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache li= nes. */ > > > + .p2align 6,, 47 > > > + .p2align 4 > > > +L(less_vec): > > > +L(less_vec_from_wmemset): > > > + /* Less than 1 VEC. */ > > > +# if VEC_SIZE !=3D 16 && VEC_SIZE !=3D 32 && VEC_SIZE !=3D 64 > > > +# error Unsupported VEC_SIZE! > > > +# endif > > > + /* Clear high bits from edi. Only keeping bits relevant to page > > > + cross check. Note that we are using rax which is set in > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. = */ > > > + andl $(PAGE_SIZE - 1), %edi > > > + /* Check if VEC_SIZE store cross page. Mask stores suffer > > > + serious performance degradation when it has to fault suppres= s. */ > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > > + /* This is generally considered a cold target. */ > > > + ja L(cross_page) > > > +# if VEC_SIZE > 32 > > > + movq $-1, %rcx > > > + bzhiq %rdx, %rcx, %rcx > > > + kmovq %rcx, %k1 > > > +# else > > > + movl $-1, %ecx > > > + bzhil %edx, %ecx, %ecx > > > + kmovd %ecx, %k1 > > > +# endif > > > + vmovdqu8 %VMM(0), (%rax){%k1} > > > + VZEROUPPER_RETURN > > > +#endif > > > + > > > #if defined USE_MULTIARCH && IS_IN (libc) > > > END (MEMSET_SYMBOL (__memset, unaligned)) > > > > > > @@ -185,54 +223,6 @@ L(last_2x_vec): > > > #endif > > > VZEROUPPER_RETURN > > > > > > - /* If have AVX512 mask instructions put L(less_vec) close to > > > - entry as it doesn't take much space and is likely a hot targ= et. > > > - */ > > > -#ifdef USE_LESS_VEC_MASK_STORE > > > - .p2align 4,, 10 > > > -L(less_vec): > > > -L(less_vec_from_wmemset): > > > - /* Less than 1 VEC. */ > > > -# if VEC_SIZE !=3D 16 && VEC_SIZE !=3D 32 && VEC_SIZE !=3D 64 > > > -# error Unsupported VEC_SIZE! > > > -# endif > > > - /* Clear high bits from edi. Only keeping bits relevant to page > > > - cross check. Note that we are using rax which is set in > > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. = */ > > > - andl $(PAGE_SIZE - 1), %edi > > > - /* Check if VEC_SIZE store cross page. Mask stores suffer > > > - serious performance degradation when it has to fault suppres= s. > > > - */ > > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > > - /* This is generally considered a cold target. */ > > > - ja L(cross_page) > > > -# if VEC_SIZE > 32 > > > - movq $-1, %rcx > > > - bzhiq %rdx, %rcx, %rcx > > > - kmovq %rcx, %k1 > > > -# else > > > - movl $-1, %ecx > > > - bzhil %edx, %ecx, %ecx > > > - kmovd %ecx, %k1 > > > -# endif > > > - vmovdqu8 %VMM(0), (%rax){%k1} > > > - VZEROUPPER_RETURN > > > - > > > -# if defined USE_MULTIARCH && IS_IN (libc) > > > - /* Include L(stosb_local) here if including L(less_vec) between > > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > > > - L(stosb_more_2x_vec) target. */ > > > - .p2align 4,, 10 > > > -L(stosb_local): > > > - movzbl %sil, %eax > > > - mov %RDX_LP, %RCX_LP > > > - mov %RDI_LP, %RDX_LP > > > - rep stosb > > > - mov %RDX_LP, %RAX_LP > > > - VZEROUPPER_RETURN > > > -# endif > > > -#endif > > > - > > > #if defined USE_MULTIARCH && IS_IN (libc) > > > .p2align 4 > > > L(stosb_more_2x_vec): > > > @@ -318,21 +308,33 @@ L(return_vzeroupper): > > > ret > > > #endif > > > > > > - .p2align 4,, 10 > > > -#ifndef USE_LESS_VEC_MASK_STORE > > > -# if defined USE_MULTIARCH && IS_IN (libc) > > > +#ifdef USE_WITH_AVX2 > > > + .p2align 4 > > > +#else > > > + .p2align 4,, 4 > > > +#endif > > > + > > > +#if defined USE_MULTIARCH && IS_IN (libc) > > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > > > range for 2-byte jump encoding. */ > > > L(stosb_local): > > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > > > + jae L(nt_memset) > > > movzbl %sil, %eax > > > mov %RDX_LP, %RCX_LP > > > mov %RDI_LP, %RDX_LP > > > rep stosb > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > > > + /* Use xchg to save 1-byte (this helps align targets below). *= / > > > + xchg %RDX_LP, %RAX_LP > > > +# else > > > mov %RDX_LP, %RAX_LP > > > - VZEROUPPER_RETURN > > > # endif > > > + VZEROUPPER_RETURN > > > +#endif > > > +#ifndef USE_LESS_VEC_MASK_STORE > > > /* Define L(less_vec) only if not otherwise defined. */ > > > - .p2align 4 > > > + .p2align 4,, 12 > > > L(less_vec): > > > /* Broadcast esi to partial register (i.e VEC_SIZE =3D=3D 32 br= oadcast to > > > xmm). This is only does anything for AVX2. */ > > > @@ -423,4 +425,35 @@ L(between_2_3): > > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > > > #endif > > > ret > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > > > + > > > +#if defined USE_MULTIARCH && IS_IN (libc) > > > +# ifdef USE_WITH_AVX512 > > > + /* Force align so the loop doesn't cross a cache-line. */ > > > + .p2align 4 > > > +# endif > > > + .p2align 4,, 7 > > > + /* Memset using non-temporal stores. */ > > > +L(nt_memset): > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > > > + /* Align DST. */ > > > + orq $(VEC_SIZE * 1 - 1), %rdi > > > + incq %rdi > > > + .p2align 4,, 7 > > > +L(nt_loop): > > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > > > + subq $(VEC_SIZE * -4), %rdi > > > + cmpq %rdx, %rdi > > > + jb L(nt_loop) > > > + sfence > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > > > + VZEROUPPER_RETURN > > > +#endif > > > + > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))