From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb1-xb32.google.com (mail-yb1-xb32.google.com [IPv6:2607:f8b0:4864:20::b32]) by sourceware.org (Postfix) with ESMTPS id 0AAEA3858D20 for ; Wed, 29 May 2024 22:53:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0AAEA3858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0AAEA3858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::b32 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717023212; cv=none; b=alLb49pjaBcSnzO5ohgs0iCeEZoR731TC903BjvtmrJoub2LAKuwZ5kYulFFXFdLtCAKe7zy28Q0WpdarXK0BRSyly2vWaS8F5tUz2V423kiBrtgzto7KIbTl+vNg6yjlxbAKE+tF6Ew0z3OSb08N8+0e5L/yVA7n3ZWJsqIAZQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717023212; c=relaxed/simple; bh=TMEo82kigmkjVA3Nlfumqxskw4suu0piyzmZEf9C0jc=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=kJNI1H0nD9JEx0NODy3QTCN56fXERHzWyON2fNA/iw3x8Z51H+oeoWEdCtCJlzFzpdt3O94XHTvzTO/h0GfvcnqjKMabkUcwg3a6YMoed11el/BkumPqzTem3dxg1sPqTEkehIBlJkw0WenHbi8vabl5iollDC8H6iXNyYMh+us= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-yb1-xb32.google.com with SMTP id 3f1490d57ef6-df4ea041bd7so204468276.2 for ; Wed, 29 May 2024 15:53:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717023209; x=1717628009; darn=sourceware.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=pVg1P8HFHu6eBh8DREPKcmXuFdSHGlPT5bidM2eY4Fs=; b=RCEG9v+s8Hqe8s1P4+vE5UxFBucDc27dfJFVMdsAw+Nd3TS6gwNooNVDdFBfshSyI1 TLV1OF6Xs2c6Uu5Wm2ZDwAe04YBVMxCzMnJ60AHgb5/8E1UlEptbNd0zJfVog+rxeZY8 YI8pY64lH+zRKTc4oGGiufXxVVQXk0vIe4hzLhDvTnuyFlo+SoGDWQYd5+4OL8QRtxzP iFXPscoufHKdnF/rCXl8a6yugZ5LTMzF7UVdYwAEQFTiNJo8bh1H7uDK3GsBDqjxAPFv Xz22YF15lekz7EImIbH1ZkCI/EaWwDtW5Foj9B3E4nlgLB3aKfo+xkOrkrzSMRx6ExCu csLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717023209; x=1717628009; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pVg1P8HFHu6eBh8DREPKcmXuFdSHGlPT5bidM2eY4Fs=; b=cbSn4n8GNh+EzWiz/jqlJ/shiQkd4KWSntOUH7Xa+fhih1KXYDjNun3dL2NSGT+k6K Z2enukthLCyPmk1/xZLHAql4MI7HDDEq45nlSrM3heL2iytkDjCA0dDUxCvyZeNXfnWH 8wjKFF1RcmB/gMsd/cQrrqCbIzMgh2rRGE4ZPmVKlokqgHFMN/PVMccQ6+YCzDAbkZam JRq2MUdbGUlcaIhTYOGA1glP/0rL5xrO3R69cWbL3552GaPnhrZxcCl3nXNpbgTsmTYK VNO5jO4Lpnbbeo/C13xhv7OqeoJarORGj1i0SUB/oXS6+eiPj26mRI9BlSRL9DVX6z8T i5TQ== X-Gm-Message-State: AOJu0Yxl1inirpLMMCSwNsnKMm+nwr8e0hJq79Aw0dNf3slZCF6Z/m43 oc2DGHvWhHx7AAyJdI2FIfURESfY6WMGcTMUyhVWvKN2GAP4ADj9n9juEC/CdyM3hxlQHsMn2Do ERU/L+vubwbhNiHZ3gQ2VBkdHR9eS2mIhzGtEPQ== X-Google-Smtp-Source: AGHT+IEYVWPOY/fp4Vmdj48y0YqHAcdeZ3eEaG2SU/cTBPoEDkehMna9D89Jw82jU9lhFV1rTRn5jSkJQdFnnu6+vbA= X-Received: by 2002:a25:840b:0:b0:df7:8dca:1ef4 with SMTP id 3f1490d57ef6-dfa5a5db421mr758851276.21.1717023209028; Wed, 29 May 2024 15:53:29 -0700 (PDT) MIME-Version: 1.0 References: <20240519004347.2759850-1-goldstein.w.n@gmail.com> <20240524173851.2483952-1-goldstein.w.n@gmail.com> In-Reply-To: <20240524173851.2483952-1-goldstein.w.n@gmail.com> From: "H.J. Lu" Date: Wed, 29 May 2024 15:52:53 -0700 Message-ID: Subject: Re: [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] To: Noah Goldstein Cc: GNU C Library , "Carlos O'Donell" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3018.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, May 24, 2024 at 10:38=E2=80=AFAM Noah Goldstein wrote: > > Previously we use `rep stosb` for all medium/large memsets. This is > notably worse than non-temporal stores for large (above a > few MBs) memsets. > See: > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2u= MjjQGLbLcU/edit?usp=3Dsharing > For data using different stategies for large memset on ICX and SKX. > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > on SKX. Historically, these numbers would not have been so good > because of the zero-over-zero writeback optimization that `rep stosb` > is able to do. But, the zero-over-zero writeback optimization has been > removed as a potential side-channel attack, so there is no longer any > good reason to only rely on `rep stosb` for large memsets. On the flip > size, non-temporal writes can avoid data in their RFO requests saving > memory bandwidth. > > All of the other changes to the file are to re-organize the > code-blocks to maintain "good" alignment given the new code added in > the `L(stosb_local)` case. > > The results from running the GLIBC memset benchmarks on TGL-client for > N=3D20 runs: > > Geometric Mean across the suite New / Old EXEX256: 0.979 > Geometric Mean across the suite New / Old EXEX512: 0.979 > Geometric Mean across the suite New / Old AVX2 : 0.986 > Geometric Mean across the suite New / Old SSE2 : 0.979 > > Most of the cases are essentially unchanged, this is mostly to show > that adding the non-temporal case didn't add any regressions to the > other cases. > > The results on the memset-large benchmark suite on TGL-client for N=3D20 > runs: > > Geometric Mean across the suite New / Old EXEX256: 0.926 > Geometric Mean across the suite New / Old EXEX512: 0.925 > Geometric Mean across the suite New / Old AVX2 : 0.928 > Geometric Mean across the suite New / Old SSE2 : 0.924 > > So roughly a 7.5% speedup. This is lower than what we see on servers > (likely because clients typically have faster single-core bandwidth so > saving bandwidth on RFOs is less impactful), but still advantageous. > > Full test-suite passes on x86_64 w/ and w/o multiarch. > --- > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- > 1 file changed, 91 insertions(+), 58 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysde= ps/x86_64/multiarch/memset-vec-unaligned-erms.S > index 97839a2248..637caadb40 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -21,10 +21,13 @@ > 2. If size is less than VEC, use integer register stores. > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > - 5. On machines ERMS feature, if size is greater or equal than > - __x86_rep_stosb_threshold then REP STOSB will be used. > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > - 4 VEC stores and store 4 * VEC at a time until done. */ > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > + 4 VEC stores and store 4 * VEC at a time until done. > + 6. On machines ERMS feature, if size is range > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold= ) > + then REP STOSB will be used. > + 7. If size >=3D __x86_shared_non_temporal_threshold, use a > + non-temporal stores. */ > > #include > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > VMOVU %VMM(0), (%rdi) > VZEROUPPER_RETURN > + > + /* If have AVX512 mask instructions put L(less_vec) close to > + entry as it doesn't take much space and is likely a hot target= . */ > +#ifdef USE_LESS_VEC_MASK_STORE > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.= */ > + .p2align 6,, 47 > + .p2align 4 > +L(less_vec): > +L(less_vec_from_wmemset): > + /* Less than 1 VEC. */ > +# if VEC_SIZE !=3D 16 && VEC_SIZE !=3D 32 && VEC_SIZE !=3D 64 > +# error Unsupported VEC_SIZE! > +# endif > + /* Clear high bits from edi. Only keeping bits relevant to page > + cross check. Note that we are using rax which is set in > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. *= / > + andl $(PAGE_SIZE - 1), %edi > + /* Check if VEC_SIZE store cross page. Mask stores suffer > + serious performance degradation when it has to fault suppress.= */ > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > + /* This is generally considered a cold target. */ > + ja L(cross_page) > +# if VEC_SIZE > 32 > + movq $-1, %rcx > + bzhiq %rdx, %rcx, %rcx > + kmovq %rcx, %k1 > +# else > + movl $-1, %ecx > + bzhil %edx, %ecx, %ecx > + kmovd %ecx, %k1 > +# endif > + vmovdqu8 %VMM(0), (%rax){%k1} > + VZEROUPPER_RETURN > +#endif > + > #if defined USE_MULTIARCH && IS_IN (libc) > END (MEMSET_SYMBOL (__memset, unaligned)) > > @@ -185,54 +223,6 @@ L(last_2x_vec): > #endif > VZEROUPPER_RETURN > > - /* If have AVX512 mask instructions put L(less_vec) close to > - entry as it doesn't take much space and is likely a hot target= . > - */ > -#ifdef USE_LESS_VEC_MASK_STORE > - .p2align 4,, 10 > -L(less_vec): > -L(less_vec_from_wmemset): > - /* Less than 1 VEC. */ > -# if VEC_SIZE !=3D 16 && VEC_SIZE !=3D 32 && VEC_SIZE !=3D 64 > -# error Unsupported VEC_SIZE! > -# endif > - /* Clear high bits from edi. Only keeping bits relevant to page > - cross check. Note that we are using rax which is set in > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. *= / > - andl $(PAGE_SIZE - 1), %edi > - /* Check if VEC_SIZE store cross page. Mask stores suffer > - serious performance degradation when it has to fault suppress. > - */ > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > - /* This is generally considered a cold target. */ > - ja L(cross_page) > -# if VEC_SIZE > 32 > - movq $-1, %rcx > - bzhiq %rdx, %rcx, %rcx > - kmovq %rcx, %k1 > -# else > - movl $-1, %ecx > - bzhil %edx, %ecx, %ecx > - kmovd %ecx, %k1 > -# endif > - vmovdqu8 %VMM(0), (%rax){%k1} > - VZEROUPPER_RETURN > - > -# if defined USE_MULTIARCH && IS_IN (libc) > - /* Include L(stosb_local) here if including L(less_vec) between > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > - L(stosb_more_2x_vec) target. */ > - .p2align 4,, 10 > -L(stosb_local): > - movzbl %sil, %eax > - mov %RDX_LP, %RCX_LP > - mov %RDI_LP, %RDX_LP > - rep stosb > - mov %RDX_LP, %RAX_LP > - VZEROUPPER_RETURN > -# endif > -#endif > - > #if defined USE_MULTIARCH && IS_IN (libc) > .p2align 4 > L(stosb_more_2x_vec): > @@ -318,21 +308,33 @@ L(return_vzeroupper): > ret > #endif > > - .p2align 4,, 10 > -#ifndef USE_LESS_VEC_MASK_STORE > -# if defined USE_MULTIARCH && IS_IN (libc) > +#ifdef USE_WITH_AVX2 > + .p2align 4 > +#else > + .p2align 4,, 4 > +#endif > + > +#if defined USE_MULTIARCH && IS_IN (libc) > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > range for 2-byte jump encoding. */ > L(stosb_local): > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > + jae L(nt_memset) > movzbl %sil, %eax > mov %RDX_LP, %RCX_LP > mov %RDI_LP, %RDX_LP > rep stosb > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > + /* Use xchg to save 1-byte (this helps align targets below). */ > + xchg %RDX_LP, %RAX_LP > +# else > mov %RDX_LP, %RAX_LP > - VZEROUPPER_RETURN > # endif > + VZEROUPPER_RETURN > +#endif > +#ifndef USE_LESS_VEC_MASK_STORE > /* Define L(less_vec) only if not otherwise defined. */ > - .p2align 4 > + .p2align 4,, 12 > L(less_vec): > /* Broadcast esi to partial register (i.e VEC_SIZE =3D=3D 32 broa= dcast to > xmm). This is only does anything for AVX2. */ > @@ -423,4 +425,35 @@ L(between_2_3): > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > #endif > ret > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > + > +#if defined USE_MULTIARCH && IS_IN (libc) > +# ifdef USE_WITH_AVX512 > + /* Force align so the loop doesn't cross a cache-line. */ > + .p2align 4 > +# endif > + .p2align 4,, 7 > + /* Memset using non-temporal stores. */ > +L(nt_memset): > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > + /* Align DST. */ > + orq $(VEC_SIZE * 1 - 1), %rdi > + incq %rdi > + .p2align 4,, 7 > +L(nt_loop): > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > + subq $(VEC_SIZE * -4), %rdi > + cmpq %rdx, %rdi > + jb L(nt_loop) > + sfence > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > + VZEROUPPER_RETURN > +#endif > + > +END(MEMSET_SYMBOL(__memset, unaligned_erms)) > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu Thanks. --=20 H.J.