From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x334.google.com (mail-ot1-x334.google.com [IPv6:2607:f8b0:4864:20::334]) by sourceware.org (Postfix) with ESMTPS id 218333851C0C for ; Mon, 19 Apr 2021 20:39:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 218333851C0C Received: by mail-ot1-x334.google.com with SMTP id g4-20020a9d6b040000b029029debbbb3ecso875359otp.7 for ; Mon, 19 Apr 2021 13:39:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cIt/9L3ayWSB0MaM/9Kh2UKCi4OVKTTUwfjcxnkttIg=; b=T/NZONiUnZvNlCuHzcOxL1X4GyA/RtILdk5O8og8qnxpwU63cze5L0XVondAnHwciN I9e6pzks9iz4socDHdeiyvsi7nszmeG5kLIkqd7onCvsj7s81p75bJyqUhPmRMw4m3Ul pGXcO5U+swCsUd+M+Q/ZAicI8n2cV8K64NmopZIjiAxVJGKNebPEypRv9alWste0lRwZ 3zhDQtHD3OztZcFSW8JR+aj5wZchU8/lAywkV6gU/gWy/M/QziZuFhbMuaKZYPlya5pN 1Wq8P2BMFCi3mg/qtcpVEsSWdhX2y97hYElJsb30syUfpmue4khwrpbk3+U1wPcJ/iaj CUTQ== X-Gm-Message-State: AOAM533w7wq+FSVp/rIAMOUwy6+gb2dGxiblBzRsCZ+qOjWCgMBRV4go mR0aczdlGmqpArahH+vMm5GPLD34HaRAcUOHDuY= X-Google-Smtp-Source: ABdhPJxourMQiIE71JDC66Lgq3inOexv4PF4mgqI8nwgNr4gLtZAE5OXknNtccLZAJFQLJU8l0j/trMjhx24HyYPkfI= X-Received: by 2002:a05:6830:91e:: with SMTP id v30mr5243842ott.180.1618864779503; Mon, 19 Apr 2021 13:39:39 -0700 (PDT) MIME-Version: 1.0 References: <20210419163025.2285675-1-goldstein.w.n@gmail.com> In-Reply-To: From: "H.J. Lu" Date: Mon, 19 Apr 2021 13:39:03 -0700 Message-ID: Subject: Re: [PATCH v2 1/2] x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.S To: Noah Goldstein Cc: GNU C Library , "Carlos O'Donell" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3034.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Apr 2021 20:39:43 -0000 On Mon, Apr 19, 2021 at 12:35 PM Noah Goldstein wrote: > > On Mon, Apr 19, 2021 at 2:45 PM H.J. Lu wrote: > > > > On Mon, Apr 19, 2021 at 9:30 AM Noah Goldstein wrote: > > > > > > No bug. This commit adds optimized cased for less_vec memset case that > > > uses the avx512vl/avx512bw mask store avoiding the excessive > > > branches. test-memset and test-wmemset are passing. > > > > > > Signed-off-by: Noah Goldstein > > > --- > > > sysdeps/x86_64/multiarch/ifunc-memset.h | 6 ++- > > > .../multiarch/memset-avx512-unaligned-erms.S | 2 +- > > > .../multiarch/memset-evex-unaligned-erms.S | 2 +- > > > .../multiarch/memset-vec-unaligned-erms.S | 52 +++++++++++++++---- > > > 4 files changed, 47 insertions(+), 15 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/ifunc-memset.h b/sysdeps/x86_64/multiarch/ifunc-memset.h > > > index 502f946a84..eda5640541 100644 > > > --- a/sysdeps/x86_64/multiarch/ifunc-memset.h > > > +++ b/sysdeps/x86_64/multiarch/ifunc-memset.h > > > @@ -54,7 +54,8 @@ IFUNC_SELECTOR (void) > > > && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)) > > > { > > > if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL) > > > - && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)) > > > + && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW) > > > + && CPU_FEATURE_USABLE_P (cpu_features, BMI2)) > > > { > > > if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > return OPTIMIZE (avx512_unaligned_erms); > > > @@ -68,7 +69,8 @@ IFUNC_SELECTOR (void) > > > if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)) > > > { > > > if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL) > > > - && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)) > > > + && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW) > > > + && CPU_FEATURE_USABLE_P (cpu_features, BMI2)) > > > > Please also update ifunc-impl-list.c. > > Done. > > > > > > { > > > if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > return OPTIMIZE (evex_unaligned_erms); > > > diff --git a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S > > > index 22e7b187c8..d03460be93 100644 > > > --- a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S > > > +++ b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S > > > @@ -19,6 +19,6 @@ > > > # define SECTION(p) p##.evex512 > > > # define MEMSET_SYMBOL(p,s) p##_avx512_##s > > > # define WMEMSET_SYMBOL(p,s) p##_avx512_##s > > > - > > > +# define USE_LESS_VEC_MASKMOV 1 > > > > USE_LESS_VEC_MASKED_STORE > > Done. > > > > > > # include "memset-vec-unaligned-erms.S" > > > #endif > > > diff --git a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S > > > index ae0a4d6e46..eb3541ef60 100644 > > > --- a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S > > > +++ b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S > > > @@ -19,6 +19,6 @@ > > > # define SECTION(p) p##.evex > > > # define MEMSET_SYMBOL(p,s) p##_evex_##s > > > # define WMEMSET_SYMBOL(p,s) p##_evex_##s > > > - > > > +# define USE_LESS_VEC_MASKMOV 1 > > > # include "memset-vec-unaligned-erms.S" > > > #endif > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > index 584747f1a1..6b02e87f48 100644 > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > @@ -63,6 +63,9 @@ > > > # endif > > > #endif > > > > > > +#define PAGE_SIZE 4096 > > > +#define LOG_PAGE_SIZE 12 > > > + > > > #ifndef SECTION > > > # error SECTION is not defined! > > > #endif > > > @@ -213,11 +216,38 @@ L(loop): > > > cmpq %rcx, %rdx > > > jne L(loop) > > > VZEROUPPER_SHORT_RETURN > > > + > > > + .p2align 4 > > > L(less_vec): > > > /* Less than 1 VEC. */ > > > # if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > > # error Unsupported VEC_SIZE! > > > # endif > > > +# ifdef USE_LESS_VEC_MASKMOV > > > + /* Clear high bits from edi. Only keeping bits relevant to page > > > + cross check. Using sall instead of andl saves 3 bytes. Note > > > + that we are using rax which is set in > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > > + sall $(32 - LOG_PAGE_SIZE), %edi > > > + /* Check if VEC_SIZE load cross page. Mask loads suffer serious > > > + performance degradation when it has to fault supress. */ > > > + cmpl $((PAGE_SIZE - VEC_SIZE) << (32 - LOG_PAGE_SIZE)), %edi > > > > Please use AND and CMP since AND has higher throughput. > > AND uses more code size for VEC_SIZE=16/32 and just barely pushes the > L(cross_page) to the next 16 byte chunk so the extra 3 bytes from AND > end up costing 16 bytes. Not aligning L(cross_page) to 16 also > introduces higher variance to benchmarks so I think it has to be all 16 bytes. > > As is I don't think throughput of AND / SAL is on the critical > path so code size should win out. (We can also decode MOV -1, ecx > first cycle with SAL as opposed to AND). > > What do you think? I prefer AND over SAL. Something like diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index 3a59d39267..763fb907b9 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -217,21 +217,17 @@ L(loop): jne L(loop) VZEROUPPER_SHORT_RETURN - .p2align 4 + /* NB: Don't align this branch target to reduce code size. */ L(less_vec): /* Less than 1 VEC. */ # if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 # error Unsupported VEC_SIZE! # endif # ifdef USE_LESS_VEC_MASK_STORE - /* Clear high bits from edi. Only keeping bits relevant to page - cross check. Using sall instead of andl saves 3 bytes. Note - that we are using rax which is set in - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ - sall $(32 - LOG_PAGE_SIZE), %edi - /* Check if VEC_SIZE load cross page. Mask loads suffer serious + /* Check if VEC_SIZE store cross page. Mask stores suffer serious performance degradation when it has to fault supress. */ - cmpl $((PAGE_SIZE - VEC_SIZE) << (32 - LOG_PAGE_SIZE)), %edi + andl $(PAGE_SIZE - 1), %edi + cmpl $(PAGE_SIZE - VEC_SIZE), %edi ja L(cross_page) # if VEC_SIZE > 32 movq $-1, %rcx Thanks. -- H.J.