From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-x52e.google.com (mail-pg1-x52e.google.com [IPv6:2607:f8b0:4864:20::52e]) by sourceware.org (Postfix) with ESMTPS id C873B3858C2C for ; Fri, 15 Apr 2022 17:21:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C873B3858C2C Received: by mail-pg1-x52e.google.com with SMTP id 32so7838622pgl.4 for ; Fri, 15 Apr 2022 10:21:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=D/wY1MS+31QmIzJ7kqXLbiddYqNcZP/fCwEXVAgJ01M=; b=fwjywG5QN/qR6W1x2EgaoIXZLcQ2JCahDiPlqNZOWc/yTmFg41x6W/yoD7oPYrLnuT l9KGClZ2yppllcic5mOXV70n3LU+XKhld2391aA3cH2Ynuc90W7dtpM3lgAoqb9Xy7qY 20YX/P8/6Dcpqb2bhpMORdyxA6BsG+2YJjLEkZ7nKn1OriwL4hVkxZ5Yik/mFAuNc6LW fgpJqN5qxXhZVNfuksLDMMauBdsEakTbB7wnzqJN90+AG4N155mMg70ESjSh2C0dII6p 1O4cPwRbYT7V+vRswl/JFViw9HBDfEraO3zL2JIsWTUvfEs32dkKR9frt8SmAno/EFqt OKWQ== X-Gm-Message-State: AOAM533buW/6IrP29f4wmaarWf9/b/KFo4yI/HM1q5HCDDUvBJKitA+5 UfHqFiB0tH4fXSSMn63ON248b+d+1OU1KFeUuFo= X-Google-Smtp-Source: ABdhPJy/EG/LO59WgHpEcx+UV2Rz563y/9uQ749ISJc5Rjd3s2Rx21VdsOwd0myyAJPFRPNOJ/VyDczXfjIwIEP95BA= X-Received: by 2002:a05:6a00:c85:b0:4fa:f806:10f5 with SMTP id a5-20020a056a000c8500b004faf80610f5mr154969pfv.43.1650043269725; Fri, 15 Apr 2022 10:21:09 -0700 (PDT) MIME-Version: 1.0 References: <20220415055132.1257272-1-goldstein.w.n@gmail.com> <20220415055132.1257272-2-goldstein.w.n@gmail.com> In-Reply-To: <20220415055132.1257272-2-goldstein.w.n@gmail.com> From: "H.J. Lu" Date: Fri, 15 Apr 2022 10:20:34 -0700 Message-ID: Subject: Re: [PATCH v1 2/3] x86: Remove memcmp-sse4.S To: Noah Goldstein Cc: GNU C Library , "Carlos O'Donell" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3025.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Apr 2022 17:21:12 -0000 On Thu, Apr 14, 2022 at 10:51 PM Noah Goldstein wrote: > > Code didn't actually use any sse4 instructions. The new memcmp-sse2 > implementation is also faster. Please mention that SSE4.1 ptest instruction was removed by commit 2f9062d7171850451e6044ef78d91ff8c017b9c0 Author: Noah Goldstein Date: Wed Nov 10 16:18:56 2021 -0600 x86: Shrink memcmp-sse4.S code size > geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905 > > Note there are two regressions prefering SSE2 for Size = 1 and Size = > 65. > > Size = 1: > size, align0, align1, ret, New Time/Old Time > 1, 1, 1, 0, 1.2 > 1, 1, 1, 1, 1.197 > 1, 1, 1, -1, 1.2 > > This is intentional. Size == 1 is significantly less hot based on > profiles of GCC11 and Python3 than sizes [4, 8] (which is made > hotter). > > Python3 Size = 1 -> 13.64% > Python3 Size = [4, 8] -> 60.92% > > GCC11 Size = 1 -> 1.29% > GCC11 Size = [4, 8] -> 33.86% > > size, align0, align1, ret, New Time/Old Time > 4, 4, 4, 0, 0.622 > 4, 4, 4, 1, 0.797 > 4, 4, 4, -1, 0.805 > 5, 5, 5, 0, 0.623 > 5, 5, 5, 1, 0.777 > 5, 5, 5, -1, 0.802 > 6, 6, 6, 0, 0.625 > 6, 6, 6, 1, 0.813 > 6, 6, 6, -1, 0.788 > 7, 7, 7, 0, 0.625 > 7, 7, 7, 1, 0.799 > 7, 7, 7, -1, 0.795 > 8, 8, 8, 0, 0.625 > 8, 8, 8, 1, 0.848 > 8, 8, 8, -1, 0.914 > 9, 9, 9, 0, 0.625 > > Size = 65: > size, align0, align1, ret, New Time/Old Time > 65, 0, 0, 0, 1.103 > 65, 0, 0, 1, 1.216 > 65, 0, 0, -1, 1.227 > 65, 65, 0, 0, 1.091 > 65, 0, 65, 1, 1.19 > 65, 65, 65, -1, 1.215 > > This is because A) the checks in range [65, 96] are now unrolled 2x > and B) because smaller values <= 16 are now given a hotter path. By > contrast the SSE4 version has a branch for Size = 80. The unrolled > version has get better performance for returns which need both > comparisons. > > size, align0, align1, ret, New Time/Old Time > 128, 4, 8, 0, 0.858 > 128, 4, 8, 1, 0.879 > 128, 4, 8, -1, 0.888 > > As well, out of microbenchmark environments that are not full > predictable the branch will have a real-cost. > --- > sysdeps/x86_64/multiarch/Makefile | 2 -- > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 4 ---- > sysdeps/x86_64/multiarch/ifunc-memcmp.h | 4 ---- > 3 files changed, 10 deletions(-) > Please also remove memcmp-sse4.S. > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile > index b573966966..0400ea332b 100644 > --- a/sysdeps/x86_64/multiarch/Makefile > +++ b/sysdeps/x86_64/multiarch/Makefile > @@ -11,7 +11,6 @@ sysdep_routines += \ > memcmp-avx2-movbe-rtm \ > memcmp-evex-movbe \ > memcmp-sse2 \ > - memcmp-sse4 \ > memcmpeq-avx2 \ > memcmpeq-avx2-rtm \ > memcmpeq-evex \ > @@ -164,7 +163,6 @@ sysdep_routines += \ > wmemcmp-avx2-movbe-rtm \ > wmemcmp-evex-movbe \ > wmemcmp-sse2 \ > - wmemcmp-sse4 \ > # sysdep_routines > endif > > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > index c6008a73ed..a8afcf81bb 100644 > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > @@ -96,8 +96,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > && CPU_FEATURE_USABLE (BMI2) > && CPU_FEATURE_USABLE (MOVBE)), > __memcmp_evex_movbe) > - IFUNC_IMPL_ADD (array, i, memcmp, CPU_FEATURE_USABLE (SSE4_1), > - __memcmp_sse4_1) > IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_sse2)) > > #ifdef SHARED > @@ -809,8 +807,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > && CPU_FEATURE_USABLE (BMI2) > && CPU_FEATURE_USABLE (MOVBE)), > __wmemcmp_evex_movbe) > - IFUNC_IMPL_ADD (array, i, wmemcmp, CPU_FEATURE_USABLE (SSE4_1), > - __wmemcmp_sse4_1) > IFUNC_IMPL_ADD (array, i, wmemcmp, 1, __wmemcmp_sse2)) > > /* Support sysdeps/x86_64/multiarch/wmemset.c. */ > diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmp.h b/sysdeps/x86_64/multiarch/ifunc-memcmp.h > index 44759a3ad5..c743970fe3 100644 > --- a/sysdeps/x86_64/multiarch/ifunc-memcmp.h > +++ b/sysdeps/x86_64/multiarch/ifunc-memcmp.h > @@ -20,7 +20,6 @@ > # include > > extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden; > -extern __typeof (REDIRECT_NAME) OPTIMIZE (sse4_1) attribute_hidden; > extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_movbe) attribute_hidden; > extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_movbe_rtm) attribute_hidden; > extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_movbe) attribute_hidden; > @@ -46,8 +45,5 @@ IFUNC_SELECTOR (void) > return OPTIMIZE (avx2_movbe); > } > > - if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_1)) > - return OPTIMIZE (sse4_1); > - > return OPTIMIZE (sse2); > } > -- > 2.25.1 > Thanks. -- H.J.