From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by sourceware.org (Postfix) with ESMTPS id E083B385840C for ; Wed, 30 Mar 2022 16:46:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E083B385840C Received: by mail-pl1-x62d.google.com with SMTP id x2so20950320plm.7 for ; Wed, 30 Mar 2022 09:46:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=se5MhYGLzjXvhWRBbUdw+78wwBbirhgaB4aVNjvgNEs=; b=trxqaHPsxhgswEIBdcayQikPTd3kQOf/or973NCkK6sqGs4k2bBj4qu8vNBQjmH1Nm q5T1hBGhNUKGQE2wu6JiJNcHZILiY/lChczxYPguskkHpsnomu2v+banSukvfRoIOOrt zz0lBDwk73Y8OCsqxC524nSGdk3yGRRkklzxQTAY4tHAGEeTP3GnM+GPhONRYb3yIQsA WU2vSTEZvCGnSdpYZEdUCOXptuBdzAuqhRuL1Vy88zlzOkImJoTu5DYrvdey6MZkJ5rh tooPq5Z2OZ/AWBIs6ePdBVP4OTGZf8ykHj9qLzNHJj6m95m27fPexkQqaJaWp/eqhMDc eXdA== X-Gm-Message-State: AOAM531wJBZ1sFJjbN9CS3O/UfPrq1U+J54LMbh4WMXslEoFdHXox/oz 0JWnUeXqpbMjzeVfmWMZZAWH8MID5kQoiibuRq0= X-Google-Smtp-Source: ABdhPJydqfeZjZeWtNgDxO1EYRg7WmXa6JOIN72gNfDD7HlbiuvaYAt1hPKXxQZwSb/5GUcWiRoP/nNdtPV1eQBOFl0= X-Received: by 2002:a17:90a:be12:b0:1c7:aea:b384 with SMTP id a18-20020a17090abe1200b001c70aeab384mr345972pjs.178.1648658768854; Wed, 30 Mar 2022 09:46:08 -0700 (PDT) MIME-Version: 1.0 References: <89bb3f1942814671ae858dcef4b3b870@zhaoxin.com> <09816f3ba25043339d57121bbae3d991@zhaoxin.com> <239c5445e4ea4c55b85a5b7db70983a2@zhaoxin.com> In-Reply-To: <239c5445e4ea4c55b85a5b7db70983a2@zhaoxin.com> From: Noah Goldstein Date: Wed, 30 Mar 2022 11:45:58 -0500 Message-ID: Subject: Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3 To: Mayshao-oc Cc: "H.J. Lu" , GNU C Library , Florian Weimer , "Carlos O'Donell" , "Louis Qi(BJ-RD)" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Mar 2022 16:46:13 -0000 On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc wrote: > > On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein wrote: > > > >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc wrote: > > > > > > On Mon, Mar 28, 2022 at 9:07 PM H.J. Lu wrote: > > > > > > > > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc wrote: > > > > > > > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein wrote: > > > > > > > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer > > > > > > SSSE3. As a result its no longer with the code size cost. > > > > > > --- > > > > > > sysdeps/x86_64/multiarch/Makefile | 2 - > > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 15 - > > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h | 18 +- > > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S | 3151 -------------------- > > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S | 4 - > > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-) > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S > > > > > > > > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3 > > > > > is better than that of AVX2, and the current computer system has sufficient > > > > > disk capacity and memory capacity. > > > > > > > > How does the SSSE3 version compare against the SSE2 version? > > > > > > On some Zhaoxin processors, the overall performance of SSSE3 is about > > > 10% higher than that of SSE2. > > > > > > > > > Best Regards, > > > May Shao > > > > Any chance you can post the result from running `bench-memset` or some > > equivalent benchmark? Curious where the regressions are. Ideally we would > > fix the SSE2 version so its optimal. > > Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or > length >= 128, memcpy SSSE3 can achieve an average performance improvement > of 25% compared to SSSE2. Thanks The size <= 4 regression is expected as profiles of SPEC show the [5, 32] sized copies to significantly hotter. Regarding the large sizes, it seems to be because the SSSE3 version avoids unaligned loads/stores much more aggressively. For now we will keep the function. Will look into a replacement that isn't so costly to code size. Out of curiosity, is bench-memcpy-random performance also improved with SSSE3? The jump table / branches generally look really nice in micro-benchmarks but that may not be fully indicative of how it will performance in an application. > > I have attached the test results, hope this is what you want to see. > > > > > > It is strongly recommended to keep the SSSE3 version. > > > > > > > > > > > > > > > > > -- > > > > H.J.