From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by sourceware.org (Postfix) with ESMTPS id 9A5503858C2D for ; Wed, 30 Mar 2022 16:54:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9A5503858C2D Received: by mail-pj1-x1031.google.com with SMTP id d30so4233412pjk.0 for ; Wed, 30 Mar 2022 09:54:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=iVYFkIAW21VwBNmrTDY/uN3CVz3UyXgv0iNU9JmMjVg=; b=nO7Tv7O4WGXD8D6f1FCX7skVL25lKMViP5MdSaaFeyBZi+8evHvnjbkPw16/ptva7E IrCWTDwNbs5Mdldj53WtVxn/wOocbVCWDdU1c2xxIok6DryWIP3DupO+Q8J2Ysr0rdps XnvTdYes7DSS185aaSXubsSlghwpSZ9G+683sWPaUnFD6SOt76LP6r/NBE/xmrWqzaVs p+MGtcyvDrRr9mVgRXtxGIn2xW3CqAQV/vIam563EN0P6Rs0c1bMNjLDwIliTLo1svx6 hRkIqGTJis6m99ngJ/rV+nDqOT4iQUhINNsjAxijNhUHbvX+NZK04g15I0zIPDQyg+/V DYiw== X-Gm-Message-State: AOAM530Mk4PugDW7mmXBQMo/jrK2jk/2JUlG8eruYi9Zz9I2ighKECw6 wVaJb19bWKGZvU5OTUJ8n3Gp+BxgOUgxpSLm680= X-Google-Smtp-Source: ABdhPJyK/070/lNMFP383rn0lqjCzIbWg3h3sma+TLu6qKlBECClg0Q/8rMyY38wc9vrLCeu+yutl1OL8wvdnmz3zdI= X-Received: by 2002:a17:90b:4c44:b0:1c7:1326:ec90 with SMTP id np4-20020a17090b4c4400b001c71326ec90mr461658pjb.87.1648659267676; Wed, 30 Mar 2022 09:54:27 -0700 (PDT) MIME-Version: 1.0 References: <89bb3f1942814671ae858dcef4b3b870@zhaoxin.com> <09816f3ba25043339d57121bbae3d991@zhaoxin.com> <239c5445e4ea4c55b85a5b7db70983a2@zhaoxin.com> In-Reply-To: From: Noah Goldstein Date: Wed, 30 Mar 2022 11:54:16 -0500 Message-ID: Subject: Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3 To: Mayshao-oc Cc: "H.J. Lu" , GNU C Library , Florian Weimer , "Carlos O'Donell" , "Louis Qi(BJ-RD)" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Mar 2022 16:54:30 -0000 On Wed, Mar 30, 2022 at 11:45 AM Noah Goldstein wrote: > > On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc wrote: > > > > On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein wrote: > > > > > > >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc wrote: > > > > > > > > On Mon, Mar 28, 2022 at 9:07 PM H.J. Lu wrote: > > > > > > > > > > > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc wrote: > > > > > > > > > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein wrote: > > > > > > > > > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer > > > > > > > SSSE3. As a result its no longer with the code size cost. > > > > > > > --- > > > > > > > sysdeps/x86_64/multiarch/Makefile | 2 - > > > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 15 - > > > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h | 18 +- > > > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S | 3151 -------------------- > > > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S | 4 - > > > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-) > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S > > > > > > > > > > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3 > > > > > > is better than that of AVX2, and the current computer system has sufficient > > > > > > disk capacity and memory capacity. > > > > > > > > > > How does the SSSE3 version compare against the SSE2 version? > > > > > > > > On some Zhaoxin processors, the overall performance of SSSE3 is about > > > > 10% higher than that of SSE2. > > > > > > > > > > > > Best Regards, > > > > May Shao > > > > > > Any chance you can post the result from running `bench-memset` or some > > > equivalent benchmark? Curious where the regressions are. Ideally we would > > > fix the SSE2 version so its optimal. > > > > Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or > > length >= 128, memcpy SSSE3 can achieve an average performance improvement > > of 25% compared to SSSE2. > > Thanks > > The size <= 4 regression is expected as profiles of SPEC show the [5, 32] sized > copies to significantly hotter. > > Regarding the large sizes, it seems to be because the SSSE3 version avoids > unaligned loads/stores much more aggressively. > > For now we will keep the function. Will look into a replacement that isn't so > costly to code size. > > Out of curiosity, is bench-memcpy-random performance also improved with > SSSE3? The jump table / branches generally look really nice in micro-benchmarks > but that may not be fully indicative of how it will performance in an > application. > > > > I have attached the test results, hope this is what you want to see. > > > > > > > > It is strongly recommended to keep the SSSE3 version. Will you guys have any issues if we upgrade the unaligned memcpy to sse4.1? That will allow us to use `pshufb` and get rid of the jump table and excessive code size. > > > > > > > > > > > > > > > > > > > > > -- > > > > > H.J.