From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ZXSHCAS2.zhaoxin.com (ZXSHCAS2.zhaoxin.com [203.148.12.82]) by sourceware.org (Postfix) with ESMTPS id 9488D3858C2D for ; Thu, 31 Mar 2022 03:34:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9488D3858C2D Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=zhaoxin.com Received: from zxbjmbx3.zhaoxin.com (10.29.252.165) by ZXSHCAS2.zhaoxin.com (10.28.252.162) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.27; Thu, 31 Mar 2022 11:34:43 +0800 Received: from zxbjmbx2.zhaoxin.com (10.29.252.164) by zxbjmbx3.zhaoxin.com (10.29.252.165) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.27; Thu, 31 Mar 2022 11:34:43 +0800 Received: from zxbjmbx2.zhaoxin.com ([fe80::4d77:9dba:64a8:8ec3]) by zxbjmbx2.zhaoxin.com ([fe80::4d77:9dba:64a8:8ec3%4]) with mapi id 15.01.2308.027; Thu, 31 Mar 2022 11:34:43 +0800 From: Mayshao-oc To: Noah Goldstein CC: "H.J. Lu" , GNU C Library , Florian Weimer , "Carlos O'Donell" , "Louis Qi(BJ-RD)" Subject: Re:Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3 Thread-Topic: Re:Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3 Thread-Index: AQHYRLBCajgxAtU9r0iwhmspFmi7Yg== Date: Thu, 31 Mar 2022 03:34:43 +0000 Message-ID: References: <89bb3f1942814671ae858dcef4b3b870@zhaoxin.com> <09816f3ba25043339d57121bbae3d991@zhaoxin.com> <239c5445e4ea4c55b85a5b7db70983a2@zhaoxin.com>, In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.29.32.65] MIME-Version: 1.0 X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00, HTML_MESSAGE, KAM_DMARC_STATUS, KAM_NUMSUBJECT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Mar 2022 03:34:57 -0000 On Thur, Mar 31, 2022 at 12:45 AM Noah Goldstein w= rote: > On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc wrote= : > > > > On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein wrote: > > > > > > >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc wr= ote: > > > > > > > > On Mon, Mar 28, 2022 at 9:07 PM H.J. Lu wrot= e: > > > > > > > > > > > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc wrote: > > > > > > > > > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein wrote: > > > > > > > > > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets p= refer > > > > > > > SSSE3. As a result its no longer with the code size cost. > > > > > > > --- > > > > > > > sysdeps/x86_64/multiarch/Makefile | 2 - > > > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 15 - > > > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h | 18 +- > > > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S | 3151 -----------= --------- > > > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S | 4 - > > > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-) > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S > > > > > > > > > > > > On some platforms, such as Zhaoxin, the memcpy performance of S= SSE3 > > > > > > is better than that of AVX2, and the current computer system ha= s sufficient > > > > > > disk capacity and memory capacity. > > > > > > > > > > How does the SSSE3 version compare against the SSE2 version? > > > > > > > > On some Zhaoxin processors, the overall performance of SSSE3 is abo= ut > > > > 10% higher than that of SSE2. > > > > > > > > > > > > Best Regards, > > > > May Shao > > > > > > Any chance you can post the result from running `bench-memset` or som= e > > > equivalent benchmark? Curious where the regressions are. Ideally we w= ould > > > fix the SSE2 version so its optimal. > > > > Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=3D4= or > > length >=3D 128, memcpy SSSE3 can achieve an average performance improv= ement > > of 25% compared to SSSE2. > > Thanks > > The size <=3D 4 regression is expected as profiles of SPEC show the [5, 3= 2] sized > copies to significantly hotter. > > Regarding the large sizes, it seems to be because the SSSE3 version avoid= s > unaligned loads/stores much more aggressively. Agree. > For now we will keep the function. Will look into a replacement that isn'= t so > costly to code size. Thanks very much for your support. > Out of curiosity, is bench-memcpy-random performance also improved with > SSSE3? The jump table / branches generally look really nice in micro-benc= hmarks > but that may not be fully indicative of how it will performance in an > application. Bench-memcpy-random shows about a 5% performance drop for SSSE3: __memcpy_sse2_unaligned __memcpy_ssse3 Improvement(ssse3 over ss= e2) length=3D32768 805982 874585 -8.51% length=3D65536 885317 940458 -6.23% length=3D131072 929177 979173 -5.38% length=3D262144 980083 1033130 -5.41% length=3D524288 1042590 1095560 -5.08% length=3D1048576 1078020 1127990 -4.64% > > > > I have attached the test results, hope this is what you want to see. > > > > > > > > It is strongly recommended to keep the SSSE3 version. > > > > > > > > > > > > > > > > > > > > > -- > > > > > H.J.