From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-x532.google.com (mail-pg1-x532.google.com [IPv6:2607:f8b0:4864:20::532]) by sourceware.org (Postfix) with ESMTPS id 29D123858C50 for ; Thu, 31 Mar 2022 03:48:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 29D123858C50 Received: by mail-pg1-x532.google.com with SMTP id k14so19042552pga.0 for ; Wed, 30 Mar 2022 20:48:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=nTxiXAdRsXw8a7QrDHGNjrJZI5nUJ5ZRhqoIFF1psIs=; b=deoaIj1WR9JrdVBIqiQjOfpBjzAXlWQEvFCYRAoLxI3j72n0FORcrNGtItYn2/6jVG NMac0u14DwRjL05hCuDGV8Z6Vqu5p1O3zZL3r/O7uzVV4jYc5D3NvCZxqjEWgZu8H5Ou kLF0OPwnVXlLTAWq54f9TnkUUF59UOkXzmg/yxgBzAIZ8vPxuNBc5rvpEdg+TM1K68Z3 g2VIQbfdhmx9rqnU6s1OoMqkO578Y6yyQ6tax2Cpm4WY8Buk0+3tnPRE8S+slL82psYE vLd8Y1t3ad7uVKF3ofRLwCUBrWwZdGfwxvgt3ZQU9/nt8CwVDEx7xonP7mCfz5AJWLzT M9zQ== X-Gm-Message-State: AOAM532PMB/RthXDvtGC922rPHe4pNQ5/Zv790fN+LqwhydCDKTvucKG 0VYi/0qz7eZk0ZIH1RgHS1HYdp0gpg6/fBSmd/g= X-Google-Smtp-Source: ABdhPJxDU6kxw1sGrQIp/efzQySB0NNnivJQCRxHqQhmNln+J37xsV4ixtjPNb3wvD/SvlKQ4irgjfe9DAkAAUeCMVA= X-Received: by 2002:a05:6a00:22d2:b0:4fa:9d26:bc5d with SMTP id f18-20020a056a0022d200b004fa9d26bc5dmr3139099pfj.79.1648698488062; Wed, 30 Mar 2022 20:48:08 -0700 (PDT) MIME-Version: 1.0 References: <89bb3f1942814671ae858dcef4b3b870@zhaoxin.com> <09816f3ba25043339d57121bbae3d991@zhaoxin.com> <239c5445e4ea4c55b85a5b7db70983a2@zhaoxin.com> In-Reply-To: From: Noah Goldstein Date: Wed, 30 Mar 2022 22:47:57 -0500 Message-ID: Subject: Re: Re: [PATCH v1 3/6] x86: Remove mem{move|cpy}-ssse3 To: Mayshao-oc Cc: "H.J. Lu" , GNU C Library , Florian Weimer , "Carlos O'Donell" , "Louis Qi(BJ-RD)" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Mar 2022 03:48:10 -0000 On Wed, Mar 30, 2022 at 10:34 PM Mayshao-oc wrote: > > On Thur, Mar 31, 2022 at 12:45 AM Noah Goldstein wrote: > > > > On Wed, Mar 30, 2022 at 4:57 AM Mayshao-oc wrote: > > > > > > On Tue, Mar 29, 2022 at 10:57 AM Noah Goldstein wrote: > > > > > > > > > >On Mon, Mar 28, 2022 at 9:51 PM Mayshao-oc wrote: > > > > > > > > > > On Mon, Mar 28, 2022 at 9:07 PM H.J. Lu wrote: > > > > > > > > > > > > > > > > On Mon, Mar 28, 2022 at 1:10 AM Mayshao-oc wrote: > > > > > > > > > > > > > > On Fri, Mar 25, 2022 at 6:36 PM Noah Goldstein wrote: > > > > > > > > > > > > > > > With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer > > > > > > > > SSSE3. As a result its no longer with the code size cost. > > > > > > > > --- > > > > > > > > sysdeps/x86_64/multiarch/Makefile | 2 - > > > > > > > > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 15 - > > > > > > > > sysdeps/x86_64/multiarch/ifunc-memmove.h | 18 +- > > > > > > > > sysdeps/x86_64/multiarch/memcpy-ssse3.S | 3151 -------------------- > > > > > > > > sysdeps/x86_64/multiarch/memmove-ssse3.S | 4 - > > > > > > > > 5 files changed, 7 insertions(+), 3183 deletions(-) > > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memcpy-ssse3.S > > > > > > > > delete mode 100644 sysdeps/x86_64/multiarch/memmove-ssse3.S > > > > > > > > > > > > > > On some platforms, such as Zhaoxin, the memcpy performance of SSSE3 > > > > > > > is better than that of AVX2, and the current computer system has sufficient > > > > > > > disk capacity and memory capacity. > > > > > > > > > > > > How does the SSSE3 version compare against the SSE2 version? > > > > > > > > > > On some Zhaoxin processors, the overall performance of SSSE3 is about > > > > > 10% higher than that of SSE2. > > > > > > > > > > > > > > > Best Regards, > > > > > May Shao > > > > > > > > Any chance you can post the result from running `bench-memset` or some > > > > equivalent benchmark? Curious where the regressions are. Ideally we would > > > > fix the SSE2 version so its optimal. > > > > > > Bench-memcpy on Zhaoxin KX-6000 processor shows that, when length <=4 or > > > length >= 128, memcpy SSSE3 can achieve an average performance improvement > > > of 25% compared to SSSE2. > > > > Thanks > > > > The size <= 4 regression is expected as profiles of SPEC show the [5, 32] sized > > copies to significantly hotter. > > > > Regarding the large sizes, it seems to be because the SSSE3 version avoids > > unaligned loads/stores much more aggressively. > > Agree. > > > For now we will keep the function. Will look into a replacement that isn't so > > costly to code size. > > Thanks very much for your support. Will SSE4.1 be an issue for you? I think the only reasonable way to fix this is with `pshufb`. > > > Out of curiosity, is bench-memcpy-random performance also improved with > > SSSE3? The jump table / branches generally look really nice in micro-benchmarks > > but that may not be fully indicative of how it will performance in an > > application. > > Bench-memcpy-random shows about a 5% performance drop for SSSE3: Thanks. > __memcpy_sse2_unaligned __memcpy_ssse3 Improvement(ssse3 over sse2) > length=32768 805982 874585 -8.51% > length=65536 885317 940458 -6.23% > length=131072 929177 979173 -5.38% > length=262144 980083 1033130 -5.41% > length=524288 1042590 1095560 -5.08% > length=1048576 1078020 1127990 -4.64% > > > > > > > > I have attached the test results, hope this is what you want to see. > > > > > > > > > > It is strongly recommended to keep the SSSE3 version. > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > H.J.