From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x331.google.com (mail-ot1-x331.google.com [IPv6:2607:f8b0:4864:20::331]) by sourceware.org (Postfix) with ESMTPS id BA3693858D33 for ; Tue, 7 Feb 2023 16:41:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BA3693858D33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-ot1-x331.google.com with SMTP id 70-20020a9d084c000000b0068bccf754f1so4368177oty.7 for ; Tue, 07 Feb 2023 08:41:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=content-transfer-encoding:in-reply-to:organization:from:references :to:content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=UDA6XH3uH5TWtr+aujB0g0M7lO/le3Vt4oDOUfWh11w=; b=kXppeKNUvEatsUYSYCOezGpD2mMeirRGQ1PDDHHQzgewV5XHNh9fYqWqfbEdXCz4GQ 9hxnVdA+V9Trrp0Gf3iKxxFs7hmLaSiRWb+Xmj5A7JkePDBGFEUfagSeL9OfeZQikx9I T1ZHm89a7c2n0rF6V9zFVnwlu2K/rLtEgfWdjdoFh9IoYcovbT2FV3OienBXgnNFdkTK zjb6zGGnl5x4vgP/iO23YtiRX5LayZN723gRmeuDQ+0jbIjVs0VwPTnHrZ7y+8k5BA/X c0qdO8XAsQ86HDvit7aE7UTHj5mO4+2v+vAjY6dVexPIOPNCmMqEN+8OMnd5j3dKMNKp XuNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :to:content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UDA6XH3uH5TWtr+aujB0g0M7lO/le3Vt4oDOUfWh11w=; b=XR2Yh1e5Le+99+ARgCcNgJfxRUC6UMnECQdzUsgfjkJJFeMAXCfGkWmPo6a50uCTx0 X/mQOexCg3R0z1E+Wnt8BUtBzZCel5zUtaAtrHRzMWD36HQsIsk9WArC6iNFPB/ai8fI W5+EwAduyXJjhFXA6N5ysIPaS23fyPDDf1jtUcdplFscNpds4GUO3EPKv3ww+AMv8ey8 3CZ/A99H955AKeFY90lnlS7OV8salNb8+yNcAWb9ERfAJbywzSKl29OYQa/TFEpBBEmY Wst0aSli5gPIvRHq/pDNzlCtFyRIbVkVAs0b9YS5Ul8Qn75utvwqQGi8fAlbXDyXXutI Kukg== X-Gm-Message-State: AO0yUKU0ixIGtRorpztieb+9R9HrQinPvbK9mS2y15cP3VRqrKz5NOzH CAjyyzi8yzccXe8RnUK43o+NfQ== X-Google-Smtp-Source: AK7set+6BPVnn7YBuGeXhB6E4KNQsH7ILyBKpR0JXRgywMA2NbSXi9Nw2AdpJqrp1g+g9bB5z2H9NQ== X-Received: by 2002:a05:6830:129a:b0:68b:dacf:a530 with SMTP id z26-20020a056830129a00b0068bdacfa530mr1980347otp.15.1675788063556; Tue, 07 Feb 2023 08:41:03 -0800 (PST) Received: from ?IPV6:2804:1b3:a7c2:8ced:8458:e6b7:cf66:aa19? ([2804:1b3:a7c2:8ced:8458:e6b7:cf66:aa19]) by smtp.gmail.com with ESMTPSA id a24-20020a9d4718000000b0068d3f341dd9sm6661673otf.62.2023.02.07.08.41.00 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 07 Feb 2023 08:41:02 -0800 (PST) Message-ID: Date: Tue, 7 Feb 2023 13:40:58 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Subject: Re: [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Content-Language: en-US To: Christoph Muellner , libc-alpha@sourceware.org, Palmer Dabbelt , Darius Rad , Andrew Waterman , DJ Delorie , Vineet Gupta , Kito Cheng , Jeff Law , Philipp Tomsich , Heiko Stuebner References: <20230207001618.458947-1-christoph.muellner@vrull.eu> From: Adhemerval Zanella Netto Organization: Linaro In-Reply-To: <20230207001618.458947-1-christoph.muellner@vrull.eu> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_MANYTO,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 06/02/23 21:15, Christoph Muellner wrote: > From: Christoph Müllner > > This RFC series introduces ifunc support for RISC-V and adds > optimized routines of memset(), memcpy()/memmove(), strlen(), > strcmp(), strncmp(), and cpu_relax(). > > The ifunc mechanism desides based on the following hart features: > - Available extensions > - Cache block size > - Fast unaligned accesses > > Since we don't have an interface to get this information from the > kernel (at the moment), this patch uses environment variables instead, > which is also why this patch should not be considered for upstream > inclusion and is explicitly tagged as RFC. > > The environment variables are: > - RISCV_RT_MARCH (e.g. "rv64gc_zicboz") > - RISCV_RT_CBOZ_BLOCKSIZE (e.g. "64") > - RISCV_RT_CBOM_BLOCKSIZE (e.g. "64") > - RISCV_RT_FAST_UNALIGNED (e.g. "1") > > The environment variables are looked up and parsed early during > startup, where other architectures query similar properties from > the kernel or the CPU. > The ifunc implementation can use test macros to select a matching > implementation (e.g. HAVE_RV(zbb) or HAVE_FAST_UNALIGNED()). So now we have 3 different proposal mechanism to provide implementation runtime selection on riscv: 1. The sysdep mechanism to select optimized routines based on compiler/ABI done at build time. It is the current mechanism and it is also used on rvv routines [1]. 2. A ifunc one using a new riscv syscall to query the kernel the required information. 3. Another ifunc one using riscv specific environment variable. Although all of them are interchangeable in a sense they can be used independently, RISCV is following MIPS on having uncountable minor ABI variants due this exactly available permutations. This incurs in extra maintanance, extra documentation, extra testing, etc. So I would like you RISCV arch-maintainers to first figure out what scheme you want focus on, instead of trying to push multiple fronts with different ad-hoc schemes. The first scheme, which is the oldest one used by architectures like arm, powerpc, mips, etc. is the sysdep where you select the variant at build time. It has the advantage of no need to extra runtime cost or probing, and a slight code size reduction. However it ties the ABI used to build glibc, which means you need multiple libc build if you targeting different chips/ABIs. I recall that Red Hat and SuSE used to provided specialized glibc build for POWER machines to try leverage new chips optimization (libm showed some gain, specially back when ISA 2.05 added rounding instruction, and isa 2.07 GRP to FP special register). But I also recall that it was deprecated over using ifunc to optimize only the required functions that does show performance improvement, since each glibc build variantion required all the steps to validation. And that's why aarch64 and x86_64 initially followed the patch to avoid using sysdeps folder and have a minimum default implementation that works on the minimum support ISA and provide any optimized variant through iFUNC. And that is what I suggest you to do for *rvv*. You can follow x86_64/s390 and add an extra optimization to only build certain variant if the ISA is high enough (for instance, if you targeting rbb, use it as default). It requires a *lot* of boilerplate code, as you can see for the x86_64-vX code recently; but it should integrate better with current ldconfig and newer RPM support (and I expect that other packages managers to follow as well). And it lead us on *how* to select the ABI variants. I am sorry, but I will *block the new environment variables* as is. You might rework it through glibc hardware tunables [2], nevertheless I *strong* suggest you to *first* figure out the kernel interface first prior starting working on providing the optimized routine in glibc. The glibc tunable might then work a way to tune/test/filter the already in place mechanism, ideally it should not rely on user intervention as default. It was not clear from the 'hardware probing user interface' thread [3] why current Linux auxv advertise mechanism are not suffice enough for this specific interface (maybe you want something more generic like a cpuid-like interface). It works for aarch64 and powerpc, so I am not sure why RISCV can't start using it. [1] https://sourceware.org/pipermail/libc-alpha/2023-February/145102.html [2] https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html [3] https://yhbt.net/lore/all/20221013163551.6775-1-palmer@rivosinc.com/ > > The following optimized routines exist: > - memset It seems that main gain here is unaligned access, loop unrolling, and cache clear instruction. Unfortuantely current implementation does not provide support for any of this, however I wonder if we could parametrize the generic implementation to allow at least some support for fast unaligned memory (we can factor cache clear as well). I am working on refactor memcpy, memmove, memset, and memcmp to get rid of old code and allow to work toward it. > - memcpy/memmove The generic implementation already does some loop unrolling, so I wonder if we can improve the generic implementation by adding a swtich to assume unaligned access (so there is no need to use the two load/merge strategy). One advantage that is not easily reproducable on C is to branch to memcpy on memmove if the copy shoud be fone fowards. This is not easily done on generic implementation because we can't simply call memcpy in such case (since source and destiny can overlap and it might call a memcpy routine that does not support it). My approach on my generic refactor is just to remove the wordcopy and make memcpy and memmove using the same strategy, but with different code. > - strlen The optimized routine seems quite similar to the generic one I installed recently [4], which should use both cbz and orc.b with RISCV hooks [5] [4] https://sourceware.org/git/?p=glibc.git;a=commit;h=350d8d13661a863e6b189f02d876fa265fe71302 [5] https://sourceware.org/git/?p=glibc.git;a=commit;h=25788431c0f5264c4830415de0cdd4d9926cbad9 > - strcmp > - strncmp The current generic implementations [6][7] now have a small advantage where unaligned inputs are also improved by first aligning one input and operating with a double load and merge comparision. [6] https://sourceware.org/git/?p=glibc.git;a=commit;h=30cf54bf3072be942847400c1669bcd63aab039e [7] https://sourceware.org/git/?p=glibc.git;a=commit;h=367c31b5d61164db97834917f5487094ebef2f58 > - cpu_relax > > The following optimizations have been applied: > - excessive loop unrolling > - Zbb's orc.b instruction > - Zbb's ctz intruction > - Zicboz/Zic64b ability to clear a cache block in memory > - Fast unaligned accesses (but with keeping exception guarantees intact) > - Fast overlapping accesses > > The patch was developed more than a year ago and was tested as part > of a vendor SDK since then. One of the areas where this patchset > was used is benchmarking (e.g. SPEC CPU2017). > The optimized string functions have been tested with the glibc tests > for that purpose. > > The first patch of the series does not strictly belong to this series, > but was required to build and test SPEC CPU2017 benchmarks. > > To build a cross-toolchain that includes these patches, > the riscv-gnu-toolchain or any other cross-toolchain > builder can be used. > > Christoph Müllner (19): > Inhibit early libcalls before ifunc support is ready > riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol > riscv: Add ENTRY_ALIGN() macro > riscv: Add hart feature run-time detection framework > riscv: Introduction of ISA extensions > riscv: Adding ISA string parser for environment variables > riscv: hart-features: Add fast_unaligned property > riscv: Add (empty) ifunc framework > riscv: Add ifunc support for memset > riscv: Add accelerated memset routines for RV64 > riscv: Add ifunc support for memcpy/memmove > riscv: Add accelerated memcpy/memmove routines for RV64 > riscv: Add ifunc support for strlen > riscv: Add accelerated strlen routine > riscv: Add ifunc support for strcmp > riscv: Add accelerated strcmp routines > riscv: Add ifunc support for strncmp > riscv: Add an optimized strncmp routine > riscv: Add __riscv_cpu_relax() to allow yielding in busy loops > > csu/libc-start.c | 1 + > elf/dl-support.c | 1 + > sysdeps/riscv/dl-machine.h | 13 + > sysdeps/riscv/ldsodefs.h | 1 + > sysdeps/riscv/multiarch/Makefile | 24 + > sysdeps/riscv/multiarch/cpu_relax.c | 36 ++ > sysdeps/riscv/multiarch/cpu_relax_impl.S | 40 ++ > sysdeps/riscv/multiarch/ifunc-impl-list.c | 70 +++ > sysdeps/riscv/multiarch/init-arch.h | 24 + > sysdeps/riscv/multiarch/memcpy.c | 49 ++ > sysdeps/riscv/multiarch/memcpy_generic.c | 32 ++ > .../riscv/multiarch/memcpy_rv64_unaligned.S | 475 ++++++++++++++++++ > sysdeps/riscv/multiarch/memmove.c | 49 ++ > sysdeps/riscv/multiarch/memmove_generic.c | 32 ++ > sysdeps/riscv/multiarch/memset.c | 52 ++ > sysdeps/riscv/multiarch/memset_generic.c | 32 ++ > .../riscv/multiarch/memset_rv64_unaligned.S | 31 ++ > .../multiarch/memset_rv64_unaligned_cboz64.S | 217 ++++++++ > sysdeps/riscv/multiarch/strcmp.c | 47 ++ > sysdeps/riscv/multiarch/strcmp_generic.c | 32 ++ > sysdeps/riscv/multiarch/strcmp_zbb.S | 104 ++++ > .../riscv/multiarch/strcmp_zbb_unaligned.S | 213 ++++++++ > sysdeps/riscv/multiarch/strlen.c | 44 ++ > sysdeps/riscv/multiarch/strlen_generic.c | 32 ++ > sysdeps/riscv/multiarch/strlen_zbb.S | 105 ++++ > sysdeps/riscv/multiarch/strncmp.c | 44 ++ > sysdeps/riscv/multiarch/strncmp_generic.c | 32 ++ > sysdeps/riscv/multiarch/strncmp_zbb.S | 119 +++++ > sysdeps/riscv/sys/asm.h | 14 +- > .../unix/sysv/linux/riscv/atomic-machine.h | 3 + > sysdeps/unix/sysv/linux/riscv/dl-procinfo.c | 62 +++ > sysdeps/unix/sysv/linux/riscv/dl-procinfo.h | 46 ++ > sysdeps/unix/sysv/linux/riscv/hart-features.c | 356 +++++++++++++ > sysdeps/unix/sysv/linux/riscv/hart-features.h | 58 +++ > .../unix/sysv/linux/riscv/isa-extensions.def | 72 +++ > sysdeps/unix/sysv/linux/riscv/libc-start.c | 29 ++ > .../unix/sysv/linux/riscv/macro-for-each.h | 24 + > 37 files changed, 2610 insertions(+), 5 deletions(-) > create mode 100644 sysdeps/riscv/multiarch/Makefile > create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c > create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S > create mode 100644 sysdeps/riscv/multiarch/ifunc-impl-list.c > create mode 100644 sysdeps/riscv/multiarch/init-arch.h > create mode 100644 sysdeps/riscv/multiarch/memcpy.c > create mode 100644 sysdeps/riscv/multiarch/memcpy_generic.c > create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S > create mode 100644 sysdeps/riscv/multiarch/memmove.c > create mode 100644 sysdeps/riscv/multiarch/memmove_generic.c > create mode 100644 sysdeps/riscv/multiarch/memset.c > create mode 100644 sysdeps/riscv/multiarch/memset_generic.c > create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S > create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S > create mode 100644 sysdeps/riscv/multiarch/strcmp.c > create mode 100644 sysdeps/riscv/multiarch/strcmp_generic.c > create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S > create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S > create mode 100644 sysdeps/riscv/multiarch/strlen.c > create mode 100644 sysdeps/riscv/multiarch/strlen_generic.c > create mode 100644 sysdeps/riscv/multiarch/strlen_zbb.S > create mode 100644 sysdeps/riscv/multiarch/strncmp.c > create mode 100644 sysdeps/riscv/multiarch/strncmp_generic.c > create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S > create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c > create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h > create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.c > create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.h > create mode 100644 sysdeps/unix/sysv/linux/riscv/isa-extensions.def > create mode 100644 sysdeps/unix/sysv/linux/riscv/libc-start.c > create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h >