public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
To: Christoph Muellner <christoph.muellner@vrull.eu>,
	libc-alpha@sourceware.org, Palmer Dabbelt <palmer@dabbelt.com>,
	Darius Rad <darius@bluespec.com>,
	Andrew Waterman <andrew@sifive.com>, DJ Delorie <dj@redhat.com>,
	Vineet Gupta <vineetg@rivosinc.com>,
	Kito Cheng <kito.cheng@sifive.com>,
	Jeff Law <jeffreyalaw@gmail.com>,
	Philipp Tomsich <philipp.tomsich@vrull.eu>,
	Heiko Stuebner <heiko.stuebner@vrull.eu>
Subject: Re: [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines
Date: Tue, 7 Feb 2023 13:40:58 -0300	[thread overview]
Message-ID: <fb9ab059-3087-1e25-d6f8-41a1ec6b6ee9@linaro.org> (raw)
In-Reply-To: <20230207001618.458947-1-christoph.muellner@vrull.eu>



On 06/02/23 21:15, Christoph Muellner wrote:
> From: Christoph Müllner <christoph.muellner@vrull.eu>
> 
> This RFC series introduces ifunc support for RISC-V and adds
> optimized routines of memset(), memcpy()/memmove(), strlen(),
> strcmp(), strncmp(), and cpu_relax().
> 
> The ifunc mechanism desides based on the following hart features:
> - Available extensions
> - Cache block size
> - Fast unaligned accesses
> 
> Since we don't have an interface to get this information from the
> kernel (at the moment), this patch uses environment variables instead,
> which is also why this patch should not be considered for upstream
> inclusion and is explicitly tagged as RFC.
> 
> The environment variables are:
> - RISCV_RT_MARCH (e.g. "rv64gc_zicboz")
> - RISCV_RT_CBOZ_BLOCKSIZE (e.g. "64")
> - RISCV_RT_CBOM_BLOCKSIZE (e.g. "64")
> - RISCV_RT_FAST_UNALIGNED (e.g. "1")
> 
> The environment variables are looked up and parsed early during
> startup, where other architectures query similar properties from
> the kernel or the CPU.
> The ifunc implementation can use test macros to select a matching
> implementation (e.g. HAVE_RV(zbb) or HAVE_FAST_UNALIGNED()).

So now we have 3 different proposal mechanism to provide implementation runtime 
selection on riscv:

  1. The sysdep mechanism to select optimized routines based on compiler/ABI
     done at build time.  It is the current mechanism and it is also used
     on rvv routines [1].

  2. A ifunc one using a new riscv syscall to query the kernel the required 
     information.

  3. Another ifunc one using riscv specific environment variable.

Although all of them are interchangeable in a sense they can be used independently,
RISCV is following MIPS on having uncountable minor ABI variants due this exactly
available permutations.  This incurs in extra maintanance, extra documentation, 
extra testing, etc.

So I would like you RISCV arch-maintainers to first figure out what scheme you
want focus on, instead of trying to push multiple fronts with different ad-hoc 
schemes.

The first scheme, which is the oldest one used by architectures like arm, powerpc,
mips, etc. is the sysdep where you select the variant at build time.  It has the
advantage of no need to extra runtime cost or probing, and a slight code size 
reduction.  However it ties the ABI used to build glibc, which means you need 
multiple libc build if you targeting different chips/ABIs.

I recall that Red Hat and SuSE used to provided specialized glibc build for POWER
machines to try leverage new chips optimization (libm showed some gain, specially
back when ISA 2.05 added rounding instruction, and isa 2.07 GRP to FP special
register).  But I also recall that it was deprecated over using ifunc to optimize
only the required functions that does show performance improvement, since each
glibc build variantion required all the steps to validation.

And that's why aarch64 and x86_64 initially followed the patch to avoid using 
sysdeps folder and have a minimum default implementation that works on the minimum 
support ISA and provide any optimized variant through iFUNC.  And that is what I 
suggest you to do for *rvv*.

You can follow x86_64/s390 and add an extra optimization to only build certain 
variant if the ISA is high enough (for instance, if you targeting rbb, use it as 
default).  It requires a *lot* of boilerplate code, as you can see for the
x86_64-vX code recently; but it should integrate better with current ldconfig
and newer RPM support (and I expect that other packages managers to follow as
well).

And it lead us on *how* to select the ABI variants. I am sorry, but I will *block 
the new environment variables* as is. You might rework it through glibc hardware 
tunables [2], nevertheless I *strong* suggest you to *first* figure out the kernel
interface first prior starting working on providing the optimized routine in glibc.
The glibc tunable might then work a way to tune/test/filter the already in place
mechanism, ideally it should not rely on user intervention as default.

It was not clear from the 'hardware probing user interface' thread [3] why current 
Linux auxv advertise mechanism are not suffice enough for this specific interface 
(maybe you want something more generic like a cpuid-like interface).  It works for
aarch64 and powerpc, so I am not sure why RISCV can't start using it.

[1] https://sourceware.org/pipermail/libc-alpha/2023-February/145102.html
[2] https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
[3] https://yhbt.net/lore/all/20221013163551.6775-1-palmer@rivosinc.com/

> 
> The following optimized routines exist:
> - memset

It seems that main gain here is unaligned access, loop unrolling, and cache clear 
instruction.  Unfortuantely current implementation does not provide support for 
any of this, however I wonder if we could parametrize the generic implementation 
to allow at least some support for fast unaligned memory (we can factor cache clear 
as well).

I am working on refactor memcpy, memmove, memset, and memcmp to get rid of old
code and allow to work toward it.

> - memcpy/memmove

The generic implementation already does some loop unrolling, so I wonder if we can
improve the generic implementation by adding a swtich to assume unaligned access
(so there is no need to use the two load/merge strategy).

One advantage that is not easily reproducable on C is to branch to memcpy on memmove
if the copy shoud be fone fowards.  This is not easily done on generic implementation 
because we can't simply call memcpy in such case (since source and destiny can overlap
and it might call a memcpy routine that does not support it).

My approach on my generic refactor is just to remove the wordcopy and make memcpy
and memmove using the same strategy, but with different code.

> - strlen

The optimized routine seems quite similar to the generic one I installed recently [4],
which should use both cbz and orc.b with RISCV hooks [5]

[4] https://sourceware.org/git/?p=glibc.git;a=commit;h=350d8d13661a863e6b189f02d876fa265fe71302
[5] https://sourceware.org/git/?p=glibc.git;a=commit;h=25788431c0f5264c4830415de0cdd4d9926cbad9

> - strcmp
> - strncmp

The current generic implementations [6][7] now have a small advantage where unaligned
inputs are also improved by first aligning one input and operating with a double
load and merge comparision.

[6] https://sourceware.org/git/?p=glibc.git;a=commit;h=30cf54bf3072be942847400c1669bcd63aab039e
[7] https://sourceware.org/git/?p=glibc.git;a=commit;h=367c31b5d61164db97834917f5487094ebef2f58

> - cpu_relax
> 
> The following optimizations have been applied:
> - excessive loop unrolling
> - Zbb's orc.b instruction
> - Zbb's ctz intruction
> - Zicboz/Zic64b ability to clear a cache block in memory
> - Fast unaligned accesses (but with keeping exception guarantees intact)
> - Fast overlapping accesses
> 
> The patch was developed more than a year ago and was tested as part
> of a vendor SDK since then. One of the areas where this patchset
> was used is benchmarking (e.g. SPEC CPU2017).
> The optimized string functions have been tested with the glibc tests
> for that purpose.
> 
> The first patch of the series does not strictly belong to this series,
> but was required to build and test SPEC CPU2017 benchmarks.
> 
> To build a cross-toolchain that includes these patches,
> the riscv-gnu-toolchain or any other cross-toolchain
> builder can be used.
> 
> Christoph Müllner (19):
>   Inhibit early libcalls before ifunc support is ready
>   riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol
>   riscv: Add ENTRY_ALIGN() macro
>   riscv: Add hart feature run-time detection framework
>   riscv: Introduction of ISA extensions
>   riscv: Adding ISA string parser for environment variables
>   riscv: hart-features: Add fast_unaligned property
>   riscv: Add (empty) ifunc framework
>   riscv: Add ifunc support for memset
>   riscv: Add accelerated memset routines for RV64
>   riscv: Add ifunc support for memcpy/memmove
>   riscv: Add accelerated memcpy/memmove routines for RV64
>   riscv: Add ifunc support for strlen
>   riscv: Add accelerated strlen routine
>   riscv: Add ifunc support for strcmp
>   riscv: Add accelerated strcmp routines
>   riscv: Add ifunc support for strncmp
>   riscv: Add an optimized strncmp routine
>   riscv: Add __riscv_cpu_relax() to allow yielding in busy loops
> 
>  csu/libc-start.c                              |   1 +
>  elf/dl-support.c                              |   1 +
>  sysdeps/riscv/dl-machine.h                    |  13 +
>  sysdeps/riscv/ldsodefs.h                      |   1 +
>  sysdeps/riscv/multiarch/Makefile              |  24 +
>  sysdeps/riscv/multiarch/cpu_relax.c           |  36 ++
>  sysdeps/riscv/multiarch/cpu_relax_impl.S      |  40 ++
>  sysdeps/riscv/multiarch/ifunc-impl-list.c     |  70 +++
>  sysdeps/riscv/multiarch/init-arch.h           |  24 +
>  sysdeps/riscv/multiarch/memcpy.c              |  49 ++
>  sysdeps/riscv/multiarch/memcpy_generic.c      |  32 ++
>  .../riscv/multiarch/memcpy_rv64_unaligned.S   | 475 ++++++++++++++++++
>  sysdeps/riscv/multiarch/memmove.c             |  49 ++
>  sysdeps/riscv/multiarch/memmove_generic.c     |  32 ++
>  sysdeps/riscv/multiarch/memset.c              |  52 ++
>  sysdeps/riscv/multiarch/memset_generic.c      |  32 ++
>  .../riscv/multiarch/memset_rv64_unaligned.S   |  31 ++
>  .../multiarch/memset_rv64_unaligned_cboz64.S  | 217 ++++++++
>  sysdeps/riscv/multiarch/strcmp.c              |  47 ++
>  sysdeps/riscv/multiarch/strcmp_generic.c      |  32 ++
>  sysdeps/riscv/multiarch/strcmp_zbb.S          | 104 ++++
>  .../riscv/multiarch/strcmp_zbb_unaligned.S    | 213 ++++++++
>  sysdeps/riscv/multiarch/strlen.c              |  44 ++
>  sysdeps/riscv/multiarch/strlen_generic.c      |  32 ++
>  sysdeps/riscv/multiarch/strlen_zbb.S          | 105 ++++
>  sysdeps/riscv/multiarch/strncmp.c             |  44 ++
>  sysdeps/riscv/multiarch/strncmp_generic.c     |  32 ++
>  sysdeps/riscv/multiarch/strncmp_zbb.S         | 119 +++++
>  sysdeps/riscv/sys/asm.h                       |  14 +-
>  .../unix/sysv/linux/riscv/atomic-machine.h    |   3 +
>  sysdeps/unix/sysv/linux/riscv/dl-procinfo.c   |  62 +++
>  sysdeps/unix/sysv/linux/riscv/dl-procinfo.h   |  46 ++
>  sysdeps/unix/sysv/linux/riscv/hart-features.c | 356 +++++++++++++
>  sysdeps/unix/sysv/linux/riscv/hart-features.h |  58 +++
>  .../unix/sysv/linux/riscv/isa-extensions.def  |  72 +++
>  sysdeps/unix/sysv/linux/riscv/libc-start.c    |  29 ++
>  .../unix/sysv/linux/riscv/macro-for-each.h    |  24 +
>  37 files changed, 2610 insertions(+), 5 deletions(-)
>  create mode 100644 sysdeps/riscv/multiarch/Makefile
>  create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c
>  create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S
>  create mode 100644 sysdeps/riscv/multiarch/ifunc-impl-list.c
>  create mode 100644 sysdeps/riscv/multiarch/init-arch.h
>  create mode 100644 sysdeps/riscv/multiarch/memcpy.c
>  create mode 100644 sysdeps/riscv/multiarch/memcpy_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/memmove.c
>  create mode 100644 sysdeps/riscv/multiarch/memmove_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memset.c
>  create mode 100644 sysdeps/riscv/multiarch/memset_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S
>  create mode 100644 sysdeps/riscv/multiarch/strcmp.c
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/strlen.c
>  create mode 100644 sysdeps/riscv/multiarch/strlen_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strlen_zbb.S
>  create mode 100644 sysdeps/riscv/multiarch/strncmp.c
>  create mode 100644 sysdeps/riscv/multiarch/strncmp_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.h
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/isa-extensions.def
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/libc-start.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h
> 

  parent reply	other threads:[~2023-02-07 16:41 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-07  0:15 Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 01/19] Inhibit early libcalls before ifunc support is ready Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 02/19] riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 03/19] riscv: Add ENTRY_ALIGN() macro Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 04/19] riscv: Add hart feature run-time detection framework Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 05/19] riscv: Introduction of ISA extensions Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 06/19] riscv: Adding ISA string parser for environment variables Christoph Muellner
2023-02-07  6:20   ` David Abdurachmanov
2023-02-07  0:16 ` [RFC PATCH 07/19] riscv: hart-features: Add fast_unaligned property Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 08/19] riscv: Add (empty) ifunc framework Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 09/19] riscv: Add ifunc support for memset Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 10/19] riscv: Add accelerated memset routines for RV64 Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 11/19] riscv: Add ifunc support for memcpy/memmove Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 13/19] riscv: Add ifunc support for strlen Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 14/19] riscv: Add accelerated strlen routine Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 15/19] riscv: Add ifunc support for strcmp Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 16/19] riscv: Add accelerated strcmp routines Christoph Muellner
2023-02-07 11:57   ` Xi Ruoyao
2023-02-07 14:15     ` Christoph Müllner
2023-03-31  5:06       ` Jeff Law
2023-03-31 12:31         ` Adhemerval Zanella Netto
2023-03-31 14:30           ` Jeff Law
2023-03-31 14:48             ` Adhemerval Zanella Netto
2023-03-31 17:19               ` Palmer Dabbelt
2023-03-31 14:32       ` Jeff Law
2023-02-07  0:16 ` [RFC PATCH 17/19] riscv: Add ifunc support for strncmp Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 18/19] riscv: Add an optimized strncmp routine Christoph Muellner
2023-02-07  1:19   ` Noah Goldstein
2023-02-08 15:13     ` Philipp Tomsich
2023-02-08 17:55       ` Palmer Dabbelt
2023-02-08 19:48         ` Adhemerval Zanella Netto
2023-02-08 18:04       ` Noah Goldstein
2023-02-07  0:16 ` [RFC PATCH 19/19] riscv: Add __riscv_cpu_relax() to allow yielding in busy loops Christoph Muellner
2023-02-07  0:23   ` Andrew Waterman
2023-02-07  0:29     ` Christoph Müllner
2023-02-07  2:59 ` [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Kito Cheng
2023-02-07 16:40 ` Adhemerval Zanella Netto [this message]
2023-02-07 17:16   ` DJ Delorie
2023-02-07 19:32     ` Philipp Tomsich
2023-02-07 21:14       ` DJ Delorie
2023-02-08 11:26         ` Christoph Müllner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fb9ab059-3087-1e25-d6f8-41a1ec6b6ee9@linaro.org \
    --to=adhemerval.zanella@linaro.org \
    --cc=andrew@sifive.com \
    --cc=christoph.muellner@vrull.eu \
    --cc=darius@bluespec.com \
    --cc=dj@redhat.com \
    --cc=heiko.stuebner@vrull.eu \
    --cc=jeffreyalaw@gmail.com \
    --cc=kito.cheng@sifive.com \
    --cc=libc-alpha@sourceware.org \
    --cc=palmer@dabbelt.com \
    --cc=philipp.tomsich@vrull.eu \
    --cc=vineetg@rivosinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).