public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Add arc4random support
@ 2022-04-13 20:23 Adhemerval Zanella
  2022-04-13 20:23 ` [PATCH 1/7] stdlib: Add arc4random, arc4random_buf, and arc4random_uniform (BZ #4417) Adhemerval Zanella
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-13 20:23 UTC (permalink / raw)
  To: libc-alpha

This patch adds the arc4random, arc4random_buf, and arc4random_uniform
along with optimized versions for x86_64, aarch64, and powerpc64.

The generic implementation is based on scalar Chacha20, with a global 
cache and locking.  It uses getrandom or /dev/urandom as fallback to 
get the initial entropy, and reseeds the internal state on every 16MB 
of consumed entropy.

It maintains an internal buffer which consumes at maximum one page on
most systems (assuming 4k pages).  The internal buffer optimizes the 
cipher encrypt calls, by amortize arc4random calls (where both
function call and locks cost are the dominating factor).

Fork detection is done by checking if MADV_WIPEONFORK supported.  If not
the fork callback will reset the state on the fork call.  It does not
handle direct clone calls, nor vfork or _Fork (arc4random is not
async-signal-safe due the internal lock usage, althought the
implementation does try to handle fork cases). 

The generic ChaCha20 implementation is based on the RFC8439 [1], which
a simple memcpy with xor implementation.  The optimized ones for x86_64,
aarch64, and powerpc64 use vectorized instruction and they are based on
libgcrypt code.

This patchset is different than the previous ones by using a much
simpler
scheme of fork detection (there is no attempt in using a global shared
counter to detect direct clone usages), and by using ChaCha20 instead
of AES.  ChaCha20 is used because is the standard cipher used on 
different arc4random implementation (BSDs, MacOSX), and recently on
Linux random subsystem.  It is also a much more simpler implementation
than AES and shows better performance when no specialized instructions
are present.

One possible improvement, not implemented in this patchset, it to use a
per-thread cache, since on some architecture the lock cost is somewhat
high.  Ideally it would reside in TCB to avoid require tuning static
TLS size, and it work similar to the malloc tcache where arc4random
would initially consume any thread local entropy thus avoid any locking.

[1] https://sourceware.org/pipermail/libc-alpha/2018-June/094879.html

Adhemerval Zanella (7):
  stdlib: Add arc4random, arc4random_buf, and arc4random_uniform (BZ
    #4417)
  stdlib: Add arc4random tests
  benchtests: Add arc4random benchtest
  x86: Add SSSE3 optimized chacha20
  x86: Add AVX2 optimized chacha20
  aarch64: Add optimized chacha20
  powerpc64: Add optimized chacha20

 LICENSES                                      |  21 ++
 NEWS                                          |   4 +-
 benchtests/Makefile                           |   6 +-
 benchtests/bench-arc4random.c                 | 243 ++++++++++++
 include/stdlib.h                              |  13 +
 posix/fork.c                                  |   2 +
 stdlib/Makefile                               |   6 +
 stdlib/Versions                               |   5 +
 stdlib/arc4random.c                           | 242 ++++++++++++
 stdlib/arc4random_uniform.c                   | 152 ++++++++
 stdlib/chacha20.c                             | 214 +++++++++++
 stdlib/stdlib.h                               |  14 +
 stdlib/tst-arc4random-chacha20.c              | 225 +++++++++++
 stdlib/tst-arc4random-fork.c                  | 174 +++++++++
 stdlib/tst-arc4random-stats.c                 | 146 +++++++
 stdlib/tst-arc4random-thread.c                | 278 ++++++++++++++
 sysdeps/aarch64/Makefile                      |   4 +
 sysdeps/aarch64/chacha20.S                    | 357 ++++++++++++++++++
 sysdeps/aarch64/chacha20_arch.h               |  43 +++
 sysdeps/generic/chacha20_arch.h               |  24 ++
 sysdeps/generic/not-cancel.h                  |   2 +
 sysdeps/mach/hurd/i386/libc.abilist           |   3 +
 sysdeps/mach/hurd/not-cancel.h                |   3 +
 sysdeps/powerpc/powerpc64/Makefile            |   3 +
 sysdeps/powerpc/powerpc64/chacha-ppc.c        | 254 +++++++++++++
 sysdeps/powerpc/powerpc64/chacha20_arch.h     |  53 +++
 sysdeps/unix/sysv/linux/aarch64/libc.abilist  |   3 +
 sysdeps/unix/sysv/linux/alpha/libc.abilist    |   3 +
 sysdeps/unix/sysv/linux/arc/libc.abilist      |   3 +
 sysdeps/unix/sysv/linux/arm/be/libc.abilist   |   3 +
 sysdeps/unix/sysv/linux/arm/le/libc.abilist   |   3 +
 sysdeps/unix/sysv/linux/csky/libc.abilist     |   3 +
 sysdeps/unix/sysv/linux/hppa/libc.abilist     |   3 +
 sysdeps/unix/sysv/linux/i386/libc.abilist     |   3 +
 sysdeps/unix/sysv/linux/ia64/libc.abilist     |   3 +
 .../sysv/linux/m68k/coldfire/libc.abilist     |   3 +
 .../unix/sysv/linux/m68k/m680x0/libc.abilist  |   3 +
 .../sysv/linux/microblaze/be/libc.abilist     |   3 +
 .../sysv/linux/microblaze/le/libc.abilist     |   3 +
 .../sysv/linux/mips/mips32/fpu/libc.abilist   |   3 +
 .../sysv/linux/mips/mips32/nofpu/libc.abilist |   3 +
 .../sysv/linux/mips/mips64/n32/libc.abilist   |   3 +
 .../sysv/linux/mips/mips64/n64/libc.abilist   |   3 +
 sysdeps/unix/sysv/linux/nios2/libc.abilist    |   3 +
 sysdeps/unix/sysv/linux/not-cancel.h          |   7 +
 sysdeps/unix/sysv/linux/or1k/libc.abilist     |   3 +
 .../linux/powerpc/powerpc32/fpu/libc.abilist  |   3 +
 .../powerpc/powerpc32/nofpu/libc.abilist      |   3 +
 .../linux/powerpc/powerpc64/be/libc.abilist   |   3 +
 .../linux/powerpc/powerpc64/le/libc.abilist   |   3 +
 .../unix/sysv/linux/riscv/rv32/libc.abilist   |   3 +
 .../unix/sysv/linux/riscv/rv64/libc.abilist   |   3 +
 .../unix/sysv/linux/s390/s390-32/libc.abilist |   3 +
 .../unix/sysv/linux/s390/s390-64/libc.abilist |   3 +
 sysdeps/unix/sysv/linux/sh/be/libc.abilist    |   3 +
 sysdeps/unix/sysv/linux/sh/le/libc.abilist    |   3 +
 .../sysv/linux/sparc/sparc32/libc.abilist     |   3 +
 .../sysv/linux/sparc/sparc64/libc.abilist     |   3 +
 .../unix/sysv/linux/x86_64/64/libc.abilist    |   3 +
 .../unix/sysv/linux/x86_64/x32/libc.abilist   |   3 +
 sysdeps/x86_64/Makefile                       |   7 +
 sysdeps/x86_64/chacha20-avx2.S                | 317 ++++++++++++++++
 sysdeps/x86_64/chacha20-ssse3.S               | 330 ++++++++++++++++
 sysdeps/x86_64/chacha20_arch.h                |  56 +++
 64 files changed, 3305 insertions(+), 2 deletions(-)
 create mode 100644 benchtests/bench-arc4random.c
 create mode 100644 stdlib/arc4random.c
 create mode 100644 stdlib/arc4random_uniform.c
 create mode 100644 stdlib/chacha20.c
 create mode 100644 stdlib/tst-arc4random-chacha20.c
 create mode 100644 stdlib/tst-arc4random-fork.c
 create mode 100644 stdlib/tst-arc4random-stats.c
 create mode 100644 stdlib/tst-arc4random-thread.c
 create mode 100644 sysdeps/aarch64/chacha20.S
 create mode 100644 sysdeps/aarch64/chacha20_arch.h
 create mode 100644 sysdeps/generic/chacha20_arch.h
 create mode 100644 sysdeps/powerpc/powerpc64/chacha-ppc.c
 create mode 100644 sysdeps/powerpc/powerpc64/chacha20_arch.h
 create mode 100644 sysdeps/x86_64/chacha20-avx2.S
 create mode 100644 sysdeps/x86_64/chacha20-ssse3.S
 create mode 100644 sysdeps/x86_64/chacha20_arch.h

-- 
2.32.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/7] stdlib: Add arc4random, arc4random_buf, and arc4random_uniform (BZ #4417)
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
@ 2022-04-13 20:23 ` Adhemerval Zanella
  2022-04-13 20:23 ` [PATCH 2/7] stdlib: Add arc4random tests Adhemerval Zanella
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-13 20:23 UTC (permalink / raw)
  To: libc-alpha; +Cc: Florian Weimer

The implementation is based on scalar Chacha20, with global cache and
locking.  It uses getrandom or /dev/urandom as fallback to get the
initial entropy, and reseeds the internal state on every 16MB of
consumed buffer.

It maintains an internal buffer which consumes at maximum one page on
most systems (assuming minimum of 4k pages).  The internal buf optimizes
the cipher encrypt calls, by amortize arc4random calls (where both
function call and locks cost are the dominating factor).

The ChaCha20 implementation is based on the RFC8439 [1], which a simple
memcpy with xor implementation.

The arc4random_uniform is based on previous work by Florian Weimer.

Checked on x86_64-linux-gnu, aarch64-linux, and powerpc64le-linux-gnu.

Co-authored-by: Florian Weimer <fweimer@redhat.com>

[1] https://datatracker.ietf.org/doc/html/rfc8439
---
 NEWS                                          |   4 +-
 include/stdlib.h                              |  13 +
 posix/fork.c                                  |   2 +
 stdlib/Makefile                               |   2 +
 stdlib/Versions                               |   5 +
 stdlib/arc4random.c                           | 242 ++++++++++++++++++
 stdlib/arc4random_uniform.c                   | 152 +++++++++++
 stdlib/chacha20.c                             | 211 +++++++++++++++
 stdlib/stdlib.h                               |  14 +
 sysdeps/generic/not-cancel.h                  |   2 +
 sysdeps/mach/hurd/i386/libc.abilist           |   3 +
 sysdeps/mach/hurd/not-cancel.h                |   3 +
 sysdeps/unix/sysv/linux/aarch64/libc.abilist  |   3 +
 sysdeps/unix/sysv/linux/alpha/libc.abilist    |   3 +
 sysdeps/unix/sysv/linux/arc/libc.abilist      |   3 +
 sysdeps/unix/sysv/linux/arm/be/libc.abilist   |   3 +
 sysdeps/unix/sysv/linux/arm/le/libc.abilist   |   3 +
 sysdeps/unix/sysv/linux/csky/libc.abilist     |   3 +
 sysdeps/unix/sysv/linux/hppa/libc.abilist     |   3 +
 sysdeps/unix/sysv/linux/i386/libc.abilist     |   3 +
 sysdeps/unix/sysv/linux/ia64/libc.abilist     |   3 +
 .../sysv/linux/m68k/coldfire/libc.abilist     |   3 +
 .../unix/sysv/linux/m68k/m680x0/libc.abilist  |   3 +
 .../sysv/linux/microblaze/be/libc.abilist     |   3 +
 .../sysv/linux/microblaze/le/libc.abilist     |   3 +
 .../sysv/linux/mips/mips32/fpu/libc.abilist   |   3 +
 .../sysv/linux/mips/mips32/nofpu/libc.abilist |   3 +
 .../sysv/linux/mips/mips64/n32/libc.abilist   |   3 +
 .../sysv/linux/mips/mips64/n64/libc.abilist   |   3 +
 sysdeps/unix/sysv/linux/nios2/libc.abilist    |   3 +
 sysdeps/unix/sysv/linux/not-cancel.h          |   7 +
 sysdeps/unix/sysv/linux/or1k/libc.abilist     |   3 +
 .../linux/powerpc/powerpc32/fpu/libc.abilist  |   3 +
 .../powerpc/powerpc32/nofpu/libc.abilist      |   3 +
 .../linux/powerpc/powerpc64/be/libc.abilist   |   3 +
 .../linux/powerpc/powerpc64/le/libc.abilist   |   3 +
 .../unix/sysv/linux/riscv/rv32/libc.abilist   |   3 +
 .../unix/sysv/linux/riscv/rv64/libc.abilist   |   3 +
 .../unix/sysv/linux/s390/s390-32/libc.abilist |   3 +
 .../unix/sysv/linux/s390/s390-64/libc.abilist |   3 +
 sysdeps/unix/sysv/linux/sh/be/libc.abilist    |   3 +
 sysdeps/unix/sysv/linux/sh/le/libc.abilist    |   3 +
 .../sysv/linux/sparc/sparc32/libc.abilist     |   3 +
 .../sysv/linux/sparc/sparc64/libc.abilist     |   3 +
 .../unix/sysv/linux/x86_64/64/libc.abilist    |   3 +
 .../unix/sysv/linux/x86_64/x32/libc.abilist   |   3 +
 46 files changed, 758 insertions(+), 1 deletion(-)
 create mode 100644 stdlib/arc4random.c
 create mode 100644 stdlib/arc4random_uniform.c
 create mode 100644 stdlib/chacha20.c

diff --git a/NEWS b/NEWS
index 4b6d9de2b5..4d9d95b35b 100644
--- a/NEWS
+++ b/NEWS
@@ -9,7 +9,9 @@ Version 2.36
 
 Major new features:
 
-  [Add new features here]
+* The functions arc4random, arc4random_buf, arc4random_uniform have been
+  added.  The functions use a cryptographic pseudo-random number generator
+  based on ChaCha20 initilized with entropy from kernel.
 
 Deprecated and removed features, and other changes affecting compatibility:
 
diff --git a/include/stdlib.h b/include/stdlib.h
index 1c6f70b082..055f9d2965 100644
--- a/include/stdlib.h
+++ b/include/stdlib.h
@@ -144,6 +144,19 @@ libc_hidden_proto (__ptsname_r)
 libc_hidden_proto (grantpt)
 libc_hidden_proto (unlockpt)
 
+__typeof (arc4random) __arc4random;
+libc_hidden_proto (__arc4random);
+__typeof (arc4random_buf) __arc4random_buf;
+libc_hidden_proto (__arc4random_buf);
+__typeof (arc4random_uniform) __arc4random_uniform;
+libc_hidden_proto (__arc4random_uniform);
+extern void __arc4random_buf_internal (void *buffer, size_t len)
+     attribute_hidden;
+/* Called from the fork function to reinitialize the internal lock in thte
+   child process.  This avoids deadlocks if fork is called in multi-threaded
+   processes.  */
+extern void __arc4random_fork_subprocess (void) attribute_hidden;
+
 extern double __strtod_internal (const char *__restrict __nptr,
 				 char **__restrict __endptr, int __group)
      __THROW __nonnull ((1)) __wur;
diff --git a/posix/fork.c b/posix/fork.c
index 6b50c091f9..87d8329b46 100644
--- a/posix/fork.c
+++ b/posix/fork.c
@@ -96,6 +96,8 @@ __libc_fork (void)
 				     &nss_database_data);
 	}
 
+      call_function_static_weak (__arc4random_fork_subprocess);
+
       /* Reset the lock the dynamic loader uses to protect its data.  */
       __rtld_lock_initialize (GL(dl_load_lock));
 
diff --git a/stdlib/Makefile b/stdlib/Makefile
index 60fc59c12c..9f9cc1bd7f 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -53,6 +53,8 @@ routines := \
   a64l \
   abort \
   abs \
+  arc4random \
+  arc4random_uniform \
   at_quick_exit \
   atof \
   atoi \
diff --git a/stdlib/Versions b/stdlib/Versions
index 5e9099a153..d09a308fb5 100644
--- a/stdlib/Versions
+++ b/stdlib/Versions
@@ -136,6 +136,11 @@ libc {
     strtof32; strtof64; strtof32x;
     strtof32_l; strtof64_l; strtof32x_l;
   }
+  GLIBC_2.36 {
+    arc4random;
+    arc4random_buf;
+    arc4random_uniform;
+  }
   GLIBC_PRIVATE {
     # functions which have an additional interface since they are
     # are cancelable.
diff --git a/stdlib/arc4random.c b/stdlib/arc4random.c
new file mode 100644
index 0000000000..6653986cc4
--- /dev/null
+++ b/stdlib/arc4random.c
@@ -0,0 +1,242 @@
+/* Pseudo Random Number Generator based on ChaCha20.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <errno.h>
+#include <libc-lock.h>
+#include <not-cancel.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/param.h>
+#include <sys/random.h>
+
+#include <chacha20.c>
+
+/* Besides the cipher state 'ctx', it keeps two counters: 'have' is ithe current
+   valid bytes not yet consumed in 'buf', while 'count' is the maximum number of
+   bytes until a reseed.
+
+   Both the initial seed an reseed tries to obtain entropy from the kernel
+   and abort the process if none could be obtained.
+
+   The state 'buf' improves the usage of the cipher call, allowing to call optimized
+   implementations (if the archictecture provides it) and optimize arc4random
+   calls (since only multiple call it will encrypt the next block).  */
+
+struct arc4random_state
+{
+  struct chacha20_state ctx;
+  size_t have;
+  size_t count;
+  uint8_t buf[CHACHA20_BUFSIZE];
+} *state;
+
+/* Indicate that MADV_WIPEONFORK is supported by the kernel and thus
+   it does not require to clear the internal state.  */
+static bool __arc4random_wipeonfork = false;
+
+__libc_lock_define_initialized (, arc4random_lock);
+
+/* Maximum number bytes until reseed (16 MB).  */
+#define CHACHE_RESEED_SIZE	(16 * 1024 * 1024)
+
+/* Called from the fork function to reset the state if MADV_WIPEONFORK is
+   not supported and to reinit the internal lock.  */
+void
+__arc4random_fork_subprocess (void)
+{
+  if (__arc4random_wipeonfork && state != NULL)
+    memset (state, 0, sizeof (struct arc4random_state));
+
+  __libc_lock_init (arc4random_lock);
+}
+
+static void
+arc4random_allocate_failure (void)
+{
+  __libc_fatal ("Fatal glibc error: Cannot allocate memory for arc4random\n");
+}
+
+static void
+arc4random_getrandom_failure (void)
+{
+  __libc_fatal ("Fatal glibc error: Cannot get entropy for arc4random\n");
+}
+
+/* Fork detection is done by checking if MADV_WIPEONFORK supported.  If not
+   the fork callback will reset the state on the fork call.  It does not
+   handle direct clone calls, nor vfork or _Fork (arc4random is not
+   async-signal-safe due the internal lock usage).  */
+static void
+arc4random_init (uint8_t *buf, size_t len)
+{
+  state = __mmap (NULL, sizeof (struct arc4random_state),
+		  PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+  if (state == MAP_FAILED)
+    arc4random_allocate_failure ();
+
+#ifdef MADV_WIPEONFORK
+  int r = __madvise (state, sizeof (struct arc4random_state), MADV_WIPEONFORK);
+  if (r == 0)
+    __arc4random_wipeonfork = true;
+  else if (errno != EINVAL)
+    arc4random_allocate_failure ();
+#endif
+
+  chacha20_init (&state->ctx, buf, buf + CHACHA20_KEY_SIZE);
+}
+
+#define min(x,y) (((x) > (y)) ? (y) : (x))
+
+static void
+arc4random_rekey (uint8_t *rnd, size_t rndlen)
+{
+  memset (state->buf, 0, sizeof state->buf);
+  chacha20_crypt (&state->ctx, state->buf, state->buf, sizeof state->buf);
+
+  /* Mix some extra entropy if provided.  */
+  if (rnd != NULL)
+    {
+      size_t m = min (rndlen, CHACHA20_KEY_SIZE + CHACHA20_IV_SIZE);
+      for (size_t i = 0; i < m; i++)
+	state->buf[i] ^= rnd[i];
+    }
+
+  /* Immediately reinit for backtracking resistance.  */
+  chacha20_init (&state->ctx, state->buf, state->buf + CHACHA20_KEY_SIZE);
+  memset (state->buf, 0, CHACHA20_KEY_SIZE + CHACHA20_IV_SIZE);
+  state->have = sizeof (state->buf) - (CHACHA20_KEY_SIZE + CHACHA20_IV_SIZE);
+}
+
+static void
+arc4random_getentropy (uint8_t *rnd, size_t len)
+{
+  if (__getrandomn_nocancel (rnd, len, GRND_NONBLOCK) == len)
+    return;
+
+  int fd = __open64_nocancel ("/dev/urandom", O_RDONLY);
+  if (fd != -1)
+    {
+      unsigned char *p = rnd;
+      unsigned char *end = p + len;
+      do
+	{
+	  ssize_t ret = TEMP_FAILURE_RETRY (__read_nocancel (fd, p, end - p));
+	  if (ret <= 0)
+	    arc4random_getrandom_failure ();
+	  p += ret;
+	}
+      while (p < end);
+
+      if (__close_nocancel (fd) != 0)
+	return;
+    }
+  arc4random_getrandom_failure ();
+}
+
+/* Either allocates the state buffer or reinit it by reseeding the cipher
+   state with kernel entropy.  */
+static void
+arc4random_stir (void)
+{
+  uint8_t rnd[CHACHA20_KEY_SIZE + CHACHA20_IV_SIZE];
+  arc4random_getentropy (rnd, sizeof rnd);
+
+  if (state == NULL)
+    arc4random_init (rnd, sizeof rnd);
+  else
+    arc4random_rekey (rnd, sizeof rnd);
+
+  explicit_bzero (rnd, sizeof rnd);
+
+  state->have = 0;
+  memset (state->buf, 0, sizeof state->buf);
+  state->count = CHACHE_RESEED_SIZE;
+}
+
+static void
+arc4random_check_stir (size_t len)
+{
+  if (state == NULL || state->count < len)
+    arc4random_stir ();
+  if (state->count <= len)
+    state->count = 0;
+  else
+    state->count -= len;
+}
+
+void
+__arc4random_buf_internal (void *buffer, size_t len)
+{
+  arc4random_check_stir (len);
+
+  while (len > 0)
+    {
+      if (state->have > 0)
+	{
+	  size_t m = min (len, state->have);
+	  uint8_t *ks = state->buf + sizeof (state->buf) - state->have;
+	  memcpy (buffer, ks, m);
+	  memset (ks, 0, m);
+	  buffer += m;
+	  len -= m;
+	  state->have -= m;
+	}
+      if (state->have == 0)
+	arc4random_rekey (NULL, 0);
+    }
+}
+
+void
+__arc4random_buf (void *buffer, size_t len)
+{
+  __libc_lock_lock (arc4random_lock);
+  __arc4random_buf_internal (buffer, len);
+  __libc_lock_unlock (arc4random_lock);
+}
+libc_hidden_def (__arc4random_buf)
+weak_alias (__arc4random_buf, arc4random_buf)
+
+
+static uint32_t
+__arc4random_internal (void)
+{
+  uint32_t r;
+
+  arc4random_check_stir (sizeof (uint32_t));
+  if (state->have < sizeof (uint32_t))
+    arc4random_rekey (NULL, 0);
+  uint8_t *ks = state->buf + sizeof (state->buf) - state->have;
+  memcpy (&r, ks, sizeof (uint32_t));
+  memset (ks, 0, sizeof (uint32_t));
+  state->have -= sizeof (uint32_t);
+
+  return r;
+}
+
+uint32_t
+__arc4random (void)
+{
+  uint32_t r;
+  __libc_lock_lock (arc4random_lock);
+  r = __arc4random_internal ();
+  __libc_lock_unlock (arc4random_lock);
+  return r;
+}
+libc_hidden_def (__arc4random)
+weak_alias (__arc4random, arc4random)
diff --git a/stdlib/arc4random_uniform.c b/stdlib/arc4random_uniform.c
new file mode 100644
index 0000000000..0cc919d8e1
--- /dev/null
+++ b/stdlib/arc4random_uniform.c
@@ -0,0 +1,152 @@
+/* Random pseudo generator numbers between 0 and 2**-31 (inclusive)
+   uniformly distributed but with an upper_bound.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <endian.h>
+#include <libc-lock.h>
+#include <stdlib.h>
+#include <sys/param.h>
+
+/* Return the number of bytes which cover values up to the limit.  */
+__attribute__ ((const))
+static uint32_t
+byte_count (uint32_t n)
+{
+  if (n <= (1U << 8))
+    return 1;
+  else if (n <= (1U << 16))
+    return 2;
+  else if (n <= (1U << 24))
+    return 3;
+  else
+    return 4;
+}
+
+/* Fill the lower bits of the result with randomness, according to the
+   number of bytes requested.  */
+static void
+random_bytes (uint32_t *result, uint32_t byte_count)
+{
+  *result = 0;
+  unsigned char *ptr = (unsigned char *) result;
+  if (__BYTE_ORDER == __BIG_ENDIAN)
+    ptr += 4 - byte_count;
+  __arc4random_buf_internal (ptr, byte_count);
+}
+
+static uint32_t
+compute_uniform (uint32_t n)
+{
+  if (n <= 1)
+    /* There is no valid return value for a zero limit, and 0 is the
+       only possible result for limit 1.  */
+    return 0;
+
+  /* The bits variable serves as a source for bits.  Prefetch the
+     minimum number of bytes needed.  */
+  unsigned count = byte_count (n);
+  uint32_t bits_length = count * CHAR_BIT;
+  uint32_t bits;
+  random_bytes (&bits, count);
+
+  /* Powers of two are easy.  */
+  if (powerof2 (n))
+    return bits & (n - 1);
+
+  /* The general case.  This algorithm follows Jérémie Lumbroso,
+     Optimal Discrete Uniform Generation from Coin Flips, and
+     Applications (2013), who credits Donald E. Knuth and Andrew
+     C. Yao, The complexity of nonuniform random number generation
+     (1976), for solving the general case.
+
+     The implementation below unrolls the initialization stage of the
+     loop, where v is less than n.  */
+
+  /* Use 64-bit variables even though the intermediate results are
+     never larger that 33 bits.  This ensures the code easier to
+     compile on 64-bit architectures.  */
+  uint64_t v;
+  uint64_t c;
+
+  /* Initialize v and c.  v is the smallest power of 2 which is larger
+     than n.*/
+  {
+    uint32_t log2p1 = 32 - __builtin_clz (n);
+    v = 1ULL << log2p1;
+    c = bits & (v - 1);
+    bits >>= log2p1;
+    bits_length -= log2p1;
+  }
+
+  /* At the start of the loop, c is uniformly distributed within the
+     half-open interval [0, v), and v < 2n < 2**33.  */
+  while (true)
+    {
+      if (v >= n)
+        {
+          /* If the candidate is less than n, accept it.  */
+          if (c < n)
+            /* c is uniformly distributed on [0, n).  */
+            return c;
+          else
+            {
+              /* c is uniformly distributed on [n, v).  */
+              v -= n;
+              c -= n;
+              /* The distribution was shifted, so c is uniformly
+                 distributed on [0, v) again.  */
+            }
+        }
+      /* v < n here.  */
+
+      /* Replenish the bit source if necessary.  */
+      if (bits_length == 0)
+        {
+          /* Overwrite the least significant byte.  */
+	  random_bytes (&bits, 1);
+	  bits_length = CHAR_BIT;
+        }
+
+      /* Double the range.  No overflow because v < n < 2**32.  */
+      v *= 2;
+      /* v < 2n here.  */
+
+      /* Extract a bit and append it to c.  c remains less than v and
+         thus 2**33.  */
+      c = (c << 1) | (bits & 1);
+      bits >>= 1;
+      --bits_length;
+
+      /* At this point, c is uniformly distributed on [0, v) again,
+         and v < 2n < 2**33.  */
+    }
+}
+
+__libc_lock_define (extern , arc4random_lock attribute_hidden)
+
+uint32_t
+__arc4random_uniform (uint32_t upper_bound)
+{
+  uint32_t r;
+  __libc_lock_lock (arc4random_lock);
+  r = compute_uniform (upper_bound);
+  __libc_lock_unlock (arc4random_lock);
+  return r;
+}
+libc_hidden_def (__arc4random_uniform)
+weak_alias (__arc4random_uniform, arc4random_uniform)
diff --git a/stdlib/chacha20.c b/stdlib/chacha20.c
new file mode 100644
index 0000000000..dbd87bd942
--- /dev/null
+++ b/stdlib/chacha20.c
@@ -0,0 +1,211 @@
+/* Generic ChaCha20 implementation (used on arc4random).
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <array_length.h>
+#include <endian.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+/* 32-bit stream position, then 96-bit nonce.  */
+#define CHACHA20_IV_SIZE	16
+#define CHACHA20_KEY_SIZE	32
+#define CHACHA20_KEY_WORDS	(CHACHA20_KEY_SIZE / sizeof (uint32_t))
+
+#define CHACHA20_BLOCK_SIZE     64
+#define CHACHA20_BUFSIZE        (16 * CHACHA20_BLOCK_SIZE)
+
+#define CHACHA20_STATE_LEN	16
+
+enum chacha20_constants
+{
+  CHACHA20_CONSTANT_EXPA = 0x61707865U,
+  CHACHA20_CONSTANT_ND_3 = 0x3320646eU,
+  CHACHA20_CONSTANT_2_BY = 0x79622d32U,
+  CHACHA20_CONSTANT_TE_K = 0x6b206574U
+};
+
+struct chacha20_state
+{
+  uint32_t ctx[CHACHA20_STATE_LEN];
+};
+
+#define READ_UNALIGNED_FUNC(type)		\
+  static inline uint##type##_t			\
+  read_unaligned_##type (const uint8_t *p)	\
+  {						\
+    uint##type##_t r;				\
+    memcpy (&r, p, sizeof (r));			\
+    return r;					\
+  }
+READ_UNALIGNED_FUNC(16)
+READ_UNALIGNED_FUNC(32)
+READ_UNALIGNED_FUNC(64)
+
+#define WRITE_UNALIGNED_FUNC(type)			\
+  static inline void					\
+  write_unaligned_##type (uint8_t *p, uint##type##_t v)	\
+  {							\
+    memcpy (p, &v, sizeof (v));				\
+  }
+WRITE_UNALIGNED_FUNC(16)
+WRITE_UNALIGNED_FUNC(32)
+WRITE_UNALIGNED_FUNC(64)
+
+static inline uint32_t
+read_unaligned_le32 (const uint8_t *p)
+{
+  uint32_t v = read_unaligned_32 (p);
+#if __BYTE_ORDER == __BIG_ENDIAN
+  return __builtin_bswap32 (v);
+#else
+  return v;
+#endif
+}
+
+static inline void
+write_unaligned_le32 (uint8_t *p, uint32_t v)
+{
+#if __BYTE_ORDER == __BIG_ENDIAN
+  v =  __builtin_bswap32 (v);
+#endif
+  write_unaligned_32 (p, v);
+}
+
+static inline void
+chacha20_init (struct chacha20_state *s, const uint8_t *key, const uint8_t *iv)
+{
+  s->ctx[0]  = CHACHA20_CONSTANT_EXPA;
+  s->ctx[1]  = CHACHA20_CONSTANT_ND_3;
+  s->ctx[2]  = CHACHA20_CONSTANT_2_BY;
+  s->ctx[3]  = CHACHA20_CONSTANT_TE_K;
+
+  s->ctx[4]  = read_unaligned_le32 (key + 0 * sizeof (uint32_t));
+  s->ctx[5]  = read_unaligned_le32 (key + 1 * sizeof (uint32_t));
+  s->ctx[6]  = read_unaligned_le32 (key + 2 * sizeof (uint32_t));
+  s->ctx[7]  = read_unaligned_le32 (key + 3 * sizeof (uint32_t));
+  s->ctx[8]  = read_unaligned_le32 (key + 4 * sizeof (uint32_t));
+  s->ctx[9]  = read_unaligned_le32 (key + 5 * sizeof (uint32_t));
+  s->ctx[10] = read_unaligned_le32 (key + 6 * sizeof (uint32_t));
+  s->ctx[11] = read_unaligned_le32 (key + 7 * sizeof (uint32_t));
+
+  s->ctx[12] = read_unaligned_le32 (iv + 0 * sizeof (uint32_t));
+  s->ctx[13] = read_unaligned_le32 (iv + 1 * sizeof (uint32_t));
+  s->ctx[14] = read_unaligned_le32 (iv + 2 * sizeof (uint32_t));
+  s->ctx[15] = read_unaligned_le32 (iv + 3 * sizeof (uint32_t));
+}
+
+static inline uint32_t
+rotl32 (unsigned int shift, uint32_t word)
+{
+  return (word << (shift & 31)) | (word >> ((-shift) & 31));
+}
+
+#define QROUND(x0, x1, x2, x3) 			\
+  do {						\
+   x0 = x0 + x1; x3 = rotl32 (16, (x0 ^ x3)); 	\
+   x2 = x2 + x3; x1 = rotl32 (12, (x1 ^ x2)); 	\
+   x0 = x0 + x1; x3 = rotl32 (8,  (x0 ^ x3));	\
+   x2 = x2 + x3; x1 = rotl32 (7,  (x1 ^ x2));	\
+  } while(0)
+
+static inline void
+chacha20_block (uint32_t *state, uint8_t *stream)
+{
+  uint32_t x[CHACHA20_STATE_LEN];
+  memcpy (x, state, sizeof x);
+
+  for (int i = 0; i < 20; i += 2)
+    {
+      QROUND(x[0], x[4], x[8],  x[12]);
+      QROUND(x[1], x[5], x[9],  x[13]);
+      QROUND(x[2], x[6], x[10], x[14]);
+      QROUND(x[3], x[7], x[11], x[15]);
+
+      QROUND(x[0], x[5], x[10], x[15]);
+      QROUND(x[1], x[6], x[11], x[12]);
+      QROUND(x[2], x[7], x[8],  x[13]);
+      QROUND(x[3], x[4], x[9],  x[14]);
+    }
+
+  for (int i = 0; i < CHACHA20_STATE_LEN; i++)
+    {
+      uint32_t v = x[i] + state[i];
+      write_unaligned_le32 (&stream[i * sizeof (uint32_t)], v);
+    }
+
+  state[12]++;
+}
+
+static void
+memxorcpy (uint8_t *dst, const uint8_t *src1, const uint8_t *src2, size_t len)
+{
+  while (len >= 8)
+    {
+      uint64_t l = read_unaligned_64 (src1) ^ read_unaligned_64 (src2);
+      write_unaligned_64 (dst, l);
+      dst += 8;
+      src1 += 8;
+      src2 += 8;
+      len -= 8;
+    }
+
+  if (len >= 4)
+    {
+      uint32_t l = read_unaligned_32 (src1) ^ read_unaligned_32 (src2);
+      write_unaligned_32 (dst, l);
+      dst += 4;
+      src1 += 4;
+      src2 += 4;
+      len -= 4;
+    }
+
+  if (len >= 2)
+    {
+      uint16_t l = read_unaligned_16 (src1) ^ read_unaligned_32 (src2);
+      write_unaligned_16 (dst, l);
+      dst += 2;
+      src1 += 2;
+      src2 += 2;
+      len -= 2;
+    }
+
+  if (len >= 1)
+    *dst++ = *src1++ ^ *src2++;
+}
+
+static void
+chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
+		const uint8_t *src, size_t bytes)
+{
+  uint8_t stream[CHACHA20_BLOCK_SIZE];
+
+  while (bytes >= CHACHA20_BLOCK_SIZE)
+    {
+      chacha20_block (state->ctx, stream);
+      memxorcpy (dst, src, stream, CHACHA20_BLOCK_SIZE);
+      bytes -= CHACHA20_BLOCK_SIZE;
+      dst += CHACHA20_BLOCK_SIZE;
+      src += CHACHA20_BLOCK_SIZE;
+    }
+  if (bytes != 0)
+    {
+      chacha20_block (state->ctx, stream);
+      memxorcpy (dst, src, stream, bytes);
+    }
+}
diff --git a/stdlib/stdlib.h b/stdlib/stdlib.h
index bf7cd438e1..f2b0c83c12 100644
--- a/stdlib/stdlib.h
+++ b/stdlib/stdlib.h
@@ -485,6 +485,7 @@ extern unsigned short int *seed48 (unsigned short int __seed16v[3])
 extern void lcong48 (unsigned short int __param[7]) __THROW __nonnull ((1));
 
 # ifdef __USE_MISC
+#  include <bits/stdint-uintn.h>
 /* Data structure for communication with thread safe versions.  This
    type is to be regarded as opaque.  It's only exported because users
    have to allocate objects of this type.  */
@@ -533,6 +534,19 @@ extern int seed48_r (unsigned short int __seed16v[3],
 extern int lcong48_r (unsigned short int __param[7],
 		      struct drand48_data *__buffer)
      __THROW __nonnull ((1, 2));
+
+/* Return a random integer between zero and 2**31-1 (inclusive).  */
+extern uint32_t arc4random (void)
+     __THROW __wur;
+
+/* Fill the buffer with random data.  */
+extern void arc4random_buf (void *__buf, size_t __size)
+     __THROW __nonnull ((1));
+
+/* Return a random number between zero (inclusive) and the specified
+   limit (exclusive).  */
+extern uint32_t arc4random_uniform (uint32_t __upper_bound)
+     __THROW __wur;
 # endif	/* Use misc.  */
 #endif	/* Use misc or X/Open.  */
 
diff --git a/sysdeps/generic/not-cancel.h b/sysdeps/generic/not-cancel.h
index 2104efeb54..f4882a9ffd 100644
--- a/sysdeps/generic/not-cancel.h
+++ b/sysdeps/generic/not-cancel.h
@@ -48,5 +48,7 @@
   (void) __writev (fd, iov, n)
 #define __fcntl64_nocancel(fd, cmd, ...) \
   __fcntl64 (fd, cmd, __VA_ARGS__)
+#define __getrandomn_nocancel(buf, size, flags) \
+  __getrandom (buf, size, flags)
 
 #endif /* NOT_CANCEL_H  */
diff --git a/sysdeps/mach/hurd/i386/libc.abilist b/sysdeps/mach/hurd/i386/libc.abilist
index 4dc87e9061..7bd565103b 100644
--- a/sysdeps/mach/hurd/i386/libc.abilist
+++ b/sysdeps/mach/hurd/i386/libc.abilist
@@ -2289,6 +2289,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 close_range F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/mach/hurd/not-cancel.h b/sysdeps/mach/hurd/not-cancel.h
index 6ec92ced84..39edfe76b6 100644
--- a/sysdeps/mach/hurd/not-cancel.h
+++ b/sysdeps/mach/hurd/not-cancel.h
@@ -74,6 +74,9 @@ __typeof (__fcntl) __fcntl_nocancel;
 #define __fcntl64_nocancel(...) \
   __fcntl_nocancel (__VA_ARGS__)
 
+#define __getrandomn_nocancel(buf, size, flags) \
+  __getrandom (buf, size, flags)
+
 #if IS_IN (libc)
 hidden_proto (__close_nocancel)
 hidden_proto (__close_nocancel_nostatus)
diff --git a/sysdeps/unix/sysv/linux/aarch64/libc.abilist b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
index 1b63d9e447..f8f38bb205 100644
--- a/sysdeps/unix/sysv/linux/aarch64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
@@ -2616,3 +2616,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/alpha/libc.abilist b/sysdeps/unix/sysv/linux/alpha/libc.abilist
index e7e4cf7d2a..9de1726de0 100644
--- a/sysdeps/unix/sysv/linux/alpha/libc.abilist
+++ b/sysdeps/unix/sysv/linux/alpha/libc.abilist
@@ -2713,6 +2713,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/arc/libc.abilist b/sysdeps/unix/sysv/linux/arc/libc.abilist
index bc3d228e31..16e2532838 100644
--- a/sysdeps/unix/sysv/linux/arc/libc.abilist
+++ b/sysdeps/unix/sysv/linux/arc/libc.abilist
@@ -2377,3 +2377,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/arm/be/libc.abilist b/sysdeps/unix/sysv/linux/arm/be/libc.abilist
index db7039c4ab..ae9e465088 100644
--- a/sysdeps/unix/sysv/linux/arm/be/libc.abilist
+++ b/sysdeps/unix/sysv/linux/arm/be/libc.abilist
@@ -496,6 +496,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _Exit F
 GLIBC_2.4 _IO_2_1_stderr_ D 0xa0
 GLIBC_2.4 _IO_2_1_stdin_ D 0xa0
diff --git a/sysdeps/unix/sysv/linux/arm/le/libc.abilist b/sysdeps/unix/sysv/linux/arm/le/libc.abilist
index d2add4fb49..b669f43194 100644
--- a/sysdeps/unix/sysv/linux/arm/le/libc.abilist
+++ b/sysdeps/unix/sysv/linux/arm/le/libc.abilist
@@ -493,6 +493,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _Exit F
 GLIBC_2.4 _IO_2_1_stderr_ D 0xa0
 GLIBC_2.4 _IO_2_1_stdin_ D 0xa0
diff --git a/sysdeps/unix/sysv/linux/csky/libc.abilist b/sysdeps/unix/sysv/linux/csky/libc.abilist
index 355d72a30c..42daa90248 100644
--- a/sysdeps/unix/sysv/linux/csky/libc.abilist
+++ b/sysdeps/unix/sysv/linux/csky/libc.abilist
@@ -2652,3 +2652,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/hppa/libc.abilist b/sysdeps/unix/sysv/linux/hppa/libc.abilist
index 3df39bb28c..090be20f53 100644
--- a/sysdeps/unix/sysv/linux/hppa/libc.abilist
+++ b/sysdeps/unix/sysv/linux/hppa/libc.abilist
@@ -2601,6 +2601,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/i386/libc.abilist b/sysdeps/unix/sysv/linux/i386/libc.abilist
index c4da358f80..6b7cf064bb 100644
--- a/sysdeps/unix/sysv/linux/i386/libc.abilist
+++ b/sysdeps/unix/sysv/linux/i386/libc.abilist
@@ -2785,6 +2785,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/ia64/libc.abilist b/sysdeps/unix/sysv/linux/ia64/libc.abilist
index 241bac70ea..3e766f64dd 100644
--- a/sysdeps/unix/sysv/linux/ia64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/ia64/libc.abilist
@@ -2551,6 +2551,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
index 78bf372b72..c0b99199a8 100644
--- a/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
@@ -497,6 +497,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _Exit F
 GLIBC_2.4 _IO_2_1_stderr_ D 0x98
 GLIBC_2.4 _IO_2_1_stdin_ D 0x98
diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
index 00df5c901f..4d0be7c86d 100644
--- a/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
@@ -2728,6 +2728,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/microblaze/be/libc.abilist b/sysdeps/unix/sysv/linux/microblaze/be/libc.abilist
index e8118569c3..b944680ede 100644
--- a/sysdeps/unix/sysv/linux/microblaze/be/libc.abilist
+++ b/sysdeps/unix/sysv/linux/microblaze/be/libc.abilist
@@ -2701,3 +2701,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/microblaze/le/libc.abilist b/sysdeps/unix/sysv/linux/microblaze/le/libc.abilist
index c0d2373e64..28f7d19983 100644
--- a/sysdeps/unix/sysv/linux/microblaze/le/libc.abilist
+++ b/sysdeps/unix/sysv/linux/microblaze/le/libc.abilist
@@ -2698,3 +2698,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
index 2d0fd04f54..3da7cdaca5 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
@@ -2693,6 +2693,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
index e39ccfb312..9fe87f15be 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
@@ -2691,6 +2691,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
index 1e900f86e4..c14fca2111 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
@@ -2699,6 +2699,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
index 9145ba7931..a363830226 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
@@ -2602,6 +2602,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/nios2/libc.abilist b/sysdeps/unix/sysv/linux/nios2/libc.abilist
index e95d60d926..89b6f98667 100644
--- a/sysdeps/unix/sysv/linux/nios2/libc.abilist
+++ b/sysdeps/unix/sysv/linux/nios2/libc.abilist
@@ -2740,3 +2740,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/not-cancel.h b/sysdeps/unix/sysv/linux/not-cancel.h
index 75b9e0ee1e..be5df35927 100644
--- a/sysdeps/unix/sysv/linux/not-cancel.h
+++ b/sysdeps/unix/sysv/linux/not-cancel.h
@@ -67,6 +67,13 @@ __writev_nocancel_nostatus (int fd, const struct iovec *iov, int iovcnt)
   INTERNAL_SYSCALL_CALL (writev, fd, iov, iovcnt);
 }
 
+static inline int
+__getrandomn_nocancel (void *buf, size_t buflen, unsigned int flags)
+{
+  return INTERNAL_SYSCALL_CALL (getrandom, buf, buflen, flags);
+}
+
+
 /* Uncancelable fcntl.  */
 __typeof (__fcntl) __fcntl64_nocancel;
 
diff --git a/sysdeps/unix/sysv/linux/or1k/libc.abilist b/sysdeps/unix/sysv/linux/or1k/libc.abilist
index ca934e374b..94c0ff9526 100644
--- a/sysdeps/unix/sysv/linux/or1k/libc.abilist
+++ b/sysdeps/unix/sysv/linux/or1k/libc.abilist
@@ -2123,3 +2123,6 @@ GLIBC_2.35 wprintf F
 GLIBC_2.35 write F
 GLIBC_2.35 writev F
 GLIBC_2.35 wscanf F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
index 3820b9f235..d6188de00b 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
@@ -2755,6 +2755,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
index 464dc27fcd..8201230059 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
@@ -2788,6 +2788,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
index 2f7e58747f..623505d783 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
@@ -2510,6 +2510,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
index 4f3043d913..23b0d83408 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
@@ -2812,3 +2812,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/riscv/rv32/libc.abilist b/sysdeps/unix/sysv/linux/riscv/rv32/libc.abilist
index 84b6ac815a..a72e8ed9cc 100644
--- a/sysdeps/unix/sysv/linux/riscv/rv32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/riscv/rv32/libc.abilist
@@ -2379,3 +2379,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
index 4d5c19c56a..f3faecc2ae 100644
--- a/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
@@ -2579,3 +2579,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
index 7c5ee8d569..105e5a9231 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
@@ -2753,6 +2753,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
index 50de0b46cf..c08c6c8301 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
@@ -2547,6 +2547,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/sh/be/libc.abilist b/sysdeps/unix/sysv/linux/sh/be/libc.abilist
index 66fba013ca..8ec1005644 100644
--- a/sysdeps/unix/sysv/linux/sh/be/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sh/be/libc.abilist
@@ -2608,6 +2608,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/sh/le/libc.abilist b/sysdeps/unix/sysv/linux/sh/le/libc.abilist
index 38703f8aa0..5d776576f9 100644
--- a/sysdeps/unix/sysv/linux/sh/le/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sh/le/libc.abilist
@@ -2605,6 +2605,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
index 6df55eb765..f5f07f612e 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
@@ -2748,6 +2748,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
index b90569d881..be687ebe02 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
@@ -2574,6 +2574,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
index e88b0f101f..7f456fbb55 100644
--- a/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
@@ -2525,6 +2525,9 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
index e0755272eb..c737201248 100644
--- a/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
@@ -2631,3 +2631,6 @@ GLIBC_2.35 __memcmpeq F
 GLIBC_2.35 _dl_find_object F
 GLIBC_2.35 epoll_pwait2 F
 GLIBC_2.35 posix_spawn_file_actions_addtcsetpgrp_np F
+GLIBC_2.36 arc4random F
+GLIBC_2.36 arc4random_buf F
+GLIBC_2.36 arc4random_uniform F
-- 
2.32.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/7] stdlib: Add arc4random tests
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
  2022-04-13 20:23 ` [PATCH 1/7] stdlib: Add arc4random, arc4random_buf, and arc4random_uniform (BZ #4417) Adhemerval Zanella
@ 2022-04-13 20:23 ` Adhemerval Zanella
  2022-04-14 18:01   ` Noah Goldstein
  2022-04-13 20:23 ` [PATCH 3/7] benchtests: Add arc4random benchtest Adhemerval Zanella
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-13 20:23 UTC (permalink / raw)
  To: libc-alpha; +Cc: Florian Weimer

The basic tst-arc4random-chacha20.c checks if the output of ChaCha20
implementation matches the reference test vectors from RFC8439.

The tst-arc4random-fork.c check if subprocesses generate distinct
streams of randomness (if fork handling is done correctly).

The tst-arc4random-stats.c is a statistical test to the randomness of
arc4random, arc4random_buf, and arc4random_uniform.

The tst-arc4random-thread.c check if threads generate distinct streams
of randomness (if function are thread-safe).

Checked on x86_64-linux-gnu, aarch64-linux, and powerpc64le-linux-gnu.

Co-authored-by: Florian Weimer <fweimer@redhat.com>
---
 stdlib/Makefile                  |   4 +
 stdlib/tst-arc4random-chacha20.c | 225 +++++++++++++++++++++++++
 stdlib/tst-arc4random-fork.c     | 174 +++++++++++++++++++
 stdlib/tst-arc4random-stats.c    | 146 ++++++++++++++++
 stdlib/tst-arc4random-thread.c   | 278 +++++++++++++++++++++++++++++++
 5 files changed, 827 insertions(+)
 create mode 100644 stdlib/tst-arc4random-chacha20.c
 create mode 100644 stdlib/tst-arc4random-fork.c
 create mode 100644 stdlib/tst-arc4random-stats.c
 create mode 100644 stdlib/tst-arc4random-thread.c

diff --git a/stdlib/Makefile b/stdlib/Makefile
index 9f9cc1bd7f..4862d008ab 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -183,6 +183,9 @@ tests := \
   testmb2 \
   testrand \
   testsort \
+  tst-arc4random-fork \
+  tst-arc4random-stats \
+  tst-arc4random-thread \
   tst-at_quick_exit \
   tst-atexit \
   tst-atof1 \
@@ -252,6 +255,7 @@ tests-internal := \
   # tests-internal
 
 tests-static := \
+  tst-arc4random-chacha20 \
   tst-secure-getenv \
   # tests-static
 
diff --git a/stdlib/tst-arc4random-chacha20.c b/stdlib/tst-arc4random-chacha20.c
new file mode 100644
index 0000000000..c5876d3f3b
--- /dev/null
+++ b/stdlib/tst-arc4random-chacha20.c
@@ -0,0 +1,225 @@
+/* Basic tests for chacha20 cypher used in arc4random.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <chacha20.c>
+#include <support/check.h>
+
+static int
+do_test (void)
+{
+  /* Reference ChaCha20 encryption test vectors from RFC8439.  */
+
+  /* Test vector #1.  */
+  {
+    struct chacha20_state state;
+
+    uint8_t key[CHACHA20_KEY_SIZE] =
+      {
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+      };
+    uint8_t iv[CHACHA20_IV_SIZE] =
+      {
+	0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+      };
+    const uint8_t plaintext[CHACHA20_BLOCK_SIZE] = { 0 };
+    uint8_t ciphertext[CHACHA20_BLOCK_SIZE];
+				     
+    chacha20_init (&state, key, iv);
+    chacha20_crypt (&state, ciphertext, plaintext, sizeof plaintext);
+
+    const uint8_t expected[] =
+      {
+	0x76, 0xb8, 0xe0, 0xad, 0xa0, 0xf1, 0x3d, 0x90,
+	0x40, 0x5d, 0x6a, 0xe5, 0x53, 0x86, 0xbd, 0x28,
+	0xbd, 0xd2, 0x19, 0xb8, 0xa0, 0x8d, 0xed, 0x1a,
+	0xa8, 0x36, 0xef, 0xcc, 0x8b, 0x77, 0x0d, 0xc7,
+	0xda, 0x41, 0x59, 0x7c, 0x51, 0x57, 0x48, 0x8d,
+	0x77, 0x24, 0xe0, 0x3f, 0xb8, 0xd8, 0x4a, 0x37,
+	0x6a, 0x43, 0xb8, 0xf4, 0x15, 0x18, 0xa1, 0x1c,
+	0xc3, 0x87, 0xb6, 0x69, 0xb2, 0xee, 0x65, 0x86
+      };
+    TEST_COMPARE_BLOB (ciphertext, sizeof ciphertext,
+			expected, sizeof expected);
+  }
+
+  /* Test vector #2.  */
+  {
+    struct chacha20_state state;
+
+    uint8_t key[CHACHA20_KEY_SIZE] =
+      {
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1,
+      };
+    uint8_t iv[CHACHA20_IV_SIZE] =
+      {
+	0x1, 0x0, 0x0, 0x0,  /* Block counter is a LE uint32_t  */
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2
+      };
+    const uint8_t plaintext[] =
+      {
+	0x41, 0x6e, 0x79, 0x20, 0x73, 0x75, 0x62, 0x6d, 0x69, 0x73, 0x73,
+	0x69, 0x6f, 0x6e, 0x20, 0x74, 0x6f, 0x20, 0x74, 0x68, 0x65, 0x20,
+	0x49, 0x45, 0x54, 0x46, 0x20, 0x69, 0x6e, 0x74, 0x65, 0x6e, 0x64,
+	0x65, 0x64, 0x20, 0x62, 0x79, 0x20, 0x74, 0x68, 0x65, 0x20, 0x43,
+	0x6f, 0x6e, 0x74, 0x72,	0x69, 0x62, 0x75, 0x74, 0x6f, 0x72, 0x20,
+	0x66, 0x6f, 0x72, 0x20, 0x70, 0x75, 0x62, 0x6c, 0x69, 0x63, 0x61,
+	0x74, 0x69, 0x6f, 0x6e, 0x20, 0x61, 0x73, 0x20, 0x61, 0x6c, 0x6c,
+	0x20, 0x6f, 0x72, 0x20, 0x70, 0x61, 0x72, 0x74, 0x20, 0x6f, 0x66,
+	0x20, 0x61, 0x6e, 0x20, 0x49, 0x45, 0x54, 0x46, 0x20, 0x49, 0x6e,
+	0x74, 0x65, 0x72, 0x6e, 0x65, 0x74, 0x2d, 0x44, 0x72, 0x61, 0x66,
+	0x74, 0x20, 0x6f, 0x72, 0x20, 0x52, 0x46, 0x43, 0x20, 0x61, 0x6e,
+	0x64, 0x20, 0x61, 0x6e, 0x79, 0x20, 0x73, 0x74, 0x61, 0x74, 0x65,
+	0x6d, 0x65, 0x6e, 0x74, 0x20, 0x6d, 0x61, 0x64, 0x65, 0x20, 0x77,
+	0x69, 0x74, 0x68, 0x69, 0x6e, 0x20, 0x74, 0x68, 0x65, 0x20, 0x63,
+	0x6f, 0x6e, 0x74, 0x65, 0x78, 0x74, 0x20, 0x6f, 0x66, 0x20, 0x61,
+	0x6e, 0x20, 0x49, 0x45, 0x54, 0x46, 0x20, 0x61, 0x63, 0x74, 0x69,
+	0x76, 0x69, 0x74, 0x79, 0x20, 0x69, 0x73, 0x20, 0x63, 0x6f, 0x6e,
+	0x73, 0x69, 0x64, 0x65, 0x72, 0x65, 0x64, 0x20, 0x61, 0x6e, 0x20,
+	0x22, 0x49, 0x45, 0x54, 0x46, 0x20, 0x43, 0x6f, 0x6e, 0x74, 0x72,
+	0x69, 0x62, 0x75, 0x74, 0x69, 0x6f, 0x6e, 0x22, 0x2e, 0x20, 0x53,
+	0x75, 0x63, 0x68, 0x20, 0x73, 0x74, 0x61, 0x74, 0x65, 0x6d, 0x65,
+	0x6e, 0x74, 0x73, 0x20, 0x69, 0x6e, 0x63, 0x6c, 0x75, 0x64, 0x65,
+	0x20, 0x6f, 0x72, 0x61, 0x6c, 0x20, 0x73, 0x74, 0x61, 0x74, 0x65,
+	0x6d, 0x65, 0x6e, 0x74, 0x73, 0x20, 0x69, 0x6e, 0x20, 0x49, 0x45,
+	0x54, 0x46, 0x20, 0x73, 0x65, 0x73, 0x73, 0x69,	0x6f, 0x6e, 0x73,
+	0x2c, 0x20, 0x61, 0x73, 0x20, 0x77, 0x65, 0x6c, 0x6c, 0x20, 0x61,
+	0x73, 0x20, 0x77, 0x72, 0x69, 0x74, 0x74, 0x65, 0x6e, 0x20, 0x61,
+	0x6e, 0x64, 0x20, 0x65, 0x6c, 0x65, 0x63, 0x74, 0x72, 0x6f, 0x6e,
+	0x69, 0x63, 0x20, 0x63, 0x6f, 0x6d, 0x6d, 0x75, 0x6e, 0x69, 0x63,
+	0x61, 0x74, 0x69, 0x6f, 0x6e, 0x73, 0x20, 0x6d, 0x61, 0x64, 0x65,
+	0x20, 0x61, 0x74, 0x20, 0x61, 0x6e, 0x79, 0x20, 0x74, 0x69, 0x6d,
+	0x65, 0x20, 0x6f, 0x72, 0x20, 0x70, 0x6c, 0x61, 0x63, 0x65, 0x2c,
+	0x20, 0x77, 0x68, 0x69, 0x63, 0x68, 0x20, 0x61, 0x72, 0x65, 0x20,
+	0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x65, 0x64, 0x20, 0x74,
+	0x6f,
+      };
+    uint8_t ciphertext[sizeof plaintext];
+
+    chacha20_init (&state, key, iv);
+    chacha20_crypt (&state, ciphertext, plaintext, sizeof plaintext);
+
+    const uint8_t expected[] =
+      {
+	0xa3, 0xfb, 0xf0, 0x7d, 0xf3, 0xfa, 0x2f, 0xde, 0x4f, 0x37, 0x6c,
+	0xa2, 0x3e, 0x82, 0x73, 0x70, 0x41, 0x60, 0x5d, 0x9f, 0x4f, 0x4f,
+	0x57, 0xbd, 0x8c, 0xff, 0x2c, 0x1d, 0x4b, 0x79, 0x55, 0xec, 0x2a,
+	0x97, 0x94, 0x8b, 0xd3, 0x72, 0x29, 0x15, 0xc8, 0xf3, 0xd3, 0x37,
+	0xf7, 0xd3, 0x70, 0x05,	0x0e, 0x9e, 0x96, 0xd6, 0x47, 0xb7, 0xc3,
+	0x9f, 0x56, 0xe0, 0x31, 0xca, 0x5e, 0xb6, 0x25, 0x0d, 0x40, 0x42,
+	0xe0, 0x27, 0x85, 0xec, 0xec, 0xfa, 0x4b, 0x4b, 0xb5, 0xe8, 0xea,
+	0xd0, 0x44, 0x0e, 0x20, 0xb6, 0xe8, 0xdb, 0x09, 0xd8, 0x81, 0xa7,
+	0xc6, 0x13, 0x2f, 0x42, 0x0e, 0x52, 0x79, 0x50,	0x42, 0xbd, 0xfa,
+	0x77, 0x73, 0xd8, 0xa9, 0x05, 0x14, 0x47, 0xb3, 0x29, 0x1c, 0xe1,
+	0x41, 0x1c, 0x68, 0x04, 0x65, 0x55, 0x2a, 0xa6, 0xc4, 0x05, 0xb7,
+	0x76, 0x4d, 0x5e, 0x87, 0xbe, 0xa8, 0x5a, 0xd0, 0x0f, 0x84, 0x49,
+	0xed, 0x8f, 0x72, 0xd0, 0xd6, 0x62, 0xab, 0x05, 0x26, 0x91, 0xca,
+	0x66, 0x42, 0x4b, 0xc8, 0x6d, 0x2d, 0xf8, 0x0e, 0xa4, 0x1f, 0x43,
+	0xab, 0xf9, 0x37, 0xd3, 0x25, 0x9d, 0xc4, 0xb2, 0xd0, 0xdf, 0xb4,
+	0x8a, 0x6c, 0x91, 0x39, 0xdd, 0xd7, 0xf7, 0x69, 0x66, 0xe9, 0x28,
+	0xe6, 0x35, 0x55, 0x3b, 0xa7, 0x6c, 0x5c, 0x87, 0x9d, 0x7b, 0x35,
+	0xd4, 0x9e, 0xb2, 0xe6, 0x2b, 0x08, 0x71, 0xcd, 0xac, 0x63, 0x89,
+	0x39, 0xe2, 0x5e, 0x8a, 0x1e, 0x0e, 0xf9, 0xd5, 0x28, 0x0f, 0xa8,
+	0xca, 0x32, 0x8b, 0x35, 0x1c, 0x3c, 0x76, 0x59, 0x89, 0xcb, 0xcf,
+	0x3d, 0xaa, 0x8b, 0x6c,	0xcc, 0x3a, 0xaf, 0x9f, 0x39, 0x79, 0xc9,
+	0x2b, 0x37, 0x20, 0xfc, 0x88, 0xdc, 0x95, 0xed, 0x84, 0xa1, 0xbe,
+	0x05, 0x9c, 0x64, 0x99, 0xb9, 0xfd, 0xa2, 0x36, 0xe7, 0xe8, 0x18,
+	0xb0, 0x4b, 0x0b, 0xc3, 0x9c, 0x1e, 0x87, 0x6b, 0x19, 0x3b, 0xfe,
+	0x55, 0x69, 0x75, 0x3f, 0x88, 0x12, 0x8c, 0xc0,	0x8a, 0xaa, 0x9b,
+	0x63, 0xd1, 0xa1, 0x6f, 0x80, 0xef, 0x25, 0x54, 0xd7, 0x18, 0x9c,
+	0x41, 0x1f, 0x58, 0x69, 0xca, 0x52, 0xc5, 0xb8, 0x3f, 0xa3, 0x6f,
+	0xf2, 0x16, 0xb9, 0xc1, 0xd3, 0x00, 0x62, 0xbe, 0xbc, 0xfd, 0x2d,
+	0xc5, 0xbc, 0xe0, 0x91, 0x19, 0x34, 0xfd, 0xa7, 0x9a, 0x86, 0xf6,
+	0xe6, 0x98, 0xce, 0xd7, 0x59, 0xc3, 0xff, 0x9b, 0x64, 0x77, 0x33,
+	0x8f, 0x3d, 0xa4, 0xf9, 0xcd, 0x85, 0x14, 0xea, 0x99, 0x82, 0xcc,
+	0xaf, 0xb3, 0x41, 0xb2, 0x38, 0x4d, 0xd9, 0x02, 0xf3, 0xd1, 0xab,
+	0x7a, 0xc6, 0x1d, 0xd2, 0x9c, 0x6f, 0x21, 0xba, 0x5b, 0x86, 0x2f,
+	0x37, 0x30, 0xe3, 0x7c, 0xfd, 0xc4, 0xfd, 0x80, 0x6c, 0x22, 0xf2,
+	0x21,
+      };
+    TEST_COMPARE_BLOB (ciphertext, sizeof ciphertext,
+			expected, sizeof expected);
+    }
+
+  /* Test vector #3.  */
+  {
+    struct chacha20_state state;
+
+    uint8_t key[CHACHA20_KEY_SIZE] =
+      {
+	0x1c, 0x92, 0x40, 0xa5, 0xeb, 0x55, 0xd3, 0x8a,
+	0xf3, 0x33, 0x88, 0x86, 0x04, 0xf6, 0xb5, 0xf0,
+	0x47, 0x39, 0x17, 0xc1, 0x40, 0x2b, 0x80, 0x09,
+	0x9d, 0xca, 0x5c, 0xbc, 0x20, 0x70, 0x75, 0xc0
+      };
+    uint8_t iv[CHACHA20_IV_SIZE] =
+      {
+	0x2a, 0x0, 0x0, 0x0,  /* Block counter is a LE uint32_t  */
+	0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2
+      };
+
+    uint8_t plaintext[] =
+      {
+	0x27, 0x54, 0x77, 0x61, 0x73, 0x20, 0x62, 0x72, 0x69, 0x6c, 0x6c,
+	0x69, 0x67, 0x2c, 0x20, 0x61, 0x6e, 0x64, 0x20, 0x74, 0x68, 0x65,
+	0x20, 0x73, 0x6c, 0x69, 0x74, 0x68, 0x79, 0x20, 0x74, 0x6f, 0x76,
+	0x65, 0x73, 0x0a, 0x44, 0x69, 0x64, 0x20, 0x67, 0x79, 0x72, 0x65,
+	0x20, 0x61, 0x6e, 0x64, 0x20, 0x67, 0x69, 0x6d, 0x62, 0x6c, 0x65,
+	0x20, 0x69, 0x6e, 0x20, 0x74, 0x68, 0x65, 0x20, 0x77, 0x61, 0x62,
+	0x65, 0x3a, 0x0a, 0x41, 0x6c, 0x6c, 0x20, 0x6d, 0x69, 0x6d, 0x73,
+	0x79, 0x20, 0x77, 0x65, 0x72, 0x65, 0x20, 0x74, 0x68, 0x65, 0x20,
+	0x62, 0x6f, 0x72, 0x6f, 0x67, 0x6f, 0x76, 0x65,	0x73, 0x2c, 0x0a,
+	0x41, 0x6e, 0x64, 0x20, 0x74, 0x68, 0x65, 0x20, 0x6d, 0x6f, 0x6d,
+	0x65, 0x20, 0x72, 0x61, 0x74, 0x68, 0x73, 0x20, 0x6f, 0x75, 0x74,
+	0x67, 0x72, 0x61, 0x62, 0x65, 0x2e,
+      };
+    uint8_t ciphertext[sizeof plaintext];
+
+    chacha20_init (&state, key, iv);
+    chacha20_crypt (&state, ciphertext, plaintext, sizeof plaintext);
+
+    const uint8_t expected[] =
+      {
+	0x62, 0xe6, 0x34, 0x7f, 0x95, 0xed, 0x87, 0xa4, 0x5f, 0xfa, 0xe7,
+	0x42, 0x6f, 0x27, 0xa1, 0xdf, 0x5f, 0xb6, 0x91, 0x10, 0x04, 0x4c,
+	0x0d, 0x73, 0x11, 0x8e, 0xff, 0xa9, 0x5b, 0x01, 0xe5, 0xcf, 0x16,
+	0x6d, 0x3d, 0xf2, 0xd7, 0x21, 0xca, 0xf9, 0xb2, 0x1e, 0x5f, 0xb1,
+	0x4c, 0x61, 0x68, 0x71,	0xfd, 0x84, 0xc5, 0x4f, 0x9d, 0x65, 0xb2,
+	0x83, 0x19, 0x6c, 0x7f, 0xe4, 0xf6, 0x05, 0x53, 0xeb, 0xf3, 0x9c,
+	0x64, 0x02, 0xc4, 0x22, 0x34, 0xe3, 0x2a, 0x35, 0x6b, 0x3e, 0x76,
+	0x43, 0x12, 0xa6, 0x1a, 0x55, 0x32, 0x05, 0x57, 0x16, 0xea, 0xd6,
+	0x96, 0x25, 0x68, 0xf8, 0x7d, 0x3f, 0x3f, 0x77, 0x04, 0xc6, 0xa8,
+	0xd1, 0xbc, 0xd1, 0xbf, 0x4d, 0x50, 0xd6, 0x15, 0x4b, 0x6d, 0xa7,
+	0x31, 0xb1, 0x87, 0xb5, 0x8d, 0xfd, 0x72, 0x8a, 0xfa, 0x36, 0x75,
+	0x7a, 0x79, 0x7a, 0xc1, 0x88, 0xd1,
+      };
+
+    TEST_COMPARE_BLOB (ciphertext, sizeof ciphertext,
+			expected, sizeof expected);
+  }
+
+  return 0;
+}
+
+#include <support/test-driver.c>
diff --git a/stdlib/tst-arc4random-fork.c b/stdlib/tst-arc4random-fork.c
new file mode 100644
index 0000000000..cd8852c8d3
--- /dev/null
+++ b/stdlib/tst-arc4random-fork.c
@@ -0,0 +1,174 @@
+/* Test that subprocesses generate distinct streams of randomness.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Collect random data from subprocesses and check that all the
+   results are unique.  */
+
+#include <array_length.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <support/check.h>
+#include <support/support.h>
+#include <support/xthread.h>
+#include <support/xunistd.h>
+#include <unistd.h>
+
+/* Perform multiple runs.  The subsequent runs start with an
+   already-initialized random number generator.  (The number 1500 was
+   seen to reproduce failures reliable in case of a race condition in
+   the fork detection code.)  */
+enum { runs = 1500 };
+
+/* One hundred processes in total.  This should be high enough to
+   expose any issues, but low enough not to tax the overall system too
+   much.  */
+enum { subprocesses = 49 };
+
+/* The total number of processes.  */
+enum { processes = subprocesses + 1 };
+
+/* Number of bytes of randomness to generate per process.  Large
+   enough to make false positive duplicates extremely unlikely.  */
+enum { random_size = 16 };
+
+/* Generated bytes of randomness.  */
+struct result
+{
+  unsigned char bytes[random_size];
+};
+
+/* Shared across all processes.  */
+static struct shared_data
+{
+  pthread_barrier_t barrier;
+  struct result results[runs][processes];
+} *shared_data;
+
+/* Invoked to collect data from a subprocess.  */
+static void
+subprocess (int run, int process_index)
+{
+  xpthread_barrier_wait (&shared_data->barrier);
+  arc4random_buf (shared_data->results[run][process_index].bytes, random_size);
+}
+
+/* Used to sort the results.  */
+struct index
+{
+  int run;
+  int process_index;
+};
+
+/* Used to sort an array of struct index values.  */
+static int
+index_compare (const void *left1, const void *right1)
+{
+  const struct index *left = left1;
+  const struct index *right = right1;
+
+  return memcmp (shared_data->results[left->run][left->process_index].bytes,
+                 shared_data->results[right->run][right->process_index].bytes,
+                 random_size);
+}
+
+static int
+do_test (void)
+{
+  shared_data = support_shared_allocate (sizeof (*shared_data));
+  {
+    pthread_barrierattr_t attr;
+    xpthread_barrierattr_init (&attr);
+    xpthread_barrierattr_setpshared (&attr, PTHREAD_PROCESS_SHARED);
+    xpthread_barrier_init (&shared_data->barrier, &attr, processes);
+    xpthread_barrierattr_destroy (&attr);
+  }
+
+  /* Collect random data.  */
+  for (int run = 0; run < runs; ++run)
+    {
+#if 0
+      if (run == runs / 2)
+        {
+          /* In the middle, desynchronize the block cache by consuming
+             an odd number of bytes.  */
+          char buf;
+          arc4random_buf (&buf, 1);
+        }
+#endif
+
+      pid_t pids[subprocesses];
+      for (int process_index = 0; process_index < subprocesses;
+           ++process_index)
+        {
+          pids[process_index] = xfork ();
+          if (pids[process_index] == 0)
+            {
+              subprocess (run, process_index);
+              _exit (0);
+            }
+        }
+
+      /* Trigger all subprocesses.  Also add data from the parent
+         process.  */
+      subprocess (run, subprocesses);
+
+      for (int process_index = 0; process_index < subprocesses;
+           ++process_index)
+        {
+          int status;
+          xwaitpid (pids[process_index], &status, 0);
+          if (status != 0)
+            FAIL_EXIT1 ("subprocess index %d (PID %d) exit status %d\n",
+                        process_index, (int) pids[process_index], status);
+        }
+    }
+
+  /* Check for duplicates.  */
+  struct index indexes[runs * processes];
+  for (int run = 0; run < runs; ++run)
+    for (int process_index = 0; process_index < processes; ++process_index)
+      indexes[run * processes + process_index]
+        = (struct index) { .run = run, .process_index = process_index };
+  qsort (indexes, array_length (indexes), sizeof (indexes[0]), index_compare);
+  for (size_t i = 1; i < array_length (indexes); ++i)
+    {
+      if (index_compare (indexes + i - 1, indexes + i) == 0)
+        {
+          support_record_failure ();
+          unsigned char *bytes
+            = shared_data->results[indexes[i].run]
+                [indexes[i].process_index].bytes;
+          char *quoted = support_quote_blob (bytes, random_size);
+          printf ("error: duplicate randomness data: \"%s\"\n"
+                  "  run %d, subprocess %d\n"
+                  "  run %d, subprocess %d\n",
+                  quoted, indexes[i - 1].run, indexes[i - 1].process_index,
+                  indexes[i].run, indexes[i].process_index);
+          free (quoted);
+        }
+    }
+
+  xpthread_barrier_destroy (&shared_data->barrier);
+  support_shared_free (shared_data);
+  shared_data = NULL;
+
+  return 0;
+}
+
+#include <support/test-driver.c>
diff --git a/stdlib/tst-arc4random-stats.c b/stdlib/tst-arc4random-stats.c
new file mode 100644
index 0000000000..9747180c99
--- /dev/null
+++ b/stdlib/tst-arc4random-stats.c
@@ -0,0 +1,146 @@
+/* Statistical tests for arc4random-related functions.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <array_length.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <support/check.h>
+
+enum
+{
+  arc4random_key_size = 32
+};
+
+struct key
+{
+  unsigned char data[arc4random_key_size];
+};
+
+/* With 12,000 keys, the probability that a byte in a predetermined
+   position does not have a predetermined value in all generated keys
+   is about 4e-21.  The probability that this happens with any of the
+   16 * 256 possible byte position/values is 1.6e-17.  This results in
+   an acceptably low false-positive rate.  */
+enum { key_count = 12000 };
+
+static struct key keys[key_count];
+
+/* Used to perform the distribution check.  */
+static int byte_counts[arc4random_key_size][256];
+
+/* Bail out after this many failures.  */
+enum { failure_limit = 100 };
+
+static void
+find_stuck_bytes (bool (*func) (unsigned char *key))
+{
+  memset (&keys, 0xcc, sizeof (keys));
+
+  int failures = 0;
+  for (int key = 0; key < key_count; ++key)
+    {
+      while (true)
+        {
+          if (func (keys[key].data))
+            break;
+          ++failures;
+          if (failures >= failure_limit)
+            {
+              printf ("warning: bailing out after %d failures\n", failures);
+              return;
+            }
+        }
+    }
+  printf ("info: key generation finished with %d failures\n", failures);
+
+  memset (&byte_counts, 0, sizeof (byte_counts));
+  for (int key = 0; key < key_count; ++key)
+    for (int pos = 0; pos < arc4random_key_size; ++pos)
+      ++byte_counts[pos][keys[key].data[pos]];
+
+  for (int pos = 0; pos < arc4random_key_size; ++pos)
+    for (int byte = 0; byte < 256; ++byte)
+      if (byte_counts[pos][byte] == 0)
+        {
+          support_record_failure ();
+          printf ("error: byte %d never appeared at position %d\n", byte, pos);
+        }
+}
+
+/* Test adapter for arc4random.  */
+static bool
+generate_arc4random (unsigned char *key)
+{
+  uint32_t words[arc4random_key_size / 4];
+  _Static_assert (sizeof (words) == arc4random_key_size, "sizeof (words)");
+
+  for (int i = 0; i < array_length (words); ++i)
+    words[i] = arc4random ();
+  memcpy (key, &words, arc4random_key_size);
+  return true;
+}
+
+/* Test adapter for arc4random_buf.  */
+static bool
+generate_arc4random_buf (unsigned char *key)
+{
+  arc4random_buf (key, arc4random_key_size);
+  return true;
+}
+
+/* Test adapter for arc4random_uniform.  */
+static bool
+generate_arc4random_uniform (unsigned char *key)
+{
+  for (int i = 0; i < arc4random_key_size; ++i)
+    key[i] = arc4random_uniform (256);
+  return true;
+}
+
+/* Test adapter for arc4random_uniform with argument 257.  This means
+   that byte 0 happens more often, but we do not perform such a
+   statistcal check, so the test will still pass */
+static bool
+generate_arc4random_uniform_257 (unsigned char *key)
+{
+  for (int i = 0; i < arc4random_key_size; ++i)
+    key[i] = arc4random_uniform (257);
+  return true;
+}
+
+static int
+do_test (void)
+{
+  puts ("info: arc4random implementation test");
+  find_stuck_bytes (generate_arc4random);
+
+  puts ("info: arc4random_buf implementation test");
+  find_stuck_bytes (generate_arc4random_buf);
+
+  puts ("info: arc4random_uniform implementation test");
+  find_stuck_bytes (generate_arc4random_uniform);
+
+  puts ("info: arc4random_uniform implementation test (257 variant)");
+  find_stuck_bytes (generate_arc4random_uniform_257);
+
+  return 0;
+}
+
+#include <support/test-driver.c>
diff --git a/stdlib/tst-arc4random-thread.c b/stdlib/tst-arc4random-thread.c
new file mode 100644
index 0000000000..b122eaa826
--- /dev/null
+++ b/stdlib/tst-arc4random-thread.c
@@ -0,0 +1,278 @@
+/* Test that threads generate distinct streams of randomness.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <array_length.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <support/check.h>
+#include <support/namespace.h>
+#include <support/support.h>
+#include <support/xthread.h>
+
+/* Number of arc4random_buf calls per thread.  */
+enum { count_per_thread = 5000 };
+
+/* Number of threads computing randomness.  */
+enum { inner_threads = 5 };
+
+/* Number of threads launching other threads.  Chosen as to not to
+   overload the system.  */
+enum { outer_threads = 7 };
+
+/* Number of launching rounds performed by the outer threads.  */
+enum { outer_rounds = 10 };
+
+/* Maximum number of bytes generated in an arc4random call.  */
+enum { max_size = 32 };
+
+/* Sizes generated by threads.  Must be long enough to be unique with
+   high probability.  */
+static const int sizes[] = { 12, 15, 16, 17, 24, 31, max_size };
+
+/* Data structure to capture randomness results.  */
+struct blob
+{
+  unsigned int size;
+  int thread_id;
+  unsigned int index;
+  unsigned char bytes[max_size];
+};
+
+#define DYNARRAY_STRUCT dynarray_blob
+#define DYNARRAY_ELEMENT struct blob
+#define DYNARRAY_PREFIX dynarray_blob_
+#include <malloc/dynarray-skeleton.c>
+
+/* Sort blob elements by length first, then by comparing the data
+   member.  */
+static int
+compare_blob (const void *left1, const void *right1)
+{
+  const struct blob *left = left1;
+  const struct blob *right = right1;
+
+  if (left->size != right->size)
+    /* No overflow due to limited range.  */
+    return left->size - right->size;
+  return memcmp (left->bytes, right->bytes, left->size);
+}
+
+/* Used to store the global result.  */
+static pthread_mutex_t global_result_lock = PTHREAD_MUTEX_INITIALIZER;
+static struct dynarray_blob global_result;
+
+/* Copy data to the global result, with locking.  */
+static void
+copy_result_to_global (struct dynarray_blob *result)
+{
+  xpthread_mutex_lock (&global_result_lock);
+  size_t old_size = dynarray_blob_size (&global_result);
+  TEST_VERIFY_EXIT
+    (dynarray_blob_resize (&global_result,
+                           old_size + dynarray_blob_size (result)));
+  memcpy (dynarray_blob_begin (&global_result) + old_size,
+          dynarray_blob_begin (result),
+          dynarray_blob_size (result) * sizeof (struct blob));
+  xpthread_mutex_unlock (&global_result_lock);
+}
+
+/* Used to assign unique thread IDs.  Accessed atomically.  */
+static int next_thread_id;
+
+static void *
+inner_thread (void *unused)
+{
+  /* Use local result to avoid global lock contention while generating
+     randomness.  */
+  struct dynarray_blob result;
+  dynarray_blob_init (&result);
+
+  int thread_id = __atomic_fetch_add (&next_thread_id, 1, __ATOMIC_RELAXED);
+
+  /* Determine the sizes to be used by this thread.  */
+  int size_slot = thread_id % (array_length (sizes) + 1);
+  bool switch_sizes = size_slot == array_length (sizes);
+  if (switch_sizes)
+    size_slot = 0;
+
+  /* Compute the random blobs.  */
+  for (int i = 0; i < count_per_thread; ++i)
+    {
+      struct blob *place = dynarray_blob_emplace (&result);
+      TEST_VERIFY_EXIT (place != NULL);
+      place->size = sizes[size_slot];
+      place->thread_id = thread_id;
+      place->index = i;
+      arc4random_buf (place->bytes, place->size);
+
+      if (switch_sizes)
+        size_slot = (size_slot + 1) % array_length (sizes);
+    }
+
+  /* Store the blobs in the global result structure.  */
+  copy_result_to_global (&result);
+
+  dynarray_blob_free (&result);
+
+  return NULL;
+}
+
+/* Launch the inner threads and wait for their termination.  */
+static void *
+outer_thread (void *unused)
+{
+  for (int round = 0; round < outer_rounds; ++round)
+    {
+      pthread_t threads[inner_threads];
+
+      for (int i = 0; i < inner_threads; ++i)
+        threads[i] = xpthread_create (NULL, inner_thread, NULL);
+
+      for (int i = 0; i < inner_threads; ++i)
+        xpthread_join (threads[i]);
+    }
+
+  return NULL;
+}
+
+static bool termination_requested;
+
+/* Call arc4random_buf to fill one blob with 16 bytes.  */
+static void *
+get_one_blob_thread (void *closure)
+{
+  struct blob *result = closure;
+  result->size = 16;
+  arc4random_buf (result->bytes, result->size);
+  return NULL;
+}
+
+/* Invoked from fork_thread to actually obtain randomness data.  */
+static void
+fork_thread_subprocess (void *closure)
+{
+  struct blob *shared_result = closure;
+
+  pthread_t thr1 = xpthread_create
+    (NULL, get_one_blob_thread, shared_result + 1);
+  pthread_t thr2 = xpthread_create
+    (NULL, get_one_blob_thread, shared_result + 2);
+  get_one_blob_thread (shared_result);
+  xpthread_join (thr1);
+  xpthread_join (thr2);
+}
+
+/* Continuously fork subprocesses to obtain a little bit of
+   randomness.  */
+static void *
+fork_thread (void *unused)
+{
+  struct dynarray_blob result;
+  dynarray_blob_init (&result);
+
+  /* Three blobs from each subprocess.  */
+  struct blob *shared_result
+    = support_shared_allocate (3 * sizeof (*shared_result));
+
+  while (!__atomic_load_n (&termination_requested, __ATOMIC_RELAXED))
+    {
+      /* Obtain the results from a subprocess.  */
+      support_isolate_in_subprocess (fork_thread_subprocess, shared_result);
+
+      for (int i = 0; i < 3; ++i)
+        {
+          struct blob *place = dynarray_blob_emplace (&result);
+          TEST_VERIFY_EXIT (place != NULL);
+          place->size = shared_result[i].size;
+          place->thread_id = -1;
+          place->index = i;
+          memcpy (place->bytes, shared_result[i].bytes, place->size);
+        }
+    }
+
+  support_shared_free (shared_result);
+
+  copy_result_to_global (&result);
+  dynarray_blob_free (&result);
+
+  return NULL;
+}
+
+/* Launch the outer threads and wait for their termination.  */
+static void
+run_outer_threads (void)
+{
+  /* Special thread that continuously calls fork.  */
+  pthread_t fork_thread_id = xpthread_create (NULL, fork_thread, NULL);
+
+  pthread_t threads[outer_threads];
+  for (int i = 0; i < outer_threads; ++i)
+    threads[i] = xpthread_create (NULL, outer_thread, NULL);
+
+  for (int i = 0; i < outer_threads; ++i)
+    xpthread_join (threads[i]);
+
+  __atomic_store_n (&termination_requested, true, __ATOMIC_RELAXED);
+  xpthread_join (fork_thread_id);
+}
+
+static int
+do_test (void)
+{
+  dynarray_blob_init (&global_result);
+  int expected_blobs
+    = count_per_thread * inner_threads * outer_threads * outer_rounds;
+  printf ("info: minimum of %d blob results expected\n", expected_blobs);
+
+  run_outer_threads ();
+
+  /* The forking thread delivers a non-deterministic number of
+     results, which is why expected_blobs is only a minimun number of
+     results.  */
+  printf ("info: %zu blob results observed\n",
+          dynarray_blob_size (&global_result));
+  TEST_VERIFY (dynarray_blob_size (&global_result) >= expected_blobs);
+
+  /* Verify that there are no duplicates.  */
+  qsort (dynarray_blob_begin (&global_result),
+         dynarray_blob_size (&global_result),
+         sizeof (struct blob), compare_blob);
+  struct blob *end = dynarray_blob_end (&global_result);
+  for (struct blob *p = dynarray_blob_begin (&global_result) + 1;
+       p < end; ++p)
+    {
+      if (compare_blob (p - 1, p) == 0)
+        {
+          support_record_failure ();
+          char *quoted = support_quote_blob (p->bytes, p->size);
+          printf ("error: duplicate blob: \"%s\" (%d bytes)\n",
+                  quoted, (int) p->size);
+          printf ("  first source: thread %d, index %u\n",
+                  p[-1].thread_id, p[-1].index);
+          printf ("  second source: thread %d, index %u\n",
+                  p[0].thread_id, p[0].index);
+          free (quoted);
+        }
+    }
+
+  dynarray_blob_free (&global_result);
+
+  return 0;
+}
+
+#include <support/test-driver.c>
-- 
2.32.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/7] benchtests: Add arc4random benchtest
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
  2022-04-13 20:23 ` [PATCH 1/7] stdlib: Add arc4random, arc4random_buf, and arc4random_uniform (BZ #4417) Adhemerval Zanella
  2022-04-13 20:23 ` [PATCH 2/7] stdlib: Add arc4random tests Adhemerval Zanella
@ 2022-04-13 20:23 ` Adhemerval Zanella
  2022-04-14 19:17   ` Noah Goldstein
  2022-04-13 20:23 ` [PATCH 4/7] x86: Add SSSE3 optimized chacha20 Adhemerval Zanella
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-13 20:23 UTC (permalink / raw)
  To: libc-alpha

It shows both throughput (total bytes obtained in the test duration)
and latecy for both arc4random and arc4random_buf with different
sizes.

Checked on x86_64-linux-gnu, aarch64-linux, and powerpc64le-linux-gnu.
---
 benchtests/Makefile           |   6 +-
 benchtests/bench-arc4random.c | 243 ++++++++++++++++++++++++++++++++++
 2 files changed, 248 insertions(+), 1 deletion(-)
 create mode 100644 benchtests/bench-arc4random.c

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 8dfca592fd..50b96dd71f 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -111,8 +111,12 @@ bench-string := \
   ffsll \
 # bench-string
 
+bench-stdlib := \
+  arc4random \
+# bench-stdlib
+
 ifeq (${BENCHSET},)
-bench := $(bench-math) $(bench-pthread) $(bench-string)
+bench := $(bench-math) $(bench-pthread) $(bench-string) $(bench-stdlib)
 else
 bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}})
 endif
diff --git a/benchtests/bench-arc4random.c b/benchtests/bench-arc4random.c
new file mode 100644
index 0000000000..9e2ba9ba34
--- /dev/null
+++ b/benchtests/bench-arc4random.c
@@ -0,0 +1,243 @@
+/* arc4random benchmarks.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include "bench-timing.h"
+#include "json-lib.h"
+#include <array_length.h>
+#include <intprops.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <support/support.h>
+#include <support/xthread.h>
+
+static volatile uint32_t r;
+static volatile sig_atomic_t timer_finished;
+
+static void timer_callback (int unused)
+{
+  timer_finished = 1;
+}
+
+static const uint32_t sizes[] = { 0, 16, 32, 64, 128 };
+
+static double
+bench_arc4random_throughput (void)
+{
+  /* Run for approximately DURATION seconds, and it does not matter who
+     receive the signal (so not need to mask it on main thread).  */
+  timer_finished = 0;
+  timer_t timer = support_create_timer (DURATION, 0, false, timer_callback);
+
+  uint64_t n = 0;
+
+  while (1)
+    {
+      r = arc4random ();
+      n++;
+
+      if (timer_finished == 1)
+	break;
+    }
+
+  support_delete_timer (timer);
+
+  return (double) (n * sizeof (r)) / (double) DURATION;
+}
+
+static double
+bench_arc4random_latency (void)
+{
+  timing_t start, stop, cur;
+  const size_t iters = 1024;
+
+  TIMING_NOW (start);
+  for (size_t i = 0; i < iters; i++)
+    r = arc4random ();
+  TIMING_NOW (stop);
+
+  TIMING_DIFF (cur, start, stop);
+
+  return (double) (cur) / (double) iters;
+}
+
+static double
+bench_arc4random_buf_throughput (size_t len)
+{
+  timer_finished = 0;
+  timer_t timer = support_create_timer (DURATION, 0, false, timer_callback);
+
+  uint8_t buf[len];
+
+  uint64_t n = 0;
+
+  while (1)
+    {
+      arc4random_buf (buf, len);
+      n++;
+
+      if (timer_finished == 1)
+	break;
+    }
+
+  support_delete_timer (timer);
+
+  uint64_t total = (n * len);
+  return (double) (total) / (double) DURATION;
+}
+
+static double
+bench_arc4random_buf_latency (size_t len)
+{
+  timing_t start, stop, cur;
+  const size_t iters = 1024;
+
+  uint8_t buf[len];
+
+  TIMING_NOW (start);
+  for (size_t i = 0; i < iters; i++)
+    arc4random_buf (buf, len);
+  TIMING_NOW (stop);
+
+  TIMING_DIFF (cur, start, stop);
+
+  return (double) (cur) / (double) iters;
+}
+
+static void
+bench_singlethread (json_ctx_t *json_ctx)
+{
+  json_element_object_begin (json_ctx);
+
+  json_array_begin (json_ctx, "throughput");
+  for (int i = 0; i < array_length (sizes); i++)
+    if (sizes[i] == 0)
+      json_element_double (json_ctx, bench_arc4random_throughput ());
+    else
+      json_element_double (json_ctx, bench_arc4random_buf_throughput (sizes[i]));
+  json_array_end (json_ctx);
+
+  json_array_begin (json_ctx, "latency");
+  for (int i = 0; i < array_length (sizes); i++)
+    if (sizes[i] == 0)
+      json_element_double (json_ctx, bench_arc4random_latency ());
+    else
+      json_element_double (json_ctx, bench_arc4random_buf_latency (sizes[i]));
+  json_array_end (json_ctx);
+
+  json_element_object_end (json_ctx);
+}
+
+struct thr_arc4random_arg
+{
+  double ret;
+  uint32_t val;
+};
+
+static void *
+thr_arc4random_throughput (void *closure)
+{
+  struct thr_arc4random_arg *arg = closure;
+  arg->ret = arg->val == 0 ? bench_arc4random_throughput ()
+			   : bench_arc4random_buf_throughput (arg->val);
+  return NULL;
+}
+
+static void *
+thr_arc4random_latency (void *closure)
+{
+  struct thr_arc4random_arg *arg = closure;
+  arg->ret = arg->val == 0 ? bench_arc4random_latency ()
+			   : bench_arc4random_buf_latency (arg->val);
+  return NULL;
+}
+
+static void
+bench_threaded (json_ctx_t *json_ctx)
+{
+  json_element_object_begin (json_ctx);
+
+  json_array_begin (json_ctx, "throughput");
+  for (int i = 0; i < array_length (sizes); i++)
+    {
+      struct thr_arc4random_arg arg = { .val = sizes[i] };
+      pthread_t thr = xpthread_create (NULL, thr_arc4random_throughput, &arg);
+      xpthread_join (thr);
+      json_element_double (json_ctx, arg.ret);
+    }
+  json_array_end (json_ctx);
+
+  json_array_begin (json_ctx, "latency");
+  for (int i = 0; i < array_length (sizes); i++)
+    {
+      struct thr_arc4random_arg arg = { .val = sizes[i] };
+      pthread_t thr = xpthread_create (NULL, thr_arc4random_latency, &arg);
+      xpthread_join (thr);
+      json_element_double (json_ctx, arg.ret);
+    }
+  json_array_end (json_ctx);
+
+  json_element_object_end (json_ctx);
+}
+
+static void
+run_bench (json_ctx_t *json_ctx, const char *name,
+	   char *const*fnames, size_t fnameslen,
+	   void (*bench)(json_ctx_t *ctx))
+{
+  json_attr_object_begin (json_ctx, name);
+  json_array_begin (json_ctx, "functions");
+  for (int i = 0; i < fnameslen; i++)
+    json_element_string (json_ctx, fnames[i]);
+  json_array_end (json_ctx);
+
+  json_array_begin (json_ctx, "results");
+  bench (json_ctx);
+  json_array_end (json_ctx);
+  json_attr_object_end (json_ctx);
+}
+
+static int
+do_test (void)
+{
+  char *fnames[array_length (sizes) + 1];
+  fnames[0] = (char *) "arc4random";
+  for (int i = 0; i < array_length (sizes); i++)
+    fnames[i+1] = xasprintf ("arc4random_buf(%u)", sizes[i]);
+
+  json_ctx_t json_ctx;
+  json_init (&json_ctx, 0, stdout);
+
+  json_document_begin (&json_ctx);
+  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+
+  run_bench (&json_ctx, "single-thread", fnames, array_length (fnames),
+	     bench_singlethread);
+  run_bench (&json_ctx, "multi-thread", fnames, array_length (fnames),
+	     bench_threaded);
+
+  json_document_end (&json_ctx);
+
+  for (int i = 0; i < array_length (sizes); i++)
+    free (fnames[i+1]);
+
+  return 0;
+}
+
+#include <support/test-driver.c>
-- 
2.32.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
                   ` (2 preceding siblings ...)
  2022-04-13 20:23 ` [PATCH 3/7] benchtests: Add arc4random benchtest Adhemerval Zanella
@ 2022-04-13 20:23 ` Adhemerval Zanella
  2022-04-13 23:12   ` Noah Goldstein
                     ` (2 more replies)
  2022-04-13 20:23 ` [PATCH 5/7] x86: Add AVX2 " Adhemerval Zanella
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-13 20:23 UTC (permalink / raw)
  To: libc-alpha

It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-amd64-ssse3.S.  It is used only if SSSE3 is supported
and enable by the architecture.

On a Ryzen 9 5900X it shows the following improvements (using
formatted bench-arc4random data):

GENERIC
Function                                 MB/s
--------------------------------------------------
arc4random [single-thread]               375.06
arc4random_buf(0) [single-thread]        498.50
arc4random_buf(16) [single-thread]       576.86
arc4random_buf(32) [single-thread]       615.76
arc4random_buf(64) [single-thread]       633.97
--------------------------------------------------
arc4random [multi-thread]                359.86
arc4random_buf(0) [multi-thread]         479.27
arc4random_buf(16) [multi-thread]        543.65
arc4random_buf(32) [multi-thread]        581.98
arc4random_buf(64) [multi-thread]        603.01
--------------------------------------------------

SSSE3:
Function                                 MB/s
--------------------------------------------------
arc4random [single-thread]               576.55
arc4random_buf(0) [single-thread]        961.77
arc4random_buf(16) [single-thread]       1309.38
arc4random_buf(32) [single-thread]       1558.69
arc4random_buf(64) [single-thread]       1728.54
--------------------------------------------------
arc4random [multi-thread]                589.52
arc4random_buf(0) [multi-thread]         967.39
arc4random_buf(16) [multi-thread]        1319.27
arc4random_buf(32) [multi-thread]        1552.96
arc4random_buf(64) [multi-thread]        1734.27
--------------------------------------------------

Checked on x86_64-linux-gnu.
---
 LICENSES                        |  20 ++
 sysdeps/generic/chacha20_arch.h |  24 +++
 sysdeps/x86_64/Makefile         |   6 +
 sysdeps/x86_64/chacha20-ssse3.S | 330 ++++++++++++++++++++++++++++++++
 sysdeps/x86_64/chacha20_arch.h  |  42 ++++
 5 files changed, 422 insertions(+)
 create mode 100644 sysdeps/generic/chacha20_arch.h
 create mode 100644 sysdeps/x86_64/chacha20-ssse3.S
 create mode 100644 sysdeps/x86_64/chacha20_arch.h

diff --git a/LICENSES b/LICENSES
index 530893b1dc..2563abd9e2 100644
--- a/LICENSES
+++ b/LICENSES
@@ -389,3 +389,23 @@ Copyright 2001 by Stephen L. Moshier <moshier@na-net.ornl.gov>
  You should have received a copy of the GNU Lesser General Public
  License along with this library; if not, see
  <https://www.gnu.org/licenses/>.  */
+\f
+sysdeps/x86_64/chacha20-ssse3.S import code from libgcrypt, with the
+following notices:
+
+Copyright (C) 2017-2019 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+
+This file is part of Libgcrypt.
+
+Libgcrypt is free software; you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as
+published by the Free Software Foundation; either version 2.1 of
+the License, or (at your option) any later version.
+
+Libgcrypt is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU Lesser General Public License for more details.
+
+You should have received a copy of the GNU Lesser General Public
+License along with this program; if not, see <http://www.gnu.org/licenses/>.
diff --git a/sysdeps/generic/chacha20_arch.h b/sysdeps/generic/chacha20_arch.h
new file mode 100644
index 0000000000..d7200ac583
--- /dev/null
+++ b/sysdeps/generic/chacha20_arch.h
@@ -0,0 +1,24 @@
+/* Chacha20 implementation, generic interface.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static inline void
+chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
+		const uint8_t *src, size_t bytes)
+{
+  chacha20_crypt_generic (state, dst, src, bytes);
+}
diff --git a/sysdeps/x86_64/Makefile b/sysdeps/x86_64/Makefile
index 79365aff2a..f43b6a1180 100644
--- a/sysdeps/x86_64/Makefile
+++ b/sysdeps/x86_64/Makefile
@@ -5,6 +5,12 @@ ifeq ($(subdir),csu)
 gen-as-const-headers += link-defines.sym
 endif
 
+ifeq ($(subdir),stdlib)
+sysdep_routines += \
+  chacha20-ssse3 \
+  # sysdep_routines
+endif
+
 ifeq ($(subdir),gmon)
 sysdep_routines += _mcount
 # We cannot compile _mcount.S with -pg because that would create
diff --git a/sysdeps/x86_64/chacha20-ssse3.S b/sysdeps/x86_64/chacha20-ssse3.S
new file mode 100644
index 0000000000..f221daf634
--- /dev/null
+++ b/sysdeps/x86_64/chacha20-ssse3.S
@@ -0,0 +1,330 @@
+/* Optimized SSSE3 implementation of ChaCha20 cipher.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Based on D. J. Bernstein reference implementation at
+   http://cr.yp.to/chacha.html:
+
+   chacha-regs.c version 20080118
+   D. J. Bernstein
+   Public domain.  */
+
+#include <sysdep.h>
+
+#ifdef PIC
+#  define rRIP (%rip)
+#else
+#  define rRIP
+#endif
+
+/* register macros */
+#define INPUT %rdi
+#define DST   %rsi
+#define SRC   %rdx
+#define NBLKS %rcx
+#define ROUND %eax
+
+/* stack structure */
+#define STACK_VEC_X12 (16)
+#define STACK_VEC_X13 (16 + STACK_VEC_X12)
+#define STACK_TMP     (16 + STACK_VEC_X13)
+#define STACK_TMP1    (16 + STACK_TMP)
+#define STACK_TMP2    (16 + STACK_TMP1)
+
+#define STACK_MAX     (16 + STACK_TMP2)
+
+/* vector registers */
+#define X0 %xmm0
+#define X1 %xmm1
+#define X2 %xmm2
+#define X3 %xmm3
+#define X4 %xmm4
+#define X5 %xmm5
+#define X6 %xmm6
+#define X7 %xmm7
+#define X8 %xmm8
+#define X9 %xmm9
+#define X10 %xmm10
+#define X11 %xmm11
+#define X12 %xmm12
+#define X13 %xmm13
+#define X14 %xmm14
+#define X15 %xmm15
+
+/**********************************************************************
+  helper macros
+ **********************************************************************/
+
+/* 4x4 32-bit integer matrix transpose */
+#define transpose_4x4(x0, x1, x2, x3, t1, t2, t3) \
+	movdqa    x0, t2; \
+	punpckhdq x1, t2; \
+	punpckldq x1, x0; \
+	\
+	movdqa    x2, t1; \
+	punpckldq x3, t1; \
+	punpckhdq x3, x2; \
+	\
+	movdqa     x0, x1; \
+	punpckhqdq t1, x1; \
+	punpcklqdq t1, x0; \
+	\
+	movdqa     t2, x3; \
+	punpckhqdq x2, x3; \
+	punpcklqdq x2, t2; \
+	movdqa     t2, x2;
+
+/* fill xmm register with 32-bit value from memory */
+#define pbroadcastd(mem32, xreg) \
+	movd mem32, xreg; \
+	pshufd $0, xreg, xreg;
+
+/* xor with unaligned memory operand */
+#define pxor_u(umem128, xreg, t) \
+	movdqu umem128, t; \
+	pxor t, xreg;
+
+/* xor register with unaligned src and save to unaligned dst */
+#define xor_src_dst(dst, src, offset, xreg, t) \
+	pxor_u(offset(src), xreg, t); \
+	movdqu xreg, offset(dst);
+
+#define clear(x) pxor x,x;
+
+/**********************************************************************
+  4-way chacha20
+ **********************************************************************/
+
+#define ROTATE2(v1,v2,c,tmp1,tmp2)	\
+	movdqa v1, tmp1; 		\
+	movdqa v2, tmp2; 		\
+	psrld $(32 - (c)), v1;		\
+	pslld $(c), tmp1;		\
+	paddb tmp1, v1;			\
+	psrld $(32 - (c)), v2;		\
+	pslld $(c), tmp2;		\
+	paddb tmp2, v2;
+
+#define ROTATE_SHUF_2(v1,v2,shuf)	\
+	pshufb shuf, v1;		\
+	pshufb shuf, v2;
+
+#define XOR(ds,s) \
+	pxor s, ds;
+
+#define PLUS(ds,s) \
+	paddd s, ds;
+
+#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2,ign,tmp1,tmp2,\
+		      interleave_op1,interleave_op2)		\
+	movdqa L(shuf_rol16) rRIP, tmp1;			\
+		interleave_op1;					\
+	PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);	\
+	    ROTATE_SHUF_2(d1, d2, tmp1);			\
+	PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);	\
+	    ROTATE2(b1, b2, 12, tmp1, tmp2);			\
+	movdqa L(shuf_rol8) rRIP, tmp1;				\
+		interleave_op2;					\
+	PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);	\
+	    ROTATE_SHUF_2(d1, d2, tmp1);			\
+	PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);	\
+	    ROTATE2(b1, b2,  7, tmp1, tmp2);
+
+	.text
+
+chacha20_data:
+	.align 16
+L(shuf_rol16):
+	.byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
+L(shuf_rol8):
+	.byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
+L(counter1):
+	.long 1,0,0,0
+L(inc_counter):
+	.long 0,1,2,3
+L(unsigned_cmp):
+	.long 0x80000000,0x80000000,0x80000000,0x80000000
+
+ENTRY (__chacha20_ssse3_blocks8)
+	/* input:
+	 *	%rdi: input
+	 *	%rsi: dst
+	 *	%rdx: src
+	 *	%rcx: nblks (multiple of 4)
+	 */
+
+	pushq %rbp;
+	cfi_adjust_cfa_offset(8);
+	cfi_rel_offset(rbp, 0)
+	movq %rsp, %rbp;
+	cfi_def_cfa_register(%rbp);
+
+	subq $STACK_MAX, %rsp;
+	andq $~15, %rsp;
+
+L(loop4):
+	mov $20, ROUND;
+
+	/* Construct counter vectors X12 and X13 */
+	movdqa L(inc_counter) rRIP, X0;
+	movdqa L(unsigned_cmp) rRIP, X2;
+	pbroadcastd((12 * 4)(INPUT), X12);
+	pbroadcastd((13 * 4)(INPUT), X13);
+	paddd X0, X12;
+	movdqa X12, X1;
+	pxor X2, X0;
+	pxor X2, X1;
+	pcmpgtd X1, X0;
+	psubd X0, X13;
+	movdqa X12, (STACK_VEC_X12)(%rsp);
+	movdqa X13, (STACK_VEC_X13)(%rsp);
+
+	/* Load vectors */
+	pbroadcastd((0 * 4)(INPUT), X0);
+	pbroadcastd((1 * 4)(INPUT), X1);
+	pbroadcastd((2 * 4)(INPUT), X2);
+	pbroadcastd((3 * 4)(INPUT), X3);
+	pbroadcastd((4 * 4)(INPUT), X4);
+	pbroadcastd((5 * 4)(INPUT), X5);
+	pbroadcastd((6 * 4)(INPUT), X6);
+	pbroadcastd((7 * 4)(INPUT), X7);
+	pbroadcastd((8 * 4)(INPUT), X8);
+	pbroadcastd((9 * 4)(INPUT), X9);
+	pbroadcastd((10 * 4)(INPUT), X10);
+	pbroadcastd((11 * 4)(INPUT), X11);
+	pbroadcastd((14 * 4)(INPUT), X14);
+	pbroadcastd((15 * 4)(INPUT), X15);
+	movdqa X11, (STACK_TMP)(%rsp);
+	movdqa X15, (STACK_TMP1)(%rsp);
+
+L(round2_4):
+	QUARTERROUND2(X0, X4,  X8, X12,   X1, X5,  X9, X13, tmp:=,X11,X15,,)
+	movdqa (STACK_TMP)(%rsp), X11;
+	movdqa (STACK_TMP1)(%rsp), X15;
+	movdqa X8, (STACK_TMP)(%rsp);
+	movdqa X9, (STACK_TMP1)(%rsp);
+	QUARTERROUND2(X2, X6, X10, X14,   X3, X7, X11, X15, tmp:=,X8,X9,,)
+	QUARTERROUND2(X0, X5, X10, X15,   X1, X6, X11, X12, tmp:=,X8,X9,,)
+	movdqa (STACK_TMP)(%rsp), X8;
+	movdqa (STACK_TMP1)(%rsp), X9;
+	movdqa X11, (STACK_TMP)(%rsp);
+	movdqa X15, (STACK_TMP1)(%rsp);
+	QUARTERROUND2(X2, X7,  X8, X13,   X3, X4,  X9, X14, tmp:=,X11,X15,,)
+	sub $2, ROUND;
+	jnz .Lround2_4;
+
+	/* tmp := X15 */
+	movdqa (STACK_TMP)(%rsp), X11;
+	pbroadcastd((0 * 4)(INPUT), X15);
+	PLUS(X0, X15);
+	pbroadcastd((1 * 4)(INPUT), X15);
+	PLUS(X1, X15);
+	pbroadcastd((2 * 4)(INPUT), X15);
+	PLUS(X2, X15);
+	pbroadcastd((3 * 4)(INPUT), X15);
+	PLUS(X3, X15);
+	pbroadcastd((4 * 4)(INPUT), X15);
+	PLUS(X4, X15);
+	pbroadcastd((5 * 4)(INPUT), X15);
+	PLUS(X5, X15);
+	pbroadcastd((6 * 4)(INPUT), X15);
+	PLUS(X6, X15);
+	pbroadcastd((7 * 4)(INPUT), X15);
+	PLUS(X7, X15);
+	pbroadcastd((8 * 4)(INPUT), X15);
+	PLUS(X8, X15);
+	pbroadcastd((9 * 4)(INPUT), X15);
+	PLUS(X9, X15);
+	pbroadcastd((10 * 4)(INPUT), X15);
+	PLUS(X10, X15);
+	pbroadcastd((11 * 4)(INPUT), X15);
+	PLUS(X11, X15);
+	movdqa (STACK_VEC_X12)(%rsp), X15;
+	PLUS(X12, X15);
+	movdqa (STACK_VEC_X13)(%rsp), X15;
+	PLUS(X13, X15);
+	movdqa X13, (STACK_TMP)(%rsp);
+	pbroadcastd((14 * 4)(INPUT), X15);
+	PLUS(X14, X15);
+	movdqa (STACK_TMP1)(%rsp), X15;
+	movdqa X14, (STACK_TMP1)(%rsp);
+	pbroadcastd((15 * 4)(INPUT), X13);
+	PLUS(X15, X13);
+	movdqa X15, (STACK_TMP2)(%rsp);
+
+	/* Update counter */
+	addq $4, (12 * 4)(INPUT);
+
+	transpose_4x4(X0, X1, X2, X3, X13, X14, X15);
+	xor_src_dst(DST, SRC, (64 * 0 + 16 * 0), X0, X15);
+	xor_src_dst(DST, SRC, (64 * 1 + 16 * 0), X1, X15);
+	xor_src_dst(DST, SRC, (64 * 2 + 16 * 0), X2, X15);
+	xor_src_dst(DST, SRC, (64 * 3 + 16 * 0), X3, X15);
+	transpose_4x4(X4, X5, X6, X7, X0, X1, X2);
+	movdqa (STACK_TMP)(%rsp), X13;
+	movdqa (STACK_TMP1)(%rsp), X14;
+	movdqa (STACK_TMP2)(%rsp), X15;
+	xor_src_dst(DST, SRC, (64 * 0 + 16 * 1), X4, X0);
+	xor_src_dst(DST, SRC, (64 * 1 + 16 * 1), X5, X0);
+	xor_src_dst(DST, SRC, (64 * 2 + 16 * 1), X6, X0);
+	xor_src_dst(DST, SRC, (64 * 3 + 16 * 1), X7, X0);
+	transpose_4x4(X8, X9, X10, X11, X0, X1, X2);
+	xor_src_dst(DST, SRC, (64 * 0 + 16 * 2), X8, X0);
+	xor_src_dst(DST, SRC, (64 * 1 + 16 * 2), X9, X0);
+	xor_src_dst(DST, SRC, (64 * 2 + 16 * 2), X10, X0);
+	xor_src_dst(DST, SRC, (64 * 3 + 16 * 2), X11, X0);
+	transpose_4x4(X12, X13, X14, X15, X0, X1, X2);
+	xor_src_dst(DST, SRC, (64 * 0 + 16 * 3), X12, X0);
+	xor_src_dst(DST, SRC, (64 * 1 + 16 * 3), X13, X0);
+	xor_src_dst(DST, SRC, (64 * 2 + 16 * 3), X14, X0);
+	xor_src_dst(DST, SRC, (64 * 3 + 16 * 3), X15, X0);
+
+	sub $4, NBLKS;
+	lea (4 * 64)(DST), DST;
+	lea (4 * 64)(SRC), SRC;
+	jnz L(loop4);
+
+	/* clear the used vector registers and stack */
+	clear(X0);
+	movdqa X0, (STACK_VEC_X12)(%rsp);
+	movdqa X0, (STACK_VEC_X13)(%rsp);
+	movdqa X0, (STACK_TMP)(%rsp);
+	movdqa X0, (STACK_TMP1)(%rsp);
+	movdqa X0, (STACK_TMP2)(%rsp);
+	clear(X1);
+	clear(X2);
+	clear(X3);
+	clear(X4);
+	clear(X5);
+	clear(X6);
+	clear(X7);
+	clear(X8);
+	clear(X9);
+	clear(X10);
+	clear(X11);
+	clear(X12);
+	clear(X13);
+	clear(X14);
+	clear(X15);
+
+	/* eax zeroed by round loop. */
+	leave;
+	cfi_adjust_cfa_offset(-8)
+	cfi_def_cfa_register(%rsp);
+	ret;
+	int3;
+END (__chacha20_ssse3_blocks8)
diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
new file mode 100644
index 0000000000..37a4fdfb1f
--- /dev/null
+++ b/sysdeps/x86_64/chacha20_arch.h
@@ -0,0 +1,42 @@
+/* Chacha20 implementation, used on arc4random.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <ldsodefs.h>
+#include <cpu-features.h>
+#include <sys/param.h>
+
+unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
+				       const uint8_t *src, size_t nblks);
+
+static inline void
+chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
+		size_t bytes)
+{
+  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
+    {
+      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
+      nblocks -= nblocks % 4;
+      __chacha20_ssse3_blocks8 (state->ctx, dst, src, nblocks);
+      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
+      dst += nblocks * CHACHA20_BLOCK_SIZE;
+      src += nblocks * CHACHA20_BLOCK_SIZE;
+    }
+
+  if (bytes > 0)
+    chacha20_crypt_generic (state, dst, src, bytes);
+}
-- 
2.32.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 5/7] x86: Add AVX2 optimized chacha20
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
                   ` (3 preceding siblings ...)
  2022-04-13 20:23 ` [PATCH 4/7] x86: Add SSSE3 optimized chacha20 Adhemerval Zanella
@ 2022-04-13 20:23 ` Adhemerval Zanella
  2022-04-13 23:04   ` Noah Goldstein
  2022-04-13 20:24 ` [PATCH 6/7] aarch64: Add " Adhemerval Zanella
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-13 20:23 UTC (permalink / raw)
  To: libc-alpha

It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-amd64-avx2.S.  It is used only if AVX2 is supported
and enabled by the architecture.

On a Ryzen 9 5900X it shows the following improvements (using
formatted bench-arc4random data):

SSSE3:
Function                                 MB/s
--------------------------------------------------
arc4random [single-thread]               576.55
arc4random_buf(0) [single-thread]        961.77
arc4random_buf(16) [single-thread]       1309.38
arc4random_buf(32) [single-thread]       1558.69
arc4random_buf(64) [single-thread]       1728.54
--------------------------------------------------
arc4random [multi-thread]                589.52
arc4random_buf(0) [multi-thread]         967.39
arc4random_buf(16) [multi-thread]        1319.27
arc4random_buf(32) [multi-thread]        1552.96
arc4random_buf(64) [multi-thread]        1734.27
--------------------------------------------------

AVX2:
Function                                 MB/s
--------------------------------------------------
arc4random [single-thread]               672.49
arc4random_buf(0) [single-thread]        1234.85
arc4random_buf(16) [single-thread]       1892.67
arc4random_buf(32) [single-thread]       2491.10
arc4random_buf(64) [single-thread]       2696.27
--------------------------------------------------
arc4random [multi-thread]                661.25
arc4random_buf(0) [multi-thread]         1214.65
arc4random_buf(16) [multi-thread]        1867.98
arc4random_buf(32) [multi-thread]        2474.70
arc4random_buf(64) [multi-thread]        2893.21
--------------------------------------------------

Checked on x86_64-linux-gnu.
---
 LICENSES                       |   4 +-
 stdlib/chacha20.c              |   7 +-
 sysdeps/x86_64/Makefile        |   1 +
 sysdeps/x86_64/chacha20-avx2.S | 317 +++++++++++++++++++++++++++++++++
 sysdeps/x86_64/chacha20_arch.h |  14 ++
 5 files changed, 339 insertions(+), 4 deletions(-)
 create mode 100644 sysdeps/x86_64/chacha20-avx2.S

diff --git a/LICENSES b/LICENSES
index 2563abd9e2..8ef0f023d7 100644
--- a/LICENSES
+++ b/LICENSES
@@ -390,8 +390,8 @@ Copyright 2001 by Stephen L. Moshier <moshier@na-net.ornl.gov>
  License along with this library; if not, see
  <https://www.gnu.org/licenses/>.  */
 \f
-sysdeps/x86_64/chacha20-ssse3.S import code from libgcrypt, with the
-following notices:
+sysdeps/x86_64/chacha20-ssse3.S and sysdeps/x86_64/chacha20-avx2.S
+import code from libgcrypt, with the following notices:
 
 Copyright (C) 2017-2019 Jussi Kivilinna <jussi.kivilinna@iki.fi>
 
diff --git a/stdlib/chacha20.c b/stdlib/chacha20.c
index dbd87bd942..8569e1e78d 100644
--- a/stdlib/chacha20.c
+++ b/stdlib/chacha20.c
@@ -190,8 +190,8 @@ memxorcpy (uint8_t *dst, const uint8_t *src1, const uint8_t *src2, size_t len)
 }
 
 static void
-chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
-		const uint8_t *src, size_t bytes)
+chacha20_crypt_generic (struct chacha20_state *state, uint8_t *dst,
+			const uint8_t *src, size_t bytes)
 {
   uint8_t stream[CHACHA20_BLOCK_SIZE];
 
@@ -209,3 +209,6 @@ chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
       memxorcpy (dst, src, stream, bytes);
     }
 }
+
+/* Get the arch-optimized implementation, if any.  */
+#include <chacha20_arch.h>
diff --git a/sysdeps/x86_64/Makefile b/sysdeps/x86_64/Makefile
index f43b6a1180..afb4d173e8 100644
--- a/sysdeps/x86_64/Makefile
+++ b/sysdeps/x86_64/Makefile
@@ -7,6 +7,7 @@ endif
 
 ifeq ($(subdir),stdlib)
 sysdep_routines += \
+  chacha20-avx2 \
   chacha20-ssse3 \
   # sysdep_routines
 endif
diff --git a/sysdeps/x86_64/chacha20-avx2.S b/sysdeps/x86_64/chacha20-avx2.S
new file mode 100644
index 0000000000..96174c0e40
--- /dev/null
+++ b/sysdeps/x86_64/chacha20-avx2.S
@@ -0,0 +1,317 @@
+/* Optimized AVX2 implementation of ChaCha20 cipher.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+/* Based on D. J. Bernstein reference implementation at
+   http://cr.yp.to/chacha.html:
+
+   chacha-regs.c version 20080118
+   D. J. Bernstein
+   Public domain.  */
+
+#ifdef PIC
+#  define rRIP (%rip)
+#else
+#  define rRIP
+#endif
+
+/* register macros */
+#define INPUT %rdi
+#define DST   %rsi
+#define SRC   %rdx
+#define NBLKS %rcx
+#define ROUND %eax
+
+/* stack structure */
+#define STACK_VEC_X12 (32)
+#define STACK_VEC_X13 (32 + STACK_VEC_X12)
+#define STACK_TMP     (32 + STACK_VEC_X13)
+#define STACK_TMP1    (32 + STACK_TMP)
+
+#define STACK_MAX     (32 + STACK_TMP1)
+
+/* vector registers */
+#define X0 %ymm0
+#define X1 %ymm1
+#define X2 %ymm2
+#define X3 %ymm3
+#define X4 %ymm4
+#define X5 %ymm5
+#define X6 %ymm6
+#define X7 %ymm7
+#define X8 %ymm8
+#define X9 %ymm9
+#define X10 %ymm10
+#define X11 %ymm11
+#define X12 %ymm12
+#define X13 %ymm13
+#define X14 %ymm14
+#define X15 %ymm15
+
+#define X0h %xmm0
+#define X1h %xmm1
+#define X2h %xmm2
+#define X3h %xmm3
+#define X4h %xmm4
+#define X5h %xmm5
+#define X6h %xmm6
+#define X7h %xmm7
+#define X8h %xmm8
+#define X9h %xmm9
+#define X10h %xmm10
+#define X11h %xmm11
+#define X12h %xmm12
+#define X13h %xmm13
+#define X14h %xmm14
+#define X15h %xmm15
+
+/**********************************************************************
+  helper macros
+ **********************************************************************/
+
+/* 4x4 32-bit integer matrix transpose */
+#define transpose_4x4(x0,x1,x2,x3,t1,t2) \
+	vpunpckhdq x1, x0, t2; \
+	vpunpckldq x1, x0, x0; \
+	\
+	vpunpckldq x3, x2, t1; \
+	vpunpckhdq x3, x2, x2; \
+	\
+	vpunpckhqdq t1, x0, x1; \
+	vpunpcklqdq t1, x0, x0; \
+	\
+	vpunpckhqdq x2, t2, x3; \
+	vpunpcklqdq x2, t2, x2;
+
+/* 2x2 128-bit matrix transpose */
+#define transpose_16byte_2x2(x0,x1,t1) \
+	vmovdqa    x0, t1; \
+	vperm2i128 $0x20, x1, x0, x0; \
+	vperm2i128 $0x31, x1, t1, x1;
+
+/* xor register with unaligned src and save to unaligned dst */
+#define xor_src_dst(dst, src, offset, xreg) \
+	vpxor offset(src), xreg, xreg; \
+	vmovdqu xreg, offset(dst);
+
+/**********************************************************************
+  8-way chacha20
+ **********************************************************************/
+
+#define ROTATE2(v1,v2,c,tmp)	\
+	vpsrld $(32 - (c)), v1, tmp;	\
+	vpslld $(c), v1, v1;		\
+	vpaddb tmp, v1, v1;		\
+	vpsrld $(32 - (c)), v2, tmp;	\
+	vpslld $(c), v2, v2;		\
+	vpaddb tmp, v2, v2;
+
+#define ROTATE_SHUF_2(v1,v2,shuf)	\
+	vpshufb shuf, v1, v1;		\
+	vpshufb shuf, v2, v2;
+
+#define XOR(ds,s) \
+	vpxor s, ds, ds;
+
+#define PLUS(ds,s) \
+	vpaddd s, ds, ds;
+
+#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2,ign,tmp1,\
+		      interleave_op1,interleave_op2,\
+		      interleave_op3,interleave_op4)		\
+	vbroadcasti128 .Lshuf_rol16 rRIP, tmp1;			\
+		interleave_op1;					\
+	PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);	\
+	    ROTATE_SHUF_2(d1, d2, tmp1);			\
+		interleave_op2;					\
+	PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);	\
+	    ROTATE2(b1, b2, 12, tmp1);				\
+	vbroadcasti128 .Lshuf_rol8 rRIP, tmp1;			\
+		interleave_op3;					\
+	PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);	\
+	    ROTATE_SHUF_2(d1, d2, tmp1);			\
+		interleave_op4;					\
+	PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);	\
+	    ROTATE2(b1, b2,  7, tmp1);
+
+	.text
+	.align 32
+chacha20_data:
+L(shuf_rol16):
+	.byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
+L(shuf_rol8):
+	.byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
+L(inc_counter):
+	.byte 0,1,2,3,4,5,6,7
+L(unsigned_cmp):
+	.long 0x80000000
+
+ENTRY (__chacha20_avx2_blocks8)
+	/* input:
+	 *	%rdi: input
+	 *	%rsi: dst
+	 *	%rdx: src
+	 *	%rcx: nblks (multiple of 8)
+	 */
+	vzeroupper;
+
+	pushq %rbp;
+	cfi_adjust_cfa_offset(8);
+	cfi_rel_offset(rbp, 0)
+	movq %rsp, %rbp;
+	cfi_def_cfa_register(rbp);
+
+	subq $STACK_MAX, %rsp;
+	andq $~31, %rsp;
+
+L(loop8):
+	mov $20, ROUND;
+
+	/* Construct counter vectors X12 and X13 */
+	vpmovzxbd L(inc_counter) rRIP, X0;
+	vpbroadcastd L(unsigned_cmp) rRIP, X2;
+	vpbroadcastd (12 * 4)(INPUT), X12;
+	vpbroadcastd (13 * 4)(INPUT), X13;
+	vpaddd X0, X12, X12;
+	vpxor X2, X0, X0;
+	vpxor X2, X12, X1;
+	vpcmpgtd X1, X0, X0;
+	vpsubd X0, X13, X13;
+	vmovdqa X12, (STACK_VEC_X12)(%rsp);
+	vmovdqa X13, (STACK_VEC_X13)(%rsp);
+
+	/* Load vectors */
+	vpbroadcastd (0 * 4)(INPUT), X0;
+	vpbroadcastd (1 * 4)(INPUT), X1;
+	vpbroadcastd (2 * 4)(INPUT), X2;
+	vpbroadcastd (3 * 4)(INPUT), X3;
+	vpbroadcastd (4 * 4)(INPUT), X4;
+	vpbroadcastd (5 * 4)(INPUT), X5;
+	vpbroadcastd (6 * 4)(INPUT), X6;
+	vpbroadcastd (7 * 4)(INPUT), X7;
+	vpbroadcastd (8 * 4)(INPUT), X8;
+	vpbroadcastd (9 * 4)(INPUT), X9;
+	vpbroadcastd (10 * 4)(INPUT), X10;
+	vpbroadcastd (11 * 4)(INPUT), X11;
+	vpbroadcastd (14 * 4)(INPUT), X14;
+	vpbroadcastd (15 * 4)(INPUT), X15;
+	vmovdqa X15, (STACK_TMP)(%rsp);
+
+L(round2):
+	QUARTERROUND2(X0, X4,  X8, X12,   X1, X5,  X9, X13, tmp:=,X15,,,,)
+	vmovdqa (STACK_TMP)(%rsp), X15;
+	vmovdqa X8, (STACK_TMP)(%rsp);
+	QUARTERROUND2(X2, X6, X10, X14,   X3, X7, X11, X15, tmp:=,X8,,,,)
+	QUARTERROUND2(X0, X5, X10, X15,   X1, X6, X11, X12, tmp:=,X8,,,,)
+	vmovdqa (STACK_TMP)(%rsp), X8;
+	vmovdqa X15, (STACK_TMP)(%rsp);
+	QUARTERROUND2(X2, X7,  X8, X13,   X3, X4,  X9, X14, tmp:=,X15,,,,)
+	sub $2, ROUND;
+	jnz L(round2);
+
+	vmovdqa X8, (STACK_TMP1)(%rsp);
+
+	/* tmp := X15 */
+	vpbroadcastd (0 * 4)(INPUT), X15;
+	PLUS(X0, X15);
+	vpbroadcastd (1 * 4)(INPUT), X15;
+	PLUS(X1, X15);
+	vpbroadcastd (2 * 4)(INPUT), X15;
+	PLUS(X2, X15);
+	vpbroadcastd (3 * 4)(INPUT), X15;
+	PLUS(X3, X15);
+	vpbroadcastd (4 * 4)(INPUT), X15;
+	PLUS(X4, X15);
+	vpbroadcastd (5 * 4)(INPUT), X15;
+	PLUS(X5, X15);
+	vpbroadcastd (6 * 4)(INPUT), X15;
+	PLUS(X6, X15);
+	vpbroadcastd (7 * 4)(INPUT), X15;
+	PLUS(X7, X15);
+	transpose_4x4(X0, X1, X2, X3, X8, X15);
+	transpose_4x4(X4, X5, X6, X7, X8, X15);
+	vmovdqa (STACK_TMP1)(%rsp), X8;
+	transpose_16byte_2x2(X0, X4, X15);
+	transpose_16byte_2x2(X1, X5, X15);
+	transpose_16byte_2x2(X2, X6, X15);
+	transpose_16byte_2x2(X3, X7, X15);
+	vmovdqa (STACK_TMP)(%rsp), X15;
+	xor_src_dst(DST, SRC, (64 * 0 + 16 * 0), X0);
+	xor_src_dst(DST, SRC, (64 * 1 + 16 * 0), X1);
+	vpbroadcastd (8 * 4)(INPUT), X0;
+	PLUS(X8, X0);
+	vpbroadcastd (9 * 4)(INPUT), X0;
+	PLUS(X9, X0);
+	vpbroadcastd (10 * 4)(INPUT), X0;
+	PLUS(X10, X0);
+	vpbroadcastd (11 * 4)(INPUT), X0;
+	PLUS(X11, X0);
+	vmovdqa (STACK_VEC_X12)(%rsp), X0;
+	PLUS(X12, X0);
+	vmovdqa (STACK_VEC_X13)(%rsp), X0;
+	PLUS(X13, X0);
+	vpbroadcastd (14 * 4)(INPUT), X0;
+	PLUS(X14, X0);
+	vpbroadcastd (15 * 4)(INPUT), X0;
+	PLUS(X15, X0);
+	xor_src_dst(DST, SRC, (64 * 2 + 16 * 0), X2);
+	xor_src_dst(DST, SRC, (64 * 3 + 16 * 0), X3);
+
+	/* Update counter */
+	addq $8, (12 * 4)(INPUT);
+
+	transpose_4x4(X8, X9, X10, X11, X0, X1);
+	transpose_4x4(X12, X13, X14, X15, X0, X1);
+	xor_src_dst(DST, SRC, (64 * 4 + 16 * 0), X4);
+	xor_src_dst(DST, SRC, (64 * 5 + 16 * 0), X5);
+	transpose_16byte_2x2(X8, X12, X0);
+	transpose_16byte_2x2(X9, X13, X0);
+	transpose_16byte_2x2(X10, X14, X0);
+	transpose_16byte_2x2(X11, X15, X0);
+	xor_src_dst(DST, SRC, (64 * 6 + 16 * 0), X6);
+	xor_src_dst(DST, SRC, (64 * 7 + 16 * 0), X7);
+	xor_src_dst(DST, SRC, (64 * 0 + 16 * 2), X8);
+	xor_src_dst(DST, SRC, (64 * 1 + 16 * 2), X9);
+	xor_src_dst(DST, SRC, (64 * 2 + 16 * 2), X10);
+	xor_src_dst(DST, SRC, (64 * 3 + 16 * 2), X11);
+	xor_src_dst(DST, SRC, (64 * 4 + 16 * 2), X12);
+	xor_src_dst(DST, SRC, (64 * 5 + 16 * 2), X13);
+	xor_src_dst(DST, SRC, (64 * 6 + 16 * 2), X14);
+	xor_src_dst(DST, SRC, (64 * 7 + 16 * 2), X15);
+
+	sub $8, NBLKS;
+	lea (8 * 64)(DST), DST;
+	lea (8 * 64)(SRC), SRC;
+	jnz L(loop8);
+
+	/* clear the used vector registers and stack */
+	vpxor X0, X0, X0;
+	vmovdqa X0, (STACK_VEC_X12)(%rsp);
+	vmovdqa X0, (STACK_VEC_X13)(%rsp);
+	vmovdqa X0, (STACK_TMP)(%rsp);
+	vmovdqa X0, (STACK_TMP1)(%rsp);
+	vzeroall;
+
+	/* eax zeroed by round loop. */
+	leave;
+	cfi_adjust_cfa_offset(-8)
+	cfi_def_cfa_register(%rsp);
+	ret;
+	int3;
+END(__chacha20_avx2_blocks8)
diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
index 37a4fdfb1f..7e9e7755f3 100644
--- a/sysdeps/x86_64/chacha20_arch.h
+++ b/sysdeps/x86_64/chacha20_arch.h
@@ -22,11 +22,25 @@
 
 unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
 				       const uint8_t *src, size_t nblks);
+unsigned int __chacha20_avx2_blocks8 (uint32_t *state, uint8_t *dst,
+				      const uint8_t *src, size_t nblks);
 
 static inline void
 chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
 		size_t bytes)
 {
+  const struct cpu_features* cpu_features = __get_cpu_features ();
+
+  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && bytes >= CHACHA20_BLOCK_SIZE * 8)
+    {
+      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
+      nblocks -= nblocks % 8;
+      __chacha20_avx2_blocks8 (state->ctx, dst, src, nblocks);
+      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
+      dst += nblocks * CHACHA20_BLOCK_SIZE;
+      src += nblocks * CHACHA20_BLOCK_SIZE;
+    }
+
   if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
     {
       size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
-- 
2.32.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 6/7] aarch64: Add optimized chacha20
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
                   ` (4 preceding siblings ...)
  2022-04-13 20:23 ` [PATCH 5/7] x86: Add AVX2 " Adhemerval Zanella
@ 2022-04-13 20:24 ` Adhemerval Zanella
  2022-04-13 20:24 ` [PATCH 7/7] powerpc64: " Adhemerval Zanella
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-13 20:24 UTC (permalink / raw)
  To: libc-alpha

It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-aarch64.S.  It is used as default and only
little-endian is supported (BE uses generic fallback code).

On a Neoverse-N1 it shows the following improvements (using
formatted bench-arc4random data):

GENERIC
Function                                 MB/s
--------------------------------------------------
arc4random [single-thread]               129.96
arc4random_buf(0) [single-thread]        245.83
arc4random_buf(16) [single-thread]       312.38
arc4random_buf(32) [single-thread]       353.77
arc4random_buf(64) [single-thread]       380.53
--------------------------------------------------
arc4random [multi-thread]                129.63
arc4random_buf(0) [multi-thread]         245.54
arc4random_buf(16) [multi-thread]        309.15
arc4random_buf(32) [multi-thread]        356.40
arc4random_buf(64) [multi-thread]        381.94
--------------------------------------------------

OPTIMIZED
Function                                 MB/s
--------------------------------------------------
arc4random [single-thread]               153.76
arc4random_buf(0) [single-thread]        349.12
arc4random_buf(16) [single-thread]       498.68
arc4random_buf(32) [single-thread]       619.87
arc4random_buf(64) [single-thread]       706.69
--------------------------------------------------
arc4random [multi-thread]                154.25
arc4random_buf(0) [multi-thread]         349.08
arc4random_buf(16) [multi-thread]        494.77
arc4random_buf(32) [multi-thread]        623.87
arc4random_buf(64) [multi-thread]        706.63
--------------------------------------------------

Checked on aarch64-linux-gnu.
---
 LICENSES                        |   5 +-
 sysdeps/aarch64/Makefile        |   4 +
 sysdeps/aarch64/chacha20.S      | 357 ++++++++++++++++++++++++++++++++
 sysdeps/aarch64/chacha20_arch.h |  43 ++++
 4 files changed, 407 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/aarch64/chacha20.S
 create mode 100644 sysdeps/aarch64/chacha20_arch.h

diff --git a/LICENSES b/LICENSES
index 8ef0f023d7..b0c43495cb 100644
--- a/LICENSES
+++ b/LICENSES
@@ -390,8 +390,9 @@ Copyright 2001 by Stephen L. Moshier <moshier@na-net.ornl.gov>
  License along with this library; if not, see
  <https://www.gnu.org/licenses/>.  */
 \f
-sysdeps/x86_64/chacha20-ssse3.S and sysdeps/x86_64/chacha20-avx2.S
-import code from libgcrypt, with the following notices:
+sysdeps/x86_64/chacha20-ssse3.S, sysdeps/x86_64/chacha20-avx2.S, and
+sysdeps/aarch64/chacha20.S import code from libgcrypt, with the
+following notices:
 
 Copyright (C) 2017-2019 Jussi Kivilinna <jussi.kivilinna@iki.fi>
 
diff --git a/sysdeps/aarch64/Makefile b/sysdeps/aarch64/Makefile
index 7183895d04..173665e306 100644
--- a/sysdeps/aarch64/Makefile
+++ b/sysdeps/aarch64/Makefile
@@ -50,6 +50,10 @@ ifeq ($(subdir),csu)
 gen-as-const-headers += tlsdesc.sym
 endif
 
+ifeq ($(subdir),stdlib)
+sysdep_routines += chacha20
+endif
+
 ifeq ($(subdir),gmon)
 CFLAGS-mcount.c += -mgeneral-regs-only
 endif
diff --git a/sysdeps/aarch64/chacha20.S b/sysdeps/aarch64/chacha20.S
new file mode 100644
index 0000000000..730b9a14b9
--- /dev/null
+++ b/sysdeps/aarch64/chacha20.S
@@ -0,0 +1,357 @@
+/* Optimized AArch64 implementation of ChaCha20 cipher.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+/* Only LE is supported.  */
+#ifdef __AARCH64EL__
+
+/* Based on D. J. Bernstein reference implementation at
+   http://cr.yp.to/chacha.html:
+
+   chacha-regs.c version 20080118
+   D. J. Bernstein
+   Public domain.  */
+
+#define GET_DATA_POINTER(reg, name) \
+        adrp    reg, :got:name ; \
+        ldr     reg, [reg, #:got_lo12:name] ;
+
+/* 'ret' instruction replacement for straight-line speculation mitigation */
+#define ret_spec_stop \
+        ret; dsb sy; isb;
+
+.cpu generic+simd
+
+.text
+
+/* register macros */
+#define INPUT     x0
+#define DST       x1
+#define SRC       x2
+#define NBLKS     x3
+#define ROUND     x4
+#define INPUT_CTR x5
+#define INPUT_POS x6
+#define CTR       x7
+
+/* vector registers */
+#define X0 v16
+#define X1 v17
+#define X2 v18
+#define X3 v19
+#define X4 v20
+#define X5 v21
+#define X6 v22
+#define X7 v23
+#define X8 v24
+#define X9 v25
+#define X10 v26
+#define X11 v27
+#define X12 v28
+#define X13 v29
+#define X14 v30
+#define X15 v31
+
+#define VCTR    v0
+#define VTMP0   v1
+#define VTMP1   v2
+#define VTMP2   v3
+#define VTMP3   v4
+#define X12_TMP v5
+#define X13_TMP v6
+#define ROT8    v7
+
+/**********************************************************************
+  helper macros
+ **********************************************************************/
+
+#define _(...) __VA_ARGS__
+
+#define vpunpckldq(s1, s2, dst) \
+	zip1 dst.4s, s2.4s, s1.4s;
+
+#define vpunpckhdq(s1, s2, dst) \
+	zip2 dst.4s, s2.4s, s1.4s;
+
+#define vpunpcklqdq(s1, s2, dst) \
+	zip1 dst.2d, s2.2d, s1.2d;
+
+#define vpunpckhqdq(s1, s2, dst) \
+	zip2 dst.2d, s2.2d, s1.2d;
+
+/* 4x4 32-bit integer matrix transpose */
+#define transpose_4x4(x0, x1, x2, x3, t1, t2, t3) \
+	vpunpckhdq(x1, x0, t2); \
+	vpunpckldq(x1, x0, x0); \
+	\
+	vpunpckldq(x3, x2, t1); \
+	vpunpckhdq(x3, x2, x2); \
+	\
+	vpunpckhqdq(t1, x0, x1); \
+	vpunpcklqdq(t1, x0, x0); \
+	\
+	vpunpckhqdq(x2, t2, x3); \
+	vpunpcklqdq(x2, t2, x2);
+
+#define clear(x) \
+	movi x.16b, #0;
+
+/**********************************************************************
+  4-way chacha20
+ **********************************************************************/
+
+#define XOR(d,s1,s2) \
+	eor d.16b, s2.16b, s1.16b;
+
+#define PLUS(ds,s) \
+	add ds.4s, ds.4s, s.4s;
+
+#define ROTATE4(dst1,dst2,dst3,dst4,c,src1,src2,src3,src4,iop1,iop2,iop3) \
+	shl dst1.4s, src1.4s, #(c);		\
+	shl dst2.4s, src2.4s, #(c);		\
+	iop1;					\
+	shl dst3.4s, src3.4s, #(c);		\
+	shl dst4.4s, src4.4s, #(c);		\
+	iop2;					\
+	sri dst1.4s, src1.4s, #(32 - (c));	\
+	sri dst2.4s, src2.4s, #(32 - (c));	\
+	iop3;					\
+	sri dst3.4s, src3.4s, #(32 - (c));	\
+	sri dst4.4s, src4.4s, #(32 - (c));
+
+#define ROTATE4_8(dst1,dst2,dst3,dst4,src1,src2,src3,src4,iop1,iop2,iop3) \
+	tbl dst1.16b, {src1.16b}, ROT8.16b;     \
+	iop1;					\
+	tbl dst2.16b, {src2.16b}, ROT8.16b;	\
+	iop2;					\
+	tbl dst3.16b, {src3.16b}, ROT8.16b;	\
+	iop3;					\
+	tbl dst4.16b, {src4.16b}, ROT8.16b;
+
+#define ROTATE4_16(dst1,dst2,dst3,dst4,src1,src2,src3,src4,iop1) \
+	rev32 dst1.8h, src1.8h;			\
+	rev32 dst2.8h, src2.8h;			\
+	iop1;					\
+	rev32 dst3.8h, src3.8h;			\
+	rev32 dst4.8h, src4.8h;
+
+#define QUARTERROUND4(a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3,a4,b4,c4,d4,ign,tmp1,tmp2,tmp3,tmp4,\
+		      iop1,iop2,iop3,iop4,iop5,iop6,iop7,iop8,iop9,iop10,iop11,iop12,iop13,iop14,\
+		      iop15,iop16,iop17,iop18,iop19,iop20,iop21,iop22,iop23,iop24,iop25,iop26,\
+		      iop27,iop28,iop29) \
+	PLUS(a1,b1); PLUS(a2,b2); iop1;						\
+	PLUS(a3,b3); PLUS(a4,b4); iop2;						\
+	    XOR(tmp1,d1,a1); XOR(tmp2,d2,a2); iop3;				\
+	    XOR(tmp3,d3,a3); XOR(tmp4,d4,a4); iop4;				\
+		ROTATE4_16(d1, d2, d3, d4, tmp1, tmp2, tmp3, tmp4, _(iop5));	\
+		iop6;								\
+	PLUS(c1,d1); PLUS(c2,d2); iop7;						\
+	PLUS(c3,d3); PLUS(c4,d4); iop8;						\
+	    XOR(tmp1,b1,c1); XOR(tmp2,b2,c2); iop9;				\
+	    XOR(tmp3,b3,c3); XOR(tmp4,b4,c4); iop10;				\
+		ROTATE4(b1, b2, b3, b4, 12, tmp1, tmp2, tmp3, tmp4,		\
+			_(iop11), _(iop12), _(iop13)); iop14;			\
+	PLUS(a1,b1); PLUS(a2,b2); iop15;					\
+	PLUS(a3,b3); PLUS(a4,b4); iop16;					\
+	    XOR(tmp1,d1,a1); XOR(tmp2,d2,a2); iop17;				\
+	    XOR(tmp3,d3,a3); XOR(tmp4,d4,a4); iop18;				\
+		ROTATE4_8(d1, d2, d3, d4, tmp1, tmp2, tmp3, tmp4,		\
+			  _(iop19), _(iop20), _(iop21)); iop22;			\
+	PLUS(c1,d1); PLUS(c2,d2); iop23;					\
+	PLUS(c3,d3); PLUS(c4,d4); iop24;					\
+	    XOR(tmp1,b1,c1); XOR(tmp2,b2,c2); iop25;				\
+	    XOR(tmp3,b3,c3); XOR(tmp4,b4,c4); iop26;				\
+		ROTATE4(b1, b2, b3, b4, 7, tmp1, tmp2, tmp3, tmp4,		\
+			_(iop27), _(iop28), _(iop29));
+
+.align 4
+.hidden __chacha20_blocks4_data_inc_counter
+__chacha20_blocks4_data_inc_counter:
+	.long 0,1,2,3
+
+.align 4
+.hidden __chacha20_blocks4_data_rot8
+__chacha20_blocks4_data_rot8:
+	.byte 3,0,1,2
+	.byte 7,4,5,6
+	.byte 11,8,9,10
+	.byte 15,12,13,14
+
+ENTRY (__chacha20_neon_blocks4)
+	/* input:
+	 *	x0: input
+	 *	x1: dst
+	 *	x2: src
+	 *	x3: nblks (multiple of 4)
+	 */
+
+	GET_DATA_POINTER(CTR, __chacha20_blocks4_data_rot8);
+	add INPUT_CTR, INPUT, #(12*4);
+	ld1 {ROT8.16b}, [CTR];
+	GET_DATA_POINTER(CTR, __chacha20_blocks4_data_inc_counter);
+	mov INPUT_POS, INPUT;
+	ld1 {VCTR.16b}, [CTR];
+
+L(loop4):
+	/* Construct counter vectors X12 and X13 */
+
+	ld1 {X15.16b}, [INPUT_CTR];
+	mov ROUND, #20;
+	ld1 {VTMP1.16b-VTMP3.16b}, [INPUT_POS];
+
+	dup X12.4s, X15.s[0];
+	dup X13.4s, X15.s[1];
+	ldr CTR, [INPUT_CTR];
+	add X12.4s, X12.4s, VCTR.4s;
+	dup X0.4s, VTMP1.s[0];
+	dup X1.4s, VTMP1.s[1];
+	dup X2.4s, VTMP1.s[2];
+	dup X3.4s, VTMP1.s[3];
+	dup X14.4s, X15.s[2];
+	cmhi VTMP0.4s, VCTR.4s, X12.4s;
+	dup X15.4s, X15.s[3];
+	add CTR, CTR, #4; /* Update counter */
+	dup X4.4s, VTMP2.s[0];
+	dup X5.4s, VTMP2.s[1];
+	dup X6.4s, VTMP2.s[2];
+	dup X7.4s, VTMP2.s[3];
+	sub X13.4s, X13.4s, VTMP0.4s;
+	dup X8.4s, VTMP3.s[0];
+	dup X9.4s, VTMP3.s[1];
+	dup X10.4s, VTMP3.s[2];
+	dup X11.4s, VTMP3.s[3];
+	mov X12_TMP.16b, X12.16b;
+	mov X13_TMP.16b, X13.16b;
+	str CTR, [INPUT_CTR];
+
+L(round2):
+	subs ROUND, ROUND, #2
+	QUARTERROUND4(X0, X4,  X8, X12,   X1, X5,  X9, X13,
+		      X2, X6, X10, X14,   X3, X7, X11, X15,
+		      tmp:=,VTMP0,VTMP1,VTMP2,VTMP3,
+		      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,)
+	QUARTERROUND4(X0, X5, X10, X15,   X1, X6, X11, X12,
+		      X2, X7,  X8, X13,   X3, X4,  X9, X14,
+		      tmp:=,VTMP0,VTMP1,VTMP2,VTMP3,
+		      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,)
+	b.ne L(round2);
+
+	ld1 {VTMP0.16b, VTMP1.16b}, [INPUT_POS], #32;
+
+	PLUS(X12, X12_TMP);        /* INPUT + 12 * 4 + counter */
+	PLUS(X13, X13_TMP);        /* INPUT + 13 * 4 + counter */
+
+	dup VTMP2.4s, VTMP0.s[0]; /* INPUT + 0 * 4 */
+	dup VTMP3.4s, VTMP0.s[1]; /* INPUT + 1 * 4 */
+	dup X12_TMP.4s, VTMP0.s[2]; /* INPUT + 2 * 4 */
+	dup X13_TMP.4s, VTMP0.s[3]; /* INPUT + 3 * 4 */
+	PLUS(X0, VTMP2);
+	PLUS(X1, VTMP3);
+	PLUS(X2, X12_TMP);
+	PLUS(X3, X13_TMP);
+
+	dup VTMP2.4s, VTMP1.s[0]; /* INPUT + 4 * 4 */
+	dup VTMP3.4s, VTMP1.s[1]; /* INPUT + 5 * 4 */
+	dup X12_TMP.4s, VTMP1.s[2]; /* INPUT + 6 * 4 */
+	dup X13_TMP.4s, VTMP1.s[3]; /* INPUT + 7 * 4 */
+	ld1 {VTMP0.16b, VTMP1.16b}, [INPUT_POS];
+	mov INPUT_POS, INPUT;
+	PLUS(X4, VTMP2);
+	PLUS(X5, VTMP3);
+	PLUS(X6, X12_TMP);
+	PLUS(X7, X13_TMP);
+
+	dup VTMP2.4s, VTMP0.s[0]; /* INPUT + 8 * 4 */
+	dup VTMP3.4s, VTMP0.s[1]; /* INPUT + 9 * 4 */
+	dup X12_TMP.4s, VTMP0.s[2]; /* INPUT + 10 * 4 */
+	dup X13_TMP.4s, VTMP0.s[3]; /* INPUT + 11 * 4 */
+	dup VTMP0.4s, VTMP1.s[2]; /* INPUT + 14 * 4 */
+	dup VTMP1.4s, VTMP1.s[3]; /* INPUT + 15 * 4 */
+	PLUS(X8, VTMP2);
+	PLUS(X9, VTMP3);
+	PLUS(X10, X12_TMP);
+	PLUS(X11, X13_TMP);
+	PLUS(X14, VTMP0);
+	PLUS(X15, VTMP1);
+
+	transpose_4x4(X0, X1, X2, X3, VTMP0, VTMP1, VTMP2);
+	transpose_4x4(X4, X5, X6, X7, VTMP0, VTMP1, VTMP2);
+	transpose_4x4(X8, X9, X10, X11, VTMP0, VTMP1, VTMP2);
+	transpose_4x4(X12, X13, X14, X15, VTMP0, VTMP1, VTMP2);
+
+	subs NBLKS, NBLKS, #4;
+
+	ld1 {VTMP0.16b-VTMP3.16b}, [SRC], #64;
+	ld1 {X12_TMP.16b-X13_TMP.16b}, [SRC], #32;
+	eor VTMP0.16b, X0.16b, VTMP0.16b;
+	eor VTMP1.16b, X4.16b, VTMP1.16b;
+	eor VTMP2.16b, X8.16b, VTMP2.16b;
+	eor VTMP3.16b, X12.16b, VTMP3.16b;
+	eor X12_TMP.16b, X1.16b, X12_TMP.16b;
+	eor X13_TMP.16b, X5.16b, X13_TMP.16b;
+	st1 {VTMP0.16b-VTMP3.16b}, [DST], #64;
+	ld1 {VTMP0.16b-VTMP3.16b}, [SRC], #64;
+	st1 {X12_TMP.16b-X13_TMP.16b}, [DST], #32;
+	ld1 {X12_TMP.16b-X13_TMP.16b}, [SRC], #32;
+	eor VTMP0.16b, X9.16b, VTMP0.16b;
+	eor VTMP1.16b, X13.16b, VTMP1.16b;
+	eor VTMP2.16b, X2.16b, VTMP2.16b;
+	eor VTMP3.16b, X6.16b, VTMP3.16b;
+	eor X12_TMP.16b, X10.16b, X12_TMP.16b;
+	eor X13_TMP.16b, X14.16b, X13_TMP.16b;
+	st1 {VTMP0.16b-VTMP3.16b}, [DST], #64;
+	ld1 {VTMP0.16b-VTMP3.16b}, [SRC], #64;
+	st1 {X12_TMP.16b-X13_TMP.16b}, [DST], #32;
+	eor VTMP0.16b, X3.16b, VTMP0.16b;
+	eor VTMP1.16b, X7.16b, VTMP1.16b;
+	eor VTMP2.16b, X11.16b, VTMP2.16b;
+	eor VTMP3.16b, X15.16b, VTMP3.16b;
+	st1 {VTMP0.16b-VTMP3.16b}, [DST], #64;
+
+	b.ne L(loop4);
+
+	/* clear the used vector registers and stack */
+	clear(VTMP0);
+	clear(VTMP1);
+	clear(VTMP2);
+	clear(VTMP3);
+	clear(X12_TMP);
+	clear(X13_TMP);
+	clear(X0);
+	clear(X1);
+	clear(X2);
+	clear(X3);
+	clear(X4);
+	clear(X5);
+	clear(X6);
+	clear(X7);
+	clear(X8);
+	clear(X9);
+	clear(X10);
+	clear(X11);
+	clear(X12);
+	clear(X13);
+	clear(X14);
+	clear(X15);
+
+	eor x0, x0, x0
+	ret_spec_stop
+END (__chacha20_neon_blocks4)
+
+#endif
diff --git a/sysdeps/aarch64/chacha20_arch.h b/sysdeps/aarch64/chacha20_arch.h
new file mode 100644
index 0000000000..f7b9462793
--- /dev/null
+++ b/sysdeps/aarch64/chacha20_arch.h
@@ -0,0 +1,43 @@
+/* Chacha20 implementation, used on arc4random.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <ldsodefs.h>
+#include <stdbool.h>
+
+unsigned int __chacha20_neon_blocks4 (uint32_t *state, uint8_t *dst,
+				      const uint8_t *src, size_t nblks);
+
+static void
+chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
+		const uint8_t *src, size_t bytes)
+{
+#ifdef __AARCH64EL__
+  if (bytes >= CHACHA20_BLOCK_SIZE * 4)
+    {
+      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
+      nblocks -= nblocks % 4;
+      __chacha20_neon_blocks4 (state->ctx, dst, src, nblocks);
+      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
+      dst += nblocks * CHACHA20_BLOCK_SIZE;
+      src += nblocks * CHACHA20_BLOCK_SIZE;
+    }
+#endif
+
+  if (bytes > 0)
+    chacha20_crypt_generic (state, dst, src, bytes);
+}
-- 
2.32.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 7/7] powerpc64: Add optimized chacha20
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
                   ` (5 preceding siblings ...)
  2022-04-13 20:24 ` [PATCH 6/7] aarch64: Add " Adhemerval Zanella
@ 2022-04-13 20:24 ` Adhemerval Zanella
  2022-04-14  7:36 ` [PATCH 0/7] Add arc4random support Yann Droneaud
  2022-04-14 11:49 ` Cristian Rodríguez
  8 siblings, 0 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-13 20:24 UTC (permalink / raw)
  To: libc-alpha

It adds vectorized ChaCha20 implementation based on libgcrypt
cipher/chacha20-ppc.c.  It targets POWER8 and it is used on
default for LE.

On a POWER8 it shows the following improvements (using
formatted bench-arc4random data):

GENERIC (powerpc64-linux-gnu)
Function                                 MB/s
--------------------------------------------------
arc4random [single-thread]               70.05
arc4random_buf(0) [single-thread]        143.62
arc4random_buf(16) [single-thread]       200.85
arc4random_buf(32) [single-thread]       247.87
arc4random_buf(64) [single-thread]       277.19
--------------------------------------------------
arc4random [multi-thread]                69.99
arc4random_buf(0) [multi-thread]         143.52
arc4random_buf(16) [multi-thread]        200.31
arc4random_buf(32) [multi-thread]        248.63
arc4random_buf(64) [multi-thread]        279.66
--------------------------------------------------

POWER8
Function                                 MB/s
--------------------------------------------------
arc4random [single-thread]               86.91
arc4random_buf(0) [single-thread]        212.20
arc4random_buf(16) [single-thread]       373.42
arc4random_buf(32) [single-thread]       572.93
arc4random_buf(64) [single-thread]       772.87
--------------------------------------------------
arc4random [multi-thread]                84.43
arc4random_buf(0) [multi-thread]         211.93
arc4random_buf(16) [multi-thread]        373.58
arc4random_buf(32) [multi-thread]        573.80
arc4random_buf(64) [multi-thread]        772.96
--------------------------------------------------

Checked on powerpc64-linux-gnu and powerpc64le-linux-gnu.
---
 LICENSES                                  |   4 +-
 sysdeps/powerpc/powerpc64/Makefile        |   3 +
 sysdeps/powerpc/powerpc64/chacha-ppc.c    | 254 ++++++++++++++++++++++
 sysdeps/powerpc/powerpc64/chacha20_arch.h |  53 +++++
 4 files changed, 312 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/powerpc/powerpc64/chacha-ppc.c
 create mode 100644 sysdeps/powerpc/powerpc64/chacha20_arch.h

diff --git a/LICENSES b/LICENSES
index b0c43495cb..f7dc51c3a9 100644
--- a/LICENSES
+++ b/LICENSES
@@ -391,8 +391,8 @@ Copyright 2001 by Stephen L. Moshier <moshier@na-net.ornl.gov>
  <https://www.gnu.org/licenses/>.  */
 \f
 sysdeps/x86_64/chacha20-ssse3.S, sysdeps/x86_64/chacha20-avx2.S, and
-sysdeps/aarch64/chacha20.S import code from libgcrypt, with the
-following notices:
+sysdeps/aarch64/chacha20.S, and sysdeps/powerpc/powerpc64/chacha-ppc.c
+import code from libgcrypt, with the following notices:
 
 Copyright (C) 2017-2019 Jussi Kivilinna <jussi.kivilinna@iki.fi>
 
diff --git a/sysdeps/powerpc/powerpc64/Makefile b/sysdeps/powerpc/powerpc64/Makefile
index 679d5e49ba..d213d23dc4 100644
--- a/sysdeps/powerpc/powerpc64/Makefile
+++ b/sysdeps/powerpc/powerpc64/Makefile
@@ -66,6 +66,9 @@ tst-setjmp-bug21895-static-ENV = \
 endif
 
 ifeq ($(subdir),stdlib)
+sysdep_routines += chacha-ppc
+CFLAGS-chacha-ppc.c += -mcpu=power8
+
 CFLAGS-tst-ucontext-ppc64-vscr.c += -maltivec
 tests += tst-ucontext-ppc64-vscr
 endif
diff --git a/sysdeps/powerpc/powerpc64/chacha-ppc.c b/sysdeps/powerpc/powerpc64/chacha-ppc.c
new file mode 100644
index 0000000000..db87aa5823
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/chacha-ppc.c
@@ -0,0 +1,254 @@
+/* Optimized PowerPC implementation of ChaCha20 cipher.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <altivec.h>
+#include <stddef.h>
+#include <stdint.h>
+
+typedef vector unsigned char vector16x_u8;
+typedef vector unsigned int vector4x_u32;
+typedef vector unsigned long long vector2x_u64;
+
+#ifdef WORDS_BIGENDIAN
+static const vector16x_u8 le_bswap_const =
+  { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 };
+#endif
+
+static inline vector4x_u32
+vec_rol_elems (vector4x_u32 v, unsigned int idx)
+{
+#ifndef WORDS_BIGENDIAN
+  return vec_sld (v, v, (16 - (4 * idx)) & 15);
+#else
+  return vec_sld (v, v, (4 * idx) & 15);
+#endif
+}
+
+static inline vector4x_u32
+vec_load_le (unsigned long offset, const unsigned char *ptr)
+{
+  vector4x_u32 vec;
+  vec = vec_vsx_ld (offset, (const uint32_t *)ptr);
+#ifdef WORDS_BIGENDIAN
+  vec = (vector4x_u32) vec_perm ((vector16x_u8)vec, (vector16x_u8)vec,
+				 le_bswap_const);
+#endif
+  return vec;
+}
+
+static inline void
+vec_store_le (vector4x_u32 vec, unsigned long offset, unsigned char *ptr)
+{
+#ifdef WORDS_BIGENDIAN
+  vec = (vector4x_u32)vec_perm((vector16x_u8)vec, (vector16x_u8)vec,
+			       le_bswap_const);
+#endif
+  vec_vsx_st (vec, offset, (uint32_t *)ptr);
+}
+
+
+static inline vector4x_u32
+vec_add_ctr_u64 (vector4x_u32 v, vector4x_u32 a)
+{
+#ifdef WORDS_BIGENDIAN
+  static const vector16x_u8 swap32 =
+    { 4, 5, 6, 7, 0, 1, 2, 3, 12, 13, 14, 15, 8, 9, 10, 11 };
+  vector2x_u64 vec, add, sum;
+
+  vec = (vector2x_u64)vec_perm ((vector16x_u8)v, (vector16x_u8)v, swap32);
+  add = (vector2x_u64)vec_perm ((vector16x_u8)a, (vector16x_u8)a, swap32);
+  sum = vec + add;
+  return (vector4x_u32)vec_perm ((vector16x_u8)sum, (vector16x_u8)sum, swap32);
+#else
+  return (vector4x_u32)((vector2x_u64)(v) + (vector2x_u64)(a));
+#endif
+}
+
+/**********************************************************************
+  4-way chacha20
+ **********************************************************************/
+
+#define ROTATE(v1,rolv)			\
+	__asm__ ("vrlw %0,%1,%2\n\t" : "=v" (v1) : "v" (v1), "v" (rolv))
+
+#define PLUS(ds,s) \
+	((ds) += (s))
+
+#define XOR(ds,s) \
+	((ds) ^= (s))
+
+#define ADD_U64(v,a) \
+	(v = vec_add_ctr_u64(v, a))
+
+/* 4x4 32-bit integer matrix transpose */
+#define transpose_4x4(x0, x1, x2, x3) ({ \
+	vector4x_u32 t1 = vec_mergeh(x0, x2); \
+	vector4x_u32 t2 = vec_mergel(x0, x2); \
+	vector4x_u32 t3 = vec_mergeh(x1, x3); \
+	x3 = vec_mergel(x1, x3); \
+	x0 = vec_mergeh(t1, t3); \
+	x1 = vec_mergel(t1, t3); \
+	x2 = vec_mergeh(t2, x3); \
+	x3 = vec_mergel(t2, x3); \
+      })
+
+#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2)			\
+	PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);	\
+	    ROTATE(d1, rotate_16); ROTATE(d2, rotate_16);	\
+	PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);	\
+	    ROTATE(b1, rotate_12); ROTATE(b2, rotate_12);	\
+	PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);	\
+	    ROTATE(d1, rotate_8); ROTATE(d2, rotate_8);		\
+	PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);	\
+	    ROTATE(b1, rotate_7); ROTATE(b2, rotate_7);
+
+unsigned int
+__chacha20_power8_blocks4 (uint32_t *state, uint8_t *dst, const uint8_t *src,
+			   size_t nblks)
+{
+  vector4x_u32 counters_0123 = { 0, 1, 2, 3 };
+  vector4x_u32 counter_4 = { 4, 0, 0, 0 };
+  vector4x_u32 rotate_16 = { 16, 16, 16, 16 };
+  vector4x_u32 rotate_12 = { 12, 12, 12, 12 };
+  vector4x_u32 rotate_8 = { 8, 8, 8, 8 };
+  vector4x_u32 rotate_7 = { 7, 7, 7, 7 };
+  vector4x_u32 state0, state1, state2, state3;
+  vector4x_u32 v0, v1, v2, v3, v4, v5, v6, v7;
+  vector4x_u32 v8, v9, v10, v11, v12, v13, v14, v15;
+  vector4x_u32 tmp;
+  int i;
+
+  /* Force preload of constants to vector registers.  */
+  __asm__ ("": "+v" (counters_0123) :: "memory");
+  __asm__ ("": "+v" (counter_4) :: "memory");
+  __asm__ ("": "+v" (rotate_16) :: "memory");
+  __asm__ ("": "+v" (rotate_12) :: "memory");
+  __asm__ ("": "+v" (rotate_8) :: "memory");
+  __asm__ ("": "+v" (rotate_7) :: "memory");
+
+  state0 = vec_vsx_ld (0 * 16, state);
+  state1 = vec_vsx_ld (1 * 16, state);
+  state2 = vec_vsx_ld (2 * 16, state);
+  state3 = vec_vsx_ld (3 * 16, state);
+
+  do
+    {
+      v0 = vec_splat (state0, 0);
+      v1 = vec_splat (state0, 1);
+      v2 = vec_splat (state0, 2);
+      v3 = vec_splat (state0, 3);
+      v4 = vec_splat (state1, 0);
+      v5 = vec_splat (state1, 1);
+      v6 = vec_splat (state1, 2);
+      v7 = vec_splat (state1, 3);
+      v8 = vec_splat (state2, 0);
+      v9 = vec_splat (state2, 1);
+      v10 = vec_splat (state2, 2);
+      v11 = vec_splat (state2, 3);
+      v12 = vec_splat (state3, 0);
+      v13 = vec_splat (state3, 1);
+      v14 = vec_splat (state3, 2);
+      v15 = vec_splat (state3, 3);
+
+      v12 += counters_0123;
+      v13 -= vec_cmplt (v12, counters_0123);
+
+      for (i = 20; i > 0; i -= 2)
+	{
+	  QUARTERROUND2 (v0, v4,  v8, v12,   v1, v5,  v9, v13)
+	  QUARTERROUND2 (v2, v6, v10, v14,   v3, v7, v11, v15)
+	  QUARTERROUND2 (v0, v5, v10, v15,   v1, v6, v11, v12)
+	  QUARTERROUND2 (v2, v7,  v8, v13,   v3, v4,  v9, v14)
+	}
+
+      v0 += vec_splat (state0, 0);
+      v1 += vec_splat (state0, 1);
+      v2 += vec_splat (state0, 2);
+      v3 += vec_splat (state0, 3);
+      v4 += vec_splat (state1, 0);
+      v5 += vec_splat (state1, 1);
+      v6 += vec_splat (state1, 2);
+      v7 += vec_splat (state1, 3);
+      v8 += vec_splat (state2, 0);
+      v9 += vec_splat (state2, 1);
+      v10 += vec_splat (state2, 2);
+      v11 += vec_splat (state2, 3);
+      tmp = vec_splat( state3, 0);
+      tmp += counters_0123;
+      v12 += tmp;
+      v13 += vec_splat (state3, 1) - vec_cmplt (tmp, counters_0123);
+      v14 += vec_splat (state3, 2);
+      v15 += vec_splat (state3, 3);
+      ADD_U64 (state3, counter_4);
+
+      transpose_4x4 (v0, v1, v2, v3);
+      transpose_4x4 (v4, v5, v6, v7);
+      transpose_4x4 (v8, v9, v10, v11);
+      transpose_4x4 (v12, v13, v14, v15);
+
+      v0 ^= vec_load_le ((64 * 0 + 16 * 0), src);
+      v1 ^= vec_load_le ((64 * 1 + 16 * 0), src);
+      v2 ^= vec_load_le ((64 * 2 + 16 * 0), src);
+      v3 ^= vec_load_le ((64 * 3 + 16 * 0), src);
+
+      v4 ^= vec_load_le ((64 * 0 + 16 * 1), src);
+      v5 ^= vec_load_le ((64 * 1 + 16 * 1), src);
+      v6 ^= vec_load_le ((64 * 2 + 16 * 1), src);
+      v7 ^= vec_load_le ((64 * 3 + 16 * 1), src);
+
+      v8 ^= vec_load_le ((64 * 0 + 16 * 2), src);
+      v9 ^= vec_load_le ((64 * 1 + 16 * 2), src);
+      v10 ^= vec_load_le ((64 * 2 + 16 * 2), src);
+      v11 ^= vec_load_le ((64 * 3 + 16 * 2), src);
+
+      v12 ^= vec_load_le ((64 * 0 + 16 * 3), src);
+      v13 ^= vec_load_le ((64 * 1 + 16 * 3), src);
+      v14 ^= vec_load_le ((64 * 2 + 16 * 3), src);
+      v15 ^= vec_load_le ((64 * 3 + 16 * 3), src);
+
+      vec_store_le (v0, (64 * 0 + 16 * 0), dst);
+      vec_store_le (v1, (64 * 1 + 16 * 0), dst);
+      vec_store_le (v2, (64 * 2 + 16 * 0), dst);
+      vec_store_le (v3, (64 * 3 + 16 * 0), dst);
+
+      vec_store_le (v4, (64 * 0 + 16 * 1), dst);
+      vec_store_le (v5, (64 * 1 + 16 * 1), dst);
+      vec_store_le (v6, (64 * 2 + 16 * 1), dst);
+      vec_store_le (v7, (64 * 3 + 16 * 1), dst);
+
+      vec_store_le (v8, (64 * 0 + 16 * 2), dst);
+      vec_store_le (v9, (64 * 1 + 16 * 2), dst);
+      vec_store_le (v10, (64 * 2 + 16 * 2), dst);
+      vec_store_le (v11, (64 * 3 + 16 * 2), dst);
+
+      vec_store_le (v12, (64 * 0 + 16 * 3), dst);
+      vec_store_le (v13, (64 * 1 + 16 * 3), dst);
+      vec_store_le (v14, (64 * 2 + 16 * 3), dst);
+      vec_store_le (v15, (64 * 3 + 16 * 3), dst);
+
+      src += 4*64;
+      dst += 4*64;
+
+      nblks -= 4;
+    }
+  while (nblks);
+
+  vec_vsx_st (state3, 3 * 16, state);
+
+  return 0;
+}
diff --git a/sysdeps/powerpc/powerpc64/chacha20_arch.h b/sysdeps/powerpc/powerpc64/chacha20_arch.h
new file mode 100644
index 0000000000..e958c73b3c
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/chacha20_arch.h
@@ -0,0 +1,53 @@
+/* PowerPC optimization for ChaCha20.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdbool.h>
+#include <ldsodefs.h>
+
+unsigned int __chacha20_power8_blocks4 (uint32_t *state, uint8_t *dst,
+					const uint8_t *src, size_t nblks);
+
+static inline bool
+is_power8 (void)
+{
+#ifdef __LITTLE_ENDIAN__
+  return true;
+#else
+  unsigned long int hwcap = GLRO(dl_hwcap);
+  unsigned long int hwcap2 = GLRO(dl_hwcap2);
+  return hwcap2 & PPC_FEATURE2_ARCH_2_07 && hwcap & PPC_FEATURE_HAS_ALTIVEC;
+#endif
+}
+
+static void
+chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
+		const uint8_t *src, size_t bytes)
+{
+  if (is_power8 () && bytes >= CHACHA20_BLOCK_SIZE * 4)
+    {
+      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
+      nblocks -= nblocks % 4;
+      __chacha20_power8_blocks4 (state->ctx, dst, src, nblocks);
+      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
+      dst += nblocks * CHACHA20_BLOCK_SIZE;
+      src += nblocks * CHACHA20_BLOCK_SIZE;
+    }
+
+  if (bytes > 0)
+    chacha20_crypt_generic (state, dst, src, bytes);
+}
-- 
2.32.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] x86: Add AVX2 optimized chacha20
  2022-04-13 20:23 ` [PATCH 5/7] x86: Add AVX2 " Adhemerval Zanella
@ 2022-04-13 23:04   ` Noah Goldstein
  2022-04-14 17:16     ` Adhemerval Zanella
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-13 23:04 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> It adds vectorized ChaCha20 implementation based on libgcrypt
> cipher/chacha20-amd64-avx2.S.  It is used only if AVX2 is supported
> and enabled by the architecture.
>
> On a Ryzen 9 5900X it shows the following improvements (using
> formatted bench-arc4random data):
>
> SSSE3:
> Function                                 MB/s
> --------------------------------------------------
> arc4random [single-thread]               576.55
> arc4random_buf(0) [single-thread]        961.77
> arc4random_buf(16) [single-thread]       1309.38
> arc4random_buf(32) [single-thread]       1558.69
> arc4random_buf(64) [single-thread]       1728.54
> --------------------------------------------------
> arc4random [multi-thread]                589.52
> arc4random_buf(0) [multi-thread]         967.39
> arc4random_buf(16) [multi-thread]        1319.27
> arc4random_buf(32) [multi-thread]        1552.96
> arc4random_buf(64) [multi-thread]        1734.27
> --------------------------------------------------
>
> AVX2:
> Function                                 MB/s
> --------------------------------------------------
> arc4random [single-thread]               672.49
> arc4random_buf(0) [single-thread]        1234.85
> arc4random_buf(16) [single-thread]       1892.67
> arc4random_buf(32) [single-thread]       2491.10
> arc4random_buf(64) [single-thread]       2696.27
> --------------------------------------------------
> arc4random [multi-thread]                661.25
> arc4random_buf(0) [multi-thread]         1214.65
> arc4random_buf(16) [multi-thread]        1867.98
> arc4random_buf(32) [multi-thread]        2474.70
> arc4random_buf(64) [multi-thread]        2893.21
> --------------------------------------------------
>
> Checked on x86_64-linux-gnu.
> ---
>  LICENSES                       |   4 +-
>  stdlib/chacha20.c              |   7 +-
>  sysdeps/x86_64/Makefile        |   1 +
>  sysdeps/x86_64/chacha20-avx2.S | 317 +++++++++++++++++++++++++++++++++
>  sysdeps/x86_64/chacha20_arch.h |  14 ++
>  5 files changed, 339 insertions(+), 4 deletions(-)
>  create mode 100644 sysdeps/x86_64/chacha20-avx2.S
>
> diff --git a/LICENSES b/LICENSES
> index 2563abd9e2..8ef0f023d7 100644
> --- a/LICENSES
> +++ b/LICENSES
> @@ -390,8 +390,8 @@ Copyright 2001 by Stephen L. Moshier <moshier@na-net.ornl.gov>
>   License along with this library; if not, see
>   <https://www.gnu.org/licenses/>.  */
>
> -sysdeps/x86_64/chacha20-ssse3.S import code from libgcrypt, with the
> -following notices:
> +sysdeps/x86_64/chacha20-ssse3.S and sysdeps/x86_64/chacha20-avx2.S
> +import code from libgcrypt, with the following notices:
>
>  Copyright (C) 2017-2019 Jussi Kivilinna <jussi.kivilinna@iki.fi>
>
> diff --git a/stdlib/chacha20.c b/stdlib/chacha20.c
> index dbd87bd942..8569e1e78d 100644
> --- a/stdlib/chacha20.c
> +++ b/stdlib/chacha20.c
> @@ -190,8 +190,8 @@ memxorcpy (uint8_t *dst, const uint8_t *src1, const uint8_t *src2, size_t len)
>  }
>
>  static void
> -chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
> -               const uint8_t *src, size_t bytes)
> +chacha20_crypt_generic (struct chacha20_state *state, uint8_t *dst,
> +                       const uint8_t *src, size_t bytes)
>  {
>    uint8_t stream[CHACHA20_BLOCK_SIZE];
>
> @@ -209,3 +209,6 @@ chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
>        memxorcpy (dst, src, stream, bytes);
>      }
>  }
> +
> +/* Get the arch-optimized implementation, if any.  */
> +#include <chacha20_arch.h>
> diff --git a/sysdeps/x86_64/Makefile b/sysdeps/x86_64/Makefile
> index f43b6a1180..afb4d173e8 100644
> --- a/sysdeps/x86_64/Makefile
> +++ b/sysdeps/x86_64/Makefile
> @@ -7,6 +7,7 @@ endif
>
>  ifeq ($(subdir),stdlib)
>  sysdep_routines += \
> +  chacha20-avx2 \
>    chacha20-ssse3 \
>    # sysdep_routines
>  endif
> diff --git a/sysdeps/x86_64/chacha20-avx2.S b/sysdeps/x86_64/chacha20-avx2.S
> new file mode 100644
> index 0000000000..96174c0e40
> --- /dev/null
> +++ b/sysdeps/x86_64/chacha20-avx2.S
> @@ -0,0 +1,317 @@
> +/* Optimized AVX2 implementation of ChaCha20 cipher.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +/* Based on D. J. Bernstein reference implementation at
> +   http://cr.yp.to/chacha.html:
> +
> +   chacha-regs.c version 20080118
> +   D. J. Bernstein
> +   Public domain.  */
> +
> +#ifdef PIC
> +#  define rRIP (%rip)
> +#else
> +#  define rRIP
> +#endif
> +
> +/* register macros */
> +#define INPUT %rdi
> +#define DST   %rsi
> +#define SRC   %rdx
> +#define NBLKS %rcx
> +#define ROUND %eax
> +
> +/* stack structure */
> +#define STACK_VEC_X12 (32)
> +#define STACK_VEC_X13 (32 + STACK_VEC_X12)
> +#define STACK_TMP     (32 + STACK_VEC_X13)
> +#define STACK_TMP1    (32 + STACK_TMP)
> +
> +#define STACK_MAX     (32 + STACK_TMP1)
> +
> +/* vector registers */
> +#define X0 %ymm0
> +#define X1 %ymm1
> +#define X2 %ymm2
> +#define X3 %ymm3
> +#define X4 %ymm4
> +#define X5 %ymm5
> +#define X6 %ymm6
> +#define X7 %ymm7
> +#define X8 %ymm8
> +#define X9 %ymm9
> +#define X10 %ymm10
> +#define X11 %ymm11
> +#define X12 %ymm12
> +#define X13 %ymm13
> +#define X14 %ymm14
> +#define X15 %ymm15
> +
> +#define X0h %xmm0
> +#define X1h %xmm1
> +#define X2h %xmm2
> +#define X3h %xmm3
> +#define X4h %xmm4
> +#define X5h %xmm5
> +#define X6h %xmm6
> +#define X7h %xmm7
> +#define X8h %xmm8
> +#define X9h %xmm9
> +#define X10h %xmm10
> +#define X11h %xmm11
> +#define X12h %xmm12
> +#define X13h %xmm13
> +#define X14h %xmm14
> +#define X15h %xmm15
> +
> +/**********************************************************************
> +  helper macros
> + **********************************************************************/
> +
> +/* 4x4 32-bit integer matrix transpose */
> +#define transpose_4x4(x0,x1,x2,x3,t1,t2) \
> +       vpunpckhdq x1, x0, t2; \
> +       vpunpckldq x1, x0, x0; \
> +       \
> +       vpunpckldq x3, x2, t1; \
> +       vpunpckhdq x3, x2, x2; \
> +       \
> +       vpunpckhqdq t1, x0, x1; \
> +       vpunpcklqdq t1, x0, x0; \
> +       \
> +       vpunpckhqdq x2, t2, x3; \
> +       vpunpcklqdq x2, t2, x2;
> +
> +/* 2x2 128-bit matrix transpose */
> +#define transpose_16byte_2x2(x0,x1,t1) \
> +       vmovdqa    x0, t1; \
> +       vperm2i128 $0x20, x1, x0, x0; \
> +       vperm2i128 $0x31, x1, t1, x1;
> +
> +/* xor register with unaligned src and save to unaligned dst */
> +#define xor_src_dst(dst, src, offset, xreg) \
> +       vpxor offset(src), xreg, xreg; \
> +       vmovdqu xreg, offset(dst);
> +
> +/**********************************************************************
> +  8-way chacha20
> + **********************************************************************/
> +
> +#define ROTATE2(v1,v2,c,tmp)   \
> +       vpsrld $(32 - (c)), v1, tmp;    \
> +       vpslld $(c), v1, v1;            \
> +       vpaddb tmp, v1, v1;             \
> +       vpsrld $(32 - (c)), v2, tmp;    \
> +       vpslld $(c), v2, v2;            \
> +       vpaddb tmp, v2, v2;
> +
> +#define ROTATE_SHUF_2(v1,v2,shuf)      \
> +       vpshufb shuf, v1, v1;           \
> +       vpshufb shuf, v2, v2;
> +
> +#define XOR(ds,s) \
> +       vpxor s, ds, ds;
> +
> +#define PLUS(ds,s) \
> +       vpaddd s, ds, ds;
> +
> +#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2,ign,tmp1,\
> +                     interleave_op1,interleave_op2,\
> +                     interleave_op3,interleave_op4)            \
> +       vbroadcasti128 .Lshuf_rol16 rRIP, tmp1;                 \
> +               interleave_op1;                                 \
> +       PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);       \
> +           ROTATE_SHUF_2(d1, d2, tmp1);                        \
> +               interleave_op2;                                 \
> +       PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);       \
> +           ROTATE2(b1, b2, 12, tmp1);                          \
> +       vbroadcasti128 .Lshuf_rol8 rRIP, tmp1;                  \
> +               interleave_op3;                                 \
> +       PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);       \
> +           ROTATE_SHUF_2(d1, d2, tmp1);                        \
> +               interleave_op4;                                 \
> +       PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);       \
> +           ROTATE2(b1, b2,  7, tmp1);
> +
> +       .text

section avx2

> +       .align 32
> +chacha20_data:
> +L(shuf_rol16):
> +       .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
> +L(shuf_rol8):
> +       .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
> +L(inc_counter):
> +       .byte 0,1,2,3,4,5,6,7
> +L(unsigned_cmp):
> +       .long 0x80000000
> +
> +ENTRY (__chacha20_avx2_blocks8)
> +       /* input:
> +        *      %rdi: input
> +        *      %rsi: dst
> +        *      %rdx: src
> +        *      %rcx: nblks (multiple of 8)
> +        */
> +       vzeroupper;

vzeroupper needs to be replaced with VZEROUPPER_RETURN
and we need a transaction safe version unless this can never
be called during a transaction.
> +
> +       pushq %rbp;
> +       cfi_adjust_cfa_offset(8);
> +       cfi_rel_offset(rbp, 0)
> +       movq %rsp, %rbp;
> +       cfi_def_cfa_register(rbp);
> +
> +       subq $STACK_MAX, %rsp;
> +       andq $~31, %rsp;
> +
> +L(loop8):
> +       mov $20, ROUND;
> +
> +       /* Construct counter vectors X12 and X13 */
> +       vpmovzxbd L(inc_counter) rRIP, X0;
> +       vpbroadcastd L(unsigned_cmp) rRIP, X2;
> +       vpbroadcastd (12 * 4)(INPUT), X12;
> +       vpbroadcastd (13 * 4)(INPUT), X13;
> +       vpaddd X0, X12, X12;
> +       vpxor X2, X0, X0;
> +       vpxor X2, X12, X1;
> +       vpcmpgtd X1, X0, X0;
> +       vpsubd X0, X13, X13;
> +       vmovdqa X12, (STACK_VEC_X12)(%rsp);
> +       vmovdqa X13, (STACK_VEC_X13)(%rsp);
> +
> +       /* Load vectors */
> +       vpbroadcastd (0 * 4)(INPUT), X0;
> +       vpbroadcastd (1 * 4)(INPUT), X1;
> +       vpbroadcastd (2 * 4)(INPUT), X2;
> +       vpbroadcastd (3 * 4)(INPUT), X3;
> +       vpbroadcastd (4 * 4)(INPUT), X4;
> +       vpbroadcastd (5 * 4)(INPUT), X5;
> +       vpbroadcastd (6 * 4)(INPUT), X6;
> +       vpbroadcastd (7 * 4)(INPUT), X7;
> +       vpbroadcastd (8 * 4)(INPUT), X8;
> +       vpbroadcastd (9 * 4)(INPUT), X9;
> +       vpbroadcastd (10 * 4)(INPUT), X10;
> +       vpbroadcastd (11 * 4)(INPUT), X11;
> +       vpbroadcastd (14 * 4)(INPUT), X14;
> +       vpbroadcastd (15 * 4)(INPUT), X15;
> +       vmovdqa X15, (STACK_TMP)(%rsp);
> +
> +L(round2):
> +       QUARTERROUND2(X0, X4,  X8, X12,   X1, X5,  X9, X13, tmp:=,X15,,,,)
> +       vmovdqa (STACK_TMP)(%rsp), X15;
> +       vmovdqa X8, (STACK_TMP)(%rsp);
> +       QUARTERROUND2(X2, X6, X10, X14,   X3, X7, X11, X15, tmp:=,X8,,,,)
> +       QUARTERROUND2(X0, X5, X10, X15,   X1, X6, X11, X12, tmp:=,X8,,,,)
> +       vmovdqa (STACK_TMP)(%rsp), X8;
> +       vmovdqa X15, (STACK_TMP)(%rsp);
> +       QUARTERROUND2(X2, X7,  X8, X13,   X3, X4,  X9, X14, tmp:=,X15,,,,)
> +       sub $2, ROUND;
> +       jnz L(round2);
> +
> +       vmovdqa X8, (STACK_TMP1)(%rsp);
> +
> +       /* tmp := X15 */
> +       vpbroadcastd (0 * 4)(INPUT), X15;
> +       PLUS(X0, X15);
> +       vpbroadcastd (1 * 4)(INPUT), X15;
> +       PLUS(X1, X15);
> +       vpbroadcastd (2 * 4)(INPUT), X15;
> +       PLUS(X2, X15);
> +       vpbroadcastd (3 * 4)(INPUT), X15;
> +       PLUS(X3, X15);
> +       vpbroadcastd (4 * 4)(INPUT), X15;
> +       PLUS(X4, X15);
> +       vpbroadcastd (5 * 4)(INPUT), X15;
> +       PLUS(X5, X15);
> +       vpbroadcastd (6 * 4)(INPUT), X15;
> +       PLUS(X6, X15);
> +       vpbroadcastd (7 * 4)(INPUT), X15;
> +       PLUS(X7, X15);
> +       transpose_4x4(X0, X1, X2, X3, X8, X15);
> +       transpose_4x4(X4, X5, X6, X7, X8, X15);
> +       vmovdqa (STACK_TMP1)(%rsp), X8;
> +       transpose_16byte_2x2(X0, X4, X15);
> +       transpose_16byte_2x2(X1, X5, X15);
> +       transpose_16byte_2x2(X2, X6, X15);
> +       transpose_16byte_2x2(X3, X7, X15);
> +       vmovdqa (STACK_TMP)(%rsp), X15;
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 0), X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 0), X1);
> +       vpbroadcastd (8 * 4)(INPUT), X0;
> +       PLUS(X8, X0);
> +       vpbroadcastd (9 * 4)(INPUT), X0;
> +       PLUS(X9, X0);
> +       vpbroadcastd (10 * 4)(INPUT), X0;
> +       PLUS(X10, X0);
> +       vpbroadcastd (11 * 4)(INPUT), X0;
> +       PLUS(X11, X0);
> +       vmovdqa (STACK_VEC_X12)(%rsp), X0;
> +       PLUS(X12, X0);
> +       vmovdqa (STACK_VEC_X13)(%rsp), X0;
> +       PLUS(X13, X0);
> +       vpbroadcastd (14 * 4)(INPUT), X0;
> +       PLUS(X14, X0);
> +       vpbroadcastd (15 * 4)(INPUT), X0;
> +       PLUS(X15, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 0), X2);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 0), X3);
> +
> +       /* Update counter */
> +       addq $8, (12 * 4)(INPUT);
> +
> +       transpose_4x4(X8, X9, X10, X11, X0, X1);
> +       transpose_4x4(X12, X13, X14, X15, X0, X1);
> +       xor_src_dst(DST, SRC, (64 * 4 + 16 * 0), X4);
> +       xor_src_dst(DST, SRC, (64 * 5 + 16 * 0), X5);
> +       transpose_16byte_2x2(X8, X12, X0);
> +       transpose_16byte_2x2(X9, X13, X0);
> +       transpose_16byte_2x2(X10, X14, X0);
> +       transpose_16byte_2x2(X11, X15, X0);
> +       xor_src_dst(DST, SRC, (64 * 6 + 16 * 0), X6);
> +       xor_src_dst(DST, SRC, (64 * 7 + 16 * 0), X7);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 2), X8);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 2), X9);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 2), X10);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 2), X11);
> +       xor_src_dst(DST, SRC, (64 * 4 + 16 * 2), X12);
> +       xor_src_dst(DST, SRC, (64 * 5 + 16 * 2), X13);
> +       xor_src_dst(DST, SRC, (64 * 6 + 16 * 2), X14);
> +       xor_src_dst(DST, SRC, (64 * 7 + 16 * 2), X15);
> +
> +       sub $8, NBLKS;
> +       lea (8 * 64)(DST), DST;
> +       lea (8 * 64)(SRC), SRC;
> +       jnz L(loop8);
> +
> +       /* clear the used vector registers and stack */
> +       vpxor X0, X0, X0;
> +       vmovdqa X0, (STACK_VEC_X12)(%rsp);
> +       vmovdqa X0, (STACK_VEC_X13)(%rsp);
> +       vmovdqa X0, (STACK_TMP)(%rsp);
> +       vmovdqa X0, (STACK_TMP1)(%rsp);
> +       vzeroall;

Do you need vzeroall?
Why not vzeroupper? Is it a security concern to leave info in the xmm pieces?


> +
> +       /* eax zeroed by round loop. */
> +       leave;
> +       cfi_adjust_cfa_offset(-8)
> +       cfi_def_cfa_register(%rsp);
> +       ret;
> +       int3;

Why do we need int3 here?
> +END(__chacha20_avx2_blocks8)
> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
> index 37a4fdfb1f..7e9e7755f3 100644
> --- a/sysdeps/x86_64/chacha20_arch.h
> +++ b/sysdeps/x86_64/chacha20_arch.h
> @@ -22,11 +22,25 @@
>
>  unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
>                                        const uint8_t *src, size_t nblks);
> +unsigned int __chacha20_avx2_blocks8 (uint32_t *state, uint8_t *dst,
> +                                     const uint8_t *src, size_t nblks);
>
>  static inline void
>  chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
>                 size_t bytes)
>  {
> +  const struct cpu_features* cpu_features = __get_cpu_features ();

Can we do this with an ifunc and take the cpufeature check off the critical
path?
> +
> +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && bytes >= CHACHA20_BLOCK_SIZE * 8)
> +    {
> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
> +      nblocks -= nblocks % 8;
> +      __chacha20_avx2_blocks8 (state->ctx, dst, src, nblocks);
> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
> +      src += nblocks * CHACHA20_BLOCK_SIZE;
> +    }
> +
>    if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
>      {
>        size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
> --
> 2.32.0
>

Do you want optimization comments or do that later?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-13 20:23 ` [PATCH 4/7] x86: Add SSSE3 optimized chacha20 Adhemerval Zanella
@ 2022-04-13 23:12   ` Noah Goldstein
  2022-04-14 17:03     ` Adhemerval Zanella
  2022-04-14 17:17   ` Noah Goldstein
  2022-04-14 19:25   ` Noah Goldstein
  2 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-13 23:12 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> It adds vectorized ChaCha20 implementation based on libgcrypt
> cipher/chacha20-amd64-ssse3.S.  It is used only if SSSE3 is supported
> and enable by the architecture.
>
> On a Ryzen 9 5900X it shows the following improvements (using
> formatted bench-arc4random data):
>
> GENERIC
> Function                                 MB/s
> --------------------------------------------------
> arc4random [single-thread]               375.06
> arc4random_buf(0) [single-thread]        498.50
> arc4random_buf(16) [single-thread]       576.86
> arc4random_buf(32) [single-thread]       615.76
> arc4random_buf(64) [single-thread]       633.97
> --------------------------------------------------
> arc4random [multi-thread]                359.86
> arc4random_buf(0) [multi-thread]         479.27
> arc4random_buf(16) [multi-thread]        543.65
> arc4random_buf(32) [multi-thread]        581.98
> arc4random_buf(64) [multi-thread]        603.01
> --------------------------------------------------
>
> SSSE3:
> Function                                 MB/s
> --------------------------------------------------
> arc4random [single-thread]               576.55
> arc4random_buf(0) [single-thread]        961.77
> arc4random_buf(16) [single-thread]       1309.38
> arc4random_buf(32) [single-thread]       1558.69
> arc4random_buf(64) [single-thread]       1728.54
> --------------------------------------------------
> arc4random [multi-thread]                589.52
> arc4random_buf(0) [multi-thread]         967.39
> arc4random_buf(16) [multi-thread]        1319.27
> arc4random_buf(32) [multi-thread]        1552.96
> arc4random_buf(64) [multi-thread]        1734.27
> --------------------------------------------------
>
> Checked on x86_64-linux-gnu.
> ---
>  LICENSES                        |  20 ++
>  sysdeps/generic/chacha20_arch.h |  24 +++
>  sysdeps/x86_64/Makefile         |   6 +
>  sysdeps/x86_64/chacha20-ssse3.S | 330 ++++++++++++++++++++++++++++++++
>  sysdeps/x86_64/chacha20_arch.h  |  42 ++++
>  5 files changed, 422 insertions(+)
>  create mode 100644 sysdeps/generic/chacha20_arch.h
>  create mode 100644 sysdeps/x86_64/chacha20-ssse3.S
>  create mode 100644 sysdeps/x86_64/chacha20_arch.h
>
> diff --git a/LICENSES b/LICENSES
> index 530893b1dc..2563abd9e2 100644
> --- a/LICENSES
> +++ b/LICENSES
> @@ -389,3 +389,23 @@ Copyright 2001 by Stephen L. Moshier <moshier@na-net.ornl.gov>
>   You should have received a copy of the GNU Lesser General Public
>   License along with this library; if not, see
>   <https://www.gnu.org/licenses/>.  */
> +
> +sysdeps/x86_64/chacha20-ssse3.S import code from libgcrypt, with the
> +following notices:
> +
> +Copyright (C) 2017-2019 Jussi Kivilinna <jussi.kivilinna@iki.fi>
> +
> +This file is part of Libgcrypt.
> +
> +Libgcrypt is free software; you can redistribute it and/or modify
> +it under the terms of the GNU Lesser General Public License as
> +published by the Free Software Foundation; either version 2.1 of
> +the License, or (at your option) any later version.
> +
> +Libgcrypt is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU Lesser General Public License for more details.
> +
> +You should have received a copy of the GNU Lesser General Public
> +License along with this program; if not, see <http://www.gnu.org/licenses/>.
> diff --git a/sysdeps/generic/chacha20_arch.h b/sysdeps/generic/chacha20_arch.h
> new file mode 100644
> index 0000000000..d7200ac583
> --- /dev/null
> +++ b/sysdeps/generic/chacha20_arch.h
> @@ -0,0 +1,24 @@
> +/* Chacha20 implementation, generic interface.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +static inline void
> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
> +               const uint8_t *src, size_t bytes)
> +{
> +  chacha20_crypt_generic (state, dst, src, bytes);
> +}
> diff --git a/sysdeps/x86_64/Makefile b/sysdeps/x86_64/Makefile
> index 79365aff2a..f43b6a1180 100644
> --- a/sysdeps/x86_64/Makefile
> +++ b/sysdeps/x86_64/Makefile
> @@ -5,6 +5,12 @@ ifeq ($(subdir),csu)
>  gen-as-const-headers += link-defines.sym
>  endif
>
> +ifeq ($(subdir),stdlib)
> +sysdep_routines += \
> +  chacha20-ssse3 \
> +  # sysdep_routines
> +endif
> +
>  ifeq ($(subdir),gmon)
>  sysdep_routines += _mcount
>  # We cannot compile _mcount.S with -pg because that would create
> diff --git a/sysdeps/x86_64/chacha20-ssse3.S b/sysdeps/x86_64/chacha20-ssse3.S
> new file mode 100644
> index 0000000000..f221daf634
> --- /dev/null
> +++ b/sysdeps/x86_64/chacha20-ssse3.S
> @@ -0,0 +1,330 @@
> +/* Optimized SSSE3 implementation of ChaCha20 cipher.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +/* Based on D. J. Bernstein reference implementation at
> +   http://cr.yp.to/chacha.html:
> +
> +   chacha-regs.c version 20080118
> +   D. J. Bernstein
> +   Public domain.  */
> +
> +#include <sysdep.h>
> +
> +#ifdef PIC
> +#  define rRIP (%rip)
> +#else
> +#  define rRIP
> +#endif
> +
> +/* register macros */
> +#define INPUT %rdi
> +#define DST   %rsi
> +#define SRC   %rdx
> +#define NBLKS %rcx
> +#define ROUND %eax
> +
> +/* stack structure */
> +#define STACK_VEC_X12 (16)
> +#define STACK_VEC_X13 (16 + STACK_VEC_X12)
> +#define STACK_TMP     (16 + STACK_VEC_X13)
> +#define STACK_TMP1    (16 + STACK_TMP)
> +#define STACK_TMP2    (16 + STACK_TMP1)
> +
> +#define STACK_MAX     (16 + STACK_TMP2)
> +
> +/* vector registers */
> +#define X0 %xmm0
> +#define X1 %xmm1
> +#define X2 %xmm2
> +#define X3 %xmm3
> +#define X4 %xmm4
> +#define X5 %xmm5
> +#define X6 %xmm6
> +#define X7 %xmm7
> +#define X8 %xmm8
> +#define X9 %xmm9
> +#define X10 %xmm10
> +#define X11 %xmm11
> +#define X12 %xmm12
> +#define X13 %xmm13
> +#define X14 %xmm14
> +#define X15 %xmm15
> +
> +/**********************************************************************
> +  helper macros
> + **********************************************************************/
> +
> +/* 4x4 32-bit integer matrix transpose */
> +#define transpose_4x4(x0, x1, x2, x3, t1, t2, t3) \
> +       movdqa    x0, t2; \
> +       punpckhdq x1, t2; \
> +       punpckldq x1, x0; \
> +       \
> +       movdqa    x2, t1; \
> +       punpckldq x3, t1; \
> +       punpckhdq x3, x2; \
> +       \
> +       movdqa     x0, x1; \
> +       punpckhqdq t1, x1; \
> +       punpcklqdq t1, x0; \
> +       \
> +       movdqa     t2, x3; \
> +       punpckhqdq x2, x3; \
> +       punpcklqdq x2, t2; \
> +       movdqa     t2, x2;
> +
> +/* fill xmm register with 32-bit value from memory */
> +#define pbroadcastd(mem32, xreg) \
> +       movd mem32, xreg; \
> +       pshufd $0, xreg, xreg;
> +
> +/* xor with unaligned memory operand */
> +#define pxor_u(umem128, xreg, t) \
> +       movdqu umem128, t; \
> +       pxor t, xreg;
> +
> +/* xor register with unaligned src and save to unaligned dst */
> +#define xor_src_dst(dst, src, offset, xreg, t) \
> +       pxor_u(offset(src), xreg, t); \
> +       movdqu xreg, offset(dst);
> +
> +#define clear(x) pxor x,x;
> +
> +/**********************************************************************
> +  4-way chacha20
> + **********************************************************************/
> +
> +#define ROTATE2(v1,v2,c,tmp1,tmp2)     \
> +       movdqa v1, tmp1;                \
> +       movdqa v2, tmp2;                \
> +       psrld $(32 - (c)), v1;          \
> +       pslld $(c), tmp1;               \
> +       paddb tmp1, v1;                 \
> +       psrld $(32 - (c)), v2;          \
> +       pslld $(c), tmp2;               \
> +       paddb tmp2, v2;
> +
> +#define ROTATE_SHUF_2(v1,v2,shuf)      \
> +       pshufb shuf, v1;                \
> +       pshufb shuf, v2;
AFAICT this is the only ssse3 code.

Can you replace this optimized (maybe?) rotate with
rotate2 so this can be sse2?
> +
> +#define XOR(ds,s) \
> +       pxor s, ds;
> +
> +#define PLUS(ds,s) \
> +       paddd s, ds;
> +
> +#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2,ign,tmp1,tmp2,\
> +                     interleave_op1,interleave_op2)            \
> +       movdqa L(shuf_rol16) rRIP, tmp1;                        \
> +               interleave_op1;                                 \
> +       PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);       \
> +           ROTATE_SHUF_2(d1, d2, tmp1);                        \
> +       PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);       \
> +           ROTATE2(b1, b2, 12, tmp1, tmp2);                    \
> +       movdqa L(shuf_rol8) rRIP, tmp1;                         \
> +               interleave_op2;                                 \
> +       PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);       \
> +           ROTATE_SHUF_2(d1, d2, tmp1);                        \
> +       PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);       \
> +           ROTATE2(b1, b2,  7, tmp1, tmp2);
> +
> +       .text
> +
> +chacha20_data:
> +       .align 16
> +L(shuf_rol16):
> +       .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
> +L(shuf_rol8):
> +       .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
> +L(counter1):
> +       .long 1,0,0,0
> +L(inc_counter):
> +       .long 0,1,2,3
> +L(unsigned_cmp):
> +       .long 0x80000000,0x80000000,0x80000000,0x80000000
> +
> +ENTRY (__chacha20_ssse3_blocks8)
> +       /* input:
> +        *      %rdi: input
> +        *      %rsi: dst
> +        *      %rdx: src
> +        *      %rcx: nblks (multiple of 4)
> +        */
> +
> +       pushq %rbp;
> +       cfi_adjust_cfa_offset(8);
> +       cfi_rel_offset(rbp, 0)
> +       movq %rsp, %rbp;
> +       cfi_def_cfa_register(%rbp);
> +
> +       subq $STACK_MAX, %rsp;
> +       andq $~15, %rsp;
> +
> +L(loop4):
> +       mov $20, ROUND;
> +
> +       /* Construct counter vectors X12 and X13 */
> +       movdqa L(inc_counter) rRIP, X0;
> +       movdqa L(unsigned_cmp) rRIP, X2;
> +       pbroadcastd((12 * 4)(INPUT), X12);
> +       pbroadcastd((13 * 4)(INPUT), X13);
> +       paddd X0, X12;
> +       movdqa X12, X1;
> +       pxor X2, X0;
> +       pxor X2, X1;
> +       pcmpgtd X1, X0;
> +       psubd X0, X13;
> +       movdqa X12, (STACK_VEC_X12)(%rsp);
> +       movdqa X13, (STACK_VEC_X13)(%rsp);
> +
> +       /* Load vectors */
> +       pbroadcastd((0 * 4)(INPUT), X0);
> +       pbroadcastd((1 * 4)(INPUT), X1);
> +       pbroadcastd((2 * 4)(INPUT), X2);
> +       pbroadcastd((3 * 4)(INPUT), X3);
> +       pbroadcastd((4 * 4)(INPUT), X4);
> +       pbroadcastd((5 * 4)(INPUT), X5);
> +       pbroadcastd((6 * 4)(INPUT), X6);
> +       pbroadcastd((7 * 4)(INPUT), X7);
> +       pbroadcastd((8 * 4)(INPUT), X8);
> +       pbroadcastd((9 * 4)(INPUT), X9);
> +       pbroadcastd((10 * 4)(INPUT), X10);
> +       pbroadcastd((11 * 4)(INPUT), X11);
> +       pbroadcastd((14 * 4)(INPUT), X14);
> +       pbroadcastd((15 * 4)(INPUT), X15);
> +       movdqa X11, (STACK_TMP)(%rsp);
> +       movdqa X15, (STACK_TMP1)(%rsp);
> +
> +L(round2_4):
> +       QUARTERROUND2(X0, X4,  X8, X12,   X1, X5,  X9, X13, tmp:=,X11,X15,,)
> +       movdqa (STACK_TMP)(%rsp), X11;
> +       movdqa (STACK_TMP1)(%rsp), X15;
> +       movdqa X8, (STACK_TMP)(%rsp);
> +       movdqa X9, (STACK_TMP1)(%rsp);
> +       QUARTERROUND2(X2, X6, X10, X14,   X3, X7, X11, X15, tmp:=,X8,X9,,)
> +       QUARTERROUND2(X0, X5, X10, X15,   X1, X6, X11, X12, tmp:=,X8,X9,,)
> +       movdqa (STACK_TMP)(%rsp), X8;
> +       movdqa (STACK_TMP1)(%rsp), X9;
> +       movdqa X11, (STACK_TMP)(%rsp);
> +       movdqa X15, (STACK_TMP1)(%rsp);
> +       QUARTERROUND2(X2, X7,  X8, X13,   X3, X4,  X9, X14, tmp:=,X11,X15,,)
> +       sub $2, ROUND;
> +       jnz .Lround2_4;
> +
> +       /* tmp := X15 */
> +       movdqa (STACK_TMP)(%rsp), X11;
> +       pbroadcastd((0 * 4)(INPUT), X15);
> +       PLUS(X0, X15);
> +       pbroadcastd((1 * 4)(INPUT), X15);
> +       PLUS(X1, X15);
> +       pbroadcastd((2 * 4)(INPUT), X15);
> +       PLUS(X2, X15);
> +       pbroadcastd((3 * 4)(INPUT), X15);
> +       PLUS(X3, X15);
> +       pbroadcastd((4 * 4)(INPUT), X15);
> +       PLUS(X4, X15);
> +       pbroadcastd((5 * 4)(INPUT), X15);
> +       PLUS(X5, X15);
> +       pbroadcastd((6 * 4)(INPUT), X15);
> +       PLUS(X6, X15);
> +       pbroadcastd((7 * 4)(INPUT), X15);
> +       PLUS(X7, X15);
> +       pbroadcastd((8 * 4)(INPUT), X15);
> +       PLUS(X8, X15);
> +       pbroadcastd((9 * 4)(INPUT), X15);
> +       PLUS(X9, X15);
> +       pbroadcastd((10 * 4)(INPUT), X15);
> +       PLUS(X10, X15);
> +       pbroadcastd((11 * 4)(INPUT), X15);
> +       PLUS(X11, X15);
> +       movdqa (STACK_VEC_X12)(%rsp), X15;
> +       PLUS(X12, X15);
> +       movdqa (STACK_VEC_X13)(%rsp), X15;
> +       PLUS(X13, X15);
> +       movdqa X13, (STACK_TMP)(%rsp);
> +       pbroadcastd((14 * 4)(INPUT), X15);
> +       PLUS(X14, X15);
> +       movdqa (STACK_TMP1)(%rsp), X15;
> +       movdqa X14, (STACK_TMP1)(%rsp);
> +       pbroadcastd((15 * 4)(INPUT), X13);
> +       PLUS(X15, X13);
> +       movdqa X15, (STACK_TMP2)(%rsp);
> +
> +       /* Update counter */
> +       addq $4, (12 * 4)(INPUT);
> +
> +       transpose_4x4(X0, X1, X2, X3, X13, X14, X15);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 0), X0, X15);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 0), X1, X15);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 0), X2, X15);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 0), X3, X15);
> +       transpose_4x4(X4, X5, X6, X7, X0, X1, X2);
> +       movdqa (STACK_TMP)(%rsp), X13;
> +       movdqa (STACK_TMP1)(%rsp), X14;
> +       movdqa (STACK_TMP2)(%rsp), X15;
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 1), X4, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 1), X5, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 1), X6, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 1), X7, X0);
> +       transpose_4x4(X8, X9, X10, X11, X0, X1, X2);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 2), X8, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 2), X9, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 2), X10, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 2), X11, X0);
> +       transpose_4x4(X12, X13, X14, X15, X0, X1, X2);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 3), X12, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 3), X13, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 3), X14, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 3), X15, X0);
> +
> +       sub $4, NBLKS;
> +       lea (4 * 64)(DST), DST;
> +       lea (4 * 64)(SRC), SRC;
> +       jnz L(loop4);
> +
> +       /* clear the used vector registers and stack */
> +       clear(X0);
> +       movdqa X0, (STACK_VEC_X12)(%rsp);
> +       movdqa X0, (STACK_VEC_X13)(%rsp);
> +       movdqa X0, (STACK_TMP)(%rsp);
> +       movdqa X0, (STACK_TMP1)(%rsp);
> +       movdqa X0, (STACK_TMP2)(%rsp);
> +       clear(X1);
> +       clear(X2);
> +       clear(X3);
> +       clear(X4);
> +       clear(X5);
> +       clear(X6);
> +       clear(X7);
> +       clear(X8);
> +       clear(X9);
> +       clear(X10);
> +       clear(X11);
> +       clear(X12);
> +       clear(X13);
> +       clear(X14);
> +       clear(X15);
> +
> +       /* eax zeroed by round loop. */
> +       leave;
> +       cfi_adjust_cfa_offset(-8)
> +       cfi_def_cfa_register(%rsp);
> +       ret;
> +       int3;
why int3?
> +END (__chacha20_ssse3_blocks8)
> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
> new file mode 100644
> index 0000000000..37a4fdfb1f
> --- /dev/null
> +++ b/sysdeps/x86_64/chacha20_arch.h
> @@ -0,0 +1,42 @@
> +/* Chacha20 implementation, used on arc4random.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <ldsodefs.h>
> +#include <cpu-features.h>
> +#include <sys/param.h>
> +
> +unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
> +                                      const uint8_t *src, size_t nblks);
> +
> +static inline void
> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
> +               size_t bytes)
> +{
> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)

Can we make this an ifunc?
> +    {
> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
> +      nblocks -= nblocks % 4;
> +      __chacha20_ssse3_blocks8 (state->ctx, dst, src, nblocks);
> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
> +      src += nblocks * CHACHA20_BLOCK_SIZE;
> +    }
> +
> +  if (bytes > 0)
> +    chacha20_crypt_generic (state, dst, src, bytes);
> +}
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] Add arc4random support
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
                   ` (6 preceding siblings ...)
  2022-04-13 20:24 ` [PATCH 7/7] powerpc64: " Adhemerval Zanella
@ 2022-04-14  7:36 ` Yann Droneaud
  2022-04-14 18:39   ` Adhemerval Zanella
  2022-04-14 11:49 ` Cristian Rodríguez
  8 siblings, 1 reply; 34+ messages in thread
From: Yann Droneaud @ 2022-04-14  7:36 UTC (permalink / raw)
  To: Adhemerval Zanella, GNU C Library

Hi,

Le 13/04/2022 à 22:23, Adhemerval Zanella via Libc-alpha a écrit :

> This patch adds the arc4random, arc4random_buf, and arc4random_uniform
> along with optimized versions for x86_64, aarch64, and powerpc64.
>
> The generic implementation is based on scalar Chacha20, with a global
> cache and locking.  It uses getrandom or /dev/urandom as fallback to
> get the initial entropy, and reseeds the internal state on every 16MB
> of consumed entropy.
>
> It maintains an internal buffer which consumes at maximum one page on
> most systems (assuming 4k pages).  The internal buffer optimizes the
> cipher encrypt calls, by amortize arc4random calls (where both
> function call and locks cost are the dominating factor).
>
> Fork detection is done by checking if MADV_WIPEONFORK supported.  If not
> the fork callback will reset the state on the fork call.  It does not
> handle direct clone calls, nor vfork or _Fork (arc4random is not
> async-signal-safe due the internal lock usage, althought the
> implementation does try to handle fork cases).
>
> The generic ChaCha20 implementation is based on the RFC8439 [1], which
> a simple memcpy with xor implementation.

The xor (with 0) is a waste of CPU cycles as the ChaCha20 keystream is 
the PRNG output.

Regards.

-- 

Yann Droneaud

OPTEYA



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] Add arc4random support
  2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
                   ` (7 preceding siblings ...)
  2022-04-14  7:36 ` [PATCH 0/7] Add arc4random support Yann Droneaud
@ 2022-04-14 11:49 ` Cristian Rodríguez
  2022-04-14 19:26   ` Adhemerval Zanella
  8 siblings, 1 reply; 34+ messages in thread
From: Cristian Rodríguez @ 2022-04-14 11:49 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

If this interface is gonna added, GNU extensions that return uint64_t
of arc4random and arc4random_uniform will be extremely cool.
Even cooler if there is no global state.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-13 23:12   ` Noah Goldstein
@ 2022-04-14 17:03     ` Adhemerval Zanella
  2022-04-14 17:10       ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 17:03 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 13/04/2022 20:12, Noah Goldstein wrote:
> On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> +
>> +       /* eax zeroed by round loop. */
>> +       leave;
>> +       cfi_adjust_cfa_offset(-8)
>> +       cfi_def_cfa_register(%rsp);
>> +       ret;
>> +       int3;
> why int3?

It was originally added on libgcrypt by 11ade08efbfbc36dbf3571f1026946269950bc40,
as a straight-line speculation hardening.  It is was is emitted by clang 14 and
gcc 12 with -mharden-sls=return.

I am not sure if we need that kind of hardening, but I would prefer to the first
version be in sync with libgcrypt as much as possible so the future optimizations
would be simpler to keep localized to glibc (if libgcrypt does not want to
backport it).

>> +END (__chacha20_ssse3_blocks8)
>> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
>> new file mode 100644
>> index 0000000000..37a4fdfb1f
>> --- /dev/null
>> +++ b/sysdeps/x86_64/chacha20_arch.h
>> @@ -0,0 +1,42 @@
>> +/* Chacha20 implementation, used on arc4random.
>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include <ldsodefs.h>
>> +#include <cpu-features.h>
>> +#include <sys/param.h>
>> +
>> +unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
>> +                                      const uint8_t *src, size_t nblks);
>> +
>> +static inline void
>> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
>> +               size_t bytes)
>> +{
>> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
> 
> Can we make this an ifunc?

I though about it, but if you check on arc4random implementation the
chacha20_crypt is called for the whole internal buf once it is exhausted.
Assuming a 1 cycle per byte (as indicated by bench-slope libgrcypt on
my machine), it will be at least 1k cycles to encrypt each block.  I
am not sure if setting up an internal PLT call to save a couple of cycles
on a internal function will really show anything significant here (assuming
that the PLT call won't add more overhead in fact).

Besides that the code boilerplate to setup the internal ifunc is also
way more complex.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-14 17:03     ` Adhemerval Zanella
@ 2022-04-14 17:10       ` Noah Goldstein
  2022-04-14 17:18         ` Adhemerval Zanella
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 17:10 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Thu, Apr 14, 2022 at 12:03 PM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 13/04/2022 20:12, Noah Goldstein wrote:
> > On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> +
> >> +       /* eax zeroed by round loop. */
> >> +       leave;
> >> +       cfi_adjust_cfa_offset(-8)
> >> +       cfi_def_cfa_register(%rsp);
> >> +       ret;
> >> +       int3;
> > why int3?
>
> It was originally added on libgcrypt by 11ade08efbfbc36dbf3571f1026946269950bc40,
> as a straight-line speculation hardening.  It is was is emitted by clang 14 and
> gcc 12 with -mharden-sls=return.
>
> I am not sure if we need that kind of hardening, but I would prefer to the first
> version be in sync with libgcrypt as much as possible so the future optimizations
> would be simpler to keep localized to glibc (if libgcrypt does not want to
> backport it).

Okay, can keep for now. Any thoughts on changing it to sse2?


>
> >> +END (__chacha20_ssse3_blocks8)
> >> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
> >> new file mode 100644
> >> index 0000000000..37a4fdfb1f
> >> --- /dev/null
> >> +++ b/sysdeps/x86_64/chacha20_arch.h
> >> @@ -0,0 +1,42 @@
> >> +/* Chacha20 implementation, used on arc4random.
> >> +   Copyright (C) 2022 Free Software Foundation, Inc.
> >> +   This file is part of the GNU C Library.
> >> +
> >> +   The GNU C Library is free software; you can redistribute it and/or
> >> +   modify it under the terms of the GNU Lesser General Public
> >> +   License as published by the Free Software Foundation; either
> >> +   version 2.1 of the License, or (at your option) any later version.
> >> +
> >> +   The GNU C Library is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> +   Lesser General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU Lesser General Public
> >> +   License along with the GNU C Library; if not, see
> >> +   <http://www.gnu.org/licenses/>.  */
> >> +
> >> +#include <ldsodefs.h>
> >> +#include <cpu-features.h>
> >> +#include <sys/param.h>
> >> +
> >> +unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
> >> +                                      const uint8_t *src, size_t nblks);
> >> +
> >> +static inline void
> >> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
> >> +               size_t bytes)
> >> +{
> >> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
> >
> > Can we make this an ifunc?
>
> I though about it, but if you check on arc4random implementation the
> chacha20_crypt is called for the whole internal buf once it is exhausted.
> Assuming a 1 cycle per byte (as indicated by bench-slope libgrcypt on
> my machine), it will be at least 1k cycles to encrypt each block.  I
> am not sure if setting up an internal PLT call to save a couple of cycles
> on a internal function will really show anything significant here (assuming
> that the PLT call won't add more overhead in fact).
>
> Besides that the code boilerplate to setup the internal ifunc is also
> way more complex.

Okay for now as long as open to changing later (not that we will but that
this isn't locking us into the decision).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] x86: Add AVX2 optimized chacha20
  2022-04-13 23:04   ` Noah Goldstein
@ 2022-04-14 17:16     ` Adhemerval Zanella
  2022-04-14 17:20       ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 17:16 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 13/04/2022 20:04, Noah Goldstein wrote:
> On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> +       .text
> 
> section avx2
> 

Ack, I changed to '.section .text.avx2, "ax", @progbits'.

>> +       .align 32
>> +chacha20_data:
>> +L(shuf_rol16):
>> +       .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
>> +L(shuf_rol8):
>> +       .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
>> +L(inc_counter):
>> +       .byte 0,1,2,3,4,5,6,7
>> +L(unsigned_cmp):
>> +       .long 0x80000000
>> +
>> +ENTRY (__chacha20_avx2_blocks8)
>> +       /* input:
>> +        *      %rdi: input
>> +        *      %rsi: dst
>> +        *      %rdx: src
>> +        *      %rcx: nblks (multiple of 8)
>> +        */
>> +       vzeroupper;
> 
> vzeroupper needs to be replaced with VZEROUPPER_RETURN
> and we need a transaction safe version unless this can never
> be called during a transaction.

I think you meant VZEROUPPER here (VZEROUPPER_RETURN seems to trigger
test case failures). What do you mean by a 'transaction safe version'?
Ax extra __chacha20_avx2_blocks8 implementation to handle it? Or disable
it if RTM is enabled?

>> +
>> +       /* clear the used vector registers and stack */
>> +       vpxor X0, X0, X0;
>> +       vmovdqa X0, (STACK_VEC_X12)(%rsp);
>> +       vmovdqa X0, (STACK_VEC_X13)(%rsp);
>> +       vmovdqa X0, (STACK_TMP)(%rsp);
>> +       vmovdqa X0, (STACK_TMP1)(%rsp);
>> +       vzeroall;
> 
> Do you need vzeroall?
> Why not vzeroupper? Is it a security concern to leave info in the xmm pieces?

I would assume, since it is on the original libgrcypt optimization.  As
for the ssse3 version, I am not sure if we really need that level of
hardening, but it would be good to have the initial revision as close
as possible from libgcrypt.

> 
> 
>> +
>> +       /* eax zeroed by round loop. */
>> +       leave;
>> +       cfi_adjust_cfa_offset(-8)
>> +       cfi_def_cfa_register(%rsp);
>> +       ret;
>> +       int3;
> 
> Why do we need int3 here?

I think the ssse3 applies here as well.

>> +END(__chacha20_avx2_blocks8)
>> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
>> index 37a4fdfb1f..7e9e7755f3 100644
>> --- a/sysdeps/x86_64/chacha20_arch.h
>> +++ b/sysdeps/x86_64/chacha20_arch.h
>> @@ -22,11 +22,25 @@
>>
>>  unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
>>                                        const uint8_t *src, size_t nblks);
>> +unsigned int __chacha20_avx2_blocks8 (uint32_t *state, uint8_t *dst,
>> +                                     const uint8_t *src, size_t nblks);
>>
>>  static inline void
>>  chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
>>                 size_t bytes)
>>  {
>> +  const struct cpu_features* cpu_features = __get_cpu_features ();
> 
> Can we do this with an ifunc and take the cpufeature check off the critical
> path?

Ditto.

>> +
>> +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && bytes >= CHACHA20_BLOCK_SIZE * 8)
>> +    {
>> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
>> +      nblocks -= nblocks % 8;
>> +      __chacha20_avx2_blocks8 (state->ctx, dst, src, nblocks);
>> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
>> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
>> +      src += nblocks * CHACHA20_BLOCK_SIZE;
>> +    }
>> +
>>    if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
>>      {
>>        size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
>> --
>> 2.32.0
>>
> 
> Do you want optimization comments or do that later?

Ideally I would like to check if the proposed arc4random implementation 
is what we want (with current approach of using atfork handlers and the
key reschedule).  The cipher itself it not the utmost important in the 
sense it is transparent to user and we can eventually replace it if there
any issue or attack to ChaCha20.  Initially I won't add any arch-specific
optimization, but since libgcrypt provides some that fits on the current
approach I though it would be a nice thing to have.

For optimization comments it would be good to sync with libgcrypt as well,
I think the project will be interested in any performance improvement
you might have for the chacha implementations.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-13 20:23 ` [PATCH 4/7] x86: Add SSSE3 optimized chacha20 Adhemerval Zanella
  2022-04-13 23:12   ` Noah Goldstein
@ 2022-04-14 17:17   ` Noah Goldstein
  2022-04-14 18:11     ` Adhemerval Zanella
  2022-04-14 19:25   ` Noah Goldstein
  2 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 17:17 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Wed, Apr 13, 2022 at 3:27 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> It adds vectorized ChaCha20 implementation based on libgcrypt
> cipher/chacha20-amd64-ssse3.S.  It is used only if SSSE3 is supported
> and enable by the architecture.
>
> On a Ryzen 9 5900X it shows the following improvements (using
> formatted bench-arc4random data):
>
> GENERIC
> Function                                 MB/s
> --------------------------------------------------
> arc4random [single-thread]               375.06
> arc4random_buf(0) [single-thread]        498.50
> arc4random_buf(16) [single-thread]       576.86
> arc4random_buf(32) [single-thread]       615.76
> arc4random_buf(64) [single-thread]       633.97
> --------------------------------------------------
> arc4random [multi-thread]                359.86
> arc4random_buf(0) [multi-thread]         479.27
> arc4random_buf(16) [multi-thread]        543.65
> arc4random_buf(32) [multi-thread]        581.98
> arc4random_buf(64) [multi-thread]        603.01
> --------------------------------------------------
>
> SSSE3:
> Function                                 MB/s
> --------------------------------------------------
> arc4random [single-thread]               576.55
> arc4random_buf(0) [single-thread]        961.77
> arc4random_buf(16) [single-thread]       1309.38
> arc4random_buf(32) [single-thread]       1558.69
> arc4random_buf(64) [single-thread]       1728.54
> --------------------------------------------------
> arc4random [multi-thread]                589.52
> arc4random_buf(0) [multi-thread]         967.39
> arc4random_buf(16) [multi-thread]        1319.27
> arc4random_buf(32) [multi-thread]        1552.96
> arc4random_buf(64) [multi-thread]        1734.27
> --------------------------------------------------
>
> Checked on x86_64-linux-gnu.
> ---
>  LICENSES                        |  20 ++
>  sysdeps/generic/chacha20_arch.h |  24 +++
>  sysdeps/x86_64/Makefile         |   6 +
>  sysdeps/x86_64/chacha20-ssse3.S | 330 ++++++++++++++++++++++++++++++++
>  sysdeps/x86_64/chacha20_arch.h  |  42 ++++
>  5 files changed, 422 insertions(+)
>  create mode 100644 sysdeps/generic/chacha20_arch.h
>  create mode 100644 sysdeps/x86_64/chacha20-ssse3.S
>  create mode 100644 sysdeps/x86_64/chacha20_arch.h
>
> diff --git a/LICENSES b/LICENSES
> index 530893b1dc..2563abd9e2 100644
> --- a/LICENSES
> +++ b/LICENSES
> @@ -389,3 +389,23 @@ Copyright 2001 by Stephen L. Moshier <moshier@na-net.ornl.gov>
>   You should have received a copy of the GNU Lesser General Public
>   License along with this library; if not, see
>   <https://www.gnu.org/licenses/>.  */
> +
> +sysdeps/x86_64/chacha20-ssse3.S import code from libgcrypt, with the
> +following notices:
> +
> +Copyright (C) 2017-2019 Jussi Kivilinna <jussi.kivilinna@iki.fi>
> +
> +This file is part of Libgcrypt.
> +
> +Libgcrypt is free software; you can redistribute it and/or modify
> +it under the terms of the GNU Lesser General Public License as
> +published by the Free Software Foundation; either version 2.1 of
> +the License, or (at your option) any later version.
> +
> +Libgcrypt is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU Lesser General Public License for more details.
> +
> +You should have received a copy of the GNU Lesser General Public
> +License along with this program; if not, see <http://www.gnu.org/licenses/>.
> diff --git a/sysdeps/generic/chacha20_arch.h b/sysdeps/generic/chacha20_arch.h
> new file mode 100644
> index 0000000000..d7200ac583
> --- /dev/null
> +++ b/sysdeps/generic/chacha20_arch.h
> @@ -0,0 +1,24 @@
> +/* Chacha20 implementation, generic interface.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +static inline void
> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
> +               const uint8_t *src, size_t bytes)
> +{
> +  chacha20_crypt_generic (state, dst, src, bytes);
> +}
> diff --git a/sysdeps/x86_64/Makefile b/sysdeps/x86_64/Makefile
> index 79365aff2a..f43b6a1180 100644
> --- a/sysdeps/x86_64/Makefile
> +++ b/sysdeps/x86_64/Makefile
> @@ -5,6 +5,12 @@ ifeq ($(subdir),csu)
>  gen-as-const-headers += link-defines.sym
>  endif
>
> +ifeq ($(subdir),stdlib)
> +sysdep_routines += \
> +  chacha20-ssse3 \
> +  # sysdep_routines
> +endif
> +
>  ifeq ($(subdir),gmon)
>  sysdep_routines += _mcount
>  # We cannot compile _mcount.S with -pg because that would create
> diff --git a/sysdeps/x86_64/chacha20-ssse3.S b/sysdeps/x86_64/chacha20-ssse3.S
> new file mode 100644
> index 0000000000..f221daf634
> --- /dev/null
> +++ b/sysdeps/x86_64/chacha20-ssse3.S
> @@ -0,0 +1,330 @@
> +/* Optimized SSSE3 implementation of ChaCha20 cipher.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +/* Based on D. J. Bernstein reference implementation at
> +   http://cr.yp.to/chacha.html:
> +
> +   chacha-regs.c version 20080118
> +   D. J. Bernstein
> +   Public domain.  */
> +
> +#include <sysdep.h>
> +
> +#ifdef PIC
> +#  define rRIP (%rip)
> +#else
> +#  define rRIP
> +#endif
> +
> +/* register macros */
> +#define INPUT %rdi
> +#define DST   %rsi
> +#define SRC   %rdx
> +#define NBLKS %rcx
> +#define ROUND %eax
> +
> +/* stack structure */
> +#define STACK_VEC_X12 (16)
> +#define STACK_VEC_X13 (16 + STACK_VEC_X12)
> +#define STACK_TMP     (16 + STACK_VEC_X13)
> +#define STACK_TMP1    (16 + STACK_TMP)
> +#define STACK_TMP2    (16 + STACK_TMP1)
> +
> +#define STACK_MAX     (16 + STACK_TMP2)
> +
> +/* vector registers */
> +#define X0 %xmm0
> +#define X1 %xmm1
> +#define X2 %xmm2
> +#define X3 %xmm3
> +#define X4 %xmm4
> +#define X5 %xmm5
> +#define X6 %xmm6
> +#define X7 %xmm7
> +#define X8 %xmm8
> +#define X9 %xmm9
> +#define X10 %xmm10
> +#define X11 %xmm11
> +#define X12 %xmm12
> +#define X13 %xmm13
> +#define X14 %xmm14
> +#define X15 %xmm15
> +
> +/**********************************************************************
> +  helper macros
> + **********************************************************************/
> +
> +/* 4x4 32-bit integer matrix transpose */
> +#define transpose_4x4(x0, x1, x2, x3, t1, t2, t3) \
> +       movdqa    x0, t2; \
> +       punpckhdq x1, t2; \
> +       punpckldq x1, x0; \
> +       \
> +       movdqa    x2, t1; \
> +       punpckldq x3, t1; \
> +       punpckhdq x3, x2; \
> +       \
> +       movdqa     x0, x1; \
> +       punpckhqdq t1, x1; \
> +       punpcklqdq t1, x0; \
> +       \
> +       movdqa     t2, x3; \
> +       punpckhqdq x2, x3; \
> +       punpcklqdq x2, t2; \
> +       movdqa     t2, x2;
> +
> +/* fill xmm register with 32-bit value from memory */
> +#define pbroadcastd(mem32, xreg) \
> +       movd mem32, xreg; \
> +       pshufd $0, xreg, xreg;
> +
> +/* xor with unaligned memory operand */
> +#define pxor_u(umem128, xreg, t) \
> +       movdqu umem128, t; \
> +       pxor t, xreg;
> +
> +/* xor register with unaligned src and save to unaligned dst */
> +#define xor_src_dst(dst, src, offset, xreg, t) \
> +       pxor_u(offset(src), xreg, t); \
> +       movdqu xreg, offset(dst);
> +
> +#define clear(x) pxor x,x;
> +
> +/**********************************************************************
> +  4-way chacha20
> + **********************************************************************/
> +
> +#define ROTATE2(v1,v2,c,tmp1,tmp2)     \
> +       movdqa v1, tmp1;                \
> +       movdqa v2, tmp2;                \
> +       psrld $(32 - (c)), v1;          \
> +       pslld $(c), tmp1;               \
> +       paddb tmp1, v1;                 \
> +       psrld $(32 - (c)), v2;          \
> +       pslld $(c), tmp2;               \
> +       paddb tmp2, v2;
> +
> +#define ROTATE_SHUF_2(v1,v2,shuf)      \
> +       pshufb shuf, v1;                \
> +       pshufb shuf, v2;
> +
> +#define XOR(ds,s) \
> +       pxor s, ds;
> +
> +#define PLUS(ds,s) \
> +       paddd s, ds;
> +
> +#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2,ign,tmp1,tmp2,\
> +                     interleave_op1,interleave_op2)            \
> +       movdqa L(shuf_rol16) rRIP, tmp1;                        \
> +               interleave_op1;                                 \
> +       PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);       \
> +           ROTATE_SHUF_2(d1, d2, tmp1);                        \
> +       PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);       \
> +           ROTATE2(b1, b2, 12, tmp1, tmp2);                    \
> +       movdqa L(shuf_rol8) rRIP, tmp1;                         \
> +               interleave_op2;                                 \
> +       PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);       \
> +           ROTATE_SHUF_2(d1, d2, tmp1);                        \
> +       PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);       \
> +           ROTATE2(b1, b2,  7, tmp1, tmp2);
> +
> +       .text
> +
> +chacha20_data:
> +       .align 16
> +L(shuf_rol16):
> +       .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
> +L(shuf_rol8):
> +       .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
> +L(counter1):
> +       .long 1,0,0,0
> +L(inc_counter):
> +       .long 0,1,2,3
> +L(unsigned_cmp):
> +       .long 0x80000000,0x80000000,0x80000000,0x80000000
> +
> +ENTRY (__chacha20_ssse3_blocks8)
> +       /* input:
> +        *      %rdi: input
> +        *      %rsi: dst
> +        *      %rdx: src
> +        *      %rcx: nblks (multiple of 4)
> +        */
> +
> +       pushq %rbp;
> +       cfi_adjust_cfa_offset(8);
> +       cfi_rel_offset(rbp, 0)
> +       movq %rsp, %rbp;
> +       cfi_def_cfa_register(%rbp);
> +
> +       subq $STACK_MAX, %rsp;
> +       andq $~15, %rsp;
> +
> +L(loop4):
> +       mov $20, ROUND;
> +
> +       /* Construct counter vectors X12 and X13 */
> +       movdqa L(inc_counter) rRIP, X0;
> +       movdqa L(unsigned_cmp) rRIP, X2;
> +       pbroadcastd((12 * 4)(INPUT), X12);
> +       pbroadcastd((13 * 4)(INPUT), X13);
> +       paddd X0, X12;
> +       movdqa X12, X1;
> +       pxor X2, X0;
> +       pxor X2, X1;
> +       pcmpgtd X1, X0;
> +       psubd X0, X13;
> +       movdqa X12, (STACK_VEC_X12)(%rsp);
> +       movdqa X13, (STACK_VEC_X13)(%rsp);
> +
> +       /* Load vectors */
> +       pbroadcastd((0 * 4)(INPUT), X0);
> +       pbroadcastd((1 * 4)(INPUT), X1);
> +       pbroadcastd((2 * 4)(INPUT), X2);
> +       pbroadcastd((3 * 4)(INPUT), X3);
> +       pbroadcastd((4 * 4)(INPUT), X4);
> +       pbroadcastd((5 * 4)(INPUT), X5);
> +       pbroadcastd((6 * 4)(INPUT), X6);
> +       pbroadcastd((7 * 4)(INPUT), X7);
> +       pbroadcastd((8 * 4)(INPUT), X8);
> +       pbroadcastd((9 * 4)(INPUT), X9);
> +       pbroadcastd((10 * 4)(INPUT), X10);
> +       pbroadcastd((11 * 4)(INPUT), X11);
> +       pbroadcastd((14 * 4)(INPUT), X14);
> +       pbroadcastd((15 * 4)(INPUT), X15);
> +       movdqa X11, (STACK_TMP)(%rsp);
> +       movdqa X15, (STACK_TMP1)(%rsp);
> +
> +L(round2_4):
> +       QUARTERROUND2(X0, X4,  X8, X12,   X1, X5,  X9, X13, tmp:=,X11,X15,,)
> +       movdqa (STACK_TMP)(%rsp), X11;
> +       movdqa (STACK_TMP1)(%rsp), X15;
> +       movdqa X8, (STACK_TMP)(%rsp);
> +       movdqa X9, (STACK_TMP1)(%rsp);
> +       QUARTERROUND2(X2, X6, X10, X14,   X3, X7, X11, X15, tmp:=,X8,X9,,)
> +       QUARTERROUND2(X0, X5, X10, X15,   X1, X6, X11, X12, tmp:=,X8,X9,,)
> +       movdqa (STACK_TMP)(%rsp), X8;
> +       movdqa (STACK_TMP1)(%rsp), X9;
> +       movdqa X11, (STACK_TMP)(%rsp);
> +       movdqa X15, (STACK_TMP1)(%rsp);
> +       QUARTERROUND2(X2, X7,  X8, X13,   X3, X4,  X9, X14, tmp:=,X11,X15,,)
> +       sub $2, ROUND;
> +       jnz .Lround2_4;
> +
> +       /* tmp := X15 */
> +       movdqa (STACK_TMP)(%rsp), X11;
> +       pbroadcastd((0 * 4)(INPUT), X15);
> +       PLUS(X0, X15);
> +       pbroadcastd((1 * 4)(INPUT), X15);
> +       PLUS(X1, X15);
> +       pbroadcastd((2 * 4)(INPUT), X15);
> +       PLUS(X2, X15);
> +       pbroadcastd((3 * 4)(INPUT), X15);
> +       PLUS(X3, X15);
> +       pbroadcastd((4 * 4)(INPUT), X15);
> +       PLUS(X4, X15);
> +       pbroadcastd((5 * 4)(INPUT), X15);
> +       PLUS(X5, X15);
> +       pbroadcastd((6 * 4)(INPUT), X15);
> +       PLUS(X6, X15);
> +       pbroadcastd((7 * 4)(INPUT), X15);
> +       PLUS(X7, X15);
> +       pbroadcastd((8 * 4)(INPUT), X15);
> +       PLUS(X8, X15);
> +       pbroadcastd((9 * 4)(INPUT), X15);
> +       PLUS(X9, X15);
> +       pbroadcastd((10 * 4)(INPUT), X15);
> +       PLUS(X10, X15);
> +       pbroadcastd((11 * 4)(INPUT), X15);
> +       PLUS(X11, X15);
> +       movdqa (STACK_VEC_X12)(%rsp), X15;
> +       PLUS(X12, X15);
> +       movdqa (STACK_VEC_X13)(%rsp), X15;
> +       PLUS(X13, X15);
> +       movdqa X13, (STACK_TMP)(%rsp);
> +       pbroadcastd((14 * 4)(INPUT), X15);
> +       PLUS(X14, X15);
> +       movdqa (STACK_TMP1)(%rsp), X15;
> +       movdqa X14, (STACK_TMP1)(%rsp);
> +       pbroadcastd((15 * 4)(INPUT), X13);
> +       PLUS(X15, X13);
> +       movdqa X15, (STACK_TMP2)(%rsp);
> +
> +       /* Update counter */
> +       addq $4, (12 * 4)(INPUT);
> +
> +       transpose_4x4(X0, X1, X2, X3, X13, X14, X15);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 0), X0, X15);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 0), X1, X15);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 0), X2, X15);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 0), X3, X15);
> +       transpose_4x4(X4, X5, X6, X7, X0, X1, X2);
> +       movdqa (STACK_TMP)(%rsp), X13;
> +       movdqa (STACK_TMP1)(%rsp), X14;
> +       movdqa (STACK_TMP2)(%rsp), X15;
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 1), X4, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 1), X5, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 1), X6, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 1), X7, X0);
> +       transpose_4x4(X8, X9, X10, X11, X0, X1, X2);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 2), X8, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 2), X9, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 2), X10, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 2), X11, X0);
> +       transpose_4x4(X12, X13, X14, X15, X0, X1, X2);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 3), X12, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 3), X13, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 3), X14, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 3), X15, X0);
> +
> +       sub $4, NBLKS;
> +       lea (4 * 64)(DST), DST;
> +       lea (4 * 64)(SRC), SRC;
> +       jnz L(loop4);
> +
> +       /* clear the used vector registers and stack */
> +       clear(X0);
> +       movdqa X0, (STACK_VEC_X12)(%rsp);
> +       movdqa X0, (STACK_VEC_X13)(%rsp);
> +       movdqa X0, (STACK_TMP)(%rsp);
> +       movdqa X0, (STACK_TMP1)(%rsp);
> +       movdqa X0, (STACK_TMP2)(%rsp);
> +       clear(X1);
> +       clear(X2);
> +       clear(X3);
> +       clear(X4);
> +       clear(X5);
> +       clear(X6);
> +       clear(X7);
> +       clear(X8);
> +       clear(X9);
> +       clear(X10);
> +       clear(X11);
> +       clear(X12);
> +       clear(X13);
> +       clear(X14);
> +       clear(X15);

No need to change now, but out of curiosity (and possible future optimization),
do we need the clears for our purposes?
> +
> +       /* eax zeroed by round loop. */
> +       leave;
> +       cfi_adjust_cfa_offset(-8)
> +       cfi_def_cfa_register(%rsp);
> +       ret;
> +       int3;
> +END (__chacha20_ssse3_blocks8)
> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
> new file mode 100644
> index 0000000000..37a4fdfb1f
> --- /dev/null
> +++ b/sysdeps/x86_64/chacha20_arch.h
> @@ -0,0 +1,42 @@
> +/* Chacha20 implementation, used on arc4random.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <ldsodefs.h>
> +#include <cpu-features.h>
> +#include <sys/param.h>
> +
> +unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
> +                                      const uint8_t *src, size_t nblks);
> +
> +static inline void
> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
> +               size_t bytes)
> +{
> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
> +    {
> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
> +      nblocks -= nblocks % 4;
> +      __chacha20_ssse3_blocks8 (state->ctx, dst, src, nblocks);
> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
> +      src += nblocks * CHACHA20_BLOCK_SIZE;
> +    }
> +
> +  if (bytes > 0)
> +    chacha20_crypt_generic (state, dst, src, bytes);
> +}
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-14 17:10       ` Noah Goldstein
@ 2022-04-14 17:18         ` Adhemerval Zanella
  2022-04-14 17:22           ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 17:18 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 14/04/2022 14:10, Noah Goldstein wrote:
> On Thu, Apr 14, 2022 at 12:03 PM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 13/04/2022 20:12, Noah Goldstein wrote:
>>> On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> +
>>>> +       /* eax zeroed by round loop. */
>>>> +       leave;
>>>> +       cfi_adjust_cfa_offset(-8)
>>>> +       cfi_def_cfa_register(%rsp);
>>>> +       ret;
>>>> +       int3;
>>> why int3?
>>
>> It was originally added on libgcrypt by 11ade08efbfbc36dbf3571f1026946269950bc40,
>> as a straight-line speculation hardening.  It is was is emitted by clang 14 and
>> gcc 12 with -mharden-sls=return.
>>
>> I am not sure if we need that kind of hardening, but I would prefer to the first
>> version be in sync with libgcrypt as much as possible so the future optimizations
>> would be simpler to keep localized to glibc (if libgcrypt does not want to
>> backport it).
> 
> Okay, can keep for now. Any thoughts on changing it to sse2?
> 

No strong feeling, I used the ssse3 one because it is readily available from
libgcrypt.

> 
>>
>>>> +END (__chacha20_ssse3_blocks8)
>>>> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
>>>> new file mode 100644
>>>> index 0000000000..37a4fdfb1f
>>>> --- /dev/null
>>>> +++ b/sysdeps/x86_64/chacha20_arch.h
>>>> @@ -0,0 +1,42 @@
>>>> +/* Chacha20 implementation, used on arc4random.
>>>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>>>> +   This file is part of the GNU C Library.
>>>> +
>>>> +   The GNU C Library is free software; you can redistribute it and/or
>>>> +   modify it under the terms of the GNU Lesser General Public
>>>> +   License as published by the Free Software Foundation; either
>>>> +   version 2.1 of the License, or (at your option) any later version.
>>>> +
>>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> +   Lesser General Public License for more details.
>>>> +
>>>> +   You should have received a copy of the GNU Lesser General Public
>>>> +   License along with the GNU C Library; if not, see
>>>> +   <http://www.gnu.org/licenses/>.  */
>>>> +
>>>> +#include <ldsodefs.h>
>>>> +#include <cpu-features.h>
>>>> +#include <sys/param.h>
>>>> +
>>>> +unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
>>>> +                                      const uint8_t *src, size_t nblks);
>>>> +
>>>> +static inline void
>>>> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
>>>> +               size_t bytes)
>>>> +{
>>>> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
>>>
>>> Can we make this an ifunc?
>>
>> I though about it, but if you check on arc4random implementation the
>> chacha20_crypt is called for the whole internal buf once it is exhausted.
>> Assuming a 1 cycle per byte (as indicated by bench-slope libgrcypt on
>> my machine), it will be at least 1k cycles to encrypt each block.  I
>> am not sure if setting up an internal PLT call to save a couple of cycles
>> on a internal function will really show anything significant here (assuming
>> that the PLT call won't add more overhead in fact).
>>
>> Besides that the code boilerplate to setup the internal ifunc is also
>> way more complex.
> 
> Okay for now as long as open to changing later (not that we will but that
> this isn't locking us into the decision).

For sure, if iFUNC does help on this case the change should be simple for
the generic code.  The boilerplate is for the x86_64 bits in facts (to
setup the iFUNC resolver, Makefile, etc.).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] x86: Add AVX2 optimized chacha20
  2022-04-14 17:16     ` Adhemerval Zanella
@ 2022-04-14 17:20       ` Noah Goldstein
  2022-04-14 18:12         ` Adhemerval Zanella
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 17:20 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Thu, Apr 14, 2022 at 12:17 PM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 13/04/2022 20:04, Noah Goldstein wrote:
> > On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> +       .text
> >
> > section avx2
> >
>
> Ack, I changed to '.section .text.avx2, "ax", @progbits'.
>
> >> +       .align 32
> >> +chacha20_data:
> >> +L(shuf_rol16):
> >> +       .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
> >> +L(shuf_rol8):
> >> +       .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
> >> +L(inc_counter):
> >> +       .byte 0,1,2,3,4,5,6,7
> >> +L(unsigned_cmp):
> >> +       .long 0x80000000
> >> +
> >> +ENTRY (__chacha20_avx2_blocks8)
> >> +       /* input:
> >> +        *      %rdi: input
> >> +        *      %rsi: dst
> >> +        *      %rdx: src
> >> +        *      %rcx: nblks (multiple of 8)
> >> +        */
> >> +       vzeroupper;
> >
> > vzeroupper needs to be replaced with VZEROUPPER_RETURN
> > and we need a transaction safe version unless this can never
> > be called during a transaction.
>
> I think you meant VZEROUPPER here (VZEROUPPER_RETURN seems to trigger
> test case failures). What do you mean by a 'transaction safe version'?
> Ax extra __chacha20_avx2_blocks8 implementation to handle it? Or disable
> it if RTM is enabled?

For now you can just update the cpufeature check to do ssse3 if RTM is enabled.

>
> >> +
> >> +       /* clear the used vector registers and stack */
> >> +       vpxor X0, X0, X0;
> >> +       vmovdqa X0, (STACK_VEC_X12)(%rsp);
> >> +       vmovdqa X0, (STACK_VEC_X13)(%rsp);
> >> +       vmovdqa X0, (STACK_TMP)(%rsp);
> >> +       vmovdqa X0, (STACK_TMP1)(%rsp);
> >> +       vzeroall;
> >
> > Do you need vzeroall?
> > Why not vzeroupper? Is it a security concern to leave info in the xmm pieces?
>
> I would assume, since it is on the original libgrcypt optimization.  As
> for the ssse3 version, I am not sure if we really need that level of
> hardening, but it would be good to have the initial revision as close
> as possible from libgcrypt.

Got it.

>
> >
> >
> >> +
> >> +       /* eax zeroed by round loop. */
> >> +       leave;
> >> +       cfi_adjust_cfa_offset(-8)
> >> +       cfi_def_cfa_register(%rsp);
> >> +       ret;
> >> +       int3;
> >
> > Why do we need int3 here?
>
> I think the ssse3 applies here as well.
>
> >> +END(__chacha20_avx2_blocks8)
> >> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
> >> index 37a4fdfb1f..7e9e7755f3 100644
> >> --- a/sysdeps/x86_64/chacha20_arch.h
> >> +++ b/sysdeps/x86_64/chacha20_arch.h
> >> @@ -22,11 +22,25 @@
> >>
> >>  unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
> >>                                        const uint8_t *src, size_t nblks);
> >> +unsigned int __chacha20_avx2_blocks8 (uint32_t *state, uint8_t *dst,
> >> +                                     const uint8_t *src, size_t nblks);
> >>
> >>  static inline void
> >>  chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
> >>                 size_t bytes)
> >>  {
> >> +  const struct cpu_features* cpu_features = __get_cpu_features ();
> >
> > Can we do this with an ifunc and take the cpufeature check off the critical
> > path?
>
> Ditto.
>
> >> +
> >> +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && bytes >= CHACHA20_BLOCK_SIZE * 8)
> >> +    {
> >> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
> >> +      nblocks -= nblocks % 8;
> >> +      __chacha20_avx2_blocks8 (state->ctx, dst, src, nblocks);
> >> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
> >> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
> >> +      src += nblocks * CHACHA20_BLOCK_SIZE;
> >> +    }
> >> +
> >>    if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
> >>      {
> >>        size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
> >> --
> >> 2.32.0
> >>
> >
> > Do you want optimization comments or do that later?
>
> Ideally I would like to check if the proposed arc4random implementation
> is what we want (with current approach of using atfork handlers and the
> key reschedule).  The cipher itself it not the utmost important in the
> sense it is transparent to user and we can eventually replace it if there
> any issue or attack to ChaCha20.  Initially I won't add any arch-specific
> optimization, but since libgcrypt provides some that fits on the current
> approach I though it would be a nice thing to have.
>
> For optimization comments it would be good to sync with libgcrypt as well,
> I think the project will be interested in any performance improvement
> you might have for the chacha implementations.
Okay, I'll probably take a stab at this in the not too distant future.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-14 17:18         ` Adhemerval Zanella
@ 2022-04-14 17:22           ` Noah Goldstein
  2022-04-14 18:25             ` Adhemerval Zanella
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 17:22 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Thu, Apr 14, 2022 at 12:19 PM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 14/04/2022 14:10, Noah Goldstein wrote:
> > On Thu, Apr 14, 2022 at 12:03 PM Adhemerval Zanella
> > <adhemerval.zanella@linaro.org> wrote:
> >>
> >>
> >>
> >> On 13/04/2022 20:12, Noah Goldstein wrote:
> >>> On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
> >>> <libc-alpha@sourceware.org> wrote:
> >>>>
> >>>> +
> >>>> +       /* eax zeroed by round loop. */
> >>>> +       leave;
> >>>> +       cfi_adjust_cfa_offset(-8)
> >>>> +       cfi_def_cfa_register(%rsp);
> >>>> +       ret;
> >>>> +       int3;
> >>> why int3?
> >>
> >> It was originally added on libgcrypt by 11ade08efbfbc36dbf3571f1026946269950bc40,
> >> as a straight-line speculation hardening.  It is was is emitted by clang 14 and
> >> gcc 12 with -mharden-sls=return.
> >>
> >> I am not sure if we need that kind of hardening, but I would prefer to the first
> >> version be in sync with libgcrypt as much as possible so the future optimizations
> >> would be simpler to keep localized to glibc (if libgcrypt does not want to
> >> backport it).
> >
> > Okay, can keep for now. Any thoughts on changing it to sse2?
> >
>
> No strong feeling, I used the ssse3 one because it is readily available from
> libgcrypt.

I think the only ssse3 is `pshufb` so you can just replace the optimized
rotates with the shift rotates and that will make it sse2 (unless I'm missing
an instruction).

Also can you add the proper .text section here as well (or .sse2 or .ssse3)

>
> >
> >>
> >>>> +END (__chacha20_ssse3_blocks8)
> >>>> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
> >>>> new file mode 100644
> >>>> index 0000000000..37a4fdfb1f
> >>>> --- /dev/null
> >>>> +++ b/sysdeps/x86_64/chacha20_arch.h
> >>>> @@ -0,0 +1,42 @@
> >>>> +/* Chacha20 implementation, used on arc4random.
> >>>> +   Copyright (C) 2022 Free Software Foundation, Inc.
> >>>> +   This file is part of the GNU C Library.
> >>>> +
> >>>> +   The GNU C Library is free software; you can redistribute it and/or
> >>>> +   modify it under the terms of the GNU Lesser General Public
> >>>> +   License as published by the Free Software Foundation; either
> >>>> +   version 2.1 of the License, or (at your option) any later version.
> >>>> +
> >>>> +   The GNU C Library is distributed in the hope that it will be useful,
> >>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >>>> +   Lesser General Public License for more details.
> >>>> +
> >>>> +   You should have received a copy of the GNU Lesser General Public
> >>>> +   License along with the GNU C Library; if not, see
> >>>> +   <http://www.gnu.org/licenses/>.  */
> >>>> +
> >>>> +#include <ldsodefs.h>
> >>>> +#include <cpu-features.h>
> >>>> +#include <sys/param.h>
> >>>> +
> >>>> +unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
> >>>> +                                      const uint8_t *src, size_t nblks);
> >>>> +
> >>>> +static inline void
> >>>> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
> >>>> +               size_t bytes)
> >>>> +{
> >>>> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
> >>>
> >>> Can we make this an ifunc?
> >>
> >> I though about it, but if you check on arc4random implementation the
> >> chacha20_crypt is called for the whole internal buf once it is exhausted.
> >> Assuming a 1 cycle per byte (as indicated by bench-slope libgrcypt on
> >> my machine), it will be at least 1k cycles to encrypt each block.  I
> >> am not sure if setting up an internal PLT call to save a couple of cycles
> >> on a internal function will really show anything significant here (assuming
> >> that the PLT call won't add more overhead in fact).
> >>
> >> Besides that the code boilerplate to setup the internal ifunc is also
> >> way more complex.
> >
> > Okay for now as long as open to changing later (not that we will but that
> > this isn't locking us into the decision).
>
> For sure, if iFUNC does help on this case the change should be simple for
> the generic code.  The boilerplate is for the x86_64 bits in facts (to
> setup the iFUNC resolver, Makefile, etc.).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] stdlib: Add arc4random tests
  2022-04-13 20:23 ` [PATCH 2/7] stdlib: Add arc4random tests Adhemerval Zanella
@ 2022-04-14 18:01   ` Noah Goldstein
  0 siblings, 0 replies; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 18:01 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library, Florian Weimer

On Wed, Apr 13, 2022 at 3:25 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> The basic tst-arc4random-chacha20.c checks if the output of ChaCha20
> implementation matches the reference test vectors from RFC8439.
>
> The tst-arc4random-fork.c check if subprocesses generate distinct
> streams of randomness (if fork handling is done correctly).
>
> The tst-arc4random-stats.c is a statistical test to the randomness of
> arc4random, arc4random_buf, and arc4random_uniform.
>
> The tst-arc4random-thread.c check if threads generate distinct streams
> of randomness (if function are thread-safe).
>
> Checked on x86_64-linux-gnu, aarch64-linux, and powerpc64le-linux-gnu.
>
> Co-authored-by: Florian Weimer <fweimer@redhat.com>
> ---
>  stdlib/Makefile                  |   4 +
>  stdlib/tst-arc4random-chacha20.c | 225 +++++++++++++++++++++++++
>  stdlib/tst-arc4random-fork.c     | 174 +++++++++++++++++++
>  stdlib/tst-arc4random-stats.c    | 146 ++++++++++++++++
>  stdlib/tst-arc4random-thread.c   | 278 +++++++++++++++++++++++++++++++
>  5 files changed, 827 insertions(+)
>  create mode 100644 stdlib/tst-arc4random-chacha20.c
>  create mode 100644 stdlib/tst-arc4random-fork.c
>  create mode 100644 stdlib/tst-arc4random-stats.c
>  create mode 100644 stdlib/tst-arc4random-thread.c
>
> diff --git a/stdlib/Makefile b/stdlib/Makefile
> index 9f9cc1bd7f..4862d008ab 100644
> --- a/stdlib/Makefile
> +++ b/stdlib/Makefile
> @@ -183,6 +183,9 @@ tests := \
>    testmb2 \
>    testrand \
>    testsort \
> +  tst-arc4random-fork \
> +  tst-arc4random-stats \
> +  tst-arc4random-thread \
>    tst-at_quick_exit \
>    tst-atexit \
>    tst-atof1 \
> @@ -252,6 +255,7 @@ tests-internal := \
>    # tests-internal
>
>  tests-static := \
> +  tst-arc4random-chacha20 \
>    tst-secure-getenv \
>    # tests-static
>
> diff --git a/stdlib/tst-arc4random-chacha20.c b/stdlib/tst-arc4random-chacha20.c
> new file mode 100644
> index 0000000000..c5876d3f3b
> --- /dev/null
> +++ b/stdlib/tst-arc4random-chacha20.c
> @@ -0,0 +1,225 @@
> +/* Basic tests for chacha20 cypher used in arc4random.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <chacha20.c>
> +#include <support/check.h>
> +
> +static int
> +do_test (void)
> +{
> +  /* Reference ChaCha20 encryption test vectors from RFC8439.  */
> +
> +  /* Test vector #1.  */
> +  {
> +    struct chacha20_state state;
> +
> +    uint8_t key[CHACHA20_KEY_SIZE] =
> +      {
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> +      };
> +    uint8_t iv[CHACHA20_IV_SIZE] =
> +      {
> +       0x0, 0x0, 0x0, 0x0,
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> +      };
> +    const uint8_t plaintext[CHACHA20_BLOCK_SIZE] = { 0 };
> +    uint8_t ciphertext[CHACHA20_BLOCK_SIZE];
> +
Can you remove this whitespace.
> +    chacha20_init (&state, key, iv);
> +    chacha20_crypt (&state, ciphertext, plaintext, sizeof plaintext);
> +
> +    const uint8_t expected[] =
> +      {
> +       0x76, 0xb8, 0xe0, 0xad, 0xa0, 0xf1, 0x3d, 0x90,
> +       0x40, 0x5d, 0x6a, 0xe5, 0x53, 0x86, 0xbd, 0x28,
> +       0xbd, 0xd2, 0x19, 0xb8, 0xa0, 0x8d, 0xed, 0x1a,
> +       0xa8, 0x36, 0xef, 0xcc, 0x8b, 0x77, 0x0d, 0xc7,
> +       0xda, 0x41, 0x59, 0x7c, 0x51, 0x57, 0x48, 0x8d,
> +       0x77, 0x24, 0xe0, 0x3f, 0xb8, 0xd8, 0x4a, 0x37,
> +       0x6a, 0x43, 0xb8, 0xf4, 0x15, 0x18, 0xa1, 0x1c,
> +       0xc3, 0x87, 0xb6, 0x69, 0xb2, 0xee, 0x65, 0x86
> +      };
> +    TEST_COMPARE_BLOB (ciphertext, sizeof ciphertext,
> +                       expected, sizeof expected);
> +  }
> +
> +  /* Test vector #2.  */
> +  {
> +    struct chacha20_state state;
> +
> +    uint8_t key[CHACHA20_KEY_SIZE] =
> +      {
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1,
> +      };
> +    uint8_t iv[CHACHA20_IV_SIZE] =
> +      {
> +       0x1, 0x0, 0x0, 0x0,  /* Block counter is a LE uint32_t  */
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2
> +      };
> +    const uint8_t plaintext[] =
> +      {
> +       0x41, 0x6e, 0x79, 0x20, 0x73, 0x75, 0x62, 0x6d, 0x69, 0x73, 0x73,
> +       0x69, 0x6f, 0x6e, 0x20, 0x74, 0x6f, 0x20, 0x74, 0x68, 0x65, 0x20,
> +       0x49, 0x45, 0x54, 0x46, 0x20, 0x69, 0x6e, 0x74, 0x65, 0x6e, 0x64,
> +       0x65, 0x64, 0x20, 0x62, 0x79, 0x20, 0x74, 0x68, 0x65, 0x20, 0x43,
> +       0x6f, 0x6e, 0x74, 0x72, 0x69, 0x62, 0x75, 0x74, 0x6f, 0x72, 0x20,
> +       0x66, 0x6f, 0x72, 0x20, 0x70, 0x75, 0x62, 0x6c, 0x69, 0x63, 0x61,
> +       0x74, 0x69, 0x6f, 0x6e, 0x20, 0x61, 0x73, 0x20, 0x61, 0x6c, 0x6c,
> +       0x20, 0x6f, 0x72, 0x20, 0x70, 0x61, 0x72, 0x74, 0x20, 0x6f, 0x66,
> +       0x20, 0x61, 0x6e, 0x20, 0x49, 0x45, 0x54, 0x46, 0x20, 0x49, 0x6e,
> +       0x74, 0x65, 0x72, 0x6e, 0x65, 0x74, 0x2d, 0x44, 0x72, 0x61, 0x66,
> +       0x74, 0x20, 0x6f, 0x72, 0x20, 0x52, 0x46, 0x43, 0x20, 0x61, 0x6e,
> +       0x64, 0x20, 0x61, 0x6e, 0x79, 0x20, 0x73, 0x74, 0x61, 0x74, 0x65,
> +       0x6d, 0x65, 0x6e, 0x74, 0x20, 0x6d, 0x61, 0x64, 0x65, 0x20, 0x77,
> +       0x69, 0x74, 0x68, 0x69, 0x6e, 0x20, 0x74, 0x68, 0x65, 0x20, 0x63,
> +       0x6f, 0x6e, 0x74, 0x65, 0x78, 0x74, 0x20, 0x6f, 0x66, 0x20, 0x61,
> +       0x6e, 0x20, 0x49, 0x45, 0x54, 0x46, 0x20, 0x61, 0x63, 0x74, 0x69,
> +       0x76, 0x69, 0x74, 0x79, 0x20, 0x69, 0x73, 0x20, 0x63, 0x6f, 0x6e,
> +       0x73, 0x69, 0x64, 0x65, 0x72, 0x65, 0x64, 0x20, 0x61, 0x6e, 0x20,
> +       0x22, 0x49, 0x45, 0x54, 0x46, 0x20, 0x43, 0x6f, 0x6e, 0x74, 0x72,
> +       0x69, 0x62, 0x75, 0x74, 0x69, 0x6f, 0x6e, 0x22, 0x2e, 0x20, 0x53,
> +       0x75, 0x63, 0x68, 0x20, 0x73, 0x74, 0x61, 0x74, 0x65, 0x6d, 0x65,
> +       0x6e, 0x74, 0x73, 0x20, 0x69, 0x6e, 0x63, 0x6c, 0x75, 0x64, 0x65,
> +       0x20, 0x6f, 0x72, 0x61, 0x6c, 0x20, 0x73, 0x74, 0x61, 0x74, 0x65,
> +       0x6d, 0x65, 0x6e, 0x74, 0x73, 0x20, 0x69, 0x6e, 0x20, 0x49, 0x45,
> +       0x54, 0x46, 0x20, 0x73, 0x65, 0x73, 0x73, 0x69, 0x6f, 0x6e, 0x73,
> +       0x2c, 0x20, 0x61, 0x73, 0x20, 0x77, 0x65, 0x6c, 0x6c, 0x20, 0x61,
> +       0x73, 0x20, 0x77, 0x72, 0x69, 0x74, 0x74, 0x65, 0x6e, 0x20, 0x61,
> +       0x6e, 0x64, 0x20, 0x65, 0x6c, 0x65, 0x63, 0x74, 0x72, 0x6f, 0x6e,
> +       0x69, 0x63, 0x20, 0x63, 0x6f, 0x6d, 0x6d, 0x75, 0x6e, 0x69, 0x63,
> +       0x61, 0x74, 0x69, 0x6f, 0x6e, 0x73, 0x20, 0x6d, 0x61, 0x64, 0x65,
> +       0x20, 0x61, 0x74, 0x20, 0x61, 0x6e, 0x79, 0x20, 0x74, 0x69, 0x6d,
> +       0x65, 0x20, 0x6f, 0x72, 0x20, 0x70, 0x6c, 0x61, 0x63, 0x65, 0x2c,
> +       0x20, 0x77, 0x68, 0x69, 0x63, 0x68, 0x20, 0x61, 0x72, 0x65, 0x20,
> +       0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x65, 0x64, 0x20, 0x74,
> +       0x6f,
> +      };
> +    uint8_t ciphertext[sizeof plaintext];
> +
> +    chacha20_init (&state, key, iv);
> +    chacha20_crypt (&state, ciphertext, plaintext, sizeof plaintext);
> +
> +    const uint8_t expected[] =
> +      {
> +       0xa3, 0xfb, 0xf0, 0x7d, 0xf3, 0xfa, 0x2f, 0xde, 0x4f, 0x37, 0x6c,
> +       0xa2, 0x3e, 0x82, 0x73, 0x70, 0x41, 0x60, 0x5d, 0x9f, 0x4f, 0x4f,
> +       0x57, 0xbd, 0x8c, 0xff, 0x2c, 0x1d, 0x4b, 0x79, 0x55, 0xec, 0x2a,
> +       0x97, 0x94, 0x8b, 0xd3, 0x72, 0x29, 0x15, 0xc8, 0xf3, 0xd3, 0x37,
> +       0xf7, 0xd3, 0x70, 0x05, 0x0e, 0x9e, 0x96, 0xd6, 0x47, 0xb7, 0xc3,
> +       0x9f, 0x56, 0xe0, 0x31, 0xca, 0x5e, 0xb6, 0x25, 0x0d, 0x40, 0x42,
> +       0xe0, 0x27, 0x85, 0xec, 0xec, 0xfa, 0x4b, 0x4b, 0xb5, 0xe8, 0xea,
> +       0xd0, 0x44, 0x0e, 0x20, 0xb6, 0xe8, 0xdb, 0x09, 0xd8, 0x81, 0xa7,
> +       0xc6, 0x13, 0x2f, 0x42, 0x0e, 0x52, 0x79, 0x50, 0x42, 0xbd, 0xfa,
> +       0x77, 0x73, 0xd8, 0xa9, 0x05, 0x14, 0x47, 0xb3, 0x29, 0x1c, 0xe1,
> +       0x41, 0x1c, 0x68, 0x04, 0x65, 0x55, 0x2a, 0xa6, 0xc4, 0x05, 0xb7,
> +       0x76, 0x4d, 0x5e, 0x87, 0xbe, 0xa8, 0x5a, 0xd0, 0x0f, 0x84, 0x49,
> +       0xed, 0x8f, 0x72, 0xd0, 0xd6, 0x62, 0xab, 0x05, 0x26, 0x91, 0xca,
> +       0x66, 0x42, 0x4b, 0xc8, 0x6d, 0x2d, 0xf8, 0x0e, 0xa4, 0x1f, 0x43,
> +       0xab, 0xf9, 0x37, 0xd3, 0x25, 0x9d, 0xc4, 0xb2, 0xd0, 0xdf, 0xb4,
> +       0x8a, 0x6c, 0x91, 0x39, 0xdd, 0xd7, 0xf7, 0x69, 0x66, 0xe9, 0x28,
> +       0xe6, 0x35, 0x55, 0x3b, 0xa7, 0x6c, 0x5c, 0x87, 0x9d, 0x7b, 0x35,
> +       0xd4, 0x9e, 0xb2, 0xe6, 0x2b, 0x08, 0x71, 0xcd, 0xac, 0x63, 0x89,
> +       0x39, 0xe2, 0x5e, 0x8a, 0x1e, 0x0e, 0xf9, 0xd5, 0x28, 0x0f, 0xa8,
> +       0xca, 0x32, 0x8b, 0x35, 0x1c, 0x3c, 0x76, 0x59, 0x89, 0xcb, 0xcf,
> +       0x3d, 0xaa, 0x8b, 0x6c, 0xcc, 0x3a, 0xaf, 0x9f, 0x39, 0x79, 0xc9,
> +       0x2b, 0x37, 0x20, 0xfc, 0x88, 0xdc, 0x95, 0xed, 0x84, 0xa1, 0xbe,
> +       0x05, 0x9c, 0x64, 0x99, 0xb9, 0xfd, 0xa2, 0x36, 0xe7, 0xe8, 0x18,
> +       0xb0, 0x4b, 0x0b, 0xc3, 0x9c, 0x1e, 0x87, 0x6b, 0x19, 0x3b, 0xfe,
> +       0x55, 0x69, 0x75, 0x3f, 0x88, 0x12, 0x8c, 0xc0, 0x8a, 0xaa, 0x9b,
> +       0x63, 0xd1, 0xa1, 0x6f, 0x80, 0xef, 0x25, 0x54, 0xd7, 0x18, 0x9c,
> +       0x41, 0x1f, 0x58, 0x69, 0xca, 0x52, 0xc5, 0xb8, 0x3f, 0xa3, 0x6f,
> +       0xf2, 0x16, 0xb9, 0xc1, 0xd3, 0x00, 0x62, 0xbe, 0xbc, 0xfd, 0x2d,
> +       0xc5, 0xbc, 0xe0, 0x91, 0x19, 0x34, 0xfd, 0xa7, 0x9a, 0x86, 0xf6,
> +       0xe6, 0x98, 0xce, 0xd7, 0x59, 0xc3, 0xff, 0x9b, 0x64, 0x77, 0x33,
> +       0x8f, 0x3d, 0xa4, 0xf9, 0xcd, 0x85, 0x14, 0xea, 0x99, 0x82, 0xcc,
> +       0xaf, 0xb3, 0x41, 0xb2, 0x38, 0x4d, 0xd9, 0x02, 0xf3, 0xd1, 0xab,
> +       0x7a, 0xc6, 0x1d, 0xd2, 0x9c, 0x6f, 0x21, 0xba, 0x5b, 0x86, 0x2f,
> +       0x37, 0x30, 0xe3, 0x7c, 0xfd, 0xc4, 0xfd, 0x80, 0x6c, 0x22, 0xf2,
> +       0x21,
> +      };
> +    TEST_COMPARE_BLOB (ciphertext, sizeof ciphertext,
> +                       expected, sizeof expected);
> +    }
> +
> +  /* Test vector #3.  */
> +  {
> +    struct chacha20_state state;
> +
> +    uint8_t key[CHACHA20_KEY_SIZE] =
> +      {
> +       0x1c, 0x92, 0x40, 0xa5, 0xeb, 0x55, 0xd3, 0x8a,
> +       0xf3, 0x33, 0x88, 0x86, 0x04, 0xf6, 0xb5, 0xf0,
> +       0x47, 0x39, 0x17, 0xc1, 0x40, 0x2b, 0x80, 0x09,
> +       0x9d, 0xca, 0x5c, 0xbc, 0x20, 0x70, 0x75, 0xc0
> +      };
> +    uint8_t iv[CHACHA20_IV_SIZE] =
> +      {
> +       0x2a, 0x0, 0x0, 0x0,  /* Block counter is a LE uint32_t  */
> +       0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2
> +      };
> +
> +    uint8_t plaintext[] =
> +      {
> +       0x27, 0x54, 0x77, 0x61, 0x73, 0x20, 0x62, 0x72, 0x69, 0x6c, 0x6c,
> +       0x69, 0x67, 0x2c, 0x20, 0x61, 0x6e, 0x64, 0x20, 0x74, 0x68, 0x65,
> +       0x20, 0x73, 0x6c, 0x69, 0x74, 0x68, 0x79, 0x20, 0x74, 0x6f, 0x76,
> +       0x65, 0x73, 0x0a, 0x44, 0x69, 0x64, 0x20, 0x67, 0x79, 0x72, 0x65,
> +       0x20, 0x61, 0x6e, 0x64, 0x20, 0x67, 0x69, 0x6d, 0x62, 0x6c, 0x65,
> +       0x20, 0x69, 0x6e, 0x20, 0x74, 0x68, 0x65, 0x20, 0x77, 0x61, 0x62,
> +       0x65, 0x3a, 0x0a, 0x41, 0x6c, 0x6c, 0x20, 0x6d, 0x69, 0x6d, 0x73,
> +       0x79, 0x20, 0x77, 0x65, 0x72, 0x65, 0x20, 0x74, 0x68, 0x65, 0x20,
> +       0x62, 0x6f, 0x72, 0x6f, 0x67, 0x6f, 0x76, 0x65, 0x73, 0x2c, 0x0a,
> +       0x41, 0x6e, 0x64, 0x20, 0x74, 0x68, 0x65, 0x20, 0x6d, 0x6f, 0x6d,
> +       0x65, 0x20, 0x72, 0x61, 0x74, 0x68, 0x73, 0x20, 0x6f, 0x75, 0x74,
> +       0x67, 0x72, 0x61, 0x62, 0x65, 0x2e,
> +      };
> +    uint8_t ciphertext[sizeof plaintext];
> +
> +    chacha20_init (&state, key, iv);
> +    chacha20_crypt (&state, ciphertext, plaintext, sizeof plaintext);
> +
> +    const uint8_t expected[] =
> +      {
> +       0x62, 0xe6, 0x34, 0x7f, 0x95, 0xed, 0x87, 0xa4, 0x5f, 0xfa, 0xe7,
> +       0x42, 0x6f, 0x27, 0xa1, 0xdf, 0x5f, 0xb6, 0x91, 0x10, 0x04, 0x4c,
> +       0x0d, 0x73, 0x11, 0x8e, 0xff, 0xa9, 0x5b, 0x01, 0xe5, 0xcf, 0x16,
> +       0x6d, 0x3d, 0xf2, 0xd7, 0x21, 0xca, 0xf9, 0xb2, 0x1e, 0x5f, 0xb1,
> +       0x4c, 0x61, 0x68, 0x71, 0xfd, 0x84, 0xc5, 0x4f, 0x9d, 0x65, 0xb2,
> +       0x83, 0x19, 0x6c, 0x7f, 0xe4, 0xf6, 0x05, 0x53, 0xeb, 0xf3, 0x9c,
> +       0x64, 0x02, 0xc4, 0x22, 0x34, 0xe3, 0x2a, 0x35, 0x6b, 0x3e, 0x76,
> +       0x43, 0x12, 0xa6, 0x1a, 0x55, 0x32, 0x05, 0x57, 0x16, 0xea, 0xd6,
> +       0x96, 0x25, 0x68, 0xf8, 0x7d, 0x3f, 0x3f, 0x77, 0x04, 0xc6, 0xa8,
> +       0xd1, 0xbc, 0xd1, 0xbf, 0x4d, 0x50, 0xd6, 0x15, 0x4b, 0x6d, 0xa7,
> +       0x31, 0xb1, 0x87, 0xb5, 0x8d, 0xfd, 0x72, 0x8a, 0xfa, 0x36, 0x75,
> +       0x7a, 0x79, 0x7a, 0xc1, 0x88, 0xd1,
> +      };
> +
> +    TEST_COMPARE_BLOB (ciphertext, sizeof ciphertext,
> +                       expected, sizeof expected);
> +  }
> +
> +  return 0;
> +}
> +
> +#include <support/test-driver.c>
> diff --git a/stdlib/tst-arc4random-fork.c b/stdlib/tst-arc4random-fork.c
> new file mode 100644
> index 0000000000..cd8852c8d3
> --- /dev/null
> +++ b/stdlib/tst-arc4random-fork.c
> @@ -0,0 +1,174 @@
> +/* Test that subprocesses generate distinct streams of randomness.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Collect random data from subprocesses and check that all the
> +   results are unique.  */
> +
> +#include <array_length.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <support/check.h>
> +#include <support/support.h>
> +#include <support/xthread.h>
> +#include <support/xunistd.h>
> +#include <unistd.h>
> +
> +/* Perform multiple runs.  The subsequent runs start with an
> +   already-initialized random number generator.  (The number 1500 was
> +   seen to reproduce failures reliable in case of a race condition in
> +   the fork detection code.)  */
> +enum { runs = 1500 };
> +
> +/* One hundred processes in total.  This should be high enough to
> +   expose any issues, but low enough not to tax the overall system too
> +   much.  */
> +enum { subprocesses = 49 };
> +
> +/* The total number of processes.  */
> +enum { processes = subprocesses + 1 };
> +
> +/* Number of bytes of randomness to generate per process.  Large
> +   enough to make false positive duplicates extremely unlikely.  */
> +enum { random_size = 16 };
> +
> +/* Generated bytes of randomness.  */
> +struct result
> +{
> +  unsigned char bytes[random_size];
> +};
> +
> +/* Shared across all processes.  */
> +static struct shared_data
> +{
> +  pthread_barrier_t barrier;
> +  struct result results[runs][processes];
> +} *shared_data;
> +
> +/* Invoked to collect data from a subprocess.  */
> +static void
> +subprocess (int run, int process_index)
> +{
> +  xpthread_barrier_wait (&shared_data->barrier);
> +  arc4random_buf (shared_data->results[run][process_index].bytes, random_size);
> +}
> +
> +/* Used to sort the results.  */
> +struct index
> +{
> +  int run;
> +  int process_index;
> +};
> +
> +/* Used to sort an array of struct index values.  */
> +static int
> +index_compare (const void *left1, const void *right1)
> +{
> +  const struct index *left = left1;
> +  const struct index *right = right1;
> +
> +  return memcmp (shared_data->results[left->run][left->process_index].bytes,
> +                 shared_data->results[right->run][right->process_index].bytes,
> +                 random_size);
> +}
> +
> +static int
> +do_test (void)
> +{
> +  shared_data = support_shared_allocate (sizeof (*shared_data));
> +  {
> +    pthread_barrierattr_t attr;
> +    xpthread_barrierattr_init (&attr);
> +    xpthread_barrierattr_setpshared (&attr, PTHREAD_PROCESS_SHARED);
> +    xpthread_barrier_init (&shared_data->barrier, &attr, processes);
> +    xpthread_barrierattr_destroy (&attr);
> +  }
> +
> +  /* Collect random data.  */
> +  for (int run = 0; run < runs; ++run)
> +    {
> +#if 0
> +      if (run == runs / 2)
> +        {
> +          /* In the middle, desynchronize the block cache by consuming
> +             an odd number of bytes.  */
> +          char buf;
> +          arc4random_buf (&buf, 1);
> +        }
> +#endif
> +
> +      pid_t pids[subprocesses];
> +      for (int process_index = 0; process_index < subprocesses;
> +           ++process_index)
> +        {
> +          pids[process_index] = xfork ();
> +          if (pids[process_index] == 0)
> +            {
> +              subprocess (run, process_index);
> +              _exit (0);
> +            }
> +        }
> +
> +      /* Trigger all subprocesses.  Also add data from the parent
> +         process.  */
> +      subprocess (run, subprocesses);
> +
> +      for (int process_index = 0; process_index < subprocesses;
> +           ++process_index)
> +        {
> +          int status;
> +          xwaitpid (pids[process_index], &status, 0);
> +          if (status != 0)
> +            FAIL_EXIT1 ("subprocess index %d (PID %d) exit status %d\n",
> +                        process_index, (int) pids[process_index], status);
> +        }
> +    }
> +
> +  /* Check for duplicates.  */
> +  struct index indexes[runs * processes];
> +  for (int run = 0; run < runs; ++run)
> +    for (int process_index = 0; process_index < processes; ++process_index)
> +      indexes[run * processes + process_index]
> +        = (struct index) { .run = run, .process_index = process_index };
> +  qsort (indexes, array_length (indexes), sizeof (indexes[0]), index_compare);
> +  for (size_t i = 1; i < array_length (indexes); ++i)
> +    {
> +      if (index_compare (indexes + i - 1, indexes + i) == 0)
> +        {
> +          support_record_failure ();
> +          unsigned char *bytes
> +            = shared_data->results[indexes[i].run]
> +                [indexes[i].process_index].bytes;
> +          char *quoted = support_quote_blob (bytes, random_size);
> +          printf ("error: duplicate randomness data: \"%s\"\n"
> +                  "  run %d, subprocess %d\n"
> +                  "  run %d, subprocess %d\n",
> +                  quoted, indexes[i - 1].run, indexes[i - 1].process_index,
> +                  indexes[i].run, indexes[i].process_index);
> +          free (quoted);
> +        }
> +    }
> +
> +  xpthread_barrier_destroy (&shared_data->barrier);
> +  support_shared_free (shared_data);
> +  shared_data = NULL;
> +
> +  return 0;
> +}
> +
> +#include <support/test-driver.c>
> diff --git a/stdlib/tst-arc4random-stats.c b/stdlib/tst-arc4random-stats.c
> new file mode 100644
> index 0000000000..9747180c99
> --- /dev/null
> +++ b/stdlib/tst-arc4random-stats.c
> @@ -0,0 +1,146 @@
> +/* Statistical tests for arc4random-related functions.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <array_length.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <support/check.h>
> +
> +enum
> +{
> +  arc4random_key_size = 32
> +};
> +
> +struct key
> +{
> +  unsigned char data[arc4random_key_size];
> +};
> +
> +/* With 12,000 keys, the probability that a byte in a predetermined
> +   position does not have a predetermined value in all generated keys
> +   is about 4e-21.  The probability that this happens with any of the
> +   16 * 256 possible byte position/values is 1.6e-17.  This results in
> +   an acceptably low false-positive rate.  */
> +enum { key_count = 12000 };
> +
> +static struct key keys[key_count];
> +
> +/* Used to perform the distribution check.  */
> +static int byte_counts[arc4random_key_size][256];
> +
> +/* Bail out after this many failures.  */
> +enum { failure_limit = 100 };
> +
> +static void
> +find_stuck_bytes (bool (*func) (unsigned char *key))
> +{
> +  memset (&keys, 0xcc, sizeof (keys));
> +
> +  int failures = 0;
> +  for (int key = 0; key < key_count; ++key)
> +    {
> +      while (true)
> +        {
> +          if (func (keys[key].data))
> +            break;
> +          ++failures;
> +          if (failures >= failure_limit)
> +            {
> +              printf ("warning: bailing out after %d failures\n", failures);
> +              return;
> +            }
> +        }
> +    }
> +  printf ("info: key generation finished with %d failures\n", failures);
> +
> +  memset (&byte_counts, 0, sizeof (byte_counts));
> +  for (int key = 0; key < key_count; ++key)
> +    for (int pos = 0; pos < arc4random_key_size; ++pos)
> +      ++byte_counts[pos][keys[key].data[pos]];
> +
> +  for (int pos = 0; pos < arc4random_key_size; ++pos)
> +    for (int byte = 0; byte < 256; ++byte)
> +      if (byte_counts[pos][byte] == 0)
> +        {
> +          support_record_failure ();
> +          printf ("error: byte %d never appeared at position %d\n", byte, pos);
> +        }
> +}
> +
> +/* Test adapter for arc4random.  */
> +static bool
> +generate_arc4random (unsigned char *key)
> +{
> +  uint32_t words[arc4random_key_size / 4];
> +  _Static_assert (sizeof (words) == arc4random_key_size, "sizeof (words)");
> +
> +  for (int i = 0; i < array_length (words); ++i)
> +    words[i] = arc4random ();
> +  memcpy (key, &words, arc4random_key_size);
> +  return true;
> +}
> +
> +/* Test adapter for arc4random_buf.  */
> +static bool
> +generate_arc4random_buf (unsigned char *key)
> +{
> +  arc4random_buf (key, arc4random_key_size);
> +  return true;
> +}
> +
> +/* Test adapter for arc4random_uniform.  */
> +static bool
> +generate_arc4random_uniform (unsigned char *key)
> +{
> +  for (int i = 0; i < arc4random_key_size; ++i)
> +    key[i] = arc4random_uniform (256);
> +  return true;
> +}
> +
> +/* Test adapter for arc4random_uniform with argument 257.  This means
> +   that byte 0 happens more often, but we do not perform such a
> +   statistcal check, so the test will still pass */
> +static bool
> +generate_arc4random_uniform_257 (unsigned char *key)
> +{
> +  for (int i = 0; i < arc4random_key_size; ++i)
> +    key[i] = arc4random_uniform (257);
> +  return true;
> +}
> +
> +static int
> +do_test (void)
> +{
> +  puts ("info: arc4random implementation test");
> +  find_stuck_bytes (generate_arc4random);
> +
> +  puts ("info: arc4random_buf implementation test");
> +  find_stuck_bytes (generate_arc4random_buf);
> +
> +  puts ("info: arc4random_uniform implementation test");
> +  find_stuck_bytes (generate_arc4random_uniform);
> +
> +  puts ("info: arc4random_uniform implementation test (257 variant)");
> +  find_stuck_bytes (generate_arc4random_uniform_257);
> +
> +  return 0;
> +}
> +
> +#include <support/test-driver.c>
> diff --git a/stdlib/tst-arc4random-thread.c b/stdlib/tst-arc4random-thread.c
> new file mode 100644
> index 0000000000..b122eaa826
> --- /dev/null
> +++ b/stdlib/tst-arc4random-thread.c
> @@ -0,0 +1,278 @@
> +/* Test that threads generate distinct streams of randomness.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <array_length.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <support/check.h>
> +#include <support/namespace.h>
> +#include <support/support.h>
> +#include <support/xthread.h>
> +
> +/* Number of arc4random_buf calls per thread.  */
> +enum { count_per_thread = 5000 };
> +
> +/* Number of threads computing randomness.  */
> +enum { inner_threads = 5 };
> +
> +/* Number of threads launching other threads.  Chosen as to not to
> +   overload the system.  */
> +enum { outer_threads = 7 };
> +
> +/* Number of launching rounds performed by the outer threads.  */
> +enum { outer_rounds = 10 };
> +
> +/* Maximum number of bytes generated in an arc4random call.  */
> +enum { max_size = 32 };
> +
> +/* Sizes generated by threads.  Must be long enough to be unique with
> +   high probability.  */
> +static const int sizes[] = { 12, 15, 16, 17, 24, 31, max_size };
> +
> +/* Data structure to capture randomness results.  */
> +struct blob
> +{
> +  unsigned int size;
> +  int thread_id;
> +  unsigned int index;
> +  unsigned char bytes[max_size];
> +};
> +
> +#define DYNARRAY_STRUCT dynarray_blob
> +#define DYNARRAY_ELEMENT struct blob
> +#define DYNARRAY_PREFIX dynarray_blob_
> +#include <malloc/dynarray-skeleton.c>
> +
> +/* Sort blob elements by length first, then by comparing the data
> +   member.  */
> +static int
> +compare_blob (const void *left1, const void *right1)
> +{
> +  const struct blob *left = left1;
> +  const struct blob *right = right1;
> +
> +  if (left->size != right->size)
> +    /* No overflow due to limited range.  */
> +    return left->size - right->size;
> +  return memcmp (left->bytes, right->bytes, left->size);
> +}
> +
> +/* Used to store the global result.  */
> +static pthread_mutex_t global_result_lock = PTHREAD_MUTEX_INITIALIZER;
> +static struct dynarray_blob global_result;
> +
> +/* Copy data to the global result, with locking.  */
> +static void
> +copy_result_to_global (struct dynarray_blob *result)
> +{
> +  xpthread_mutex_lock (&global_result_lock);
> +  size_t old_size = dynarray_blob_size (&global_result);
> +  TEST_VERIFY_EXIT
> +    (dynarray_blob_resize (&global_result,
> +                           old_size + dynarray_blob_size (result)));
> +  memcpy (dynarray_blob_begin (&global_result) + old_size,
> +          dynarray_blob_begin (result),
> +          dynarray_blob_size (result) * sizeof (struct blob));
> +  xpthread_mutex_unlock (&global_result_lock);
> +}
> +
> +/* Used to assign unique thread IDs.  Accessed atomically.  */
> +static int next_thread_id;
> +
> +static void *
> +inner_thread (void *unused)
> +{
> +  /* Use local result to avoid global lock contention while generating
> +     randomness.  */
> +  struct dynarray_blob result;
> +  dynarray_blob_init (&result);
> +
> +  int thread_id = __atomic_fetch_add (&next_thread_id, 1, __ATOMIC_RELAXED);
> +
> +  /* Determine the sizes to be used by this thread.  */
> +  int size_slot = thread_id % (array_length (sizes) + 1);
> +  bool switch_sizes = size_slot == array_length (sizes);
> +  if (switch_sizes)
> +    size_slot = 0;
> +
> +  /* Compute the random blobs.  */
> +  for (int i = 0; i < count_per_thread; ++i)
> +    {
> +      struct blob *place = dynarray_blob_emplace (&result);
> +      TEST_VERIFY_EXIT (place != NULL);
> +      place->size = sizes[size_slot];
> +      place->thread_id = thread_id;
> +      place->index = i;
> +      arc4random_buf (place->bytes, place->size);
> +
> +      if (switch_sizes)
> +        size_slot = (size_slot + 1) % array_length (sizes);
> +    }
> +
> +  /* Store the blobs in the global result structure.  */
> +  copy_result_to_global (&result);
> +
> +  dynarray_blob_free (&result);
> +
> +  return NULL;
> +}
> +
> +/* Launch the inner threads and wait for their termination.  */
> +static void *
> +outer_thread (void *unused)
> +{
> +  for (int round = 0; round < outer_rounds; ++round)
> +    {
> +      pthread_t threads[inner_threads];
> +
> +      for (int i = 0; i < inner_threads; ++i)
> +        threads[i] = xpthread_create (NULL, inner_thread, NULL);
> +
> +      for (int i = 0; i < inner_threads; ++i)
> +        xpthread_join (threads[i]);
> +    }
> +
> +  return NULL;
> +}
> +
> +static bool termination_requested;
> +
> +/* Call arc4random_buf to fill one blob with 16 bytes.  */
> +static void *
> +get_one_blob_thread (void *closure)
> +{
> +  struct blob *result = closure;
> +  result->size = 16;
> +  arc4random_buf (result->bytes, result->size);
> +  return NULL;
> +}
> +
> +/* Invoked from fork_thread to actually obtain randomness data.  */
> +static void
> +fork_thread_subprocess (void *closure)
> +{
> +  struct blob *shared_result = closure;
> +
> +  pthread_t thr1 = xpthread_create
> +    (NULL, get_one_blob_thread, shared_result + 1);
> +  pthread_t thr2 = xpthread_create
> +    (NULL, get_one_blob_thread, shared_result + 2);
> +  get_one_blob_thread (shared_result);
> +  xpthread_join (thr1);
> +  xpthread_join (thr2);
> +}
> +
> +/* Continuously fork subprocesses to obtain a little bit of
> +   randomness.  */
> +static void *
> +fork_thread (void *unused)
> +{
> +  struct dynarray_blob result;
> +  dynarray_blob_init (&result);
> +
> +  /* Three blobs from each subprocess.  */
> +  struct blob *shared_result
> +    = support_shared_allocate (3 * sizeof (*shared_result));
> +
> +  while (!__atomic_load_n (&termination_requested, __ATOMIC_RELAXED))
> +    {
> +      /* Obtain the results from a subprocess.  */
> +      support_isolate_in_subprocess (fork_thread_subprocess, shared_result);
> +
> +      for (int i = 0; i < 3; ++i)
> +        {
> +          struct blob *place = dynarray_blob_emplace (&result);
> +          TEST_VERIFY_EXIT (place != NULL);
> +          place->size = shared_result[i].size;
> +          place->thread_id = -1;
> +          place->index = i;
> +          memcpy (place->bytes, shared_result[i].bytes, place->size);
> +        }
> +    }
> +
> +  support_shared_free (shared_result);
> +
> +  copy_result_to_global (&result);
> +  dynarray_blob_free (&result);
> +
> +  return NULL;
> +}
> +
> +/* Launch the outer threads and wait for their termination.  */
> +static void
> +run_outer_threads (void)
> +{
> +  /* Special thread that continuously calls fork.  */
> +  pthread_t fork_thread_id = xpthread_create (NULL, fork_thread, NULL);
> +
> +  pthread_t threads[outer_threads];
> +  for (int i = 0; i < outer_threads; ++i)
> +    threads[i] = xpthread_create (NULL, outer_thread, NULL);
> +
> +  for (int i = 0; i < outer_threads; ++i)
> +    xpthread_join (threads[i]);
> +
> +  __atomic_store_n (&termination_requested, true, __ATOMIC_RELAXED);
> +  xpthread_join (fork_thread_id);
> +}
> +
> +static int
> +do_test (void)
> +{
> +  dynarray_blob_init (&global_result);
> +  int expected_blobs
> +    = count_per_thread * inner_threads * outer_threads * outer_rounds;
> +  printf ("info: minimum of %d blob results expected\n", expected_blobs);
> +
> +  run_outer_threads ();
> +
> +  /* The forking thread delivers a non-deterministic number of
> +     results, which is why expected_blobs is only a minimun number of
> +     results.  */
> +  printf ("info: %zu blob results observed\n",
> +          dynarray_blob_size (&global_result));
> +  TEST_VERIFY (dynarray_blob_size (&global_result) >= expected_blobs);
> +
> +  /* Verify that there are no duplicates.  */
> +  qsort (dynarray_blob_begin (&global_result),
> +         dynarray_blob_size (&global_result),
> +         sizeof (struct blob), compare_blob);
> +  struct blob *end = dynarray_blob_end (&global_result);
> +  for (struct blob *p = dynarray_blob_begin (&global_result) + 1;
> +       p < end; ++p)
> +    {
> +      if (compare_blob (p - 1, p) == 0)
> +        {
> +          support_record_failure ();
> +          char *quoted = support_quote_blob (p->bytes, p->size);
> +          printf ("error: duplicate blob: \"%s\" (%d bytes)\n",
> +                  quoted, (int) p->size);
> +          printf ("  first source: thread %d, index %u\n",
> +                  p[-1].thread_id, p[-1].index);
> +          printf ("  second source: thread %d, index %u\n",
> +                  p[0].thread_id, p[0].index);
> +          free (quoted);
> +        }
> +    }
> +
> +  dynarray_blob_free (&global_result);
> +
> +  return 0;
> +}
> +
> +#include <support/test-driver.c>
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-14 17:17   ` Noah Goldstein
@ 2022-04-14 18:11     ` Adhemerval Zanella
  0 siblings, 0 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 18:11 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 14/04/2022 14:17, Noah Goldstein wrote:
> On Wed, Apr 13, 2022 at 3:27 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> +       clear(X1);
>> +       clear(X2);
>> +       clear(X3);
>> +       clear(X4);
>> +       clear(X5);
>> +       clear(X6);
>> +       clear(X7);
>> +       clear(X8);
>> +       clear(X9);
>> +       clear(X10);
>> +       clear(X11);
>> +       clear(X12);
>> +       clear(X13);
>> +       clear(X14);
>> +       clear(X15);
> 
> No need to change now, but out of curiosity (and possible future optimization),
> do we need the clears for our purposes?

That's a good question which I am not sure.  Distro do usually build glibc
with security options (such as stack protector and stack check) and keep
adding support for newer CPU security hardening (such as ARM PAC/BIT or
Intel CET). 

We also uses some more software oriented hardening, such as explicit_memset
on some places.

I would expect that distro might use -mharden-sls, but I am not sure if
we should enforce it on all assembly implementations.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] x86: Add AVX2 optimized chacha20
  2022-04-14 17:20       ` Noah Goldstein
@ 2022-04-14 18:12         ` Adhemerval Zanella
  0 siblings, 0 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 18:12 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 14/04/2022 14:20, Noah Goldstein wrote:
> On Thu, Apr 14, 2022 at 12:17 PM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 13/04/2022 20:04, Noah Goldstein wrote:
>>> On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> +       .text
>>>
>>> section avx2
>>>
>>
>> Ack, I changed to '.section .text.avx2, "ax", @progbits'.
>>
>>>> +       .align 32
>>>> +chacha20_data:
>>>> +L(shuf_rol16):
>>>> +       .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
>>>> +L(shuf_rol8):
>>>> +       .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
>>>> +L(inc_counter):
>>>> +       .byte 0,1,2,3,4,5,6,7
>>>> +L(unsigned_cmp):
>>>> +       .long 0x80000000
>>>> +
>>>> +ENTRY (__chacha20_avx2_blocks8)
>>>> +       /* input:
>>>> +        *      %rdi: input
>>>> +        *      %rsi: dst
>>>> +        *      %rdx: src
>>>> +        *      %rcx: nblks (multiple of 8)
>>>> +        */
>>>> +       vzeroupper;
>>>
>>> vzeroupper needs to be replaced with VZEROUPPER_RETURN
>>> and we need a transaction safe version unless this can never
>>> be called during a transaction.
>>
>> I think you meant VZEROUPPER here (VZEROUPPER_RETURN seems to trigger
>> test case failures). What do you mean by a 'transaction safe version'?
>> Ax extra __chacha20_avx2_blocks8 implementation to handle it? Or disable
>> it if RTM is enabled?
> 
> For now you can just update the cpufeature check to do ssse3 if RTM is enabled.

Right, I will do it.

> 
>>
>>>> +
>>>> +       /* clear the used vector registers and stack */
>>>> +       vpxor X0, X0, X0;
>>>> +       vmovdqa X0, (STACK_VEC_X12)(%rsp);
>>>> +       vmovdqa X0, (STACK_VEC_X13)(%rsp);
>>>> +       vmovdqa X0, (STACK_TMP)(%rsp);
>>>> +       vmovdqa X0, (STACK_TMP1)(%rsp);
>>>> +       vzeroall;
>>>
>>> Do you need vzeroall?
>>> Why not vzeroupper? Is it a security concern to leave info in the xmm pieces?
>>
>> I would assume, since it is on the original libgrcypt optimization.  As
>> for the ssse3 version, I am not sure if we really need that level of
>> hardening, but it would be good to have the initial revision as close
>> as possible from libgcrypt.
> 
> Got it.
> 
>>
>>>
>>>
>>>> +
>>>> +       /* eax zeroed by round loop. */
>>>> +       leave;
>>>> +       cfi_adjust_cfa_offset(-8)
>>>> +       cfi_def_cfa_register(%rsp);
>>>> +       ret;
>>>> +       int3;
>>>
>>> Why do we need int3 here?
>>
>> I think the ssse3 applies here as well.
>>
>>>> +END(__chacha20_avx2_blocks8)
>>>> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
>>>> index 37a4fdfb1f..7e9e7755f3 100644
>>>> --- a/sysdeps/x86_64/chacha20_arch.h
>>>> +++ b/sysdeps/x86_64/chacha20_arch.h
>>>> @@ -22,11 +22,25 @@
>>>>
>>>>  unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
>>>>                                        const uint8_t *src, size_t nblks);
>>>> +unsigned int __chacha20_avx2_blocks8 (uint32_t *state, uint8_t *dst,
>>>> +                                     const uint8_t *src, size_t nblks);
>>>>
>>>>  static inline void
>>>>  chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
>>>>                 size_t bytes)
>>>>  {
>>>> +  const struct cpu_features* cpu_features = __get_cpu_features ();
>>>
>>> Can we do this with an ifunc and take the cpufeature check off the critical
>>> path?
>>
>> Ditto.
>>
>>>> +
>>>> +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && bytes >= CHACHA20_BLOCK_SIZE * 8)
>>>> +    {
>>>> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
>>>> +      nblocks -= nblocks % 8;
>>>> +      __chacha20_avx2_blocks8 (state->ctx, dst, src, nblocks);
>>>> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
>>>> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
>>>> +      src += nblocks * CHACHA20_BLOCK_SIZE;
>>>> +    }
>>>> +
>>>>    if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
>>>>      {
>>>>        size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
>>>> --
>>>> 2.32.0
>>>>
>>>
>>> Do you want optimization comments or do that later?
>>
>> Ideally I would like to check if the proposed arc4random implementation
>> is what we want (with current approach of using atfork handlers and the
>> key reschedule).  The cipher itself it not the utmost important in the
>> sense it is transparent to user and we can eventually replace it if there
>> any issue or attack to ChaCha20.  Initially I won't add any arch-specific
>> optimization, but since libgcrypt provides some that fits on the current
>> approach I though it would be a nice thing to have.
>>
>> For optimization comments it would be good to sync with libgcrypt as well,
>> I think the project will be interested in any performance improvement
>> you might have for the chacha implementations.
> Okay, I'll probably take a stab at this in the not too distant future.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-14 17:22           ` Noah Goldstein
@ 2022-04-14 18:25             ` Adhemerval Zanella
  0 siblings, 0 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 18:25 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 14/04/2022 14:22, Noah Goldstein wrote:
> On Thu, Apr 14, 2022 at 12:19 PM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 14/04/2022 14:10, Noah Goldstein wrote:
>>> On Thu, Apr 14, 2022 at 12:03 PM Adhemerval Zanella
>>> <adhemerval.zanella@linaro.org> wrote:
>>>>
>>>>
>>>>
>>>> On 13/04/2022 20:12, Noah Goldstein wrote:
>>>>> On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>>
>>>>>> +
>>>>>> +       /* eax zeroed by round loop. */
>>>>>> +       leave;
>>>>>> +       cfi_adjust_cfa_offset(-8)
>>>>>> +       cfi_def_cfa_register(%rsp);
>>>>>> +       ret;
>>>>>> +       int3;
>>>>> why int3?
>>>>
>>>> It was originally added on libgcrypt by 11ade08efbfbc36dbf3571f1026946269950bc40,
>>>> as a straight-line speculation hardening.  It is was is emitted by clang 14 and
>>>> gcc 12 with -mharden-sls=return.
>>>>
>>>> I am not sure if we need that kind of hardening, but I would prefer to the first
>>>> version be in sync with libgcrypt as much as possible so the future optimizations
>>>> would be simpler to keep localized to glibc (if libgcrypt does not want to
>>>> backport it).
>>>
>>> Okay, can keep for now. Any thoughts on changing it to sse2?
>>>
>>
>> No strong feeling, I used the ssse3 one because it is readily available from
>> libgcrypt.
> 
> I think the only ssse3 is `pshufb` so you can just replace the optimized
> rotates with the shift rotates and that will make it sse2 (unless I'm missing
> an instruction).

Right, do you have a patch for it? I can add it on the v2 I am will send.

> 
> Also can you add the proper .text section here as well (or .sse2 or .ssse3)
Ack.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] Add arc4random support
  2022-04-14  7:36 ` [PATCH 0/7] Add arc4random support Yann Droneaud
@ 2022-04-14 18:39   ` Adhemerval Zanella
  2022-04-14 18:43     ` Noah Goldstein
  2022-04-15 10:22     ` Yann Droneaud
  0 siblings, 2 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 18:39 UTC (permalink / raw)
  To: Yann Droneaud, GNU C Library



On 14/04/2022 04:36, Yann Droneaud wrote:
> Hi,
> 
> Le 13/04/2022 à 22:23, Adhemerval Zanella via Libc-alpha a écrit :
> 
>> This patch adds the arc4random, arc4random_buf, and arc4random_uniform
>> along with optimized versions for x86_64, aarch64, and powerpc64.
>>
>> The generic implementation is based on scalar Chacha20, with a global
>> cache and locking.  It uses getrandom or /dev/urandom as fallback to
>> get the initial entropy, and reseeds the internal state on every 16MB
>> of consumed entropy.
>>
>> It maintains an internal buffer which consumes at maximum one page on
>> most systems (assuming 4k pages).  The internal buffer optimizes the
>> cipher encrypt calls, by amortize arc4random calls (where both
>> function call and locks cost are the dominating factor).
>>
>> Fork detection is done by checking if MADV_WIPEONFORK supported.  If not
>> the fork callback will reset the state on the fork call.  It does not
>> handle direct clone calls, nor vfork or _Fork (arc4random is not
>> async-signal-safe due the internal lock usage, althought the
>> implementation does try to handle fork cases).
>>
>> The generic ChaCha20 implementation is based on the RFC8439 [1], which
>> a simple memcpy with xor implementation.
> 
> The xor (with 0) is a waste of CPU cycles as the ChaCha20 keystream is the PRNG output.

I don't have a strong feeling about, although it seems that any other
ChaCha20 implementation I have checked does it (libgcrypt, Linux,
BSD).  The BSD also does it for arc4random, although most if not
all come from OpenBSD and they are usually paranoid with security
hardening.

I am no security expert, so I will keep it as is for generic interface
(also the arch optimization also does it, so I think it might be a
good idea to keep the implementation with similar semantic). 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] Add arc4random support
  2022-04-14 18:39   ` Adhemerval Zanella
@ 2022-04-14 18:43     ` Noah Goldstein
  2022-04-15 10:22     ` Yann Droneaud
  1 sibling, 0 replies; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 18:43 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Yann Droneaud, GNU C Library

On Thu, Apr 14, 2022 at 1:39 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
>
>
> On 14/04/2022 04:36, Yann Droneaud wrote:
> > Hi,
> >
> > Le 13/04/2022 à 22:23, Adhemerval Zanella via Libc-alpha a écrit :
> >
> >> This patch adds the arc4random, arc4random_buf, and arc4random_uniform
> >> along with optimized versions for x86_64, aarch64, and powerpc64.
> >>
> >> The generic implementation is based on scalar Chacha20, with a global
> >> cache and locking.  It uses getrandom or /dev/urandom as fallback to
> >> get the initial entropy, and reseeds the internal state on every 16MB
> >> of consumed entropy.
> >>
> >> It maintains an internal buffer which consumes at maximum one page on
> >> most systems (assuming 4k pages).  The internal buffer optimizes the
> >> cipher encrypt calls, by amortize arc4random calls (where both
> >> function call and locks cost are the dominating factor).
> >>
> >> Fork detection is done by checking if MADV_WIPEONFORK supported.  If not
> >> the fork callback will reset the state on the fork call.  It does not
> >> handle direct clone calls, nor vfork or _Fork (arc4random is not
> >> async-signal-safe due the internal lock usage, althought the
> >> implementation does try to handle fork cases).
> >>
> >> The generic ChaCha20 implementation is based on the RFC8439 [1], which
> >> a simple memcpy with xor implementation.
> >
> > The xor (with 0) is a waste of CPU cycles as the ChaCha20 keystream is the PRNG output.
>
> I don't have a strong feeling about, although it seems that any other
> ChaCha20 implementation I have checked does it (libgcrypt, Linux,
> BSD).  The BSD also does it for arc4random, although most if not
> all come from OpenBSD and they are usually paranoid with security
> hardening.
>
> I am no security expert, so I will keep it as is for generic interface
> (also the arch optimization also does it, so I think it might be a
> good idea to keep the implementation with similar semantic).

Does the arc4random usecase require the xor zeroing though? Think
it would be a mistake to gurantee it as it seems like a pretty reasonable
thing to want to optimize out if we need better performance.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] benchtests: Add arc4random benchtest
  2022-04-13 20:23 ` [PATCH 3/7] benchtests: Add arc4random benchtest Adhemerval Zanella
@ 2022-04-14 19:17   ` Noah Goldstein
  2022-04-14 19:48     ` Adhemerval Zanella
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 19:17 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Wed, Apr 13, 2022 at 3:26 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> It shows both throughput (total bytes obtained in the test duration)
> and latecy for both arc4random and arc4random_buf with different
> sizes.
>
> Checked on x86_64-linux-gnu, aarch64-linux, and powerpc64le-linux-gnu.
> ---
>  benchtests/Makefile           |   6 +-
>  benchtests/bench-arc4random.c | 243 ++++++++++++++++++++++++++++++++++
>  2 files changed, 248 insertions(+), 1 deletion(-)
>  create mode 100644 benchtests/bench-arc4random.c
>
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index 8dfca592fd..50b96dd71f 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -111,8 +111,12 @@ bench-string := \
>    ffsll \
>  # bench-string
>
> +bench-stdlib := \
> +  arc4random \
> +# bench-stdlib
> +
>  ifeq (${BENCHSET},)
> -bench := $(bench-math) $(bench-pthread) $(bench-string)
> +bench := $(bench-math) $(bench-pthread) $(bench-string) $(bench-stdlib)
>  else
>  bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}})
>  endif
> diff --git a/benchtests/bench-arc4random.c b/benchtests/bench-arc4random.c
> new file mode 100644
> index 0000000000..9e2ba9ba34
> --- /dev/null
> +++ b/benchtests/bench-arc4random.c
> @@ -0,0 +1,243 @@
> +/* arc4random benchmarks.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include "bench-timing.h"
> +#include "json-lib.h"
> +#include <array_length.h>
> +#include <intprops.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <support/support.h>
> +#include <support/xthread.h>
> +
> +static volatile uint32_t r;
> +static volatile sig_atomic_t timer_finished;
> +
> +static void timer_callback (int unused)
> +{
> +  timer_finished = 1;
> +}
> +
> +static const uint32_t sizes[] = { 0, 16, 32, 64, 128 };
> +
> +static double
> +bench_arc4random_throughput (void)
> +{
> +  /* Run for approximately DURATION seconds, and it does not matter who
> +     receive the signal (so not need to mask it on main thread).  */
> +  timer_finished = 0;
> +  timer_t timer = support_create_timer (DURATION, 0, false, timer_callback);
> +
> +  uint64_t n = 0;
> +
> +  while (1)
> +    {
> +      r = arc4random ();
> +      n++;
> +
> +      if (timer_finished == 1)
> +       break;
> +    }
> +
> +  support_delete_timer (timer);
> +
> +  return (double) (n * sizeof (r)) / (double) DURATION;
> +}
> +
> +static double
> +bench_arc4random_latency (void)
> +{
> +  timing_t start, stop, cur;
> +  const size_t iters = 1024;
> +
> +  TIMING_NOW (start);
> +  for (size_t i = 0; i < iters; i++)
> +    r = arc4random ();
> +  TIMING_NOW (stop);
> +
> +  TIMING_DIFF (cur, start, stop);
> +
> +  return (double) (cur) / (double) iters;
> +}
> +
> +static double
> +bench_arc4random_buf_throughput (size_t len)
> +{
> +  timer_finished = 0;
> +  timer_t timer = support_create_timer (DURATION, 0, false, timer_callback);
> +
> +  uint8_t buf[len];
> +
> +  uint64_t n = 0;
> +
> +  while (1)
> +    {
> +      arc4random_buf (buf, len);
> +      n++;
> +
> +      if (timer_finished == 1)
> +       break;
> +    }
> +
> +  support_delete_timer (timer);
> +
> +  uint64_t total = (n * len);
> +  return (double) (total) / (double) DURATION;
> +}
> +
> +static double
> +bench_arc4random_buf_latency (size_t len)
> +{
> +  timing_t start, stop, cur;
> +  const size_t iters = 1024;
> +
> +  uint8_t buf[len];
> +
> +  TIMING_NOW (start);
> +  for (size_t i = 0; i < iters; i++)
> +    arc4random_buf (buf, len);
> +  TIMING_NOW (stop);
> +
> +  TIMING_DIFF (cur, start, stop);
> +
> +  return (double) (cur) / (double) iters;
> +}
> +
> +static void
> +bench_singlethread (json_ctx_t *json_ctx)
> +{
> +  json_element_object_begin (json_ctx);
> +
> +  json_array_begin (json_ctx, "throughput");
> +  for (int i = 0; i < array_length (sizes); i++)
> +    if (sizes[i] == 0)
> +      json_element_double (json_ctx, bench_arc4random_throughput ());
> +    else
> +      json_element_double (json_ctx, bench_arc4random_buf_throughput (sizes[i]));
> +  json_array_end (json_ctx);
> +
> +  json_array_begin (json_ctx, "latency");
> +  for (int i = 0; i < array_length (sizes); i++)
> +    if (sizes[i] == 0)
> +      json_element_double (json_ctx, bench_arc4random_latency ());
> +    else
> +      json_element_double (json_ctx, bench_arc4random_buf_latency (sizes[i]));
> +  json_array_end (json_ctx);
> +
> +  json_element_object_end (json_ctx);
> +}
> +
> +struct thr_arc4random_arg
> +{
> +  double ret;
> +  uint32_t val;
> +};
> +
> +static void *
> +thr_arc4random_throughput (void *closure)
> +{
> +  struct thr_arc4random_arg *arg = closure;
> +  arg->ret = arg->val == 0 ? bench_arc4random_throughput ()
> +                          : bench_arc4random_buf_throughput (arg->val);
> +  return NULL;
> +}
> +
> +static void *
> +thr_arc4random_latency (void *closure)
> +{
> +  struct thr_arc4random_arg *arg = closure;
> +  arg->ret = arg->val == 0 ? bench_arc4random_latency ()
> +                          : bench_arc4random_buf_latency (arg->val);
> +  return NULL;
> +}

I think the expectation is that the chacha calls will be cold,
maybe it is worth adding a cache flush of sorts between
calls. It may be some prefetching in the start will help the code in
that case but would only be a regression with the hot in L1
benchmarks.

Can wait though this V1 looks fine.
> +
> +static void
> +bench_threaded (json_ctx_t *json_ctx)
> +{
> +  json_element_object_begin (json_ctx);
> +
> +  json_array_begin (json_ctx, "throughput");
> +  for (int i = 0; i < array_length (sizes); i++)
> +    {
> +      struct thr_arc4random_arg arg = { .val = sizes[i] };
> +      pthread_t thr = xpthread_create (NULL, thr_arc4random_throughput, &arg);
> +      xpthread_join (thr);
> +      json_element_double (json_ctx, arg.ret);
> +    }
> +  json_array_end (json_ctx);
> +
> +  json_array_begin (json_ctx, "latency");
> +  for (int i = 0; i < array_length (sizes); i++)
> +    {
> +      struct thr_arc4random_arg arg = { .val = sizes[i] };
> +      pthread_t thr = xpthread_create (NULL, thr_arc4random_latency, &arg);
> +      xpthread_join (thr);
> +      json_element_double (json_ctx, arg.ret);
> +    }
> +  json_array_end (json_ctx);
> +
> +  json_element_object_end (json_ctx);
> +}
> +
> +static void
> +run_bench (json_ctx_t *json_ctx, const char *name,
> +          char *const*fnames, size_t fnameslen,
> +          void (*bench)(json_ctx_t *ctx))
> +{
> +  json_attr_object_begin (json_ctx, name);
> +  json_array_begin (json_ctx, "functions");
> +  for (int i = 0; i < fnameslen; i++)
> +    json_element_string (json_ctx, fnames[i]);
> +  json_array_end (json_ctx);
> +
> +  json_array_begin (json_ctx, "results");
> +  bench (json_ctx);
> +  json_array_end (json_ctx);
> +  json_attr_object_end (json_ctx);
> +}
> +
> +static int
> +do_test (void)
> +{
> +  char *fnames[array_length (sizes) + 1];
> +  fnames[0] = (char *) "arc4random";
> +  for (int i = 0; i < array_length (sizes); i++)
> +    fnames[i+1] = xasprintf ("arc4random_buf(%u)", sizes[i]);
> +
> +  json_ctx_t json_ctx;
> +  json_init (&json_ctx, 0, stdout);
> +
> +  json_document_begin (&json_ctx);
> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> +
> +  run_bench (&json_ctx, "single-thread", fnames, array_length (fnames),
> +            bench_singlethread);
> +  run_bench (&json_ctx, "multi-thread", fnames, array_length (fnames),
> +            bench_threaded);
> +
> +  json_document_end (&json_ctx);
> +
> +  for (int i = 0; i < array_length (sizes); i++)
> +    free (fnames[i+1]);
> +
> +  return 0;
> +}
> +
> +#include <support/test-driver.c>
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-13 20:23 ` [PATCH 4/7] x86: Add SSSE3 optimized chacha20 Adhemerval Zanella
  2022-04-13 23:12   ` Noah Goldstein
  2022-04-14 17:17   ` Noah Goldstein
@ 2022-04-14 19:25   ` Noah Goldstein
  2022-04-14 19:40     ` Adhemerval Zanella
  2 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 19:25 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Wed, Apr 13, 2022 at 3:27 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> It adds vectorized ChaCha20 implementation based on libgcrypt
> cipher/chacha20-amd64-ssse3.S.  It is used only if SSSE3 is supported
> and enable by the architecture.
>
> On a Ryzen 9 5900X it shows the following improvements (using
> formatted bench-arc4random data):
>
> GENERIC
> Function                                 MB/s
> --------------------------------------------------
> arc4random [single-thread]               375.06
> arc4random_buf(0) [single-thread]        498.50
> arc4random_buf(16) [single-thread]       576.86
> arc4random_buf(32) [single-thread]       615.76
> arc4random_buf(64) [single-thread]       633.97
> --------------------------------------------------
> arc4random [multi-thread]                359.86
> arc4random_buf(0) [multi-thread]         479.27
> arc4random_buf(16) [multi-thread]        543.65
> arc4random_buf(32) [multi-thread]        581.98
> arc4random_buf(64) [multi-thread]        603.01
> --------------------------------------------------
>
> SSSE3:
> Function                                 MB/s
> --------------------------------------------------
> arc4random [single-thread]               576.55
> arc4random_buf(0) [single-thread]        961.77
> arc4random_buf(16) [single-thread]       1309.38
> arc4random_buf(32) [single-thread]       1558.69
> arc4random_buf(64) [single-thread]       1728.54
> --------------------------------------------------
> arc4random [multi-thread]                589.52
> arc4random_buf(0) [multi-thread]         967.39
> arc4random_buf(16) [multi-thread]        1319.27
> arc4random_buf(32) [multi-thread]        1552.96
> arc4random_buf(64) [multi-thread]        1734.27
> --------------------------------------------------
>
> Checked on x86_64-linux-gnu.
> ---
>  LICENSES                        |  20 ++
>  sysdeps/generic/chacha20_arch.h |  24 +++
>  sysdeps/x86_64/Makefile         |   6 +
>  sysdeps/x86_64/chacha20-ssse3.S | 330 ++++++++++++++++++++++++++++++++
>  sysdeps/x86_64/chacha20_arch.h  |  42 ++++
>  5 files changed, 422 insertions(+)
>  create mode 100644 sysdeps/generic/chacha20_arch.h
>  create mode 100644 sysdeps/x86_64/chacha20-ssse3.S
>  create mode 100644 sysdeps/x86_64/chacha20_arch.h
>
> diff --git a/LICENSES b/LICENSES
> index 530893b1dc..2563abd9e2 100644
> --- a/LICENSES
> +++ b/LICENSES
> @@ -389,3 +389,23 @@ Copyright 2001 by Stephen L. Moshier <moshier@na-net.ornl.gov>
>   You should have received a copy of the GNU Lesser General Public
>   License along with this library; if not, see
>   <https://www.gnu.org/licenses/>.  */
> +
> +sysdeps/x86_64/chacha20-ssse3.S import code from libgcrypt, with the
> +following notices:
> +
> +Copyright (C) 2017-2019 Jussi Kivilinna <jussi.kivilinna@iki.fi>
> +
> +This file is part of Libgcrypt.
> +
> +Libgcrypt is free software; you can redistribute it and/or modify
> +it under the terms of the GNU Lesser General Public License as
> +published by the Free Software Foundation; either version 2.1 of
> +the License, or (at your option) any later version.
> +
> +Libgcrypt is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU Lesser General Public License for more details.
> +
> +You should have received a copy of the GNU Lesser General Public
> +License along with this program; if not, see <http://www.gnu.org/licenses/>.
> diff --git a/sysdeps/generic/chacha20_arch.h b/sysdeps/generic/chacha20_arch.h
> new file mode 100644
> index 0000000000..d7200ac583
> --- /dev/null
> +++ b/sysdeps/generic/chacha20_arch.h
> @@ -0,0 +1,24 @@
> +/* Chacha20 implementation, generic interface.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +static inline void
> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst,
> +               const uint8_t *src, size_t bytes)
> +{
> +  chacha20_crypt_generic (state, dst, src, bytes);
> +}
> diff --git a/sysdeps/x86_64/Makefile b/sysdeps/x86_64/Makefile
> index 79365aff2a..f43b6a1180 100644
> --- a/sysdeps/x86_64/Makefile
> +++ b/sysdeps/x86_64/Makefile
> @@ -5,6 +5,12 @@ ifeq ($(subdir),csu)
>  gen-as-const-headers += link-defines.sym
>  endif
>
> +ifeq ($(subdir),stdlib)
> +sysdep_routines += \
> +  chacha20-ssse3 \
> +  # sysdep_routines
> +endif
> +
>  ifeq ($(subdir),gmon)
>  sysdep_routines += _mcount
>  # We cannot compile _mcount.S with -pg because that would create
> diff --git a/sysdeps/x86_64/chacha20-ssse3.S b/sysdeps/x86_64/chacha20-ssse3.S
> new file mode 100644
> index 0000000000..f221daf634
> --- /dev/null
> +++ b/sysdeps/x86_64/chacha20-ssse3.S
> @@ -0,0 +1,330 @@
> +/* Optimized SSSE3 implementation of ChaCha20 cipher.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +/* Based on D. J. Bernstein reference implementation at
> +   http://cr.yp.to/chacha.html:
> +
> +   chacha-regs.c version 20080118
> +   D. J. Bernstein
> +   Public domain.  */
> +
> +#include <sysdep.h>
> +
> +#ifdef PIC
> +#  define rRIP (%rip)
> +#else
> +#  define rRIP
> +#endif
> +
> +/* register macros */
> +#define INPUT %rdi
> +#define DST   %rsi
> +#define SRC   %rdx
> +#define NBLKS %rcx
> +#define ROUND %eax
> +
> +/* stack structure */
> +#define STACK_VEC_X12 (16)
> +#define STACK_VEC_X13 (16 + STACK_VEC_X12)
> +#define STACK_TMP     (16 + STACK_VEC_X13)
> +#define STACK_TMP1    (16 + STACK_TMP)
> +#define STACK_TMP2    (16 + STACK_TMP1)
> +
> +#define STACK_MAX     (16 + STACK_TMP2)
> +
> +/* vector registers */
> +#define X0 %xmm0
> +#define X1 %xmm1
> +#define X2 %xmm2
> +#define X3 %xmm3
> +#define X4 %xmm4
> +#define X5 %xmm5
> +#define X6 %xmm6
> +#define X7 %xmm7
> +#define X8 %xmm8
> +#define X9 %xmm9
> +#define X10 %xmm10
> +#define X11 %xmm11
> +#define X12 %xmm12
> +#define X13 %xmm13
> +#define X14 %xmm14
> +#define X15 %xmm15
> +
> +/**********************************************************************
> +  helper macros
> + **********************************************************************/
> +
> +/* 4x4 32-bit integer matrix transpose */
> +#define transpose_4x4(x0, x1, x2, x3, t1, t2, t3) \
> +       movdqa    x0, t2; \
> +       punpckhdq x1, t2; \
> +       punpckldq x1, x0; \
> +       \
> +       movdqa    x2, t1; \
> +       punpckldq x3, t1; \
> +       punpckhdq x3, x2; \
> +       \
> +       movdqa     x0, x1; \
> +       punpckhqdq t1, x1; \
> +       punpcklqdq t1, x0; \
> +       \
> +       movdqa     t2, x3; \
> +       punpckhqdq x2, x3; \
> +       punpcklqdq x2, t2; \
> +       movdqa     t2, x2;
> +
> +/* fill xmm register with 32-bit value from memory */
> +#define pbroadcastd(mem32, xreg) \
> +       movd mem32, xreg; \
> +       pshufd $0, xreg, xreg;
> +
> +/* xor with unaligned memory operand */
> +#define pxor_u(umem128, xreg, t) \
> +       movdqu umem128, t; \
> +       pxor t, xreg;
> +
> +/* xor register with unaligned src and save to unaligned dst */
> +#define xor_src_dst(dst, src, offset, xreg, t) \
> +       pxor_u(offset(src), xreg, t); \
> +       movdqu xreg, offset(dst);
> +
> +#define clear(x) pxor x,x;
> +
> +/**********************************************************************
> +  4-way chacha20
> + **********************************************************************/
> +
> +#define ROTATE2(v1,v2,c,tmp1,tmp2)     \
> +       movdqa v1, tmp1;                \
> +       movdqa v2, tmp2;                \
> +       psrld $(32 - (c)), v1;          \
> +       pslld $(c), tmp1;               \
> +       paddb tmp1, v1;                 \
> +       psrld $(32 - (c)), v2;          \
> +       pslld $(c), tmp2;               \
> +       paddb tmp2, v2;
> +
> +#define ROTATE_SHUF_2(v1,v2,shuf)      \
> +       pshufb shuf, v1;                \
> +       pshufb shuf, v2;
> +
> +#define XOR(ds,s) \
> +       pxor s, ds;
> +
> +#define PLUS(ds,s) \
> +       paddd s, ds;
> +
> +#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2,ign,tmp1,tmp2,\
> +                     interleave_op1,interleave_op2)            \
> +       movdqa L(shuf_rol16) rRIP, tmp1;                        \
> +               interleave_op1;                                 \
> +       PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);       \
> +           ROTATE_SHUF_2(d1, d2, tmp1);                        \
> +       PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);       \
> +           ROTATE2(b1, b2, 12, tmp1, tmp2);                    \
> +       movdqa L(shuf_rol8) rRIP, tmp1;                         \
> +               interleave_op2;                                 \
> +       PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2);       \
> +           ROTATE_SHUF_2(d1, d2, tmp1);                        \
> +       PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2);       \
> +           ROTATE2(b1, b2,  7, tmp1, tmp2);
> +
> +       .text
> +
> +chacha20_data:
> +       .align 16
> +L(shuf_rol16):
> +       .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
> +L(shuf_rol8):
> +       .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
> +L(counter1):
> +       .long 1,0,0,0
> +L(inc_counter):
> +       .long 0,1,2,3
> +L(unsigned_cmp):
> +       .long 0x80000000,0x80000000,0x80000000,0x80000000
> +
> +ENTRY (__chacha20_ssse3_blocks8)
> +       /* input:
> +        *      %rdi: input
> +        *      %rsi: dst
> +        *      %rdx: src
> +        *      %rcx: nblks (multiple of 4)
> +        */
> +
> +       pushq %rbp;
> +       cfi_adjust_cfa_offset(8);
> +       cfi_rel_offset(rbp, 0)
> +       movq %rsp, %rbp;
> +       cfi_def_cfa_register(%rbp);
> +
> +       subq $STACK_MAX, %rsp;
> +       andq $~15, %rsp;
> +
> +L(loop4):
> +       mov $20, ROUND;
> +
> +       /* Construct counter vectors X12 and X13 */
> +       movdqa L(inc_counter) rRIP, X0;
> +       movdqa L(unsigned_cmp) rRIP, X2;
> +       pbroadcastd((12 * 4)(INPUT), X12);
> +       pbroadcastd((13 * 4)(INPUT), X13);
> +       paddd X0, X12;
> +       movdqa X12, X1;
> +       pxor X2, X0;
> +       pxor X2, X1;
> +       pcmpgtd X1, X0;
> +       psubd X0, X13;
> +       movdqa X12, (STACK_VEC_X12)(%rsp);
> +       movdqa X13, (STACK_VEC_X13)(%rsp);
> +
> +       /* Load vectors */
> +       pbroadcastd((0 * 4)(INPUT), X0);
> +       pbroadcastd((1 * 4)(INPUT), X1);
> +       pbroadcastd((2 * 4)(INPUT), X2);
> +       pbroadcastd((3 * 4)(INPUT), X3);
> +       pbroadcastd((4 * 4)(INPUT), X4);
> +       pbroadcastd((5 * 4)(INPUT), X5);
> +       pbroadcastd((6 * 4)(INPUT), X6);
> +       pbroadcastd((7 * 4)(INPUT), X7);
> +       pbroadcastd((8 * 4)(INPUT), X8);
> +       pbroadcastd((9 * 4)(INPUT), X9);
> +       pbroadcastd((10 * 4)(INPUT), X10);
> +       pbroadcastd((11 * 4)(INPUT), X11);
> +       pbroadcastd((14 * 4)(INPUT), X14);
> +       pbroadcastd((15 * 4)(INPUT), X15);
> +       movdqa X11, (STACK_TMP)(%rsp);
> +       movdqa X15, (STACK_TMP1)(%rsp);
> +
> +L(round2_4):
> +       QUARTERROUND2(X0, X4,  X8, X12,   X1, X5,  X9, X13, tmp:=,X11,X15,,)
> +       movdqa (STACK_TMP)(%rsp), X11;
> +       movdqa (STACK_TMP1)(%rsp), X15;
> +       movdqa X8, (STACK_TMP)(%rsp);
> +       movdqa X9, (STACK_TMP1)(%rsp);
> +       QUARTERROUND2(X2, X6, X10, X14,   X3, X7, X11, X15, tmp:=,X8,X9,,)
> +       QUARTERROUND2(X0, X5, X10, X15,   X1, X6, X11, X12, tmp:=,X8,X9,,)
> +       movdqa (STACK_TMP)(%rsp), X8;
> +       movdqa (STACK_TMP1)(%rsp), X9;
> +       movdqa X11, (STACK_TMP)(%rsp);
> +       movdqa X15, (STACK_TMP1)(%rsp);
> +       QUARTERROUND2(X2, X7,  X8, X13,   X3, X4,  X9, X14, tmp:=,X11,X15,,)
> +       sub $2, ROUND;
> +       jnz .Lround2_4;
> +
> +       /* tmp := X15 */
> +       movdqa (STACK_TMP)(%rsp), X11;
> +       pbroadcastd((0 * 4)(INPUT), X15);
> +       PLUS(X0, X15);
> +       pbroadcastd((1 * 4)(INPUT), X15);
> +       PLUS(X1, X15);
> +       pbroadcastd((2 * 4)(INPUT), X15);
> +       PLUS(X2, X15);
> +       pbroadcastd((3 * 4)(INPUT), X15);
> +       PLUS(X3, X15);
> +       pbroadcastd((4 * 4)(INPUT), X15);
> +       PLUS(X4, X15);
> +       pbroadcastd((5 * 4)(INPUT), X15);
> +       PLUS(X5, X15);
> +       pbroadcastd((6 * 4)(INPUT), X15);
> +       PLUS(X6, X15);
> +       pbroadcastd((7 * 4)(INPUT), X15);
> +       PLUS(X7, X15);
> +       pbroadcastd((8 * 4)(INPUT), X15);
> +       PLUS(X8, X15);
> +       pbroadcastd((9 * 4)(INPUT), X15);
> +       PLUS(X9, X15);
> +       pbroadcastd((10 * 4)(INPUT), X15);
> +       PLUS(X10, X15);
> +       pbroadcastd((11 * 4)(INPUT), X15);
> +       PLUS(X11, X15);
> +       movdqa (STACK_VEC_X12)(%rsp), X15;
> +       PLUS(X12, X15);
> +       movdqa (STACK_VEC_X13)(%rsp), X15;
> +       PLUS(X13, X15);
> +       movdqa X13, (STACK_TMP)(%rsp);
> +       pbroadcastd((14 * 4)(INPUT), X15);
> +       PLUS(X14, X15);
> +       movdqa (STACK_TMP1)(%rsp), X15;
> +       movdqa X14, (STACK_TMP1)(%rsp);
> +       pbroadcastd((15 * 4)(INPUT), X13);
> +       PLUS(X15, X13);
> +       movdqa X15, (STACK_TMP2)(%rsp);
> +
> +       /* Update counter */
> +       addq $4, (12 * 4)(INPUT);
> +
> +       transpose_4x4(X0, X1, X2, X3, X13, X14, X15);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 0), X0, X15);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 0), X1, X15);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 0), X2, X15);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 0), X3, X15);
> +       transpose_4x4(X4, X5, X6, X7, X0, X1, X2);
> +       movdqa (STACK_TMP)(%rsp), X13;
> +       movdqa (STACK_TMP1)(%rsp), X14;
> +       movdqa (STACK_TMP2)(%rsp), X15;
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 1), X4, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 1), X5, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 1), X6, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 1), X7, X0);
> +       transpose_4x4(X8, X9, X10, X11, X0, X1, X2);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 2), X8, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 2), X9, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 2), X10, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 2), X11, X0);
> +       transpose_4x4(X12, X13, X14, X15, X0, X1, X2);
> +       xor_src_dst(DST, SRC, (64 * 0 + 16 * 3), X12, X0);
> +       xor_src_dst(DST, SRC, (64 * 1 + 16 * 3), X13, X0);
> +       xor_src_dst(DST, SRC, (64 * 2 + 16 * 3), X14, X0);
> +       xor_src_dst(DST, SRC, (64 * 3 + 16 * 3), X15, X0);
> +
> +       sub $4, NBLKS;
> +       lea (4 * 64)(DST), DST;
> +       lea (4 * 64)(SRC), SRC;
> +       jnz L(loop4);
> +
> +       /* clear the used vector registers and stack */
> +       clear(X0);
> +       movdqa X0, (STACK_VEC_X12)(%rsp);
> +       movdqa X0, (STACK_VEC_X13)(%rsp);
> +       movdqa X0, (STACK_TMP)(%rsp);
> +       movdqa X0, (STACK_TMP1)(%rsp);
> +       movdqa X0, (STACK_TMP2)(%rsp);
> +       clear(X1);
> +       clear(X2);
> +       clear(X3);
> +       clear(X4);
> +       clear(X5);
> +       clear(X6);
> +       clear(X7);
> +       clear(X8);
> +       clear(X9);
> +       clear(X10);
> +       clear(X11);
> +       clear(X12);
> +       clear(X13);
> +       clear(X14);
> +       clear(X15);
> +
> +       /* eax zeroed by round loop. */
> +       leave;
> +       cfi_adjust_cfa_offset(-8)
> +       cfi_def_cfa_register(%rsp);
> +       ret;
> +       int3;
> +END (__chacha20_ssse3_blocks8)
> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
> new file mode 100644
> index 0000000000..37a4fdfb1f
> --- /dev/null
> +++ b/sysdeps/x86_64/chacha20_arch.h
> @@ -0,0 +1,42 @@
> +/* Chacha20 implementation, used on arc4random.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <ldsodefs.h>
> +#include <cpu-features.h>
> +#include <sys/param.h>
> +
> +unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
> +                                      const uint8_t *src, size_t nblks);
> +
> +static inline void
> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
> +               size_t bytes)
> +{
> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
> +    {
> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
> +      nblocks -= nblocks % 4;

Are we locking ourselves into the api of __chacha_* expecting
this precomputation? I imagine we might want to move this to
assembly unless `nblock` is a compile time constant.

> +      __chacha20_ssse3_blocks8 (state->ctx, dst, src, nblocks);
> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
> +      src += nblocks * CHACHA20_BLOCK_SIZE;
> +    }
> +
> +  if (bytes > 0)
> +    chacha20_crypt_generic (state, dst, src, bytes);
> +}
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] Add arc4random support
  2022-04-14 11:49 ` Cristian Rodríguez
@ 2022-04-14 19:26   ` Adhemerval Zanella
  2022-04-14 20:36     ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 19:26 UTC (permalink / raw)
  To: Cristian Rodríguez; +Cc: libc-alpha



On 14/04/2022 08:49, Cristian Rodríguez wrote:
> If this interface is gonna added, GNU extensions that return uint64_t
> of arc4random and arc4random_uniform will be extremely cool.
> Even cooler if there is no global state.

I don't think adding a uint64_t interface for arc4random would improve
much, specially because a simple wrapper using arc4random_buf should
be suffice.  It would also require portable code to handle another
GNU extension over a BSD defined interface that is presented in multiple
systems.  Also performance-wise I think it would be much different than
arc4random_buf. It make some sense for arc4random_uniform, but I don't
have a strong opinion.

The global state adds some hardening by 'slicing up the stream' since
multiple consumers getting different pieces add backtracking and prediction
resistance. Theo de Raadt explains a bit why OpenBSD has added this 
concept [1] (check about minute 26) on its arc4random implementation.
As he puts, there is no formal proof, but I agree that the ideas are
reasonable.

Also, not using a global state means we will need to add a per-thread or
per-cpu state which is at least one page (due MADV_WIPEONFORK). The
per-cpu state is only actually possible on newer Linux kernels that
support rseq.  We might just not care about MADV_WIPEONFORK and use
a malloc buffer which would be reset by the atfork internal handler.

[1] https://www.youtube.com/watch?v=gp_90-3R0pE

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20
  2022-04-14 19:25   ` Noah Goldstein
@ 2022-04-14 19:40     ` Adhemerval Zanella
  0 siblings, 0 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 19:40 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 14/04/2022 16:25, Noah Goldstein wrote:
> On Wed, Apr 13, 2022 at 3:27 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> +
>> +static inline void
>> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
>> +               size_t bytes)
>> +{
>> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
>> +    {
>> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
>> +      nblocks -= nblocks % 4;
> 
> Are we locking ourselves into the api of __chacha_* expecting
> this precomputation? I imagine we might want to move this to
> assembly unless `nblock` is a compile time constant.
> 
>> +      __chacha20_ssse3_blocks8 (state->ctx, dst, src, nblocks);
>> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
>> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
>> +      src += nblocks * CHACHA20_BLOCK_SIZE;
>> +    }
>> +
>> +  if (bytes > 0)
>> +    chacha20_crypt_generic (state, dst, src, bytes);
>> +}
>> --
>> 2.32.0
>>

I think should be ok to _Static_assert that CHACHA20_BUFSIZE is a multiple of
the expected nblocks used by the optimized version and just call it without
the need to handle nblocks.

I will change it to v2.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] benchtests: Add arc4random benchtest
  2022-04-14 19:17   ` Noah Goldstein
@ 2022-04-14 19:48     ` Adhemerval Zanella
  2022-04-14 20:33       ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 19:48 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 14/04/2022 16:17, Noah Goldstein wrote:
> On Wed, Apr 13, 2022 at 3:26 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> It shows both throughput (total bytes obtained in the test duration)
>> and latecy for both arc4random and arc4random_buf with different
>> sizes.
>>
>> +
>> +static void *
>> +thr_arc4random_latency (void *closure)
>> +{
>> +  struct thr_arc4random_arg *arg = closure;
>> +  arg->ret = arg->val == 0 ? bench_arc4random_latency ()
>> +                          : bench_arc4random_buf_latency (arg->val);
>> +  return NULL;
>> +}
> 
> I think the expectation is that the chacha calls will be cold,
> maybe it is worth adding a cache flush of sorts between
> calls. It may be some prefetching in the start will help the code in
> that case but would only be a regression with the hot in L1
> benchmarks.
> 
> Can wait though this V1 looks fine.

In fact I think just checking the call within a thread does not add
much, specially since we don't have any single-thread lock optimization
for internal locks.  I will remove it on v2 and maybe revise it in the
future.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] benchtests: Add arc4random benchtest
  2022-04-14 19:48     ` Adhemerval Zanella
@ 2022-04-14 20:33       ` Noah Goldstein
  2022-04-14 20:48         ` Adhemerval Zanella
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 20:33 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library

On Thu, Apr 14, 2022 at 2:48 PM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 14/04/2022 16:17, Noah Goldstein wrote:
> > On Wed, Apr 13, 2022 at 3:26 PM Adhemerval Zanella via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> It shows both throughput (total bytes obtained in the test duration)
> >> and latecy for both arc4random and arc4random_buf with different
> >> sizes.
> >>
> >> +
> >> +static void *
> >> +thr_arc4random_latency (void *closure)
> >> +{
> >> +  struct thr_arc4random_arg *arg = closure;
> >> +  arg->ret = arg->val == 0 ? bench_arc4random_latency ()
> >> +                          : bench_arc4random_buf_latency (arg->val);
> >> +  return NULL;
> >> +}
> >
> > I think the expectation is that the chacha calls will be cold,
> > maybe it is worth adding a cache flush of sorts between
> > calls. It may be some prefetching in the start will help the code in
> > that case but would only be a regression with the hot in L1
> > benchmarks.
> >
> > Can wait though this V1 looks fine.
>
> In fact I think just checking the call within a thread does not add
> much, specially since we don't have any single-thread lock optimization
> for internal locks.  I will remove it on v2 and maybe revise it in the
> future.

What do you mean single-thread lock optimization?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] Add arc4random support
  2022-04-14 19:26   ` Adhemerval Zanella
@ 2022-04-14 20:36     ` Noah Goldstein
  0 siblings, 0 replies; 34+ messages in thread
From: Noah Goldstein @ 2022-04-14 20:36 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Cristian Rodríguez, GNU C Library

On Thu, Apr 14, 2022 at 2:26 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
>
>
> On 14/04/2022 08:49, Cristian Rodríguez wrote:
> > If this interface is gonna added, GNU extensions that return uint64_t
> > of arc4random and arc4random_uniform will be extremely cool.
> > Even cooler if there is no global state.
>
> I don't think adding a uint64_t interface for arc4random would improve
> much, specially because a simple wrapper using arc4random_buf should
> be suffice.  It would also require portable code to handle another
> GNU extension over a BSD defined interface that is presented in multiple
> systems.  Also performance-wise I think it would be much different than
> arc4random_buf. It make some sense for arc4random_uniform, but I don't
> have a strong opinion.
>
> The global state adds some hardening by 'slicing up the stream' since
> multiple consumers getting different pieces add backtracking and prediction
> resistance. Theo de Raadt explains a bit why OpenBSD has added this
> concept [1] (check about minute 26) on its arc4random implementation.
> As he puts, there is no formal proof, but I agree that the ideas are
> reasonable.
>
> Also, not using a global state means we will need to add a per-thread or
> per-cpu state which is at least one page (due MADV_WIPEONFORK). The
> per-cpu state is only actually possible on newer Linux kernels that
> support rseq.  We might just not care about MADV_WIPEONFORK and use
> a malloc buffer which would be reset by the atfork internal handler.

We could best-effort per-cpu without rseq (select arena based on current
cpu) and have a truly optimized version if rseq is supported. Either way
it's likely to be an improvement if this function is hot.

>
> [1] https://www.youtube.com/watch?v=gp_90-3R0pE

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] benchtests: Add arc4random benchtest
  2022-04-14 20:33       ` Noah Goldstein
@ 2022-04-14 20:48         ` Adhemerval Zanella
  0 siblings, 0 replies; 34+ messages in thread
From: Adhemerval Zanella @ 2022-04-14 20:48 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library



On 14/04/2022 17:33, Noah Goldstein wrote:
> On Thu, Apr 14, 2022 at 2:48 PM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 14/04/2022 16:17, Noah Goldstein wrote:
>>> On Wed, Apr 13, 2022 at 3:26 PM Adhemerval Zanella via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> It shows both throughput (total bytes obtained in the test duration)
>>>> and latecy for both arc4random and arc4random_buf with different
>>>> sizes.
>>>>
>>>> +
>>>> +static void *
>>>> +thr_arc4random_latency (void *closure)
>>>> +{
>>>> +  struct thr_arc4random_arg *arg = closure;
>>>> +  arg->ret = arg->val == 0 ? bench_arc4random_latency ()
>>>> +                          : bench_arc4random_buf_latency (arg->val);
>>>> +  return NULL;
>>>> +}
>>>
>>> I think the expectation is that the chacha calls will be cold,
>>> maybe it is worth adding a cache flush of sorts between
>>> calls. It may be some prefetching in the start will help the code in
>>> that case but would only be a regression with the hot in L1
>>> benchmarks.
>>>
>>> Can wait though this V1 looks fine.
>>
>> In fact I think just checking the call within a thread does not add
>> much, specially since we don't have any single-thread lock optimization
>> for internal locks.  I will remove it on v2 and maybe revise it in the
>> future.
> 
> What do you mean single-thread lock optimization?

Not take the lock if process is single-threaded, as we do on some fast-path
in malloc code.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/7] Add arc4random support
  2022-04-14 18:39   ` Adhemerval Zanella
  2022-04-14 18:43     ` Noah Goldstein
@ 2022-04-15 10:22     ` Yann Droneaud
  1 sibling, 0 replies; 34+ messages in thread
From: Yann Droneaud @ 2022-04-15 10:22 UTC (permalink / raw)
  To: Adhemerval Zanella, GNU C Library

Hi,

Le 14/04/2022 à 20:39, Adhemerval Zanella a écrit :
> On 14/04/2022 04:36, Yann Droneaud wrote:
>
> Le 13/04/2022 à 22:23, Adhemerval Zanella via Libc-alpha a écrit :
>
>>> This patch adds the arc4random, arc4random_buf, and arc4random_uniform
>>> along with optimized versions for x86_64, aarch64, and powerpc64.
>>>
>>> The generic implementation is based on scalar Chacha20, with a global
>>> cache and locking.  It uses getrandom or /dev/urandom as fallback to
>>> get the initial entropy, and reseeds the internal state on every 16MB
>>> of consumed entropy.
>>>
>>> It maintains an internal buffer which consumes at maximum one page on
>>> most systems (assuming 4k pages).  The internal buffer optimizes the
>>> cipher encrypt calls, by amortize arc4random calls (where both
>>> function call and locks cost are the dominating factor).
>>>
>>> Fork detection is done by checking if MADV_WIPEONFORK supported.  If not
>>> the fork callback will reset the state on the fork call.  It does not
>>> handle direct clone calls, nor vfork or _Fork (arc4random is not
>>> async-signal-safe due the internal lock usage, althought the
>>> implementation does try to handle fork cases).
>>>
>>> The generic ChaCha20 implementation is based on the RFC8439 [1], which
>>> a simple memcpy with xor implementation.
>> The xor (with 0) is a waste of CPU cycles as the ChaCha20 keystream is the PRNG output.
> I don't have a strong feeling about, although it seems that any other
> ChaCha20 implementation I have checked does it (libgcrypt, Linux,
> BSD).  The BSD also does it for arc4random, although most if not
> all come from OpenBSD and they are usually paranoid with security
> hardening.

Check #define KEYSTREAM_ONLY

https://github.com/openbsd/src/blob/master/lib/libc/crypt/arc4random.c#L36

https://github.com/openbsd/src/blob/master/lib/libc/crypt/chacha_private.h#L166

Regards.

-- 

Yann Droneaud

OPTEYA



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-04-15 10:22 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-13 20:23 [PATCH 0/7] Add arc4random support Adhemerval Zanella
2022-04-13 20:23 ` [PATCH 1/7] stdlib: Add arc4random, arc4random_buf, and arc4random_uniform (BZ #4417) Adhemerval Zanella
2022-04-13 20:23 ` [PATCH 2/7] stdlib: Add arc4random tests Adhemerval Zanella
2022-04-14 18:01   ` Noah Goldstein
2022-04-13 20:23 ` [PATCH 3/7] benchtests: Add arc4random benchtest Adhemerval Zanella
2022-04-14 19:17   ` Noah Goldstein
2022-04-14 19:48     ` Adhemerval Zanella
2022-04-14 20:33       ` Noah Goldstein
2022-04-14 20:48         ` Adhemerval Zanella
2022-04-13 20:23 ` [PATCH 4/7] x86: Add SSSE3 optimized chacha20 Adhemerval Zanella
2022-04-13 23:12   ` Noah Goldstein
2022-04-14 17:03     ` Adhemerval Zanella
2022-04-14 17:10       ` Noah Goldstein
2022-04-14 17:18         ` Adhemerval Zanella
2022-04-14 17:22           ` Noah Goldstein
2022-04-14 18:25             ` Adhemerval Zanella
2022-04-14 17:17   ` Noah Goldstein
2022-04-14 18:11     ` Adhemerval Zanella
2022-04-14 19:25   ` Noah Goldstein
2022-04-14 19:40     ` Adhemerval Zanella
2022-04-13 20:23 ` [PATCH 5/7] x86: Add AVX2 " Adhemerval Zanella
2022-04-13 23:04   ` Noah Goldstein
2022-04-14 17:16     ` Adhemerval Zanella
2022-04-14 17:20       ` Noah Goldstein
2022-04-14 18:12         ` Adhemerval Zanella
2022-04-13 20:24 ` [PATCH 6/7] aarch64: Add " Adhemerval Zanella
2022-04-13 20:24 ` [PATCH 7/7] powerpc64: " Adhemerval Zanella
2022-04-14  7:36 ` [PATCH 0/7] Add arc4random support Yann Droneaud
2022-04-14 18:39   ` Adhemerval Zanella
2022-04-14 18:43     ` Noah Goldstein
2022-04-15 10:22     ` Yann Droneaud
2022-04-14 11:49 ` Cristian Rodríguez
2022-04-14 19:26   ` Adhemerval Zanella
2022-04-14 20:36     ` Noah Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).