From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oa1-x34.google.com (mail-oa1-x34.google.com [IPv6:2001:4860:4864:20::34]) by sourceware.org (Postfix) with ESMTPS id 927D23858430 for ; Mon, 25 Apr 2022 13:02:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 927D23858430 Received: by mail-oa1-x34.google.com with SMTP id 586e51a60fabf-e2442907a1so15925372fac.8 for ; Mon, 25 Apr 2022 06:02:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=gUGUrzDGe7xOOl99Ctyc+BkiGpy9XE6imCWSbJVzMrc=; b=kX6Q36AkecWv3RAQNjPJh+FenJID7f3OkVwizDeemjxKx7c4XDC3e3GkmECKzx9zkq ctUtNsyVmegBG9jVnrmyfyinxVsFlrWCHACmLEAqxDTePrX6CYjrhtiRKOu+cOSEw09J yo6OxLAmdipQvWfl8khbWDOx+d722t1/DL2K6lI5nOXHOTeChaEOd6gNtB3W3Zq7hJqH Tri8DT0Tq81hEEh2pZ2glQkhTQKFUAqLGXxzIeefotsQizgTf4uNdyrPea6Ww2GmsMbe ghUUp9PCwNz+zCvPTXEsgUqy9xHdq48womrd+wh4QRftmD8vf+iWKp3muy9+pxXJB+Ws FFHg== X-Gm-Message-State: AOAM533G9Ja2yAH/izox/0i90E62dFRnxkFYpyHun9Gq2IR/IwksjnZr seFjETPYrw+qsJ2u5LV0eDWfE+DIGg6hHw== X-Google-Smtp-Source: ABdhPJyuJiTgl+a3tLpS4wCspYmCUdZuY081AZS2r0ztlV5kSyPxRY5PI9x6/sr2ibLbtPZ/jb6XSQ== X-Received: by 2002:a05:6871:20c:b0:e6:8475:947e with SMTP id t12-20020a056871020c00b000e68475947emr7533544oad.239.1650891742059; Mon, 25 Apr 2022 06:02:22 -0700 (PDT) Received: from birita.. ([2804:431:c7ca:4214:b4dd:3339:98d6:1ec0]) by smtp.gmail.com with ESMTPSA id e4-20020a0568301e4400b006054dca5442sm3804278otj.28.2022.04.25.06.02.20 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Apr 2022 06:02:21 -0700 (PDT) From: Adhemerval Zanella To: libc-alpha@sourceware.org Subject: [PATCH v4 9/9] stdlib: Add TLS optimization to arc4random Date: Mon, 25 Apr 2022 10:01:56 -0300 Message-Id: <20220425130156.1062525-10-adhemerval.zanella@linaro.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220425130156.1062525-1-adhemerval.zanella@linaro.org> References: <20220425130156.1062525-1-adhemerval.zanella@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_ASCII_DIVIDERS, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2022 13:02:26 -0000 The arc4random state is moved to TCB, so there is no allocation failure. It adds about 592 bytes struct pthread. Now that the state is thread private within a shared struct, the MADV_WIPEONFORK usage is removed. The cipher state reset is done solely by the atfork internal handler. The state is also cleared on thread exit iff it was initialized (so if arc4random is not called it is not touched). Although it is lock-free, arc4random is still not async-signal-safe (the per thread state is not updated On x86_64 using AVX2 it shows a slight better performance: From -------------------------------------------------- arc4random [single-thread] 809.53 arc4random_buf(16) [single-thread] 1242.56 arc4random_buf(32) [single-thread] 1915.90 arc4random_buf(48) [single-thread] 2230.03 arc4random_buf(64) [single-thread] 2429.68 arc4random_buf(80) [single-thread] 2489.70 arc4random_buf(96) [single-thread] 2598.88 arc4random_buf(112) [single-thread] 2699.93 arc4random_buf(128) [single-thread] 2747.31 To MB/s -------------------------------------------------- arc4random [single-thread] 941.54 arc4random_buf(16) [single-thread] 1409.39 arc4random_buf(32) [single-thread] 2056.17 arc4random_buf(48) [single-thread] 2367.13 arc4random_buf(64) [single-thread] 2551.44 arc4random_buf(80) [single-thread] 2601.38 arc4random_buf(96) [single-thread] 2710.21 arc4random_buf(112) [single-thread] 2797.86 arc4random_buf(128) [single-thread] 2846.12 -------------------------------------------------- However it shows a large speed up specially on architecture with most costly atomics. For instance, on a aarch64 Neoverse N1: >From MB/s -------------------------------------------------- arc4random [single-thread] 154.98 arc4random_buf(16) [single-thread] 342.63 arc4random_buf(32) [single-thread] 485.91 arc4random_buf(48) [single-thread] 539.95 arc4random_buf(64) [single-thread] 593.38 arc4random_buf(80) [single-thread] 629.45 arc4random_buf(96) [single-thread] 655.78 arc4random_buf(112) [single-thread] 670.54 arc4random_buf(128) [single-thread] 681.65 -------------------------------------------------- To MB/s -------------------------------------------------- arc4random [single-thread] 335.94 arc4random_buf(16) [single-thread] 498.69 arc4random_buf(32) [single-thread] 612.24 arc4random_buf(48) [single-thread] 655.77 arc4random_buf(64) [single-thread] 691.97 arc4random_buf(80) [single-thread] 701.68 arc4random_buf(96) [single-thread] 710.35 arc4random_buf(112) [single-thread] 714.23 arc4random_buf(128) [single-thread] 722.13 -------------------------------------------------- Checked on x86_64-linux-gnu. --- nptl/allocatestack.c | 5 +- stdlib/arc4random.c | 137 +++++++------------------ stdlib/arc4random.h | 45 ++++++++ stdlib/arc4random_uniform.c | 18 +--- stdlib/chacha20.c | 1 - stdlib/tst-arc4random-chacha20.c | 2 +- sysdeps/generic/tls-internal-struct.h | 3 + sysdeps/unix/sysv/linux/tls-internal.h | 22 +++- 8 files changed, 111 insertions(+), 122 deletions(-) create mode 100644 stdlib/arc4random.h diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c index 01a282f3f6..ada65d40c2 100644 --- a/nptl/allocatestack.c +++ b/nptl/allocatestack.c @@ -32,6 +32,7 @@ #include #include #include +#include /* Default alignment of stack. */ #ifndef STACK_ALIGN @@ -127,7 +128,7 @@ get_cached_stack (size_t *sizep, void **memp) result->exiting = false; __libc_lock_init (result->exit_lock); - result->tls_state = (struct tls_internal_t) { 0 }; + __glibc_tls_internal_init (&result->tls_state); /* Clear the DTV. */ dtv_t *dtv = GET_DTV (TLS_TPADJ (result)); @@ -559,6 +560,8 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp, #endif pd->robust_head.list = &pd->robust_head; + __glibc_tls_internal_init (&pd->tls_state); + /* We place the thread descriptor at the end of the stack. */ *pdp = pd; diff --git a/stdlib/arc4random.c b/stdlib/arc4random.c index 19a486405e..f14b597c5e 100644 --- a/stdlib/arc4random.c +++ b/stdlib/arc4random.c @@ -16,14 +16,15 @@ License along with the GNU C Library; if not, see . */ +#include #include -#include #include #include #include #include #include #include +#include /* arc4random keeps two counters: 'have' is the current valid bytes not yet consumed in 'buf' while 'count' is the maximum number of bytes until a @@ -36,42 +37,16 @@ optimized implementations (where it usually operates on multiple blocks). */ -/* Maximum number bytes until reseed (16 MB). */ -#define CHACHA20_RESEED_SIZE (16 * 1024 * 1024) -/* Internal buffer size in bytes (512B). */ -#define CHACHA20_BUFSIZE (8 * CHACHA20_BLOCK_SIZE) - #include -static struct arc4random_state -{ - uint32_t ctx[CHACHA20_STATE_LEN]; - size_t have; - size_t count; - uint8_t buf[CHACHA20_BUFSIZE]; -} *state; - -/* Indicate that MADV_WIPEONFORK is supported by the kernel and thus - it does not require to clear the internal state. */ -static bool __arc4random_wipeonfork = false; - -__libc_lock_define_initialized (, __arc4random_lock); - -/* Called from the fork function to reset the state if MADV_WIPEONFORK is - not supported and to reinit the internal lock. */ +/* Called from the fork function to reset the state. */ void __arc4random_fork_subprocess (void) { - if (!__arc4random_wipeonfork && state != NULL) - memset (state, 0, sizeof (struct arc4random_state)); - - __libc_lock_init (__arc4random_lock); -} - -static void -arc4random_allocate_failure (void) -{ - __libc_fatal ("Fatal glibc error: Cannot allocate memory for arc4random\n"); + struct arc4random_state *state = &__glibc_tls_internal()->rnd_state; + memset (state, 0, sizeof (struct arc4random_state)); + /* Force key init. */ + state->count = -1; } static void @@ -80,33 +55,10 @@ arc4random_getrandom_failure (void) __libc_fatal ("Fatal glibc error: Cannot get entropy for arc4random\n"); } -/* If the kernel supports MADV_WIPEONFORK it is used to provide fork - detection. Otherwise, the state is resetied with an atfork handler. The - latter does not handle direct clone calls, nor vfork or _Fork (arc4random - is not async-signal-safe due to the internal lock usage). */ -static void -arc4random_init (uint8_t *buf) -{ - state = __mmap (NULL, sizeof (struct arc4random_state), - PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); - if (state == MAP_FAILED) - arc4random_allocate_failure (); - -#ifdef MADV_WIPEONFORK - int r = __madvise (state, sizeof (struct arc4random_state), MADV_WIPEONFORK); - if (r == 0) - __arc4random_wipeonfork = true; - else if (errno != EINVAL) - arc4random_allocate_failure (); -#endif - - chacha20_init (state->ctx, buf, buf + CHACHA20_KEY_SIZE); -} - #define min(x,y) (((x) > (y)) ? (y) : (x)) static void -arc4random_rekey (uint8_t *rnd, size_t rndlen) +arc4random_rekey (struct arc4random_state *state, uint8_t *rnd, size_t rndlen) { chacha20_crypt (state->ctx, state->buf, state->buf, sizeof state->buf); @@ -150,41 +102,41 @@ arc4random_getentropy (uint8_t *rnd, size_t len) arc4random_getrandom_failure (); } -/* Either allocates the state buffer or reinit it by reseeding the cipher - state with kernel entropy. */ -static void -arc4random_stir (void) +/* Reinit the thread context by reseeding the cipher state with kernel + entropy. */ +static struct arc4random_state * +arc4random_check_stir (size_t len) { - uint8_t rnd[CHACHA20_KEY_SIZE + CHACHA20_IV_SIZE]; - arc4random_getentropy (rnd, sizeof rnd); + struct arc4random_state *state = &__glibc_tls_internal()->rnd_state; - if (state == NULL) - arc4random_init (rnd); - else - arc4random_rekey (rnd, sizeof rnd); + if (state->count < len || state->count == -1) + { + uint8_t rnd[CHACHA20_KEY_SIZE + CHACHA20_IV_SIZE]; + arc4random_getentropy (rnd, sizeof rnd); - explicit_bzero (rnd, sizeof rnd); + if (state->count == -1) + chacha20_init (state->ctx, rnd, rnd + CHACHA20_KEY_SIZE); + else + arc4random_rekey (state, rnd, sizeof rnd); - state->have = 0; - memset (state->buf, 0, sizeof state->buf); - state->count = CHACHA20_RESEED_SIZE; -} + explicit_bzero (rnd, sizeof rnd); -static void -arc4random_check_stir (size_t len) -{ - if (state == NULL || state->count < len) - arc4random_stir (); + state->have = 0; + memset (state->buf, 0, sizeof state->buf); + state->count = CHACHA20_RESEED_SIZE; + } if (state->count <= len) state->count = 0; else state->count -= len; + + return state; } void -__arc4random_buf_internal (void *buffer, size_t len) +__arc4random_buf (void *buffer, size_t len) { - arc4random_check_stir (len); + struct arc4random_state *state = arc4random_check_stir (len); while (len > 0) { @@ -199,29 +151,20 @@ __arc4random_buf_internal (void *buffer, size_t len) state->have -= m; } if (state->have == 0) - arc4random_rekey (NULL, 0); + arc4random_rekey (state, NULL, 0); } } - -void -__arc4random_buf (void *buffer, size_t len) -{ - __libc_lock_lock (__arc4random_lock); - __arc4random_buf_internal (buffer, len); - __libc_lock_unlock (__arc4random_lock); -} libc_hidden_def (__arc4random_buf) weak_alias (__arc4random_buf, arc4random_buf) - -static uint32_t -__arc4random_internal (void) +uint32_t +__arc4random (void) { uint32_t r; - arc4random_check_stir (sizeof (uint32_t)); + struct arc4random_state *state = arc4random_check_stir (sizeof (uint32_t)); if (state->have < sizeof (uint32_t)) - arc4random_rekey (NULL, 0); + arc4random_rekey (state, NULL, 0); uint8_t *ks = state->buf + sizeof (state->buf) - state->have; memcpy (&r, ks, sizeof (uint32_t)); memset (ks, 0, sizeof (uint32_t)); @@ -229,15 +172,5 @@ __arc4random_internal (void) return r; } - -uint32_t -__arc4random (void) -{ - uint32_t r; - __libc_lock_lock (__arc4random_lock); - r = __arc4random_internal (); - __libc_lock_unlock (__arc4random_lock); - return r; -} libc_hidden_def (__arc4random) weak_alias (__arc4random, arc4random) diff --git a/stdlib/arc4random.h b/stdlib/arc4random.h new file mode 100644 index 0000000000..a1c97bdb84 --- /dev/null +++ b/stdlib/arc4random.h @@ -0,0 +1,45 @@ +/* Arc4random definition used on TLS. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef _CHACHA20_H +#define _CHACHA20_H + +#include +#include + +/* Internal ChaCha20 state. */ +#define CHACHA20_STATE_LEN 16 +#define CHACHA20_BLOCK_SIZE 64 + +/* Maximum number bytes until reseed (16 MB). */ +#define CHACHA20_RESEED_SIZE (16 * 1024 * 1024) + +/* Internal arc4random buffer, used on each feedback step so offer some + backtracking protection and to allow better used of vectorized + chacha20 implementations. */ +#define CHACHA20_BUFSIZE (8 * CHACHA20_BLOCK_SIZE) + +struct arc4random_state +{ + uint32_t ctx[CHACHA20_STATE_LEN]; + size_t have; + size_t count; + uint8_t buf[CHACHA20_BUFSIZE]; +}; + +#endif diff --git a/stdlib/arc4random_uniform.c b/stdlib/arc4random_uniform.c index 67ef5a4218..643ef856d7 100644 --- a/stdlib/arc4random_uniform.c +++ b/stdlib/arc4random_uniform.c @@ -46,11 +46,11 @@ random_bytes (uint32_t *result, uint32_t byte_count) unsigned char *ptr = (unsigned char *) result; if (__BYTE_ORDER == __BIG_ENDIAN) ptr += 4 - byte_count; - __arc4random_buf_internal (ptr, byte_count); + __arc4random_buf (ptr, byte_count); } -static uint32_t -compute_uniform (uint32_t n) +uint32_t +__arc4random_uniform (uint32_t n) { if (n <= 1) /* There is no valid return value for a zero limit, and 0 is the @@ -136,17 +136,5 @@ compute_uniform (uint32_t n) and v < 2n < 2**33. */ } } - -__libc_lock_define (extern , __arc4random_lock attribute_hidden) - -uint32_t -__arc4random_uniform (uint32_t upper_bound) -{ - uint32_t r; - __libc_lock_lock (__arc4random_lock); - r = compute_uniform (upper_bound); - __libc_lock_unlock (__arc4random_lock); - return r; -} libc_hidden_def (__arc4random_uniform) weak_alias (__arc4random_uniform, arc4random_uniform) diff --git a/stdlib/chacha20.c b/stdlib/chacha20.c index e09c782c7b..87b2b0f423 100644 --- a/stdlib/chacha20.c +++ b/stdlib/chacha20.c @@ -26,7 +26,6 @@ #define CHACHA20_IV_SIZE 16 #define CHACHA20_KEY_SIZE 32 -#define CHACHA20_BLOCK_SIZE 64 #define CHACHA20_BLOCK_WORDS (CHACHA20_BLOCK_SIZE / sizeof (uint32_t)) #define CHACHA20_STATE_LEN 16 diff --git a/stdlib/tst-arc4random-chacha20.c b/stdlib/tst-arc4random-chacha20.c index dd0ef6d8ba..614e6e0736 100644 --- a/stdlib/tst-arc4random-chacha20.c +++ b/stdlib/tst-arc4random-chacha20.c @@ -16,11 +16,11 @@ License along with the GNU C Library; if not, see . */ +#include #include #include /* It does not define CHACHA20_XOR_FINAL to check what glibc actual uses. */ -#define CHACHA20_BUFSIZE (8 * CHACHA20_BLOCK_SIZE) #include static int diff --git a/sysdeps/generic/tls-internal-struct.h b/sysdeps/generic/tls-internal-struct.h index d76c715a96..5d0e2fba53 100644 --- a/sysdeps/generic/tls-internal-struct.h +++ b/sysdeps/generic/tls-internal-struct.h @@ -19,10 +19,13 @@ #ifndef _TLS_INTERNAL_STRUCT_H #define _TLS_INTERNAL_STRUCT_H 1 +#include + struct tls_internal_t { char *strsignal_buf; char *strerror_l_buf; + struct arc4random_state rnd_state; }; #endif diff --git a/sysdeps/unix/sysv/linux/tls-internal.h b/sysdeps/unix/sysv/linux/tls-internal.h index f7a1a62135..8e90384089 100644 --- a/sysdeps/unix/sysv/linux/tls-internal.h +++ b/sysdeps/unix/sysv/linux/tls-internal.h @@ -22,6 +22,19 @@ #include #include +static inline void +__glibc_tls_internal_init (struct tls_internal_t *tls_state) +{ + tls_state->strsignal_buf = NULL; + tls_state->strerror_l_buf = NULL; + + /* Force key init on created threads. There is no need to clear the + initial state since it will be done either by allocation a new + stack (through mmap with MAP_ANONYMOUS) or by the free function + below). */ + tls_state->rnd_state.count = -1; +} + static inline struct tls_internal_t * __glibc_tls_internal (void) { @@ -31,8 +44,13 @@ __glibc_tls_internal (void) static inline void __glibc_tls_internal_free (void) { - free (THREAD_SELF->tls_state.strsignal_buf); - free (THREAD_SELF->tls_state.strerror_l_buf); + struct pthread *self = THREAD_SELF; + free (self->tls_state.strsignal_buf); + free (self->tls_state.strerror_l_buf); + if (self->tls_state.rnd_state.count != -1) + /* Clear any lingering random state prior so if the thread stack is + cached it won't leak any data. */ + memset (&self->tls_state.rnd_state, 0, sizeof self->tls_state.rnd_state); } #endif -- 2.34.1