From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw1-x1133.google.com (mail-yw1-x1133.google.com [IPv6:2607:f8b0:4864:20::1133]) by sourceware.org (Postfix) with ESMTPS id 2E2EF3858C53 for ; Thu, 14 Apr 2022 17:22:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2E2EF3858C53 Received: by mail-yw1-x1133.google.com with SMTP id 00721157ae682-2ec42eae76bso62327417b3.10 for ; Thu, 14 Apr 2022 10:22:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=nlKusynx9HYpR5eO4uDtdYqAo4NJtysyObr34SH08A0=; b=gdJjCKZCC3HfJg8DecScvn3DWCgD3H/WkKomup+VnN0VqcPus9/Y3PRIc0uVa3X9/v FOcAMzxxtpDWvdP9ghtQxFoIdjveTzLLLha/FR3C0xKNGC9vnmhF9SdHMJwKkXqN0vUV +AxnHIkpQ0M11jcgrk9zJOBj/x6MTfOOyrk7CfrRNaJk7bKNkcDWDFhBMZAT+KL1z7BN iDApaNZsyrd6G9YCTfXwqZkvCt7VsK1TH+wfSGaMn2R3AXs5G9GI3f9/fsfI5qNx/tTX p6HbshdlMdAM/tIASzSLR+AIhFNxHm+Tieg81WBAdIzECMf9rk/QpZTLb7ypkqR/zQa8 3mKQ== X-Gm-Message-State: AOAM532d6umZFOtvtDkDuldhPuNUWSQpCdWEzRY9iExuqltZRwN1JVO9 uEioV3jVvQiSOwgoyqppQRpAztHQ5To/q/mZMlJ84qmU X-Google-Smtp-Source: ABdhPJwYBOnfmc/V9qRB9YDKg6nyy8v3B8j0dltfUA7SH/aOpJ/AdEizupm4uYVtCeCB3k0W9E+1Xl2thc/7uDegXcI= X-Received: by 2002:a81:ad47:0:b0:2ee:927d:ff39 with SMTP id l7-20020a81ad47000000b002ee927dff39mr2911915ywk.249.1649956971672; Thu, 14 Apr 2022 10:22:51 -0700 (PDT) MIME-Version: 1.0 References: <20220413202401.408267-1-adhemerval.zanella@linaro.org> <20220413202401.408267-5-adhemerval.zanella@linaro.org> In-Reply-To: From: Noah Goldstein Date: Thu, 14 Apr 2022 12:22:41 -0500 Message-ID: Subject: Re: [PATCH 4/7] x86: Add SSSE3 optimized chacha20 To: Adhemerval Zanella Cc: GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2022 17:22:54 -0000 On Thu, Apr 14, 2022 at 12:19 PM Adhemerval Zanella wrote: > > > > On 14/04/2022 14:10, Noah Goldstein wrote: > > On Thu, Apr 14, 2022 at 12:03 PM Adhemerval Zanella > > wrote: > >> > >> > >> > >> On 13/04/2022 20:12, Noah Goldstein wrote: > >>> On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha > >>> wrote: > >>>> > >>>> + > >>>> + /* eax zeroed by round loop. */ > >>>> + leave; > >>>> + cfi_adjust_cfa_offset(-8) > >>>> + cfi_def_cfa_register(%rsp); > >>>> + ret; > >>>> + int3; > >>> why int3? > >> > >> It was originally added on libgcrypt by 11ade08efbfbc36dbf3571f1026946269950bc40, > >> as a straight-line speculation hardening. It is was is emitted by clang 14 and > >> gcc 12 with -mharden-sls=return. > >> > >> I am not sure if we need that kind of hardening, but I would prefer to the first > >> version be in sync with libgcrypt as much as possible so the future optimizations > >> would be simpler to keep localized to glibc (if libgcrypt does not want to > >> backport it). > > > > Okay, can keep for now. Any thoughts on changing it to sse2? > > > > No strong feeling, I used the ssse3 one because it is readily available from > libgcrypt. I think the only ssse3 is `pshufb` so you can just replace the optimized rotates with the shift rotates and that will make it sse2 (unless I'm missing an instruction). Also can you add the proper .text section here as well (or .sse2 or .ssse3) > > > > >> > >>>> +END (__chacha20_ssse3_blocks8) > >>>> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h > >>>> new file mode 100644 > >>>> index 0000000000..37a4fdfb1f > >>>> --- /dev/null > >>>> +++ b/sysdeps/x86_64/chacha20_arch.h > >>>> @@ -0,0 +1,42 @@ > >>>> +/* Chacha20 implementation, used on arc4random. > >>>> + Copyright (C) 2022 Free Software Foundation, Inc. > >>>> + This file is part of the GNU C Library. > >>>> + > >>>> + The GNU C Library is free software; you can redistribute it and/or > >>>> + modify it under the terms of the GNU Lesser General Public > >>>> + License as published by the Free Software Foundation; either > >>>> + version 2.1 of the License, or (at your option) any later version. > >>>> + > >>>> + The GNU C Library is distributed in the hope that it will be useful, > >>>> + but WITHOUT ANY WARRANTY; without even the implied warranty of > >>>> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > >>>> + Lesser General Public License for more details. > >>>> + > >>>> + You should have received a copy of the GNU Lesser General Public > >>>> + License along with the GNU C Library; if not, see > >>>> + . */ > >>>> + > >>>> +#include > >>>> +#include > >>>> +#include > >>>> + > >>>> +unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst, > >>>> + const uint8_t *src, size_t nblks); > >>>> + > >>>> +static inline void > >>>> +chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src, > >>>> + size_t bytes) > >>>> +{ > >>>> + if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4) > >>> > >>> Can we make this an ifunc? > >> > >> I though about it, but if you check on arc4random implementation the > >> chacha20_crypt is called for the whole internal buf once it is exhausted. > >> Assuming a 1 cycle per byte (as indicated by bench-slope libgrcypt on > >> my machine), it will be at least 1k cycles to encrypt each block. I > >> am not sure if setting up an internal PLT call to save a couple of cycles > >> on a internal function will really show anything significant here (assuming > >> that the PLT call won't add more overhead in fact). > >> > >> Besides that the code boilerplate to setup the internal ifunc is also > >> way more complex. > > > > Okay for now as long as open to changing later (not that we will but that > > this isn't locking us into the decision). > > For sure, if iFUNC does help on this case the change should be simple for > the generic code. The boilerplate is for the x86_64 bits in facts (to > setup the iFUNC resolver, Makefile, etc.).