From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw1-x1133.google.com (mail-yw1-x1133.google.com [IPv6:2607:f8b0:4864:20::1133]) by sourceware.org (Postfix) with ESMTPS id 74C3B3858C54 for ; Thu, 14 Apr 2022 17:20:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 74C3B3858C54 Received: by mail-yw1-x1133.google.com with SMTP id 00721157ae682-2ef4a241cc5so55548707b3.2 for ; Thu, 14 Apr 2022 10:20:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mTbM1+iXXfonSVL9iqvDG9Nq6C5mkOD6jANLSIjKGc8=; b=t96Cwafrb/bJMFVZhoeV1EUXYUEcbBobEjwT0+HaXNMlO43YgarwKkz4sipdaVpGqF WooMYz/XBcWagIdKyuQXHz+t68wVp6WiDzeEC2ud8LhB3lIWUoroDpBg/ZLNslpEasBR 7dDAHYgIMPxV3//OL6cIUToBiU3Df8qxv5eZllHPY16QxGQ2CjRjEzytpg93rKEybECv ixMTQSZ4G33Op6ro2Xq5vrcWL0wE42OzL30nj2c9elEi5CnwW5v12bnGhPVksZxvIVw0 +tVjlQ8SL2ryZlsTASx3CvYs0pW5+y6eZKbSB87B0LCKGbulsLAVLLUXf0zlm0SOTUQz q7VQ== X-Gm-Message-State: AOAM5319+m0mL5hB/7ksT5j+vDoUsK6rc0iUsPi4iwmbJfkrkZqTnixB sp+jtOXTbapwcAClob+f+i6QAc0MowbKwDGXnuhl9Yry X-Google-Smtp-Source: ABdhPJy151zhikMlJCnpy3wRhtTpy6hO4Yq9dTnzTu0IutxEdjXTC+CDNneTHDsrvUuGZemenJbU7QJFsgqyXu4Y/oc= X-Received: by 2002:a05:690c:d:b0:2d0:e02a:6cda with SMTP id bc13-20020a05690c000d00b002d0e02a6cdamr2993189ywb.192.1649956830867; Thu, 14 Apr 2022 10:20:30 -0700 (PDT) MIME-Version: 1.0 References: <20220413202401.408267-1-adhemerval.zanella@linaro.org> <20220413202401.408267-6-adhemerval.zanella@linaro.org> In-Reply-To: From: Noah Goldstein Date: Thu, 14 Apr 2022 12:20:20 -0500 Message-ID: Subject: Re: [PATCH 5/7] x86: Add AVX2 optimized chacha20 To: Adhemerval Zanella Cc: GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2022 17:20:33 -0000 On Thu, Apr 14, 2022 at 12:17 PM Adhemerval Zanella wrote: > > > > On 13/04/2022 20:04, Noah Goldstein wrote: > > On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha > > wrote: > >> > >> + .text > > > > section avx2 > > > > Ack, I changed to '.section .text.avx2, "ax", @progbits'. > > >> + .align 32 > >> +chacha20_data: > >> +L(shuf_rol16): > >> + .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13 > >> +L(shuf_rol8): > >> + .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14 > >> +L(inc_counter): > >> + .byte 0,1,2,3,4,5,6,7 > >> +L(unsigned_cmp): > >> + .long 0x80000000 > >> + > >> +ENTRY (__chacha20_avx2_blocks8) > >> + /* input: > >> + * %rdi: input > >> + * %rsi: dst > >> + * %rdx: src > >> + * %rcx: nblks (multiple of 8) > >> + */ > >> + vzeroupper; > > > > vzeroupper needs to be replaced with VZEROUPPER_RETURN > > and we need a transaction safe version unless this can never > > be called during a transaction. > > I think you meant VZEROUPPER here (VZEROUPPER_RETURN seems to trigger > test case failures). What do you mean by a 'transaction safe version'? > Ax extra __chacha20_avx2_blocks8 implementation to handle it? Or disable > it if RTM is enabled? For now you can just update the cpufeature check to do ssse3 if RTM is enabled. > > >> + > >> + /* clear the used vector registers and stack */ > >> + vpxor X0, X0, X0; > >> + vmovdqa X0, (STACK_VEC_X12)(%rsp); > >> + vmovdqa X0, (STACK_VEC_X13)(%rsp); > >> + vmovdqa X0, (STACK_TMP)(%rsp); > >> + vmovdqa X0, (STACK_TMP1)(%rsp); > >> + vzeroall; > > > > Do you need vzeroall? > > Why not vzeroupper? Is it a security concern to leave info in the xmm pieces? > > I would assume, since it is on the original libgrcypt optimization. As > for the ssse3 version, I am not sure if we really need that level of > hardening, but it would be good to have the initial revision as close > as possible from libgcrypt. Got it. > > > > > > >> + > >> + /* eax zeroed by round loop. */ > >> + leave; > >> + cfi_adjust_cfa_offset(-8) > >> + cfi_def_cfa_register(%rsp); > >> + ret; > >> + int3; > > > > Why do we need int3 here? > > I think the ssse3 applies here as well. > > >> +END(__chacha20_avx2_blocks8) > >> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h > >> index 37a4fdfb1f..7e9e7755f3 100644 > >> --- a/sysdeps/x86_64/chacha20_arch.h > >> +++ b/sysdeps/x86_64/chacha20_arch.h > >> @@ -22,11 +22,25 @@ > >> > >> unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst, > >> const uint8_t *src, size_t nblks); > >> +unsigned int __chacha20_avx2_blocks8 (uint32_t *state, uint8_t *dst, > >> + const uint8_t *src, size_t nblks); > >> > >> static inline void > >> chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src, > >> size_t bytes) > >> { > >> + const struct cpu_features* cpu_features = __get_cpu_features (); > > > > Can we do this with an ifunc and take the cpufeature check off the critical > > path? > > Ditto. > > >> + > >> + if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && bytes >= CHACHA20_BLOCK_SIZE * 8) > >> + { > >> + size_t nblocks = bytes / CHACHA20_BLOCK_SIZE; > >> + nblocks -= nblocks % 8; > >> + __chacha20_avx2_blocks8 (state->ctx, dst, src, nblocks); > >> + bytes -= nblocks * CHACHA20_BLOCK_SIZE; > >> + dst += nblocks * CHACHA20_BLOCK_SIZE; > >> + src += nblocks * CHACHA20_BLOCK_SIZE; > >> + } > >> + > >> if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4) > >> { > >> size_t nblocks = bytes / CHACHA20_BLOCK_SIZE; > >> -- > >> 2.32.0 > >> > > > > Do you want optimization comments or do that later? > > Ideally I would like to check if the proposed arc4random implementation > is what we want (with current approach of using atfork handlers and the > key reschedule). The cipher itself it not the utmost important in the > sense it is transparent to user and we can eventually replace it if there > any issue or attack to ChaCha20. Initially I won't add any arch-specific > optimization, but since libgcrypt provides some that fits on the current > approach I though it would be a nice thing to have. > > For optimization comments it would be good to sync with libgcrypt as well, > I think the project will be interested in any performance improvement > you might have for the chacha implementations. Okay, I'll probably take a stab at this in the not too distant future.