From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oa1-x32.google.com (mail-oa1-x32.google.com [IPv6:2001:4860:4864:20::32]) by sourceware.org (Postfix) with ESMTPS id A4C9F3858C53 for ; Thu, 14 Apr 2022 17:17:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A4C9F3858C53 Received: by mail-oa1-x32.google.com with SMTP id 586e51a60fabf-e2fa360f6dso5924438fac.2 for ; Thu, 14 Apr 2022 10:17:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=oCvh/YVkQFRnh5dpsRCqXPtH22m/p3LQJegtxfWsJ2w=; b=4VJ/3jD727GyvDMC9Nza2aFHVIltPrRm2afLfvZVHGuvkiqGU710wtFrt5uGaDg3ec HIwCo9DdS6uEQ/HCNHFRw7yJJqT+VD+NnWO2JJY2DUXtsTrZJxKm/nmVWX8g2tDnX8pP 6JaCqSBr10gJC/GTF2dcrzkSB8kduT1k1UAHBUl0xeW4LEx515yedebVsBOwRt04dyyX zXZPg0HV3trmxAKQPh+tkHHQNimou/h9cd0totcliHyMkKEtWrAsj4y1Iax+PLLHzC6/ fB0jxn4zgVpISK7q3yFEKjKVkynmKH9+8oT+bjBMUKsciFxj0YtGK2cjMOwC3RnNJKEt Hj6Q== X-Gm-Message-State: AOAM531MoO2f/1falablN7MghCWu0JezK39NWi+LvJAG3FHyitH45WMT Zr5fe30okdFZpnNSt6Mghljx9ayh/NffMA== X-Google-Smtp-Source: ABdhPJwU+x7gWIWoVFfHEjBD1bJhoXHzL3y1WL/MmDV/dmDDGJNeGDR7dW5Mx6CFaQxsPJDFwKfgLA== X-Received: by 2002:a05:6870:45a5:b0:e1:f27d:d1c8 with SMTP id y37-20020a05687045a500b000e1f27dd1c8mr1934094oao.60.1649956621867; Thu, 14 Apr 2022 10:17:01 -0700 (PDT) Received: from ?IPV6:2804:431:c7ca:431f:3dc9:7133:8dac:5273? ([2804:431:c7ca:431f:3dc9:7133:8dac:5273]) by smtp.gmail.com with ESMTPSA id z82-20020aca3355000000b002ef73b018absm233568oiz.9.2022.04.14.10.16.56 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 14 Apr 2022 10:16:57 -0700 (PDT) Message-ID: Date: Thu, 14 Apr 2022 14:16:55 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Subject: Re: [PATCH 5/7] x86: Add AVX2 optimized chacha20 Content-Language: en-US To: Noah Goldstein Cc: GNU C Library References: <20220413202401.408267-1-adhemerval.zanella@linaro.org> <20220413202401.408267-6-adhemerval.zanella@linaro.org> From: Adhemerval Zanella In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-13.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2022 17:17:04 -0000 On 13/04/2022 20:04, Noah Goldstein wrote: > On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha > wrote: >> >> + .text > > section avx2 > Ack, I changed to '.section .text.avx2, "ax", @progbits'. >> + .align 32 >> +chacha20_data: >> +L(shuf_rol16): >> + .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13 >> +L(shuf_rol8): >> + .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14 >> +L(inc_counter): >> + .byte 0,1,2,3,4,5,6,7 >> +L(unsigned_cmp): >> + .long 0x80000000 >> + >> +ENTRY (__chacha20_avx2_blocks8) >> + /* input: >> + * %rdi: input >> + * %rsi: dst >> + * %rdx: src >> + * %rcx: nblks (multiple of 8) >> + */ >> + vzeroupper; > > vzeroupper needs to be replaced with VZEROUPPER_RETURN > and we need a transaction safe version unless this can never > be called during a transaction. I think you meant VZEROUPPER here (VZEROUPPER_RETURN seems to trigger test case failures). What do you mean by a 'transaction safe version'? Ax extra __chacha20_avx2_blocks8 implementation to handle it? Or disable it if RTM is enabled? >> + >> + /* clear the used vector registers and stack */ >> + vpxor X0, X0, X0; >> + vmovdqa X0, (STACK_VEC_X12)(%rsp); >> + vmovdqa X0, (STACK_VEC_X13)(%rsp); >> + vmovdqa X0, (STACK_TMP)(%rsp); >> + vmovdqa X0, (STACK_TMP1)(%rsp); >> + vzeroall; > > Do you need vzeroall? > Why not vzeroupper? Is it a security concern to leave info in the xmm pieces? I would assume, since it is on the original libgrcypt optimization. As for the ssse3 version, I am not sure if we really need that level of hardening, but it would be good to have the initial revision as close as possible from libgcrypt. > > >> + >> + /* eax zeroed by round loop. */ >> + leave; >> + cfi_adjust_cfa_offset(-8) >> + cfi_def_cfa_register(%rsp); >> + ret; >> + int3; > > Why do we need int3 here? I think the ssse3 applies here as well. >> +END(__chacha20_avx2_blocks8) >> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h >> index 37a4fdfb1f..7e9e7755f3 100644 >> --- a/sysdeps/x86_64/chacha20_arch.h >> +++ b/sysdeps/x86_64/chacha20_arch.h >> @@ -22,11 +22,25 @@ >> >> unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst, >> const uint8_t *src, size_t nblks); >> +unsigned int __chacha20_avx2_blocks8 (uint32_t *state, uint8_t *dst, >> + const uint8_t *src, size_t nblks); >> >> static inline void >> chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src, >> size_t bytes) >> { >> + const struct cpu_features* cpu_features = __get_cpu_features (); > > Can we do this with an ifunc and take the cpufeature check off the critical > path? Ditto. >> + >> + if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && bytes >= CHACHA20_BLOCK_SIZE * 8) >> + { >> + size_t nblocks = bytes / CHACHA20_BLOCK_SIZE; >> + nblocks -= nblocks % 8; >> + __chacha20_avx2_blocks8 (state->ctx, dst, src, nblocks); >> + bytes -= nblocks * CHACHA20_BLOCK_SIZE; >> + dst += nblocks * CHACHA20_BLOCK_SIZE; >> + src += nblocks * CHACHA20_BLOCK_SIZE; >> + } >> + >> if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4) >> { >> size_t nblocks = bytes / CHACHA20_BLOCK_SIZE; >> -- >> 2.32.0 >> > > Do you want optimization comments or do that later? Ideally I would like to check if the proposed arc4random implementation is what we want (with current approach of using atfork handlers and the key reschedule). The cipher itself it not the utmost important in the sense it is transparent to user and we can eventually replace it if there any issue or attack to ChaCha20. Initially I won't add any arch-specific optimization, but since libgcrypt provides some that fits on the current approach I though it would be a nice thing to have. For optimization comments it would be good to sync with libgcrypt as well, I think the project will be interested in any performance improvement you might have for the chacha implementations.