From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-x531.google.com (mail-pg1-x531.google.com [IPv6:2607:f8b0:4864:20::531]) by sourceware.org (Postfix) with ESMTPS id DE78A3858C60 for ; Fri, 21 Jan 2022 21:50:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DE78A3858C60 Received: by mail-pg1-x531.google.com with SMTP id t32so9186109pgm.7 for ; Fri, 21 Jan 2022 13:50:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=VbMjmhyqkTl41/gRfYvJDh18MPW9LOmr+JoQnI30hOE=; b=SjAIadACqnIYMKr4bCOK43qqtTc9xYta65O2Jd1/o7TuCf75r2GQmkhNSwMOes/vzL 38ridE8bbANcYfTfGDSrNkLx3mm6PlW3vl5Qla8a+aqrEofnzCZdQV7+9zRz21htPDO5 Nang7IkbDhHxDNJo1Vf0vJoNGD2N/jNDHvxZ0fUDccoDOdisAti68wwGLCT52IGmSKYw jPCkrjZRn52Bb5ZY/kJ/k94fILIweWd9e+hU7nkSlNHDXwH2zH6AG1cu62/NnD5QyA/J 0Pyyz3v0XzqHpsZEJqCUtOCeKj9HBok/vBxQByUZNPI7UCkBJiVYdOIr3mxi4kmdi5ls xLxQ== X-Gm-Message-State: AOAM533NeY7GJ+ZK6jN0ErVl5nsvlpsMEs1moCYmSiY6Go0Y25dR44X5 NX0cGxbhRaTQPA89tw5vLXnPjoFIDjkDqLMoUdPvcdWs8MQ= X-Google-Smtp-Source: ABdhPJwVT3IVa6BDZ+8zj8cgnD70cU8qvsTmx9rvmkZmW02hooLFc10GIfNgakjf59SkZxOQadWhPCilYq0eFEEjaSA= X-Received: by 2002:a05:6a00:1151:b0:4c4:b2c6:6060 with SMTP id b17-20020a056a00115100b004c4b2c66060mr5472705pfm.11.1642801836978; Fri, 21 Jan 2022 13:50:36 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Noah Goldstein Date: Fri, 21 Jan 2022 15:50:26 -0600 Message-ID: Subject: Re: [libc-coord] Add new ABIs '__strcmpeq', '__strncmpeq', '__wcscmpeq' and '__wcsncmpeq' to libc To: Joerg Sonnenberger Cc: libc-coord@lists.openwall.com, Richard Biener via Gcc , GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Jan 2022 21:50:39 -0000 On Fri, Jan 21, 2022 at 12:51 PM Joerg Sonnenberger wrote: > > On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote: > > The goal is that the new interfaces will be usable as an optimization > > by compilers if a program uses the return value of the non "eq" > > variant as a boolean. > > So I'm curious, but can you demonstrate that it can be implemented > notacibly faster than regular strcmp? Unlike for memcmp, I don't see an > obvious way to save any operations. Strong point! I had been somewhat assuming we could make the same optimizations with `__memcmpeq` but there still needs to be some logic that tracks which comes first the mismatch or the null terminator. It's not quite as much as `memcmp` vs `__memcmpeq` but we can still save. Using the x86_64 AVX2 optimized implementation as reference: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strcmp-avx2.S;h=9c73b5899d55a72b292f21b52593284cd513d2a3;hb=HEAD We can convert the general return method of checking equals + strlen from: ``` VMOVU (%rdi), %ymm0 VPCMPEQ (%rsi), %ymm0, %ymm1 VPCMPEQ %ymm0, %ymmZERO, %ymm2 vpandn %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %ecx incl %ecx jz L(keep_going) tzcntl %ecx, %ecx movzbl (%rdi, %rcx), %eax movzbl (%rsi, %rcx), %ecx subl %ecx, %eax vzeroupper ret ``` To ``` VMOVU (%rdi), %ymm0 VPCMPEQ (%rsi), %ymm0, %ymm1 VPCMPEQ %ymm0, %ymmZERO, %ymm2 vpandn %ymm1, %ymm2, %ymm2 vpmovmskb %ymm2, %ecx incl %ecx jz L(keep_going) vpmovmskb %ymm1, %eax blsi %ecx, %ecx andn %eax, %ecx, %eax vzeroupper ret ``` Testing this with comparisons where mismatch or strlen in the first 32 bytes (common case) it's about the same throughput but ~20% reduction in latency. Another benefit is we can reuse this exact return logic throughout as memory offset is no longer required. This simplifies the page cross logic a great deal and will net us some serious code size reduction for the common usage of strcmp. I think though I was a bit over optimistic about the performance benefits as I was using `memcmp` vs `__memcmpeq` as a reference. I'll put together a patch for just `__strcmpeq` and post the results here. I think the wide-character versions have more expensive return value checks so if the character versions show a benefit we can expect it to translate. > > Joerg