From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by sourceware.org (Postfix) with ESMTPS id 97C453943541 for ; Mon, 3 May 2021 20:06:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 97C453943541 Received: by mail-pl1-x62e.google.com with SMTP id v13so3471322ple.9 for ; Mon, 03 May 2021 13:06:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=uQDQoeyZ/Ki9LCXavR8QB1i/Wq1J3hKy7ADaWqFrfTA=; b=TPvk5U7kHvetghj3hCJVGuw2/9LY9HLDw+dPdR1Cy+WsyIDMNWLTcbwztSuGuqfDDC ozuLJmIMqrpl346ApkAaW+KCLnkygMCGWi0wNxK6sZKqUOp24uQxVjd0AtxfhXOu7VoD zEF8NbDHe+RuIZnpIOVN0Q6sUjE3aL6//aEc3kc/xpRb2rdIJ6iRrS07hvIyR3QQ07DN di7Pn7VGeo+WXYqmZKjf0PXYvb966vjGFuGKyIaZ57WyASmdzyHLmEO0epurQFwlIvi7 UbkCUXSWlXEwAMdKll8Hme9q0wxJ39c9i3csOpJQqlMWS1F7rNUsg8VIviv87LK+V01c wLQQ== X-Gm-Message-State: AOAM531kqQM8LqVcQGwMMb5raHbYwOVtvXfDNQvP4C+GEz+4I/uP1xZr BLfVojhnA76/gE6BC30hMH6dmUwX3MYXEvusyFjFL1d5UE4= X-Google-Smtp-Source: ABdhPJzRV8nxSSM0LOuQ1HtGguQHSE/QoQlFxd3mOStzsh/9l4dyB28gAHKSyEDVfaNPqg1nYaPNjXcV5Vg9VmGkk6g= X-Received: by 2002:a17:902:ba8a:b029:ec:b04c:451d with SMTP id k10-20020a170902ba8ab02900ecb04c451dmr22114182pls.67.1620072380077; Mon, 03 May 2021 13:06:20 -0700 (PDT) MIME-Version: 1.0 References: <20210503084435.160548-1-goldstein.w.n@gmail.com> <20210503084435.160548-3-goldstein.w.n@gmail.com> In-Reply-To: From: Noah Goldstein Date: Mon, 3 May 2021 16:06:08 -0400 Message-ID: Subject: Re: [PATCH v1 3/3] x86: Optimize memchr-evex.S To: "H.J. Lu" Cc: GNU C Library , "Carlos O'Donell" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 May 2021 20:06:27 -0000 On Mon, May 3, 2021 at 2:58 PM H.J. Lu wrote: > > On Mon, May 03, 2021 at 04:44:38AM -0400, Noah Goldstein wrote: > > No bug. This commit optimizes memchr-evex.S. The optimizations include > > replacing some branches with cmovcc, avoiding some branches entirely > > in the less_4x_vec case, making the page cross logic less strict, > > saving some ALU in the alignment process, and most importantly > > increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and > > test-wmemchr are all passing. > > > > Signed-off-by: Noah Goldstein > > --- > > Tests where run on the following CPUs: > > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > > > > All times are the geometric mean of N=20. The unit of time is > > seconds. > > > > "Cur" refers to the current implementation > > "New" refers to this patches implementation > > > > Note: The numbers for size = [1, 32] are highly dependent on function > > alignment. That being said the new implementation which uses cmovcc > > instead of a branch (mostly for the reason of high variance with > > different alignments) for the [1, 32] case is far more consistent and > > performs about as well (and should only be a bigger improvement in > > cases where the sizes / position are not 100% predictable). > > > > For memchr-evex the numbers are a near universal improvement. The case > > where the current implement as better is for size = 0 and for size = > > [1, 32] with pos < size the two implementations are about the > > same. For size = [1, 32] with pos > size, for medium range sizes, and > > large size, however, the new implementation is faster. > > > > Results For Tigerlake memchr-evex > > size , algn , Pos , Cur T , New T , Win , Dif > > 2048 , 0 , , 32 5.58 , 5.22 , New , 0.36 > > 256 , 1 , , 64 5.22 , 4.93 , New , 0.29 > > 2048 , 0 , , 64 5.22 , 4.89 , New , 0.33 > > 256 , 2 , , 64 5.14 , 4.81 , New , 0.33 > > 2048 , 0 , , 128 6.3 , 5.67 , New , 0.63 > > 256 , 3 , , 64 5.22 , 4.9 , New , 0.32 > > 2048 , 0 , , 256 11.07 , 10.92 , New , 0.15 > > 256 , 4 , , 64 5.16 , 4.86 , New , 0.3 > > 2048 , 0 , , 512 15.66 , 14.81 , New , 0.85 > > 256 , 5 , , 64 5.15 , 4.84 , New , 0.31 > > 2048 , 0 , , 1024 25.7 , 23.02 , New , 2.68 > > 256 , 6 , , 64 5.12 , 4.89 , New , 0.23 > > 2048 , 0 , , 2048 42.34 , 37.71 , New , 4.63 > > 256 , 7 , , 64 5.03 , 4.62 , New , 0.41 > > 192 , 1 , , 32 4.96 , 4.28 , New , 0.68 > > 256 , 1 , , 32 4.95 , 4.28 , New , 0.67 > > 512 , 1 , , 32 4.94 , 4.29 , New , 0.65 > > 192 , 2 , , 64 5.1 , 4.8 , New , 0.3 > > 512 , 2 , , 64 5.12 , 4.72 , New , 0.4 > > 192 , 3 , , 96 5.54 , 5.12 , New , 0.42 > > 256 , 3 , , 96 5.52 , 5.15 , New , 0.37 > > 512 , 3 , , 96 5.51 , 5.16 , New , 0.35 > > 192 , 4 , , 128 6.1 , 5.53 , New , 0.57 > > 256 , 4 , , 128 6.09 , 5.49 , New , 0.6 > > 512 , 4 , , 128 6.08 , 5.48 , New , 0.6 > > 192 , 5 , , 160 7.42 , 6.71 , New , 0.71 > > 256 , 5 , , 160 6.86 , 6.71 , New , 0.15 > > 512 , 5 , , 160 9.28 , 8.68 , New , 0.6 > > 192 , 6 , , 192 7.94 , 7.47 , New , 0.47 > > 256 , 6 , , 192 7.62 , 7.17 , New , 0.45 > > 512 , 6 , , 192 9.2 , 9.16 , New , 0.04 > > 192 , 7 , , 224 8.02 , 7.43 , New , 0.59 > > 256 , 7 , , 224 8.34 , 7.85 , New , 0.49 > > 512 , 7 , , 224 9.89 , 9.16 , New , 0.73 > > 2 , 0 , , 1 3.0 , 3.0 , Eq , 0.0 > > 2 , 1 , , 1 3.0 , 3.0 , Eq , 0.0 > > 0 , 0 , , 1 3.01 , 3.6 , Cur , 0.59 > > 0 , 1 , , 1 3.01 , 3.6 , Cur , 0.59 > > 3 , 0 , , 2 3.0 , 3.0 , Eq , 0.0 > > 3 , 2 , , 2 3.0 , 3.0 , Eq , 0.0 > > 1 , 0 , , 2 3.6 , 3.0 , New , 0.6 > > 1 , 2 , , 2 3.6 , 3.0 , New , 0.6 > > 4 , 0 , , 3 3.01 , 3.01 , Eq , 0.0 > > 4 , 3 , , 3 3.01 , 3.01 , Eq , 0.0 > > 2 , 0 , , 3 3.62 , 3.02 , New , 0.6 > > 2 , 3 , , 3 3.62 , 3.03 , New , 0.59 > > 5 , 0 , , 4 3.02 , 3.03 , Cur , 0.01 > > 5 , 4 , , 4 3.02 , 3.02 , Eq , 0.0 > > 3 , 0 , , 4 3.63 , 3.02 , New , 0.61 > > 3 , 4 , , 4 3.63 , 3.04 , New , 0.59 > > 6 , 0 , , 5 3.05 , 3.04 , New , 0.01 > > 6 , 5 , , 5 3.02 , 3.02 , Eq , 0.0 > > 4 , 0 , , 5 3.63 , 3.02 , New , 0.61 > > 4 , 5 , , 5 3.64 , 3.03 , New , 0.61 > > 7 , 0 , , 6 3.03 , 3.03 , Eq , 0.0 > > 7 , 6 , , 6 3.02 , 3.02 , Eq , 0.0 > > 5 , 0 , , 6 3.64 , 3.01 , New , 0.63 > > 5 , 6 , , 6 3.64 , 3.03 , New , 0.61 > > 8 , 0 , , 7 3.03 , 3.04 , Cur , 0.01 > > 8 , 7 , , 7 3.04 , 3.04 , Eq , 0.0 > > 6 , 0 , , 7 3.67 , 3.04 , New , 0.63 > > 6 , 7 , , 7 3.65 , 3.05 , New , 0.6 > > 9 , 0 , , 8 3.05 , 3.05 , Eq , 0.0 > > 7 , 0 , , 8 3.67 , 3.05 , New , 0.62 > > 10 , 0 , , 9 3.06 , 3.06 , Eq , 0.0 > > 10 , 1 , , 9 3.06 , 3.06 , Eq , 0.0 > > 8 , 0 , , 9 3.67 , 3.06 , New , 0.61 > > 8 , 1 , , 9 3.67 , 3.06 , New , 0.61 > > 11 , 0 , , 10 3.06 , 3.06 , Eq , 0.0 > > 11 , 2 , , 10 3.07 , 3.06 , New , 0.01 > > 9 , 0 , , 10 3.67 , 3.05 , New , 0.62 > > 9 , 2 , , 10 3.67 , 3.06 , New , 0.61 > > 12 , 0 , , 11 3.06 , 3.06 , Eq , 0.0 > > 12 , 3 , , 11 3.06 , 3.06 , Eq , 0.0 > > 10 , 0 , , 11 3.67 , 3.06 , New , 0.61 > > 10 , 3 , , 11 3.67 , 3.06 , New , 0.61 > > 13 , 0 , , 12 3.06 , 3.07 , Cur , 0.01 > > 13 , 4 , , 12 3.06 , 3.07 , Cur , 0.01 > > 11 , 0 , , 12 3.67 , 3.11 , New , 0.56 > > 11 , 4 , , 12 3.68 , 3.12 , New , 0.56 > > 14 , 0 , , 13 3.07 , 3.1 , Cur , 0.03 > > 14 , 5 , , 13 3.06 , 3.07 , Cur , 0.01 > > 12 , 0 , , 13 3.67 , 3.07 , New , 0.6 > > 12 , 5 , , 13 3.67 , 3.08 , New , 0.59 > > 15 , 0 , , 14 3.06 , 3.06 , Eq , 0.0 > > 15 , 6 , , 14 3.07 , 3.06 , New , 0.01 > > 13 , 0 , , 14 3.67 , 3.06 , New , 0.61 > > 13 , 6 , , 14 3.68 , 3.06 , New , 0.62 > > 16 , 0 , , 15 3.06 , 3.06 , Eq , 0.0 > > 16 , 7 , , 15 3.06 , 3.05 , New , 0.01 > > 14 , 0 , , 15 3.68 , 3.06 , New , 0.62 > > 14 , 7 , , 15 3.67 , 3.06 , New , 0.61 > > 17 , 0 , , 16 3.07 , 3.06 , New , 0.01 > > 15 , 0 , , 16 3.68 , 3.06 , New , 0.62 > > 18 , 0 , , 17 3.06 , 3.06 , Eq , 0.0 > > 18 , 1 , , 17 3.06 , 3.06 , Eq , 0.0 > > 16 , 0 , , 17 3.67 , 3.06 , New , 0.61 > > 16 , 1 , , 17 3.67 , 3.05 , New , 0.62 > > 19 , 0 , , 18 3.07 , 3.06 , New , 0.01 > > 19 , 2 , , 18 3.06 , 3.06 , Eq , 0.0 > > 17 , 0 , , 18 3.68 , 3.08 , New , 0.6 > > 17 , 2 , , 18 3.68 , 3.06 , New , 0.62 > > 20 , 0 , , 19 3.06 , 3.06 , Eq , 0.0 > > 20 , 3 , , 19 3.06 , 3.06 , Eq , 0.0 > > 18 , 0 , , 19 3.68 , 3.06 , New , 0.62 > > 18 , 3 , , 19 3.68 , 3.06 , New , 0.62 > > 21 , 0 , , 20 3.06 , 3.06 , Eq , 0.0 > > 21 , 4 , , 20 3.06 , 3.06 , Eq , 0.0 > > 19 , 0 , , 20 3.67 , 3.06 , New , 0.61 > > 19 , 4 , , 20 3.67 , 3.06 , New , 0.61 > > 22 , 0 , , 21 3.06 , 3.06 , Eq , 0.0 > > 22 , 5 , , 21 3.06 , 3.06 , Eq , 0.0 > > 20 , 0 , , 21 3.67 , 3.05 , New , 0.62 > > 20 , 5 , , 21 3.68 , 3.06 , New , 0.62 > > 23 , 0 , , 22 3.07 , 3.06 , New , 0.01 > > 23 , 6 , , 22 3.06 , 3.06 , Eq , 0.0 > > 21 , 0 , , 22 3.68 , 3.07 , New , 0.61 > > 21 , 6 , , 22 3.67 , 3.06 , New , 0.61 > > 24 , 0 , , 23 3.19 , 3.06 , New , 0.13 > > 24 , 7 , , 23 3.08 , 3.06 , New , 0.02 > > 22 , 0 , , 23 3.69 , 3.06 , New , 0.63 > > 22 , 7 , , 23 3.68 , 3.06 , New , 0.62 > > 25 , 0 , , 24 3.07 , 3.06 , New , 0.01 > > 23 , 0 , , 24 3.68 , 3.06 , New , 0.62 > > 26 , 0 , , 25 3.06 , 3.05 , New , 0.01 > > 26 , 1 , , 25 3.07 , 3.06 , New , 0.01 > > 24 , 0 , , 25 3.67 , 3.05 , New , 0.62 > > 24 , 1 , , 25 3.68 , 3.06 , New , 0.62 > > 27 , 0 , , 26 3.12 , 3.06 , New , 0.06 > > 27 , 2 , , 26 3.08 , 3.06 , New , 0.02 > > 25 , 0 , , 26 3.69 , 3.06 , New , 0.63 > > 25 , 2 , , 26 3.67 , 3.06 , New , 0.61 > > 28 , 0 , , 27 3.06 , 3.06 , Eq , 0.0 > > 28 , 3 , , 27 3.06 , 3.06 , Eq , 0.0 > > 26 , 0 , , 27 3.67 , 3.06 , New , 0.61 > > 26 , 3 , , 27 3.67 , 3.06 , New , 0.61 > > 29 , 0 , , 28 3.06 , 3.06 , Eq , 0.0 > > 29 , 4 , , 28 3.06 , 3.06 , Eq , 0.0 > > 27 , 0 , , 28 3.68 , 3.05 , New , 0.63 > > 27 , 4 , , 28 3.67 , 3.06 , New , 0.61 > > 30 , 0 , , 29 3.06 , 3.06 , Eq , 0.0 > > 30 , 5 , , 29 3.06 , 3.06 , Eq , 0.0 > > 28 , 0 , , 29 3.67 , 3.06 , New , 0.61 > > 28 , 5 , , 29 3.68 , 3.06 , New , 0.62 > > 31 , 0 , , 30 3.06 , 3.06 , Eq , 0.0 > > 31 , 6 , , 30 3.06 , 3.06 , Eq , 0.0 > > 29 , 0 , , 30 3.68 , 3.06 , New , 0.62 > > 29 , 6 , , 30 3.7 , 3.06 , New , 0.64 > > 32 , 0 , , 31 3.17 , 3.06 , New , 0.11 > > 32 , 7 , , 31 3.12 , 3.06 , New , 0.06 > > 30 , 0 , , 31 3.68 , 3.06 , New , 0.62 > > 30 , 7 , , 31 3.68 , 3.06 , New , 0.62 > > > > Results For Icelake memchr-evex > > size , algn , Pos , Cur T , New T , Win , Dif > > 2048 , 0 , , 32 4.94 , 4.26 , New , 0.68 > > 256 , 1 , , 64 4.5 , 4.13 , New , 0.37 > > 2048 , 0 , , 64 4.19 , 3.9 , New , 0.29 > > 256 , 2 , , 64 4.19 , 3.87 , New , 0.32 > > 2048 , 0 , , 128 4.96 , 4.53 , New , 0.43 > > 256 , 3 , , 64 4.07 , 3.86 , New , 0.21 > > 2048 , 0 , , 256 8.77 , 8.61 , New , 0.16 > > 256 , 4 , , 64 4.08 , 3.87 , New , 0.21 > > 2048 , 0 , , 512 12.22 , 11.67 , New , 0.55 > > 256 , 5 , , 64 4.12 , 3.83 , New , 0.29 > > 2048 , 0 , , 1024 20.06 , 18.09 , New , 1.97 > > 256 , 6 , , 64 4.2 , 3.95 , New , 0.25 > > 2048 , 0 , , 2048 33.83 , 30.62 , New , 3.21 > > 256 , 7 , , 64 4.3 , 4.04 , New , 0.26 > > 192 , 1 , , 32 4.2 , 3.71 , New , 0.49 > > 256 , 1 , , 32 4.24 , 3.76 , New , 0.48 > > 512 , 1 , , 32 4.29 , 3.74 , New , 0.55 > > 192 , 2 , , 64 4.42 , 4.0 , New , 0.42 > > 512 , 2 , , 64 4.17 , 3.83 , New , 0.34 > > 192 , 3 , , 96 4.44 , 4.26 , New , 0.18 > > 256 , 3 , , 96 4.45 , 4.14 , New , 0.31 > > 512 , 3 , , 96 4.42 , 4.15 , New , 0.27 > > 192 , 4 , , 128 4.93 , 4.45 , New , 0.48 > > 256 , 4 , , 128 4.93 , 4.47 , New , 0.46 > > 512 , 4 , , 128 4.95 , 4.47 , New , 0.48 > > 192 , 5 , , 160 5.95 , 5.44 , New , 0.51 > > 256 , 5 , , 160 5.59 , 5.47 , New , 0.12 > > 512 , 5 , , 160 7.59 , 7.34 , New , 0.25 > > 192 , 6 , , 192 6.53 , 6.08 , New , 0.45 > > 256 , 6 , , 192 6.2 , 5.88 , New , 0.32 > > 512 , 6 , , 192 7.53 , 7.62 , Cur , 0.09 > > 192 , 7 , , 224 6.62 , 6.12 , New , 0.5 > > 256 , 7 , , 224 6.79 , 6.51 , New , 0.28 > > 512 , 7 , , 224 8.12 , 7.61 , New , 0.51 > > 2 , 0 , , 1 2.5 , 2.54 , Cur , 0.04 > > 2 , 1 , , 1 2.56 , 2.55 , New , 0.01 > > 0 , 0 , , 1 2.57 , 3.12 , Cur , 0.55 > > 0 , 1 , , 1 2.59 , 3.14 , Cur , 0.55 > > 3 , 0 , , 2 2.62 , 2.63 , Cur , 0.01 > > 3 , 2 , , 2 2.66 , 2.67 , Cur , 0.01 > > 1 , 0 , , 2 3.24 , 2.72 , New , 0.52 > > 1 , 2 , , 2 3.28 , 2.75 , New , 0.53 > > 4 , 0 , , 3 2.78 , 2.8 , Cur , 0.02 > > 4 , 3 , , 3 2.8 , 2.82 , Cur , 0.02 > > 2 , 0 , , 3 3.38 , 2.86 , New , 0.52 > > 2 , 3 , , 3 3.41 , 2.89 , New , 0.52 > > 5 , 0 , , 4 2.88 , 2.91 , Cur , 0.03 > > 5 , 4 , , 4 2.88 , 2.92 , Cur , 0.04 > > 3 , 0 , , 4 3.48 , 2.93 , New , 0.55 > > 3 , 4 , , 4 3.47 , 2.93 , New , 0.54 > > 6 , 0 , , 5 2.95 , 2.94 , New , 0.01 > > 6 , 5 , , 5 2.91 , 2.92 , Cur , 0.01 > > 4 , 0 , , 5 3.47 , 2.9 , New , 0.57 > > 4 , 5 , , 5 3.43 , 2.91 , New , 0.52 > > 7 , 0 , , 6 2.87 , 2.9 , Cur , 0.03 > > 7 , 6 , , 6 2.87 , 2.89 , Cur , 0.02 > > 5 , 0 , , 6 3.44 , 2.88 , New , 0.56 > > 5 , 6 , , 6 3.41 , 2.87 , New , 0.54 > > 8 , 0 , , 7 2.86 , 2.87 , Cur , 0.01 > > 8 , 7 , , 7 2.86 , 2.87 , Cur , 0.01 > > 6 , 0 , , 7 3.43 , 2.87 , New , 0.56 > > 6 , 7 , , 7 3.44 , 2.87 , New , 0.57 > > 9 , 0 , , 8 2.86 , 2.88 , Cur , 0.02 > > 7 , 0 , , 8 3.41 , 2.89 , New , 0.52 > > 10 , 0 , , 9 2.83 , 2.87 , Cur , 0.04 > > 10 , 1 , , 9 2.82 , 2.87 , Cur , 0.05 > > 8 , 0 , , 9 3.4 , 2.89 , New , 0.51 > > 8 , 1 , , 9 3.41 , 2.87 , New , 0.54 > > 11 , 0 , , 10 2.83 , 2.88 , Cur , 0.05 > > 11 , 2 , , 10 2.84 , 2.88 , Cur , 0.04 > > 9 , 0 , , 10 3.41 , 2.87 , New , 0.54 > > 9 , 2 , , 10 3.41 , 2.88 , New , 0.53 > > 12 , 0 , , 11 2.83 , 2.89 , Cur , 0.06 > > 12 , 3 , , 11 2.85 , 2.87 , Cur , 0.02 > > 10 , 0 , , 11 3.41 , 2.87 , New , 0.54 > > 10 , 3 , , 11 3.42 , 2.88 , New , 0.54 > > 13 , 0 , , 12 2.86 , 2.87 , Cur , 0.01 > > 13 , 4 , , 12 2.84 , 2.88 , Cur , 0.04 > > 11 , 0 , , 12 3.43 , 2.87 , New , 0.56 > > 11 , 4 , , 12 3.49 , 2.87 , New , 0.62 > > 14 , 0 , , 13 2.85 , 2.86 , Cur , 0.01 > > 14 , 5 , , 13 2.85 , 2.86 , Cur , 0.01 > > 12 , 0 , , 13 3.41 , 2.86 , New , 0.55 > > 12 , 5 , , 13 3.44 , 2.85 , New , 0.59 > > 15 , 0 , , 14 2.83 , 2.87 , Cur , 0.04 > > 15 , 6 , , 14 2.82 , 2.86 , Cur , 0.04 > > 13 , 0 , , 14 3.41 , 2.86 , New , 0.55 > > 13 , 6 , , 14 3.4 , 2.86 , New , 0.54 > > 16 , 0 , , 15 2.84 , 2.86 , Cur , 0.02 > > 16 , 7 , , 15 2.83 , 2.85 , Cur , 0.02 > > 14 , 0 , , 15 3.41 , 2.85 , New , 0.56 > > 14 , 7 , , 15 3.39 , 2.87 , New , 0.52 > > 17 , 0 , , 16 2.83 , 2.87 , Cur , 0.04 > > 15 , 0 , , 16 3.4 , 2.85 , New , 0.55 > > 18 , 0 , , 17 2.83 , 2.86 , Cur , 0.03 > > 18 , 1 , , 17 2.85 , 2.84 , New , 0.01 > > 16 , 0 , , 17 3.41 , 2.85 , New , 0.56 > > 16 , 1 , , 17 3.4 , 2.86 , New , 0.54 > > 19 , 0 , , 18 2.8 , 2.84 , Cur , 0.04 > > 19 , 2 , , 18 2.82 , 2.83 , Cur , 0.01 > > 17 , 0 , , 18 3.39 , 2.86 , New , 0.53 > > 17 , 2 , , 18 3.39 , 2.84 , New , 0.55 > > 20 , 0 , , 19 2.85 , 2.87 , Cur , 0.02 > > 20 , 3 , , 19 2.88 , 2.87 , New , 0.01 > > 18 , 0 , , 19 3.38 , 2.85 , New , 0.53 > > 18 , 3 , , 19 3.4 , 2.85 , New , 0.55 > > 21 , 0 , , 20 2.83 , 2.85 , Cur , 0.02 > > 21 , 4 , , 20 2.88 , 2.85 , New , 0.03 > > 19 , 0 , , 20 3.39 , 2.84 , New , 0.55 > > 19 , 4 , , 20 3.39 , 2.96 , New , 0.43 > > 22 , 0 , , 21 2.84 , 2.9 , Cur , 0.06 > > 22 , 5 , , 21 2.81 , 2.84 , Cur , 0.03 > > 20 , 0 , , 21 3.41 , 2.81 , New , 0.6 > > 20 , 5 , , 21 3.38 , 2.83 , New , 0.55 > > 23 , 0 , , 22 2.8 , 2.82 , Cur , 0.02 > > 23 , 6 , , 22 2.81 , 2.83 , Cur , 0.02 > > 21 , 0 , , 22 3.35 , 2.81 , New , 0.54 > > 21 , 6 , , 22 3.34 , 2.81 , New , 0.53 > > 24 , 0 , , 23 2.77 , 2.84 , Cur , 0.07 > > 24 , 7 , , 23 2.78 , 2.8 , Cur , 0.02 > > 22 , 0 , , 23 3.34 , 2.79 , New , 0.55 > > 22 , 7 , , 23 3.32 , 2.79 , New , 0.53 > > 25 , 0 , , 24 2.77 , 2.8 , Cur , 0.03 > > 23 , 0 , , 24 3.29 , 2.79 , New , 0.5 > > 26 , 0 , , 25 2.73 , 2.78 , Cur , 0.05 > > 26 , 1 , , 25 2.75 , 2.79 , Cur , 0.04 > > 24 , 0 , , 25 3.27 , 2.79 , New , 0.48 > > 24 , 1 , , 25 3.27 , 2.77 , New , 0.5 > > 27 , 0 , , 26 2.72 , 2.78 , Cur , 0.06 > > 27 , 2 , , 26 2.75 , 2.76 , Cur , 0.01 > > 25 , 0 , , 26 3.29 , 2.73 , New , 0.56 > > 25 , 2 , , 26 3.3 , 2.76 , New , 0.54 > > 28 , 0 , , 27 2.75 , 2.79 , Cur , 0.04 > > 28 , 3 , , 27 2.77 , 2.77 , Eq , 0.0 > > 26 , 0 , , 27 3.28 , 2.78 , New , 0.5 > > 26 , 3 , , 27 3.29 , 2.78 , New , 0.51 > > 29 , 0 , , 28 2.74 , 2.76 , Cur , 0.02 > > 29 , 4 , , 28 2.74 , 2.77 , Cur , 0.03 > > 27 , 0 , , 28 3.3 , 2.76 , New , 0.54 > > 27 , 4 , , 28 3.3 , 2.74 , New , 0.56 > > 30 , 0 , , 29 2.72 , 2.76 , Cur , 0.04 > > 30 , 5 , , 29 2.74 , 2.75 , Cur , 0.01 > > 28 , 0 , , 29 3.25 , 2.73 , New , 0.52 > > 28 , 5 , , 29 3.3 , 2.73 , New , 0.57 > > 31 , 0 , , 30 2.73 , 2.77 , Cur , 0.04 > > 31 , 6 , , 30 2.74 , 2.76 , Cur , 0.02 > > 29 , 0 , , 30 3.25 , 2.73 , New , 0.52 > > 29 , 6 , , 30 3.26 , 2.74 , New , 0.52 > > 32 , 0 , , 31 2.73 , 2.74 , Cur , 0.01 > > 32 , 7 , , 31 2.73 , 2.75 , Cur , 0.02 > > 30 , 0 , , 31 3.24 , 2.72 , New , 0.52 > > 30 , 7 , , 31 3.24 , 2.72 , New , 0.52 > > > > For memchr-avx2 the improvements are more modest though again near > > universal. The improvement is most significant for medium sizes and > > small sizes with pos > size. For small sizes with pos < size and large > > sizes the two implementations perform roughly the same for large > > sizes. > > > > Results For Tigerlake memchr-avx2 > > size , algn , Pos , Cur T , New T , Win , Dif > > 2048 , 0 , , 32 6.15 , 6.27 , Cur , 0.12 > > 256 , 1 , , 64 6.21 , 6.03 , New , 0.18 > > 2048 , 0 , , 64 6.07 , 5.95 , New , 0.12 > > 256 , 2 , , 64 6.01 , 5.8 , New , 0.21 > > 2048 , 0 , , 128 7.05 , 6.55 , New , 0.5 > > 256 , 3 , , 64 6.14 , 5.83 , New , 0.31 > > 2048 , 0 , , 256 11.78 , 11.78 , Eq , 0.0 > > 256 , 4 , , 64 6.1 , 5.85 , New , 0.25 > > 2048 , 0 , , 512 16.32 , 15.96 , New , 0.36 > > 256 , 5 , , 64 6.1 , 5.77 , New , 0.33 > > 2048 , 0 , , 1024 25.38 , 25.18 , New , 0.2 > > 256 , 6 , , 64 6.08 , 5.88 , New , 0.2 > > 2048 , 0 , , 2048 38.56 , 38.32 , New , 0.24 > > 256 , 7 , , 64 5.93 , 5.68 , New , 0.25 > > 192 , 1 , , 32 5.49 , 5.3 , New , 0.19 > > 256 , 1 , , 32 5.5 , 5.28 , New , 0.22 > > 512 , 1 , , 32 5.48 , 5.32 , New , 0.16 > > 192 , 2 , , 64 6.1 , 5.73 , New , 0.37 > > 512 , 2 , , 64 5.88 , 5.72 , New , 0.16 > > 192 , 3 , , 96 6.31 , 5.93 , New , 0.38 > > 256 , 3 , , 96 6.32 , 5.93 , New , 0.39 > > 512 , 3 , , 96 6.2 , 5.94 , New , 0.26 > > 192 , 4 , , 128 6.65 , 6.4 , New , 0.25 > > 256 , 4 , , 128 6.6 , 6.37 , New , 0.23 > > 512 , 4 , , 128 6.74 , 6.33 , New , 0.41 > > 192 , 5 , , 160 7.78 , 7.4 , New , 0.38 > > 256 , 5 , , 160 7.18 , 7.4 , Cur , 0.22 > > 512 , 5 , , 160 9.81 , 9.44 , New , 0.37 > > 192 , 6 , , 192 9.12 , 7.77 , New , 1.35 > > 256 , 6 , , 192 7.97 , 7.66 , New , 0.31 > > 512 , 6 , , 192 10.14 , 9.95 , New , 0.19 > > 192 , 7 , , 224 8.96 , 7.78 , New , 1.18 > > 256 , 7 , , 224 8.52 , 8.23 , New , 0.29 > > 512 , 7 , , 224 10.33 , 9.98 , New , 0.35 > > 2 , 0 , , 1 3.61 , 3.6 , New , 0.01 > > 2 , 1 , , 1 3.6 , 3.6 , Eq , 0.0 > > 0 , 0 , , 1 3.02 , 3.0 , New , 0.02 > > 0 , 1 , , 1 3.0 , 3.0 , Eq , 0.0 > > 3 , 0 , , 2 3.6 , 3.6 , Eq , 0.0 > > 3 , 2 , , 2 3.61 , 3.6 , New , 0.01 > > 1 , 0 , , 2 4.82 , 3.6 , New , 1.22 > > 1 , 2 , , 2 4.81 , 3.6 , New , 1.21 > > 4 , 0 , , 3 3.61 , 3.61 , Eq , 0.0 > > 4 , 3 , , 3 3.62 , 3.61 , New , 0.01 > > 2 , 0 , , 3 4.82 , 3.62 , New , 1.2 > > 2 , 3 , , 3 4.83 , 3.63 , New , 1.2 > > 5 , 0 , , 4 3.63 , 3.64 , Cur , 0.01 > > 5 , 4 , , 4 3.63 , 3.62 , New , 0.01 > > 3 , 0 , , 4 4.84 , 3.62 , New , 1.22 > > 3 , 4 , , 4 4.84 , 3.64 , New , 1.2 > > 6 , 0 , , 5 3.66 , 3.64 , New , 0.02 > > 6 , 5 , , 5 3.65 , 3.62 , New , 0.03 > > 4 , 0 , , 5 4.83 , 3.63 , New , 1.2 > > 4 , 5 , , 5 4.85 , 3.64 , New , 1.21 > > 7 , 0 , , 6 3.76 , 3.79 , Cur , 0.03 > > 7 , 6 , , 6 3.76 , 3.72 , New , 0.04 > > 5 , 0 , , 6 4.84 , 3.62 , New , 1.22 > > 5 , 6 , , 6 4.85 , 3.64 , New , 1.21 > > 8 , 0 , , 7 3.64 , 3.65 , Cur , 0.01 > > 8 , 7 , , 7 3.65 , 3.65 , Eq , 0.0 > > 6 , 0 , , 7 4.88 , 3.64 , New , 1.24 > > 6 , 7 , , 7 4.87 , 3.65 , New , 1.22 > > 9 , 0 , , 8 3.66 , 3.66 , Eq , 0.0 > > 7 , 0 , , 8 4.89 , 3.66 , New , 1.23 > > 10 , 0 , , 9 3.67 , 3.67 , Eq , 0.0 > > 10 , 1 , , 9 3.67 , 3.67 , Eq , 0.0 > > 8 , 0 , , 9 4.9 , 3.67 , New , 1.23 > > 8 , 1 , , 9 4.9 , 3.67 , New , 1.23 > > 11 , 0 , , 10 3.68 , 3.67 , New , 0.01 > > 11 , 2 , , 10 3.69 , 3.67 , New , 0.02 > > 9 , 0 , , 10 4.9 , 3.67 , New , 1.23 > > 9 , 2 , , 10 4.9 , 3.67 , New , 1.23 > > 12 , 0 , , 11 3.71 , 3.68 , New , 0.03 > > 12 , 3 , , 11 3.71 , 3.67 , New , 0.04 > > 10 , 0 , , 11 4.9 , 3.67 , New , 1.23 > > 10 , 3 , , 11 4.9 , 3.67 , New , 1.23 > > 13 , 0 , , 12 4.24 , 4.23 , New , 0.01 > > 13 , 4 , , 12 4.23 , 4.23 , Eq , 0.0 > > 11 , 0 , , 12 4.9 , 3.7 , New , 1.2 > > 11 , 4 , , 12 4.9 , 3.73 , New , 1.17 > > 14 , 0 , , 13 3.99 , 4.01 , Cur , 0.02 > > 14 , 5 , , 13 3.98 , 3.98 , Eq , 0.0 > > 12 , 0 , , 13 4.9 , 3.69 , New , 1.21 > > 12 , 5 , , 13 4.9 , 3.69 , New , 1.21 > > 15 , 0 , , 14 3.99 , 3.97 , New , 0.02 > > 15 , 6 , , 14 4.0 , 4.0 , Eq , 0.0 > > 13 , 0 , , 14 4.9 , 3.67 , New , 1.23 > > 13 , 6 , , 14 4.9 , 3.67 , New , 1.23 > > 16 , 0 , , 15 3.99 , 4.02 , Cur , 0.03 > > 16 , 7 , , 15 4.01 , 3.96 , New , 0.05 > > 14 , 0 , , 15 4.93 , 3.67 , New , 1.26 > > 14 , 7 , , 15 4.92 , 3.67 , New , 1.25 > > 17 , 0 , , 16 4.04 , 3.99 , New , 0.05 > > 15 , 0 , , 16 5.42 , 4.22 , New , 1.2 > > 18 , 0 , , 17 4.01 , 3.97 , New , 0.04 > > 18 , 1 , , 17 3.99 , 3.98 , New , 0.01 > > 16 , 0 , , 17 5.22 , 3.98 , New , 1.24 > > 16 , 1 , , 17 5.19 , 3.98 , New , 1.21 > > 19 , 0 , , 18 4.0 , 3.99 , New , 0.01 > > 19 , 2 , , 18 4.03 , 3.97 , New , 0.06 > > 17 , 0 , , 18 5.18 , 3.99 , New , 1.19 > > 17 , 2 , , 18 5.18 , 3.98 , New , 1.2 > > 20 , 0 , , 19 4.02 , 3.98 , New , 0.04 > > 20 , 3 , , 19 4.0 , 3.98 , New , 0.02 > > 18 , 0 , , 19 5.19 , 3.97 , New , 1.22 > > 18 , 3 , , 19 5.21 , 3.98 , New , 1.23 > > 21 , 0 , , 20 3.98 , 4.0 , Cur , 0.02 > > 21 , 4 , , 20 4.0 , 4.0 , Eq , 0.0 > > 19 , 0 , , 20 5.19 , 3.99 , New , 1.2 > > 19 , 4 , , 20 5.17 , 3.99 , New , 1.18 > > 22 , 0 , , 21 4.03 , 3.98 , New , 0.05 > > 22 , 5 , , 21 4.01 , 3.95 , New , 0.06 > > 20 , 0 , , 21 5.19 , 4.0 , New , 1.19 > > 20 , 5 , , 21 5.21 , 3.99 , New , 1.22 > > 23 , 0 , , 22 4.06 , 3.97 , New , 0.09 > > 23 , 6 , , 22 4.02 , 3.98 , New , 0.04 > > 21 , 0 , , 22 5.2 , 4.02 , New , 1.18 > > 21 , 6 , , 22 5.22 , 4.0 , New , 1.22 > > 24 , 0 , , 23 4.15 , 3.98 , New , 0.17 > > 24 , 7 , , 23 4.0 , 4.01 , Cur , 0.01 > > 22 , 0 , , 23 5.28 , 4.0 , New , 1.28 > > 22 , 7 , , 23 5.22 , 3.99 , New , 1.23 > > 25 , 0 , , 24 4.1 , 4.04 , New , 0.06 > > 23 , 0 , , 24 5.23 , 4.04 , New , 1.19 > > 26 , 0 , , 25 4.1 , 4.06 , New , 0.04 > > 26 , 1 , , 25 4.07 , 3.99 , New , 0.08 > > 24 , 0 , , 25 5.26 , 4.02 , New , 1.24 > > 24 , 1 , , 25 5.21 , 4.0 , New , 1.21 > > 27 , 0 , , 26 4.17 , 4.03 , New , 0.14 > > 27 , 2 , , 26 4.09 , 4.03 , New , 0.06 > > 25 , 0 , , 26 5.29 , 4.1 , New , 1.19 > > 25 , 2 , , 26 5.25 , 4.0 , New , 1.25 > > 28 , 0 , , 27 4.06 , 4.1 , Cur , 0.04 > > 28 , 3 , , 27 4.09 , 4.04 , New , 0.05 > > 26 , 0 , , 27 5.26 , 4.04 , New , 1.22 > > 26 , 3 , , 27 5.28 , 4.01 , New , 1.27 > > 29 , 0 , , 28 4.07 , 4.02 , New , 0.05 > > 29 , 4 , , 28 4.07 , 4.05 , New , 0.02 > > 27 , 0 , , 28 5.25 , 4.02 , New , 1.23 > > 27 , 4 , , 28 5.25 , 4.03 , New , 1.22 > > 30 , 0 , , 29 4.14 , 4.06 , New , 0.08 > > 30 , 5 , , 29 4.08 , 4.04 , New , 0.04 > > 28 , 0 , , 29 5.26 , 4.07 , New , 1.19 > > 28 , 5 , , 29 5.28 , 4.04 , New , 1.24 > > 31 , 0 , , 30 4.09 , 4.08 , New , 0.01 > > 31 , 6 , , 30 4.1 , 4.08 , New , 0.02 > > 29 , 0 , , 30 5.28 , 4.05 , New , 1.23 > > 29 , 6 , , 30 5.24 , 4.07 , New , 1.17 > > 32 , 0 , , 31 4.1 , 4.13 , Cur , 0.03 > > 32 , 7 , , 31 4.16 , 4.09 , New , 0.07 > > 30 , 0 , , 31 5.31 , 4.09 , New , 1.22 > > 30 , 7 , , 31 5.28 , 4.08 , New , 1.2 > > > > Results For Icelake memchr-avx2 > > size , algn , Pos , Cur T , New T , Win , Dif > > 2048 , 0 , , 32 5.74 , 5.08 , New , 0.66 > > 256 , 1 , , 64 5.16 , 4.93 , New , 0.23 > > 2048 , 0 , , 64 4.86 , 4.69 , New , 0.17 > > 256 , 2 , , 64 4.78 , 4.7 , New , 0.08 > > 2048 , 0 , , 128 5.64 , 5.0 , New , 0.64 > > 256 , 3 , , 64 4.64 , 4.59 , New , 0.05 > > 2048 , 0 , , 256 9.07 , 9.17 , Cur , 0.1 > > 256 , 4 , , 64 4.7 , 4.6 , New , 0.1 > > 2048 , 0 , , 512 12.56 , 12.33 , New , 0.23 > > 256 , 5 , , 64 4.72 , 4.61 , New , 0.11 > > 2048 , 0 , , 1024 19.36 , 19.49 , Cur , 0.13 > > 256 , 6 , , 64 4.82 , 4.69 , New , 0.13 > > 2048 , 0 , , 2048 29.99 , 30.53 , Cur , 0.54 > > 256 , 7 , , 64 4.9 , 4.85 , New , 0.05 > > 192 , 1 , , 32 4.89 , 4.45 , New , 0.44 > > 256 , 1 , , 32 4.93 , 4.44 , New , 0.49 > > 512 , 1 , , 32 4.97 , 4.45 , New , 0.52 > > 192 , 2 , , 64 5.04 , 4.65 , New , 0.39 > > 512 , 2 , , 64 4.75 , 4.66 , New , 0.09 > > 192 , 3 , , 96 5.14 , 4.66 , New , 0.48 > > 256 , 3 , , 96 5.12 , 4.66 , New , 0.46 > > 512 , 3 , , 96 5.13 , 4.62 , New , 0.51 > > 192 , 4 , , 128 5.65 , 4.95 , New , 0.7 > > 256 , 4 , , 128 5.63 , 4.95 , New , 0.68 > > 512 , 4 , , 128 5.68 , 4.96 , New , 0.72 > > 192 , 5 , , 160 6.1 , 5.84 , New , 0.26 > > 256 , 5 , , 160 5.58 , 5.84 , Cur , 0.26 > > 512 , 5 , , 160 7.95 , 7.74 , New , 0.21 > > 192 , 6 , , 192 7.07 , 6.23 , New , 0.84 > > 256 , 6 , , 192 6.34 , 6.09 , New , 0.25 > > 512 , 6 , , 192 8.17 , 8.13 , New , 0.04 > > 192 , 7 , , 224 7.06 , 6.23 , New , 0.83 > > 256 , 7 , , 224 6.76 , 6.65 , New , 0.11 > > 512 , 7 , , 224 8.29 , 8.08 , New , 0.21 > > 2 , 0 , , 1 3.0 , 3.04 , Cur , 0.04 > > 2 , 1 , , 1 3.06 , 3.07 , Cur , 0.01 > > 0 , 0 , , 1 2.57 , 2.59 , Cur , 0.02 > > 0 , 1 , , 1 2.6 , 2.61 , Cur , 0.01 > > 3 , 0 , , 2 3.15 , 3.17 , Cur , 0.02 > > 3 , 2 , , 2 3.19 , 3.21 , Cur , 0.02 > > 1 , 0 , , 2 4.32 , 3.25 , New , 1.07 > > 1 , 2 , , 2 4.36 , 3.31 , New , 1.05 > > 4 , 0 , , 3 3.5 , 3.52 , Cur , 0.02 > > 4 , 3 , , 3 3.52 , 3.54 , Cur , 0.02 > > 2 , 0 , , 3 4.51 , 3.43 , New , 1.08 > > 2 , 3 , , 3 4.56 , 3.47 , New , 1.09 > > 5 , 0 , , 4 3.61 , 3.65 , Cur , 0.04 > > 5 , 4 , , 4 3.63 , 3.67 , Cur , 0.04 > > 3 , 0 , , 4 4.64 , 3.51 , New , 1.13 > > 3 , 4 , , 4 4.7 , 3.51 , New , 1.19 > > 6 , 0 , , 5 3.66 , 3.68 , Cur , 0.02 > > 6 , 5 , , 5 3.69 , 3.65 , New , 0.04 > > 4 , 0 , , 5 4.7 , 3.49 , New , 1.21 > > 4 , 5 , , 5 4.58 , 3.48 , New , 1.1 > > 7 , 0 , , 6 3.6 , 3.65 , Cur , 0.05 > > 7 , 6 , , 6 3.59 , 3.64 , Cur , 0.05 > > 5 , 0 , , 6 4.74 , 3.65 , New , 1.09 > > 5 , 6 , , 6 4.73 , 3.64 , New , 1.09 > > 8 , 0 , , 7 3.6 , 3.61 , Cur , 0.01 > > 8 , 7 , , 7 3.6 , 3.61 , Cur , 0.01 > > 6 , 0 , , 7 4.73 , 3.6 , New , 1.13 > > 6 , 7 , , 7 4.73 , 3.62 , New , 1.11 > > 9 , 0 , , 8 3.59 , 3.62 , Cur , 0.03 > > 7 , 0 , , 8 4.72 , 3.64 , New , 1.08 > > 10 , 0 , , 9 3.57 , 3.62 , Cur , 0.05 > > 10 , 1 , , 9 3.56 , 3.61 , Cur , 0.05 > > 8 , 0 , , 9 4.69 , 3.63 , New , 1.06 > > 8 , 1 , , 9 4.71 , 3.61 , New , 1.1 > > 11 , 0 , , 10 3.58 , 3.62 , Cur , 0.04 > > 11 , 2 , , 10 3.59 , 3.63 , Cur , 0.04 > > 9 , 0 , , 10 4.72 , 3.61 , New , 1.11 > > 9 , 2 , , 10 4.7 , 3.61 , New , 1.09 > > 12 , 0 , , 11 3.58 , 3.63 , Cur , 0.05 > > 12 , 3 , , 11 3.58 , 3.62 , Cur , 0.04 > > 10 , 0 , , 11 4.7 , 3.6 , New , 1.1 > > 10 , 3 , , 11 4.73 , 3.64 , New , 1.09 > > 13 , 0 , , 12 3.6 , 3.6 , Eq , 0.0 > > 13 , 4 , , 12 3.57 , 3.62 , Cur , 0.05 > > 11 , 0 , , 12 4.73 , 3.62 , New , 1.11 > > 11 , 4 , , 12 4.79 , 3.61 , New , 1.18 > > 14 , 0 , , 13 3.61 , 3.62 , Cur , 0.01 > > 14 , 5 , , 13 3.59 , 3.59 , Eq , 0.0 > > 12 , 0 , , 13 4.7 , 3.61 , New , 1.09 > > 12 , 5 , , 13 4.75 , 3.58 , New , 1.17 > > 15 , 0 , , 14 3.58 , 3.62 , Cur , 0.04 > > 15 , 6 , , 14 3.59 , 3.62 , Cur , 0.03 > > 13 , 0 , , 14 4.68 , 3.6 , New , 1.08 > > 13 , 6 , , 14 4.68 , 3.63 , New , 1.05 > > 16 , 0 , , 15 3.57 , 3.6 , Cur , 0.03 > > 16 , 7 , , 15 3.55 , 3.59 , Cur , 0.04 > > 14 , 0 , , 15 4.69 , 3.61 , New , 1.08 > > 14 , 7 , , 15 4.69 , 3.61 , New , 1.08 > > 17 , 0 , , 16 3.56 , 3.61 , Cur , 0.05 > > 15 , 0 , , 16 4.71 , 3.58 , New , 1.13 > > 18 , 0 , , 17 3.57 , 3.65 , Cur , 0.08 > > 18 , 1 , , 17 3.58 , 3.59 , Cur , 0.01 > > 16 , 0 , , 17 4.7 , 3.58 , New , 1.12 > > 16 , 1 , , 17 4.68 , 3.59 , New , 1.09 > > 19 , 0 , , 18 3.51 , 3.58 , Cur , 0.07 > > 19 , 2 , , 18 3.55 , 3.58 , Cur , 0.03 > > 17 , 0 , , 18 4.69 , 3.61 , New , 1.08 > > 17 , 2 , , 18 4.68 , 3.61 , New , 1.07 > > 20 , 0 , , 19 3.57 , 3.6 , Cur , 0.03 > > 20 , 3 , , 19 3.59 , 3.59 , Eq , 0.0 > > 18 , 0 , , 19 4.68 , 3.59 , New , 1.09 > > 18 , 3 , , 19 4.67 , 3.57 , New , 1.1 > > 21 , 0 , , 20 3.61 , 3.58 , New , 0.03 > > 21 , 4 , , 20 3.62 , 3.6 , New , 0.02 > > 19 , 0 , , 20 4.74 , 3.57 , New , 1.17 > > 19 , 4 , , 20 4.69 , 3.7 , New , 0.99 > > 22 , 0 , , 21 3.57 , 3.64 , Cur , 0.07 > > 22 , 5 , , 21 3.55 , 3.6 , Cur , 0.05 > > 20 , 0 , , 21 4.72 , 3.55 , New , 1.17 > > 20 , 5 , , 21 4.66 , 3.55 , New , 1.11 > > 23 , 0 , , 22 3.56 , 3.56 , Eq , 0.0 > > 23 , 6 , , 22 3.54 , 3.56 , Cur , 0.02 > > 21 , 0 , , 22 4.65 , 3.53 , New , 1.12 > > 21 , 6 , , 22 4.62 , 3.56 , New , 1.06 > > 24 , 0 , , 23 3.5 , 3.54 , Cur , 0.04 > > 24 , 7 , , 23 3.52 , 3.53 , Cur , 0.01 > > 22 , 0 , , 23 4.61 , 3.51 , New , 1.1 > > 22 , 7 , , 23 4.6 , 3.51 , New , 1.09 > > 25 , 0 , , 24 3.5 , 3.53 , Cur , 0.03 > > 23 , 0 , , 24 4.54 , 3.5 , New , 1.04 > > 26 , 0 , , 25 3.47 , 3.49 , Cur , 0.02 > > 26 , 1 , , 25 3.46 , 3.51 , Cur , 0.05 > > 24 , 0 , , 25 4.53 , 3.51 , New , 1.02 > > 24 , 1 , , 25 4.51 , 3.51 , New , 1.0 > > 27 , 0 , , 26 3.44 , 3.51 , Cur , 0.07 > > 27 , 2 , , 26 3.51 , 3.52 , Cur , 0.01 > > 25 , 0 , , 26 4.56 , 3.46 , New , 1.1 > > 25 , 2 , , 26 4.55 , 3.47 , New , 1.08 > > 28 , 0 , , 27 3.47 , 3.5 , Cur , 0.03 > > 28 , 3 , , 27 3.48 , 3.47 , New , 0.01 > > 26 , 0 , , 27 4.52 , 3.44 , New , 1.08 > > 26 , 3 , , 27 4.55 , 3.46 , New , 1.09 > > 29 , 0 , , 28 3.45 , 3.49 , Cur , 0.04 > > 29 , 4 , , 28 3.5 , 3.5 , Eq , 0.0 > > 27 , 0 , , 28 4.56 , 3.49 , New , 1.07 > > 27 , 4 , , 28 4.5 , 3.49 , New , 1.01 > > 30 , 0 , , 29 3.44 , 3.48 , Cur , 0.04 > > 30 , 5 , , 29 3.46 , 3.47 , Cur , 0.01 > > 28 , 0 , , 29 4.49 , 3.43 , New , 1.06 > > 28 , 5 , , 29 4.57 , 3.45 , New , 1.12 > > 31 , 0 , , 30 3.48 , 3.48 , Eq , 0.0 > > 31 , 6 , , 30 3.46 , 3.49 , Cur , 0.03 > > 29 , 0 , , 30 4.49 , 3.44 , New , 1.05 > > 29 , 6 , , 30 4.53 , 3.44 , New , 1.09 > > 32 , 0 , , 31 3.44 , 3.45 , Cur , 0.01 > > 32 , 7 , , 31 3.46 , 3.51 , Cur , 0.05 > > 30 , 0 , , 31 4.48 , 3.42 , New , 1.06 > > 30 , 7 , , 31 4.48 , 3.44 , New , 1.04 > > > > > > Results For Skylake memchr-avx2 > > size , algn , Pos , Cur T , New T , Win , Dif > > 2048 , 0 , , 32 6.61 , 5.4 , New , 1.21 > > 256 , 1 , , 64 6.52 , 5.68 , New , 0.84 > > 2048 , 0 , , 64 6.03 , 5.47 , New , 0.56 > > 256 , 2 , , 64 6.07 , 5.42 , New , 0.65 > > 2048 , 0 , , 128 7.01 , 5.83 , New , 1.18 > > 256 , 3 , , 64 6.24 , 5.68 , New , 0.56 > > 2048 , 0 , , 256 11.03 , 9.86 , New , 1.17 > > 256 , 4 , , 64 6.17 , 5.49 , New , 0.68 > > 2048 , 0 , , 512 14.11 , 13.41 , New , 0.7 > > 256 , 5 , , 64 6.03 , 5.45 , New , 0.58 > > 2048 , 0 , , 1024 19.82 , 19.92 , Cur , 0.1 > > 256 , 6 , , 64 6.14 , 5.7 , New , 0.44 > > 2048 , 0 , , 2048 30.9 , 30.59 , New , 0.31 > > 256 , 7 , , 64 6.05 , 5.64 , New , 0.41 > > 192 , 1 , , 32 5.6 , 4.89 , New , 0.71 > > 256 , 1 , , 32 5.59 , 5.07 , New , 0.52 > > 512 , 1 , , 32 5.58 , 4.93 , New , 0.65 > > 192 , 2 , , 64 6.14 , 5.46 , New , 0.68 > > 512 , 2 , , 64 5.95 , 5.38 , New , 0.57 > > 192 , 3 , , 96 6.6 , 5.74 , New , 0.86 > > 256 , 3 , , 96 6.48 , 5.37 , New , 1.11 > > 512 , 3 , , 96 6.56 , 5.44 , New , 1.12 > > 192 , 4 , , 128 7.04 , 6.02 , New , 1.02 > > 256 , 4 , , 128 6.96 , 5.89 , New , 1.07 > > 512 , 4 , , 128 6.97 , 5.99 , New , 0.98 > > 192 , 5 , , 160 8.49 , 7.07 , New , 1.42 > > 256 , 5 , , 160 8.1 , 6.96 , New , 1.14 > > 512 , 5 , , 160 10.48 , 9.14 , New , 1.34 > > 192 , 6 , , 192 8.46 , 8.52 , Cur , 0.06 > > 256 , 6 , , 192 8.53 , 7.58 , New , 0.95 > > 512 , 6 , , 192 10.88 , 9.06 , New , 1.82 > > 192 , 7 , , 224 8.59 , 8.35 , New , 0.24 > > 256 , 7 , , 224 8.86 , 7.91 , New , 0.95 > > 512 , 7 , , 224 10.89 , 8.98 , New , 1.91 > > 2 , 0 , , 1 4.28 , 3.62 , New , 0.66 > > 2 , 1 , , 1 4.32 , 3.75 , New , 0.57 > > 0 , 0 , , 1 3.76 , 3.24 , New , 0.52 > > 0 , 1 , , 1 3.7 , 3.19 , New , 0.51 > > 3 , 0 , , 2 4.16 , 3.67 , New , 0.49 > > 3 , 2 , , 2 4.21 , 3.68 , New , 0.53 > > 1 , 0 , , 2 4.25 , 3.74 , New , 0.51 > > 1 , 2 , , 2 4.4 , 3.82 , New , 0.58 > > 4 , 0 , , 3 4.43 , 3.88 , New , 0.55 > > 4 , 3 , , 3 4.34 , 3.8 , New , 0.54 > > 2 , 0 , , 3 4.33 , 3.79 , New , 0.54 > > 2 , 3 , , 3 4.37 , 3.84 , New , 0.53 > > 5 , 0 , , 4 4.45 , 3.87 , New , 0.58 > > 5 , 4 , , 4 4.41 , 3.84 , New , 0.57 > > 3 , 0 , , 4 4.34 , 3.83 , New , 0.51 > > 3 , 4 , , 4 4.35 , 3.82 , New , 0.53 > > 6 , 0 , , 5 4.41 , 3.88 , New , 0.53 > > 6 , 5 , , 5 4.41 , 3.88 , New , 0.53 > > 4 , 0 , , 5 4.35 , 3.84 , New , 0.51 > > 4 , 5 , , 5 4.37 , 3.85 , New , 0.52 > > 7 , 0 , , 6 4.4 , 3.84 , New , 0.56 > > 7 , 6 , , 6 4.39 , 3.83 , New , 0.56 > > 5 , 0 , , 6 4.37 , 3.85 , New , 0.52 > > 5 , 6 , , 6 4.4 , 3.86 , New , 0.54 > > 8 , 0 , , 7 4.39 , 3.88 , New , 0.51 > > 8 , 7 , , 7 4.4 , 3.83 , New , 0.57 > > 6 , 0 , , 7 4.39 , 3.85 , New , 0.54 > > 6 , 7 , , 7 4.38 , 3.87 , New , 0.51 > > 9 , 0 , , 8 4.47 , 3.96 , New , 0.51 > > 7 , 0 , , 8 4.37 , 3.85 , New , 0.52 > > 10 , 0 , , 9 4.61 , 4.08 , New , 0.53 > > 10 , 1 , , 9 4.61 , 4.09 , New , 0.52 > > 8 , 0 , , 9 4.37 , 3.85 , New , 0.52 > > 8 , 1 , , 9 4.37 , 3.85 , New , 0.52 > > 11 , 0 , , 10 4.68 , 4.06 , New , 0.62 > > 11 , 2 , , 10 4.56 , 4.1 , New , 0.46 > > 9 , 0 , , 10 4.36 , 3.83 , New , 0.53 > > 9 , 2 , , 10 4.37 , 3.83 , New , 0.54 > > 12 , 0 , , 11 4.62 , 4.05 , New , 0.57 > > 12 , 3 , , 11 4.63 , 4.06 , New , 0.57 > > 10 , 0 , , 11 4.38 , 3.86 , New , 0.52 > > 10 , 3 , , 11 4.41 , 3.86 , New , 0.55 > > 13 , 0 , , 12 4.57 , 4.08 , New , 0.49 > > 13 , 4 , , 12 4.59 , 4.12 , New , 0.47 > > 11 , 0 , , 12 4.45 , 4.0 , New , 0.45 > > 11 , 4 , , 12 4.51 , 4.04 , New , 0.47 > > 14 , 0 , , 13 4.64 , 4.16 , New , 0.48 > > 14 , 5 , , 13 4.67 , 4.1 , New , 0.57 > > 12 , 0 , , 13 4.58 , 4.08 , New , 0.5 > > 12 , 5 , , 13 4.6 , 4.1 , New , 0.5 > > 15 , 0 , , 14 4.61 , 4.05 , New , 0.56 > > 15 , 6 , , 14 4.59 , 4.06 , New , 0.53 > > 13 , 0 , , 14 4.57 , 4.06 , New , 0.51 > > 13 , 6 , , 14 4.57 , 4.05 , New , 0.52 > > 16 , 0 , , 15 4.62 , 4.05 , New , 0.57 > > 16 , 7 , , 15 4.63 , 4.06 , New , 0.57 > > 14 , 0 , , 15 4.61 , 4.06 , New , 0.55 > > 14 , 7 , , 15 4.59 , 4.05 , New , 0.54 > > 17 , 0 , , 16 4.58 , 4.08 , New , 0.5 > > 15 , 0 , , 16 4.64 , 4.06 , New , 0.58 > > 18 , 0 , , 17 4.56 , 4.17 , New , 0.39 > > 18 , 1 , , 17 4.59 , 4.09 , New , 0.5 > > 16 , 0 , , 17 4.59 , 4.07 , New , 0.52 > > 16 , 1 , , 17 4.58 , 4.04 , New , 0.54 > > 19 , 0 , , 18 4.61 , 4.05 , New , 0.56 > > 19 , 2 , , 18 4.6 , 4.08 , New , 0.52 > > 17 , 0 , , 18 4.64 , 4.11 , New , 0.53 > > 17 , 2 , , 18 4.56 , 4.13 , New , 0.43 > > 20 , 0 , , 19 4.77 , 4.3 , New , 0.47 > > 20 , 3 , , 19 4.6 , 4.14 , New , 0.46 > > 18 , 0 , , 19 4.72 , 4.02 , New , 0.7 > > 18 , 3 , , 19 4.53 , 4.01 , New , 0.52 > > 21 , 0 , , 20 4.66 , 4.26 , New , 0.4 > > 21 , 4 , , 20 4.74 , 4.07 , New , 0.67 > > 19 , 0 , , 20 4.62 , 4.12 , New , 0.5 > > 19 , 4 , , 20 4.57 , 4.04 , New , 0.53 > > 22 , 0 , , 21 4.61 , 4.13 , New , 0.48 > > 22 , 5 , , 21 4.64 , 4.08 , New , 0.56 > > 20 , 0 , , 21 4.49 , 4.01 , New , 0.48 > > 20 , 5 , , 21 4.58 , 4.06 , New , 0.52 > > 23 , 0 , , 22 4.62 , 4.13 , New , 0.49 > > 23 , 6 , , 22 4.72 , 4.27 , New , 0.45 > > 21 , 0 , , 22 4.65 , 3.97 , New , 0.68 > > 21 , 6 , , 22 4.5 , 4.02 , New , 0.48 > > 24 , 0 , , 23 4.78 , 4.07 , New , 0.71 > > 24 , 7 , , 23 4.67 , 4.23 , New , 0.44 > > 22 , 0 , , 23 4.49 , 3.99 , New , 0.5 > > 22 , 7 , , 23 4.56 , 4.03 , New , 0.53 > > 25 , 0 , , 24 4.6 , 4.15 , New , 0.45 > > 23 , 0 , , 24 4.57 , 4.06 , New , 0.51 > > 26 , 0 , , 25 4.54 , 4.14 , New , 0.4 > > 26 , 1 , , 25 4.72 , 4.1 , New , 0.62 > > 24 , 0 , , 25 4.52 , 4.13 , New , 0.39 > > 24 , 1 , , 25 4.55 , 4.0 , New , 0.55 > > 27 , 0 , , 26 4.51 , 4.06 , New , 0.45 > > 27 , 2 , , 26 4.53 , 4.16 , New , 0.37 > > 25 , 0 , , 26 4.59 , 4.09 , New , 0.5 > > 25 , 2 , , 26 4.55 , 4.01 , New , 0.54 > > 28 , 0 , , 27 4.59 , 3.99 , New , 0.6 > > 28 , 3 , , 27 4.57 , 3.95 , New , 0.62 > > 26 , 0 , , 27 4.55 , 4.15 , New , 0.4 > > 26 , 3 , , 27 4.57 , 3.99 , New , 0.58 > > 29 , 0 , , 28 4.41 , 4.03 , New , 0.38 > > 29 , 4 , , 28 4.59 , 4.02 , New , 0.57 > > 27 , 0 , , 28 4.63 , 4.08 , New , 0.55 > > 27 , 4 , , 28 4.44 , 4.02 , New , 0.42 > > 30 , 0 , , 29 4.53 , 3.93 , New , 0.6 > > 30 , 5 , , 29 4.55 , 3.88 , New , 0.67 > > 28 , 0 , , 29 4.49 , 3.9 , New , 0.59 > > 28 , 5 , , 29 4.44 , 3.94 , New , 0.5 > > 31 , 0 , , 30 4.41 , 3.85 , New , 0.56 > > 31 , 6 , , 30 4.48 , 3.86 , New , 0.62 > > 29 , 0 , , 30 4.55 , 3.94 , New , 0.61 > > 29 , 6 , , 30 4.32 , 3.95 , New , 0.37 > > 32 , 0 , , 31 4.36 , 3.91 , New , 0.45 > > 32 , 7 , , 31 4.37 , 3.89 , New , 0.48 > > 30 , 0 , , 31 4.65 , 3.9 , New , 0.75 > > 30 , 7 , , 31 4.42 , 3.93 , New , 0.49 > > > > sysdeps/x86_64/multiarch/memchr-evex.S | 580 +++++++++++++++---------- > > 1 file changed, 349 insertions(+), 231 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S > > index 6dd5d67b90..65c16ef8a4 100644 > > --- a/sysdeps/x86_64/multiarch/memchr-evex.S > > +++ b/sysdeps/x86_64/multiarch/memchr-evex.S > > @@ -26,14 +26,28 @@ > > > > # ifdef USE_AS_WMEMCHR > > # define VPBROADCAST vpbroadcastd > > -# define VPCMP vpcmpd > > -# define SHIFT_REG r8d > > +# define VPMINU vpminud > > +# define VPCMP vpcmpd > > +# define VPCMPEQ vpcmpeqd > > +# define CHAR_SIZE 4 > > # else > > # define VPBROADCAST vpbroadcastb > > -# define VPCMP vpcmpb > > -# define SHIFT_REG ecx > > +# define VPMINU vpminub > > +# define VPCMP vpcmpb > > +# define VPCMPEQ vpcmpeqb > > +# define CHAR_SIZE 1 > > # endif > > > > +# ifdef USE_AS_RAWMEMCHR > > +# define RAW_PTR_REG rcx > > +# define ALGN_PTR_REG rdi > > +# else > > +# define RAW_PTR_REG rdi > > +# define ALGN_PTR_REG rcx > > +# endif > > + > > +#define XZERO xmm23 > > Add a space before define. Rename XZERO to XMMZERO. Done. > > > +#define YZERO ymm23 > > Add a space before define. Rename YZERO to YMMZERO. Done. > > > # define XMMMATCH xmm16 > > # define YMMMATCH ymm16 > > # define YMM1 ymm17 > > @@ -44,18 +58,16 @@ > > # define YMM6 ymm22 > > > > # define VEC_SIZE 32 > > +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) > > +# define PAGE_SIZE 4096 > > > > .section .text.evex,"ax",@progbits > > -ENTRY (MEMCHR) > > +ENTRY(MEMCHR) > > No need for this change. Fixed. > > > # ifndef USE_AS_RAWMEMCHR > > /* Check for zero length. */ > > test %RDX_LP, %RDX_LP > > jz L(zero) > > -# endif > > - movl %edi, %ecx > > -# ifdef USE_AS_WMEMCHR > > - shl $2, %RDX_LP > > -# else > > + > > # ifdef __ILP32__ > > /* Clear the upper 32 bits. */ > > movl %edx, %edx > > @@ -63,319 +75,425 @@ ENTRY (MEMCHR) > > # endif > > /* Broadcast CHAR to YMMMATCH. */ > > VPBROADCAST %esi, %YMMMATCH > > - /* Check if we may cross page boundary with one vector load. */ > > - andl $(2 * VEC_SIZE - 1), %ecx > > - cmpl $VEC_SIZE, %ecx > > - ja L(cros_page_boundary) > > + /* Check if we may cross page boundary with one > > + vector load. */ > > Fit comments to 72 columns. Fixed. > > > + movl %edi, %eax > > + andl $(PAGE_SIZE - 1), %eax > > + cmpl $(PAGE_SIZE - VEC_SIZE), %eax > > + ja L(cross_page_boundary) > > > > /* Check the first VEC_SIZE bytes. */ > > - VPCMP $0, (%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - > > + VPCMP $0, (%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > # ifndef USE_AS_RAWMEMCHR > > - jnz L(first_vec_x0_check) > > - /* Adjust length and check the end of data. */ > > - subq $VEC_SIZE, %rdx > > - jbe L(zero) > > + /* If length < CHAR_PER_VEC handle special. */ > > + cmpq $CHAR_PER_VEC, %rdx > > + jbe L(first_vec_x0) > > +# endif > > + testl %eax, %eax > > + jz L(aligned_more) > > + tzcntl %eax, %eax > > +# ifdef USE_AS_WMEMCHR > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > Fit comments to 72 columns. Fixed. > > > + leaq (%rdi, %rax, CHAR_SIZE), %rax > > # else > > - jnz L(first_vec_x0) > > + addq %rdi, %rax > > # endif > > - > > - /* Align data for aligned loads in the loop. */ > > - addq $VEC_SIZE, %rdi > > - andl $(VEC_SIZE - 1), %ecx > > - andq $-VEC_SIZE, %rdi > > + ret > > > > # ifndef USE_AS_RAWMEMCHR > > - /* Adjust length. */ > > - addq %rcx, %rdx > > - > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > -# endif > > - jmp L(more_4x_vec) > > +L(zero): > > + xorl %eax, %eax > > + ret > > > > + .p2align 5 > > +L(first_vec_x0): > > + /* Check if first match was before length. */ > > + tzcntl %eax, %eax > > + xorl %ecx, %ecx > > + cmpl %eax, %edx > > + leaq (%rdi, %rax, CHAR_SIZE), %rax > > + cmovle %rcx, %rax > > + ret > > +# else > > + /* NB: first_vec_x0 is 17 bytes which will leave > > + cross_page_boundary (which is relatively cold) close > > + enough to ideal alignment. So only realign > > + L(cross_page_boundary) if rawmemchr. */ > > Fit comments to 72 columns. Fixed. > > > .p2align 4 > > -L(cros_page_boundary): > > - andl $(VEC_SIZE - 1), %ecx > > +# endif > > +L(cross_page_boundary): > > + /* Save pointer before aligning as its original > > + value is necessary for computer return address if byte is > > + found or adjusting length if it is not and this is > > + memchr. */ > > Fit comments to 72 columns. Fixed. > > > + movq %rdi, %rcx > > + /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx > > + for memchr and rdi for rawmemchr. */ > > Fit comments to 72 columns. Fixed. > > > + andq $-VEC_SIZE, %ALGN_PTR_REG > > + VPCMP $0, (%ALGN_PTR_REG), %YMMMATCH, %k0 > > + kmovd %k0, %r8d > > # ifdef USE_AS_WMEMCHR > > - /* NB: Divide shift count by 4 since each bit in K1 represent 4 > > - bytes. */ > > - movl %ecx, %SHIFT_REG > > - sarl $2, %SHIFT_REG > > + /* NB: Divide shift count by 4 since each bit in > > + K0 represent 4 bytes. */ > > + sarl $2, %eax > > +# endif > > +# ifndef USE_AS_RAWMEMCHR > > + movl $(PAGE_SIZE / CHAR_SIZE), %esi > > + subl %eax, %esi > > # endif > > - andq $-VEC_SIZE, %rdi > > - VPCMP $0, (%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - /* Remove the leading bytes. */ > > - sarxl %SHIFT_REG, %eax, %eax > > - testl %eax, %eax > > - jz L(aligned_more) > > - tzcntl %eax, %eax > > # ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - sall $2, %eax > > + andl $(CHAR_PER_VEC - 1), %eax > > # endif > > + /* Remove the leading bytes. */ > > + sarxl %eax, %r8d, %eax > > # ifndef USE_AS_RAWMEMCHR > > /* Check the end of data. */ > > - cmpq %rax, %rdx > > - jbe L(zero) > > + cmpq %rsi, %rdx > > + jbe L(first_vec_x0) > > +# endif > > + testl %eax, %eax > > + jz L(cross_page_continue) > > + tzcntl %eax, %eax > > +# ifdef USE_AS_WMEMCHR > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > + leaq (%RAW_PTR_REG, %rax, CHAR_SIZE), %rax > > +# else > > + addq %RAW_PTR_REG, %rax > > # endif > > - addq %rdi, %rax > > - addq %rcx, %rax > > ret > > > > .p2align 4 > > -L(aligned_more): > > -# ifndef USE_AS_RAWMEMCHR > > - /* Calculate "rdx + rcx - VEC_SIZE" with "rdx - (VEC_SIZE - rcx)" > > - instead of "(rdx + rcx) - VEC_SIZE" to void possible addition > > - overflow. */ > > - negq %rcx > > - addq $VEC_SIZE, %rcx > > +L(first_vec_x1): > > + tzcntl %eax, %eax > > + leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax > > + ret > > > > - /* Check the end of data. */ > > - subq %rcx, %rdx > > - jbe L(zero) > > -# endif > > + .p2align 4 > > +L(first_vec_x2): > > + tzcntl %eax, %eax > > + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax > > + ret > > > > - addq $VEC_SIZE, %rdi > > + .p2align 4 > > +L(first_vec_x3): > > + tzcntl %eax, %eax > > + leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax > > + ret > > + > > + .p2align 4 > > +L(first_vec_x4): > > + tzcntl %eax, %eax > > + leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax > > + ret > > + > > + .p2align 5 > > +L(aligned_more): > > + /* Check the first 4 * VEC_SIZE. Only one > > + VEC_SIZE at a time since data is only aligned to > > + VEC_SIZE. */ > > Fit comments to 72 columns. Fixed. > > > > > # ifndef USE_AS_RAWMEMCHR > > - subq $(VEC_SIZE * 4), %rdx > > + /* Align data to VEC_SIZE. */ > > +L(cross_page_continue): > > + xorl %ecx, %ecx > > + subl %edi, %ecx > > + andq $-VEC_SIZE, %rdi > > + /* esi is for adjusting length to see if near the > > + end. */ > > Fit comments to 72 columns. Fixed. > > > + leal (VEC_SIZE * 5)(%rdi, %rcx), %esi > > +# ifdef USE_AS_WMEMCHR > > + /* NB: Divide bytes by 4 to get the wchar_t > > + count. */ > > + sarl $2, %esi > > +# endif > > +# else > > + andq $-VEC_SIZE, %rdi > > +L(cross_page_continue): > > +# endif > > + /* Load first VEC regardless. */ > > + VPCMP $0, (VEC_SIZE)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > +# ifndef USE_AS_RAWMEMCHR > > + /* Adjust length. If near end handle specially. > > + */ > > Fit comments to 72 columns. Fixed. > > > + subq %rsi, %rdx > > jbe L(last_4x_vec_or_less) > > # endif > > - > > -L(more_4x_vec): > > - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time > > - since data is only aligned to VEC_SIZE. */ > > - VPCMP $0, (%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(first_vec_x0) > > - > > - VPCMP $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > testl %eax, %eax > > jnz L(first_vec_x1) > > > > - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > + VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > testl %eax, %eax > > jnz L(first_vec_x2) > > > > - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > + VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > testl %eax, %eax > > jnz L(first_vec_x3) > > > > - addq $(VEC_SIZE * 4), %rdi > > + VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > + testl %eax, %eax > > + jnz L(first_vec_x4) > > + > > > > # ifndef USE_AS_RAWMEMCHR > > - subq $(VEC_SIZE * 4), %rdx > > - jbe L(last_4x_vec_or_less) > > -# endif > > + /* Check if at last CHAR_PER_VEC * 4 length. */ > > + subq $(CHAR_PER_VEC * 4), %rdx > > + jbe L(last_4x_vec_or_less_cmpeq) > > + addq $VEC_SIZE, %rdi > > > > - /* Align data to 4 * VEC_SIZE. */ > > - movq %rdi, %rcx > > - andl $(4 * VEC_SIZE - 1), %ecx > > + /* Align data to VEC_SIZE * 4 for the loop and > > + readjust length. */ > > Fit comments to 72 columns. Fixed. > > > +# ifdef USE_AS_WMEMCHR > > + movl %edi, %ecx > > andq $-(4 * VEC_SIZE), %rdi > > - > > -# ifndef USE_AS_RAWMEMCHR > > - /* Adjust length. */ > > + andl $(VEC_SIZE * 4 - 1), %ecx > > + /* NB: Divide bytes by 4 to get the wchar_t > > + count. */ > > Fit comments to 72 columns. Fixed. > > > + sarl $2, %ecx > > addq %rcx, %rdx > > +# else > > + addq %rdi, %rdx > > + andq $-(4 * VEC_SIZE), %rdi > > + subq %rdi, %rdx > > +# endif > > +# else > > + addq $VEC_SIZE, %rdi > > + andq $-(4 * VEC_SIZE), %rdi > > # endif > > > > + vpxorq %XZERO, %XZERO, %XZERO > > + > > + /* Compare 4 * VEC at a time forward. */ > > .p2align 4 > > L(loop_4x_vec): > > - /* Compare 4 * VEC at a time forward. */ > > - VPCMP $0, (%rdi), %YMMMATCH, %k1 > > - VPCMP $0, VEC_SIZE(%rdi), %YMMMATCH, %k2 > > - kord %k1, %k2, %k5 > > - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3 > > - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4 > > - > > - kord %k3, %k4, %k6 > > - kortestd %k5, %k6 > > - jnz L(4x_vec_end) > > - > > - addq $(VEC_SIZE * 4), %rdi > > - > > + /* It would be possible to save some instructions > > + using 4x VPCMP but bottleneck on port 5 makes it not woth > > + it. */ > > Fit comments to 72 columns. Fixed. > > > + VPCMP $4, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k1 > > + /* xor will set bytes match esi to zero. */ > > + vpxorq (VEC_SIZE * 5)(%rdi), %YMMMATCH, %YMM2 > > + vpxorq (VEC_SIZE * 6)(%rdi), %YMMMATCH, %YMM3 > > + VPCMP $0, (VEC_SIZE * 7)(%rdi), %YMMMATCH, %k3 > > + /* Reduce VEC2 / VEC3 with min and VEC1 with zero > > + mask. */ > > Fit comments to 72 columns. Fixed. > > > + VPMINU %YMM2, %YMM3, %YMM3 {%k1} {z} > > + VPCMP $0, %YMM3, %YZERO, %k2 > > # ifdef USE_AS_RAWMEMCHR > > - jmp L(loop_4x_vec) > > + subq $-(VEC_SIZE * 4), %rdi > > + kortestd %k2, %k3 > > + jz L(loop_4x_vec) > > # else > > - subq $(VEC_SIZE * 4), %rdx > > - ja L(loop_4x_vec) > > + kortestd %k2, %k3 > > + jnz L(loop_4x_vec_end) > > > > -L(last_4x_vec_or_less): > > - /* Less than 4 * VEC and aligned to VEC_SIZE. */ > > - addl $(VEC_SIZE * 2), %edx > > - jle L(last_2x_vec) > > + subq $-(VEC_SIZE * 4), %rdi > > > > - VPCMP $0, (%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(first_vec_x0) > > + subq $(CHAR_PER_VEC * 4), %rdx > > + ja L(loop_4x_vec) > > > > - VPCMP $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > + /* Fall through into less than 4 remaining > > + vectors of length case. */ > > Fit comments to 72 columns. Fixed. > > > + VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > + addq $(VEC_SIZE * 3), %rdi > > + .p2align 4 > > +L(last_4x_vec_or_less): > > + /* Check if first VEC contained match. */ > > testl %eax, %eax > > - jnz L(first_vec_x1) > > + jnz L(first_vec_x1_check) > > > > - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > + /* If remaining length > CHAR_PER_VEC * 2. */ > > + addl $(CHAR_PER_VEC * 2), %edx > > + jg L(last_4x_vec) > > > > - jnz L(first_vec_x2_check) > > - subl $VEC_SIZE, %edx > > - jle L(zero) > > +L(last_2x_vec): > > + /* If remaining length < CHAR_PER_VEC. */ > > + addl $CHAR_PER_VEC, %edx > > + jle L(zero_end) > > + > > + /* Check VEC2 and compare any match with > > + remaining length. */ > > Fit comments to 72 columns. Fixed. > > > + VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > + tzcntl %eax, %eax > > + cmpl %eax, %edx > > + jbe L(set_zero_end) > > + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax > > +L(zero_end): > > + ret > > > > - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > - testl %eax, %eax > > > > - jnz L(first_vec_x3_check) > > + .p2align 4 > > +L(first_vec_x1_check): > > + tzcntl %eax, %eax > > + /* Adjust length. */ > > + subl $-(CHAR_PER_VEC * 4), %edx > > + /* Check if match within remaining length. */ > > + cmpl %eax, %edx > > + jbe L(set_zero_end) > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > Fit comments to 72 columns. Fixed. > > > + leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax > > + ret > > +L(set_zero_end): > > xorl %eax, %eax > > ret > > > > .p2align 4 > > -L(last_2x_vec): > > - addl $(VEC_SIZE * 2), %edx > > - VPCMP $0, (%rdi), %YMMMATCH, %k1 > > +L(loop_4x_vec_end): > > +# endif > > + /* rawmemchr will fall through into this if match > > + was found in loop. */ > > Fit comments to 72 columns. Fixed. > > > + > > + /* k1 has not of matches with VEC1. */ > > kmovd %k1, %eax > > - testl %eax, %eax > > +# ifdef USE_AS_WMEMCHR > > + subl $((1 << CHAR_PER_VEC) - 1), %eax > > +# else > > + incl %eax > > +# endif > > + jnz L(last_vec_x1_return) > > > > - jnz L(first_vec_x0_check) > > - subl $VEC_SIZE, %edx > > - jle L(zero) > > + VPCMP $0, %YMM2, %YZERO, %k0 > > + kmovd %k0, %eax > > + testl %eax, %eax > > + jnz L(last_vec_x2_return) > > > > - VPCMP $0, VEC_SIZE(%rdi), %YMMMATCH, %k1 > > - kmovd %k1, %eax > > + kmovd %k2, %eax > > testl %eax, %eax > > - jnz L(first_vec_x1_check) > > - xorl %eax, %eax > > - ret > > + jnz L(last_vec_x3_return) > > > > - .p2align 4 > > -L(first_vec_x0_check): > > + kmovd %k3, %eax > > tzcntl %eax, %eax > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - sall $2, %eax > > +# ifdef USE_AS_RAWMEMCHR > > + leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax > > +# else > > + leaq (VEC_SIZE * 7)(%rdi, %rax, CHAR_SIZE), %rax > > # endif > > - /* Check the end of data. */ > > - cmpq %rax, %rdx > > - jbe L(zero) > > - addq %rdi, %rax > > ret > > > > .p2align 4 > > -L(first_vec_x1_check): > > +L(last_vec_x1_return): > > tzcntl %eax, %eax > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - sall $2, %eax > > -# endif > > - /* Check the end of data. */ > > - cmpq %rax, %rdx > > - jbe L(zero) > > - addq $VEC_SIZE, %rax > > +# ifdef USE_AS_RAWMEMCHR > > +# ifdef USE_AS_WMEMCHR > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > Fit comments to 72 columns. Fixed. > > > + leaq (%rdi, %rax, CHAR_SIZE), %rax > > +# else > > addq %rdi, %rax > > - ret > > - > > - .p2align 4 > > -L(first_vec_x2_check): > > - tzcntl %eax, %eax > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - sall $2, %eax > > +# endif > > +# else > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > Fit comments to 72 columns. Fixed. > > > + leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax > > # endif > > - /* Check the end of data. */ > > - cmpq %rax, %rdx > > - jbe L(zero) > > - addq $(VEC_SIZE * 2), %rax > > - addq %rdi, %rax > > ret > > > > .p2align 4 > > -L(first_vec_x3_check): > > +L(last_vec_x2_return): > > tzcntl %eax, %eax > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - sall $2, %eax > > +# ifdef USE_AS_RAWMEMCHR > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > + leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax > > +# else > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > + leaq (VEC_SIZE * 5)(%rdi, %rax, CHAR_SIZE), %rax > > # endif > > - /* Check the end of data. */ > > - cmpq %rax, %rdx > > - jbe L(zero) > > - addq $(VEC_SIZE * 3), %rax > > - addq %rdi, %rax > > ret > > > > .p2align 4 > > -L(zero): > > - xorl %eax, %eax > > - ret > > -# endif > > - > > - .p2align 4 > > -L(first_vec_x0): > > +L(last_vec_x3_return): > > tzcntl %eax, %eax > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - leaq (%rdi, %rax, 4), %rax > > +# ifdef USE_AS_RAWMEMCHR > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > Fit comments to 72 columns. Fixed. > > > + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax > > # else > > - addq %rdi, %rax > > + /* NB: Multiply bytes by CHAR_SIZE to get the > > + wchar_t count. */ > > Fit comments to 72 columns. Fixed. > > > + leaq (VEC_SIZE * 6)(%rdi, %rax, CHAR_SIZE), %rax > > # endif > > ret > > > > + > > +# ifndef USE_AS_RAWMEMCHR > > +L(last_4x_vec_or_less_cmpeq): > > + VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > + subq $-(VEC_SIZE * 4), %rdi > > + /* Check first VEC regardless. */ > > + testl %eax, %eax > > + jnz L(first_vec_x1_check) > > + > > + /* If remaining length <= CHAR_PER_VEC * 2. */ > > + addl $(CHAR_PER_VEC * 2), %edx > > + jle L(last_2x_vec) > > + > > .p2align 4 > > -L(first_vec_x1): > > +L(last_4x_vec): > > + VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > + testl %eax, %eax > > + jnz L(last_vec_x2) > > + > > + > > + VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > + /* Create mask for possible matches within > > + remaining length. */ > > Fit comments to 72 columns. Fixed. > > > +# ifdef USE_AS_WMEMCHR > > + movl $((1 << (CHAR_PER_VEC * 2)) - 1), %ecx > > + bzhil %edx, %ecx, %ecx > > +# else > > + movq $-1, %rcx > > + bzhiq %rdx, %rcx, %rcx > > +# endif > > + /* Test matches in data against length match. */ > > + andl %ecx, %eax > > + jnz L(last_vec_x3) > > + > > + /* if remaining length <= CHAR_PER_VEC * 3 (Note > > + this is after remaining length was found to be > > > + CHAR_PER_VEC * 2. */ > > Fit comments to 72 columns. Fixed. > > > + subl $CHAR_PER_VEC, %edx > > + jbe L(zero_end2) > > + > > + > > + VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0 > > + kmovd %k0, %eax > > + /* Shift remaining length mask for last VEC. */ > > +# ifdef USE_AS_WMEMCHR > > + shrl $CHAR_PER_VEC, %ecx > > +# else > > + shrq $CHAR_PER_VEC, %rcx > > +# endif > > + andl %ecx, %eax > > + jz L(zero_end2) > > tzcntl %eax, %eax > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - leaq VEC_SIZE(%rdi, %rax, 4), %rax > > -# else > > - addq $VEC_SIZE, %rax > > - addq %rdi, %rax > > -# endif > > + leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax > > +L(zero_end2): > > ret > > > > - .p2align 4 > > -L(first_vec_x2): > > +L(last_vec_x2): > > tzcntl %eax, %eax > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - leaq (VEC_SIZE * 2)(%rdi, %rax, 4), %rax > > -# else > > - addq $(VEC_SIZE * 2), %rax > > - addq %rdi, %rax > > -# endif > > + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax > > ret > > > > .p2align 4 > > -L(4x_vec_end): > > - kmovd %k1, %eax > > - testl %eax, %eax > > - jnz L(first_vec_x0) > > - kmovd %k2, %eax > > - testl %eax, %eax > > - jnz L(first_vec_x1) > > - kmovd %k3, %eax > > - testl %eax, %eax > > - jnz L(first_vec_x2) > > - kmovd %k4, %eax > > - testl %eax, %eax > > -L(first_vec_x3): > > +L(last_vec_x3): > > tzcntl %eax, %eax > > -# ifdef USE_AS_WMEMCHR > > - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ > > - leaq (VEC_SIZE * 3)(%rdi, %rax, 4), %rax > > -# else > > - addq $(VEC_SIZE * 3), %rax > > - addq %rdi, %rax > > -# endif > > + leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax > > ret > > +# endif > > > > -END (MEMCHR) > > +END(MEMCHR) > > No need for this change.Fixed. > > > #endif > > -- > > 2.29.2 > > > > Thanks. > > H.J.