From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x32c.google.com (mail-ot1-x32c.google.com [IPv6:2607:f8b0:4864:20::32c]) by sourceware.org (Postfix) with ESMTPS id BF6823858C35 for ; Mon, 18 Dec 2023 17:49:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BF6823858C35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org BF6823858C35 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::32c ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702921761; cv=none; b=uSOFejRbMUa0JgzCiDPaDoxosaBvtPwthzMpdJiQwqiMZsha6eln3qBAfP42hpCGp6flTf1BreRBxGeG4U4RYNtGdVmrbABjjjG5Jr4o/oiHIBzndzh0c0zjuYIcNbEfddLicNqAL9iJYeWuWAdnmM6K0W5Z3sRhG7N7yPEHQqY= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702921761; c=relaxed/simple; bh=dGekzfDmurdGS+kaDv1d3AEe3JIrKPy/L08gt+4QIbw=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=YoiJFI43mP0jx0whDupSajYRJMtTcbY7yYACwOD0dMC8bbPkgZ4meWnc10eJPdv1nxY+y0183SrcK1c7Hr1+hCDOTuh9OFfT5FO5ALe90CJAbGCB+sOTVbg3OLT/o/YgHP1f1EEIwPivPNn3Q/8Q/5TSr7GBYfyau5LHYaIHmFc= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-ot1-x32c.google.com with SMTP id 46e09a7af769-6d9f7b3de20so2690243a34.2 for ; Mon, 18 Dec 2023 09:49:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702921759; x=1703526559; darn=sourceware.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=sDkNZOijhG+NfeL+BVbbdzzIPSNFA464lC7sPCKk0+k=; b=DnWbzLCW3IuxQjb17b8l1Q3K1NjbMw9g6HDwiAko4YQtM4gs6QIKjuFDhTlVgV43W4 tIlZ4s2GvIZ18qDfH1sBw4MIX+f9EsBGqDlbPRryT+7qMO/N4ma0TUfDyuKa71DwXKtj crBbOm6ob5G7t3yeCcF4BH5LecsmRGWqa89ZoiGqBYBRP3wxe7uQENyyVnBlDkAaLewg uCmHuVIH8FBd0INDuaqqvmCw/frE3iaa/Orw7DQL4EMmi5sL/1F/Ux+L284aALYET064 WhDRwCm4DQrOoG0R+FMNbvyhy4cuvwRpDkD4jL6ivWgwDiRWA68uc6Urc+6Bb+GhNEtt hKGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702921759; x=1703526559; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sDkNZOijhG+NfeL+BVbbdzzIPSNFA464lC7sPCKk0+k=; b=p+RnZancOwgsdaXP8xwsX1GbGHkfhV+j4NA8MAprE9ZYi7jiXyRN7jlCTr/IDA90Bz gjQDCJTGMnHhV811TPoYSkEW2KW7TU2m976AmE9m2Qo9mSfW1q4mBWBhWZI6ICSeW9jE xbzfS0yWBH8rrHh5k4jP/yjAMEOpWVpwxTagnP8oTJ4XktVgFfFxyiCTZe/B8ja0Lpmo 6AMjCBJFx3FcRMGYJNHaKrU245OoBeWUdKoIYa1s9wHKDScN/7yUzEVqSweSSxgZTNer PzLGyGI2KHn/s8KimAEO9ySVuHxWfr5qE6zN8nY30/0Lf6IjCq2b+OutF0sDYxTz4DZi feOw== X-Gm-Message-State: AOJu0Yw2xx5yMeZx+41u7hywNga+NGPYFBhw0F2mLdTUXHXQSk4CoTqJ v92duDm1msIdfH9rWscSLjzSYZCycmMTgmZ4poMuyHhahRo= X-Google-Smtp-Source: AGHT+IFnMw2rJ9aPUcf5Xf7SVSw8BwLl99JYSOBHnOC28HF3jXc17w4ZV4GN9S/5OaO0XeGxNFfa21aFqfAs8rtB6fI= X-Received: by 2002:a05:6870:3281:b0:203:e0c5:4b9e with SMTP id q1-20020a056870328100b00203e0c54b9emr702097oac.80.1702921759030; Mon, 18 Dec 2023 09:49:19 -0800 (PST) MIME-Version: 1.0 References: <20231216043334.72176-1-tirtajames45@gmail.com> In-Reply-To: <20231216043334.72176-1-tirtajames45@gmail.com> From: Noah Goldstein Date: Mon, 18 Dec 2023 11:48:41 -0600 Message-ID: Subject: Re: [PATCH] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c To: James Tirta Halim Cc: carlos@redhat.com, libc-alpha@sourceware.org, skpgkp2@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SCC_10_SHORT_WORD_LINES,SCC_20_SHORT_WORD_LINES,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, Dec 15, 2023 at 10:37=E2=80=AFPM James Tirta Halim wrote: > > Find the rarest byte in NE. Do a naive loop until HS is aligned. Once ali= gned, find > the parts of HS that matches the rare byte and the byte after it, shift > back to the position of HS that should match NE and do a memcmp. > > Average timings (Core i5 8400): > __memmem_avx2 basic_memmem twoway_memmem memmem > 1342.942864 19100.87074 3335.335377 2745.971856 can you attach the .out result file? > > --- > sysdeps/x86_64/multiarch/memmem-avx2.c | 72 ++++++++++++++++++++++++++ > 1 file changed, 72 insertions(+) > create mode 100644 sysdeps/x86_64/multiarch/memmem-avx2.c > > diff --git a/sysdeps/x86_64/multiarch/memmem-avx2.c b/sysdeps/x86_64/mult= iarch/memmem-avx2.c > new file mode 100644 > index 0000000000..524d0fe45f > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/memmem-avx2.c > @@ -0,0 +1,72 @@ > +#include > +#include > +#include > +#include > + > +static inline void * > +__find_rarest_byte (const void *ne, > + size_t n) > +{ > + static const unsigned char rarebyte_table[256] =3D { 0, 1, 13, 56, 59,= 60, 61, 62, 63, 232, 248, 2, 158, 4, 5, 6, 7, 8, 9, 10, 14, 20, 26, 29, 37= , 46, 52, 53, 54, 55, 57, 58, 255, 172, 242, 193, 162, 174, 178, 182, 218, = 219, 212, 180, 249, 197, 221, 210, 253, 231, 230, 224, 225, 226, 227, 223, = 222, 220, 176, 213, 184, 229, 188, 164, 159, 209, 181, 203, 189, 216, 196, = 192, 185, 205, 161, 168, 215, 187, 211, 194, 195, 165, 206, 204, 214, 198, = 173, 179, 175, 183, 167, 202, 239, 201, 160, 241, 163, 246, 233, 238, 240, = 254, 237, 208, 234, 250, 169, 186, 236, 217, 245, 243, 228, 170, 247, 244, = 251, 235, 199, 200, 252, 207, 177, 191, 171, 190, 166, 3, 140, 134, 124, 12= 6, 86, 128, 95, 117, 114, 93, 81, 87, 132, 96, 112, 97, 103, 82, 139, 89, 9= 8, 88, 119, 74, 156, 115, 104, 75, 120, 106, 76, 155, 90, 122, 107, 125, 15= 2, 145, 136, 137, 101, 116, 102, 108, 99, 141, 77, 78, 118, 79, 109, 100, 1= 50, 73, 94, 72, 121, 151, 113, 135, 110, 105, 83, 91, 11, 12, 64, 149, 146,= 111, 65, 69, 66, 15, 16, 17, 18, 19, 130, 92, 144, 123, 21, 22, 23, 24, 13= 1, 133, 127, 142, 25, 70, 129, 27, 28, 67, 153, 84, 143, 138, 147, 157, 148= , 68, 71, 30, 31, 32, 33, 34, 35, 36, 154, 38, 39, 40, 41, 42, 80, 43, 44, = 45, 47, 48, 85, 49, 50, 51 }; can you add a coment explaining how this table was generated / what it is? > + const unsigned char *rare =3D (const unsigned char *) ne; > + const unsigned char *p =3D (const unsigned char *) ne; > + int c_rare =3D rarebyte_table[*rare]; > + int c; > + for (; n--; ++p) > + { > + c =3D rarebyte_table[*p]; > + if (c < c_rare) { > + rare =3D p; > + c_rare =3D c; > + } > + } > + return (void *) rare; > +} > + > +void * > +__memmem_avx2 (const void *hs, > + size_t hs_len, > + const void *ne, > + size_t ne_len) > +{ > + if (ne_len =3D=3D 1) > + return (void *) memchr (hs, *(unsigned char *) ne, hs_len); > + if (__glibc_unlikely (ne_len =3D=3D 0)) > + return (void *) hs; > + if (__glibc_unlikely (hs_len < ne_len)) > + return NULL; > + const unsigned char *h =3D (const unsigned char *) hs; > + const unsigned char *const end =3D h + hs_len - ne_len; > + size_t shift =3D PTR_DIFF (__find_rarest_byte (ne, ne_len), ne); > + if (shift =3D=3D ne_len - 1) > + --shift; > + h +=3D shift; > + for (; !PTR_IS_ALIGNED (h, sizeof (__m256i)); ++h) > + { > + if (__glibc_unlikely (h - shift > end)) > + return NULL; > + if (*h =3D=3D *((unsigned char *) ne + shift) && !memcmp (h - shift,= ne, ne_len)) should be `__memcmp` or you could directly use `__memcmpeq_avx2` (probably the fastest here). likewise below. > + return (void *) (h - shift); > + } > + const __m256i nv =3D _mm256_set1_epi8 (*((char *) ne + shift)); > + const __m256i nv1 =3D _mm256_set1_epi8 (*((char *) ne + shift + 1)); > + __m256i hv, hv1; > + uint32_t i, hm0, hm1, m; > + for (; h - shift <=3D end; h +=3D sizeof (__m256i)) { > + hv =3D _mm256_load_si256 ((const __m256i *) h); > + hv1 =3D _mm256_loadu_si256 ((const __m256i *) (h + 1)); > + hm0 =3D (uint32_t) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (hv, nv))= ; > + hm1 =3D (uint32_t) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (hv1, nv1= )); think faster is: m =3D _mm256_movemask_epi8(_mm256_and_si256(_mm256_cmpeq_epi8(hv, nv), _mm256_cmpeq_epi8(hv1, nv))); > + m =3D hm0 & hm1; > + while (m) > + { > + i =3D _tzcnt_u32 (m); > + m =3D _blsr_u32 (m); > + if (__glibc_unlikely (h + i - shift > end)) > + return NULL; > + if (!memcmp (h + i - shift, ne, ne_len)) > + return (char *) h + i - shift; > + } > + } > + return NULL; > +} > -- > 2.43.0 >