Re: [PATCH] optimize memmem for short needles

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Re: [PATCH] optimize memmem for short needles
@ 2024-03-06 19:10 Wilco Dijkstra
  0 siblings, 0 replies; 2+ messages in thread
From: Wilco Dijkstra @ 2024-03-06 19:10 UTC (permalink / raw)
  To: tirtajames45; +Cc: 'GNU C Library'

Hi James,

> The current implementation does not check for an early match with memchr
> before initializing the shift table because large shifts may be faster
> than memchr. However, for short shifts, memchr may be faster.
> find_edge_in_needle (taken from strstr_avx512) is used to find a rarer
> character for memchr to find.

It looks like it is faster on this particular benchmark - however I'm not convinced
doing this is faster in general. I tried the change on the SMART benchmark [1][2],
and there it is generally slower. A few cases show quite large differences:

rand2, len 2: 58% slower (len 3-16 also slower but by less)
rand4, len 2: 43% slower (len 3-16 also slower but by less)
rand8, len 2: 16% slower (len 3-16 slower but close)
chineseTexts, len 2: 8% faster
genome, len 2: 44% slower (len 3 slower but close)

Overall there is only one clear gain on Chinese but there are large slowdowns
with short needles in texts with a small number of symbols.

Cheers,
Wilco

[1] https://www.dmi.unict.it/faro/smart/howto.php
[2] https://github.com/smart-tool/smart.git

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [PATCH] optimize memmem for short needles
@ 2024-02-29  8:00 James Tirta Halim
  0 siblings, 0 replies; 2+ messages in thread
From: James Tirta Halim @ 2024-02-29  8:00 UTC (permalink / raw)
  To: libc-alpha; +Cc: James Tirta Halim

The current implementation does not check for an early match with memchr
before initializing the shift table because large shifts may be faster
than memchr. However, for short shifts, memchr may be faster.
find_edge_in_needle (taken from strstr_avx512) is used to find a rarer
character for memchr to find.

bench-memmem timings (Core i5 8400):
memmem_new memmem memmem_new_use_first_char
average:
1679.59 2502.39 ~1940.1
total:
440053 655625 ~508305

Passes test-memmem.

---
 string/memmem.c | 43 +++++++++++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/string/memmem.c b/string/memmem.c
index a4117f8e1e..772c795caf 100644
--- a/string/memmem.c
+++ b/string/memmem.c
@@ -38,6 +38,19 @@
    in smaller shifts.  */
 #define hash2(p) (((size_t)(p)[0] - ((size_t)(p)[-1] << 3)) % sizeof (shift))
 
+/* Returns the first edge within the needle, returns original S
+   if no edge is found.  Example: 'ab' is the first edge in 'aaaaaaaaaabaarddg'.
+   N must be > 0.  */
+static inline unsigned char *__attribute__ ((always_inline))
+find_edge_in_needle (const unsigned char *s, size_t n)
+{
+  unsigned char *const ret = (unsigned char *) s;
+  for (; --n; ++s)
+    if (*(s + 1) != *s)
+      return (unsigned char *) s + 1;
+  return ret;
+}
+
 /* Fast memmem algorithm with guaranteed linear-time performance.
    Small needles up to size 2 use a dedicated linear search.  Longer needles
    up to size 256 use a novel modified Horspool algorithm.  It hashes pairs
@@ -58,8 +71,6 @@ __memmem (const void *haystack, size_t hs_len,
 
   if (ne_len == 0)
     return (void *) hs;
-  if (ne_len == 1)
-    return (void *) memchr (hs, ne[0], hs_len);
 
   /* Ensure haystack length is >= needle length.  */
   if (hs_len < ne_len)
@@ -67,17 +78,29 @@ __memmem (const void *haystack, size_t hs_len,
 
   const unsigned char *end = hs + hs_len - ne_len;
 
-  if (ne_len == 2)
+  /* Check for an early match before initializing the shift table.  Only done
+     for short needles where memchr may be faster than shifting around with the
+     table.  */
+  if (ne_len <= sizeof (unsigned long) * 4)
     {
-      uint32_t nw = ne[0] << 16 | ne[1], hw = hs[0] << 16 | hs[1];
-      for (hs++; hs <= end && hw != nw; )
-	hw = hw << 16 | *++hs;
-      return hw == nw ? (void *)hs - 1 : NULL;
+      const unsigned int edge = find_edge_in_needle (ne, ne_len) - ne;
+      hs = memchr (hs + edge, *((char *) ne + edge),
+		   (hs_len - edge) - (ne_len - edge) + 1);
+      if (hs == NULL || memcmp (hs -= edge, ne, ne_len) == 0)
+	return (void *) hs;
+      if (ne_len == 2)
+	{
+	  uint32_t nw = ne[0] << 16 | ne[1], hw = hs[0] << 16 | hs[1];
+	  for (hs++; hs <= end && hw != nw;)
+	    hw = hw << 16 | *++hs;
+	  return hw == nw ? (void *) (hs - 1) : NULL;
+	}
     }
-
   /* Use Two-Way algorithm for very long needles.  */
-  if (__builtin_expect (ne_len > 256, 0))
-    return two_way_long_needle (hs, hs_len, ne, ne_len);
+  else if (__glibc_unlikely (ne_len > 256))
+    {
+      return two_way_long_needle (hs, hs_len, ne, ne_len);
+    }
 
   uint8_t shift[256];
   size_t tmp, shift1;
-- 
2.44.0


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-03-06 19:10 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-06 19:10 [PATCH] optimize memmem for short needles Wilco Dijkstra
  -- strict thread matches above, loose matches on Subject: below --
2024-02-29  8:00 James Tirta Halim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).