public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* Fastest String Search Algorithm.
@ 2021-06-09 10:31 Wilco Dijkstra
  2021-06-09 11:42 ` Amit Choudhary
  0 siblings, 1 reply; 69+ messages in thread
From: Wilco Dijkstra @ 2021-06-09 10:31 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: 'GNU C Library'

Hi Amit,

> If my algorithm is beating strstr() for long search patterns then I think
> that my algorithm should be included in glibc for long search patterns.

I tried it in bench-strstr.c, strstr is often 10 times faster on many inputs.
I only noticed a case with a 32KB needle and 64KB text where it beats strstr by
a factor 3. On complex needles it behaves like basic_strstr (ie. quadratic time).

The existing code that handles large needles isn't highly optimized, so it is
certainly possible to improve its performance. Large needles are rare - a size
like 4KB will never be used in practice. A key issue with large needles is quadratic
performance, and any new algorithm will need to address that somehow.

So it's not as simple as showing your algorithm is faster in one case, you have to
show it is faster on average for all large needles and avoid quadratic performance
on specially crafted worst-case inputs. As others have pointed out, this is not an
easy task...

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 69+ messages in thread
* Fastest String Search Algorithm.
@ 2021-06-07 16:17 Wilco Dijkstra
  2021-06-07 17:33 ` Florian Weimer
  0 siblings, 1 reply; 69+ messages in thread
From: Wilco Dijkstra @ 2021-06-07 16:17 UTC (permalink / raw)
  To: amitchoudhary0523; +Cc: 'GNU C Library'

Hi Amit,

> So, do you think that 2.28 version would be having the strstr() bug which
> fails to match very large patterns.

2.28 was the failing version indeed: https://sourceware.org/bugzilla/show_bug.cgi?id=23637

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 69+ messages in thread
* Fastest String Search Algorithm.
@ 2021-06-07 14:46 Wilco Dijkstra
  2021-06-07 15:21 ` Amit Choudhary
  0 siblings, 1 reply; 69+ messages in thread
From: Wilco Dijkstra @ 2021-06-07 14:46 UTC (permalink / raw)
  To: 'GNU C Library', amitchoudhary0523

Hi Amit,

> I wrote a program to test my algorithm with strstr(). The text to search is
> around 60 KB and the pattern is around 1.3 KB.
> 
> The surprise is that """""strstr() failed to find the pattern."""""

It's possible you have an old GLIBC - I remember a strstr bug years ago where
it failed to match very large patterns.

Anyway, the generic strstr in a recent GLIBC is over 1000 times faster on this test.

Basically you don't want to compare with ancient Boyer-Moore or KMP (which AFAIK
nobody uses in the real world) but with the best modern algorithms. Modern string
algorithms are fast due to not inspecting every single character. They skip most of
the input by jumping ahead by the size of the pattern after a mismatch. So perhaps
counterintuitively, they become much faster with a larger search pattern.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 69+ messages in thread
* Fastest String Search Algorithm.
@ 2021-06-07 10:53 Wilco Dijkstra
  2021-06-07 12:33 ` Amit Choudhary
  2021-06-08  6:33 ` Amit Choudhary
  0 siblings, 2 replies; 69+ messages in thread
From: Wilco Dijkstra @ 2021-06-07 10:53 UTC (permalink / raw)
  To: 'GNU C Library', amitchoudhary0523

Hi Amit,

Please test your algorithm against the generic one GLIBC uses [1]. This is the
fastest known implementation which outperforms all of the algorithms in SMART.

It seems your algorithm scans every character and thus does more work than
necessary - practically all modern algorithms use hashing of multiple characters
and a lookup table to skip mismatches.

> Also, in real world scenario, my algorithm will almost never hit worst case.

This is true but you still need to handle worst cases efficiently. For example an
attacker could give you inputs that hit the worst case, thus allowing a denial of
service attack.

[1] https://github.com/bminor/glibc/blob/master/string/strstr.c

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 69+ messages in thread
* Fastest String Search Algorithm.
@ 2021-06-06 14:15 Amit Choudhary
  2021-06-06 14:30 ` H.J. Lu
  2021-06-06 17:58 ` Paul Eggert
  0 siblings, 2 replies; 69+ messages in thread
From: Amit Choudhary @ 2021-06-06 14:15 UTC (permalink / raw)
  To: libc-alpha

Hi All,

I have invented the fastest string search algorithm. It is called
"Choudhary Algorithm". I am sharing it here, in case glibc wants to include
it in for strstr().

I have compared my algorithm my other standard algorithms and below are the
results.

You can view the results and source code of the algorithm and text/pattern
here: https://sourceforge.net/projects/fastest-string-search-algo/files/

The search text is around 60 KB long and the pattern is 27 bytes.

Result:
---------

Pattern found at index 30384

Time taken by Brute Force algorithm = 57,151,900 ns.
Time taken by Boyer Moore algorithm = 25,012,100 ns.
Time taken by Rabin Karp algorithm = 613,410,200 ns.
Time taken by Knuth Morris Pratt Force algorithm = 58,461,200 ns.
Time taken by Choudhary algorithm = 17,413,000 ns. (Fastest).

Choudhary string search algorithm is the fastest.

-------------------------------------------------------------------------------

Regards,
Amit

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2021-06-13  4:06 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-09 10:31 Fastest String Search Algorithm Wilco Dijkstra
2021-06-09 11:42 ` Amit Choudhary
2021-06-09 12:05   ` Phil Blundell
2021-06-09 12:11     ` Adhemerval Zanella
2021-06-09 12:22       ` Amit Choudhary
2021-06-09 12:55         ` Adhemerval Zanella
2021-06-09 13:18           ` Amit Choudhary
2021-06-09 13:32             ` Amit Choudhary
2021-06-09 13:40             ` Adhemerval Zanella
2021-06-09 14:08               ` Amit Choudhary
2021-06-09 14:25                 ` Phil Blundell
2021-06-09 12:18     ` Amit Choudhary
2021-06-09 13:48       ` Phil Blundell
2021-06-09 21:44   ` Wilco Dijkstra
2021-06-10  2:42     ` Amit Choudhary
2021-06-10  3:23       ` Dan Raymond
2021-06-10  4:19         ` Amit Choudhary
2021-06-10  9:44       ` Wilco Dijkstra
2021-06-10 11:08         ` Amit Choudhary
2021-06-10 16:53           ` Amit Choudhary
2021-06-10 20:04             ` Amit Choudhary
2021-06-11  8:28               ` Amit Choudhary
2021-06-11 11:45           ` Wilco Dijkstra
2021-06-11 12:42             ` Amit Choudhary
2021-06-11 13:39               ` Amit Choudhary
2021-06-11 13:46               ` Wilco Dijkstra
2021-06-11 15:52                 ` Amit Choudhary
2021-06-11 19:14                   ` Amit Choudhary
2021-06-11 19:26                     ` Paul Eggert
2021-06-11 19:41                       ` Amit Choudhary
2021-06-11 20:00                         ` Dan Raymond
2021-06-12  4:00                           ` Amit Choudhary
2021-06-12  4:18                             ` Amit Choudhary
2021-06-12  7:16                               ` [Final testing with benchmark tests] " Amit Choudhary
2021-06-12  8:01                                 ` Paul Zimmermann
2021-06-12  8:41                                   ` Amit Choudhary
2021-06-12 13:29                                     ` Amit Choudhary
2021-06-12 10:56                                   ` Siddhesh Poyarekar
2021-06-12 13:47                                     ` Amit Choudhary
2021-06-13  3:32                                       ` Siddhesh Poyarekar
2021-06-13  4:06                                         ` Amit Choudhary
2021-06-11 13:57               ` Paul Zimmermann
2021-06-11 14:04                 ` Amit Choudhary
2021-06-11 14:16                   ` Paul Zimmermann
2021-06-11 14:27                     ` Amit Choudhary
  -- strict thread matches above, loose matches on Subject: below --
2021-06-07 16:17 Wilco Dijkstra
2021-06-07 17:33 ` Florian Weimer
2021-06-08  4:52   ` Siddhesh Poyarekar
2021-06-07 14:46 Wilco Dijkstra
2021-06-07 15:21 ` Amit Choudhary
2021-06-07 16:02   ` Amit Choudhary
2021-06-07 16:04   ` Wilco Dijkstra
2021-06-07 10:53 Wilco Dijkstra
2021-06-07 12:33 ` Amit Choudhary
2021-06-08  6:33 ` Amit Choudhary
2021-06-08  7:25   ` Amit Choudhary
2021-06-06 14:15 Amit Choudhary
2021-06-06 14:30 ` H.J. Lu
2021-06-06 18:38   ` Amit Choudhary
2021-06-09  5:33   ` Guo, Wangyang
2021-06-09  6:18     ` Amit Choudhary
2021-06-09  6:50       ` Siddhesh Poyarekar
2021-06-09  7:38       ` Phil Blundell
2021-06-09  7:54         ` Amit Choudhary
2021-06-09  9:26           ` Phil Blundell
2021-06-06 17:58 ` Paul Eggert
2021-06-06 18:26   ` Amit Choudhary
2021-06-07 11:48     ` Amit Choudhary
2021-06-07 20:12       ` Paul Eggert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).