Fastest String Search Algorithm.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Fastest String Search Algorithm.
@ 2021-06-07 10:53 Wilco Dijkstra
  2021-06-07 12:33 ` Amit Choudhary
  2021-06-08  6:33 ` Amit Choudhary
  0 siblings, 2 replies; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-07 10:53 UTC (permalink / raw)
  To: 'GNU C Library', amitchoudhary0523

Hi Amit,

Please test your algorithm against the generic one GLIBC uses [1]. This is the
fastest known implementation which outperforms all of the algorithms in SMART.

It seems your algorithm scans every character and thus does more work than
necessary - practically all modern algorithms use hashing of multiple characters
and a lookup table to skip mismatches.

> Also, in real world scenario, my algorithm will almost never hit worst case.

This is true but you still need to handle worst cases efficiently. For example an
attacker could give you inputs that hit the worst case, thus allowing a denial of
service attack.

[1] https://github.com/bminor/glibc/blob/master/string/strstr.c

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-07 10:53 Fastest String Search Algorithm Wilco Dijkstra
@ 2021-06-07 12:33 ` Amit Choudhary
  2021-06-08  6:33 ` Amit Choudhary
  1 sibling, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-07 12:33 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

> Please test your algorithm against the generic one GLIBC uses [1]. This is
> the
> fastest known implementation which outperforms all of the algorithms in
> SMART.
>

I wrote a program to test my algorithm with strstr(). The text to search is
around 60 KB and the pattern is around 1.3 KB.

The surprise is that """""strstr() failed to find the pattern."""""

My algorithm successfully found the pattern. Below are the results and a
zipped copy of my program is attached with this mail.

==========Result==========

strstr() failed. It could not find the pattern.

choudhary_string_search_algorithm() succeeded. It found the pattern at
index 30384.

Time taken by choudhary_string_search_algorithm() = 186650

=========================

Regards,
Amit

[-- Attachment #2: compare_string_search_algorithms.zip --]
[-- Type: application/zip, Size: 24176 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-07 10:53 Fastest String Search Algorithm Wilco Dijkstra
  2021-06-07 12:33 ` Amit Choudhary
@ 2021-06-08  6:33 ` Amit Choudhary
  2021-06-08  7:25   ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-08  6:33 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

> [1] https://github.com/bminor/glibc/blob/master/string/strstr.c
>

I downloaded this file and included "str-two-way.h" from glibc-2.33 source.

I tested the new / latest strstr() with my algorithm.

The text is around 60 KB long and the pattern is around 28 KB long.

The result is that my algorithm outperforms the latest strstr().

"""""My conclusion is that: For small patterns strstr() is faster but for
very long patterns my algorithm is faster."""""

If anyone needs the text / pattern or the source code for this test, then
please let me know.

Please check the results below:

========================================================
amit@amit-zorin-os:~/string_search$ ./a.out

Time taken by strstr() = 835,836

Time taken by choudhary_string_search_algorithm() = 258,305

Choudhary algorithm is faster than strstr().

amit@amit-zorin-os:~/string_search$
========================================================

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-08  6:33 ` Amit Choudhary
@ 2021-06-08  7:25   ` Amit Choudhary
  0 siblings, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-08  7:25 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

> [1] https://github.com/bminor/glibc/blob/master/string/strstr.c
>>
>
> I downloaded this file and included "str-two-way.h" from glibc-2.33 source.
>
> I tested the new / latest strstr() with my algorithm.
>
> The text is around 60 KB long and the pattern is around 28 KB long.
>
> The result is that my algorithm outperforms the latest strstr().
>
> """""My conclusion is that: For small patterns strstr() is faster but for
> very long patterns my algorithm is faster."""""
>
> If anyone needs the text / pattern or the source code for this test, then
> please let me know.
>
> Please check the results below:
>
> ========================================================
> amit@amit-zorin-os:~/string_search$ ./a.out
>
> Time taken by strstr() = 835,836
>
> Time taken by choudhary_string_search_algorithm() = 258,305
>
> Choudhary algorithm is faster than strstr().
>
> amit@amit-zorin-os:~/string_search$
> ========================================================
>

I found out that for 60 KB text and 3.2 KB pattern, strstr() is faster than
my algorithm.

But for 60 KB text and 4.4 KB pattern, my algorithm is faster than strstr().

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-12  4:00                           ` Amit Choudhary
@ 2021-06-12  4:18                             ` Amit Choudhary
  0 siblings, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-12  4:18 UTC (permalink / raw)
  To: GNU C Library

By the way, Paul Eggert thinks that this discussion is a waste of time for
glibc people.

So, I will request people to stop this thread of discussion now.

So, please don't reply to any message in this thread.

But if someone replies, then I might reply back.

But please don't blame me for this discussion. People who replied /
responded in this thread are equally at fault for continuing this
discussion.

Paul blamed me for wasting glibc developers time and I really didn't like
that.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 20:00                         ` Dan Raymond
@ 2021-06-12  4:00                           ` Amit Choudhary
  2021-06-12  4:18                             ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-12  4:00 UTC (permalink / raw)
  To: Dan Raymond; +Cc: GNU C Library

On Sat, Jun 12, 2021, 1:32 AM Dan Raymond <draymond@foxvalley.net> wrote:
  By the way your previous argument

> that antivirus software uses strstr() to search for virus signatures is
> false.  A virus signature is not a NUL-terminated string.
>

I never said that strstr() is used in antivirus software.

I only said that antivirus software use long needles for searching and I
said this because someone had put forth the argument that long needles
don't occur in real world.

But as you said that in real world anything can happen then the current
version of strstr() is susceptible to denial of service attack because for
long needles (32 KB, etc.), strstr() is slow.

But one thing is very ironic. On one hand people say that anything can
happen in real world and on the other they contradict this by saying that
long needles don't occur in real world.

If anything can happen in real world, then strstr() should be testef for 10
MB long needle in 100 MB long haystack.

But this is not tested. Actually, strstr() has not been tested for needles
beyond 1 KB and haystack beyong 65 KB.

So, the argument that anything can happen in the real world in not
reflected in the benchmark tests for strstr().

But, please don't ask me to fix benchmark tests.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 19:41                       ` Amit Choudhary
@ 2021-06-11 20:00                         ` Dan Raymond
  2021-06-12  4:00                           ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Raymond @ 2021-06-11 20:00 UTC (permalink / raw)
  To: libc-alpha, amitchoudhary0523

> So, in this case, my algorithm is of no use because it better than strstr()
> in average cases but not in worst cases.
>
> Average case means real world scenarios. Worst case means hand crafted
> worst case inputs.

Real world scenarios are precisely the ones that are sensitive to 
worst-case performance.  In the real world you can't predict inputs and 
there are consequences to unexpected performance problems.  Denial of 
service attacks are just one example.  By the way your previous argument 
that antivirus software uses strstr() to search for virus signatures is 
false.  A virus signature is not a NUL-terminated string.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 19:26                     ` Paul Eggert
@ 2021-06-11 19:41                       ` Amit Choudhary
  2021-06-11 20:00                         ` Dan Raymond
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-11 19:41 UTC (permalink / raw)
  To: Paul Eggert; +Cc: GNU C Library

On Sat, Jun 12, 2021, 12:56 AM Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 6/11/21 12:14 PM, Amit Choudhary via Libc-alpha wrote:
> > If there are two algorithms and algorithm1 performs better than
> algorithm2
> > in average case. But in worst case scenario, algorithm2 is better than
> > algorithm1.
>
> For glibc, worst-case is a big deal when it's measurable via the big-O
> notation. This is because of denial-of-service attacks and such-like. So
> in many cases we'll prefer an algorithm with a better worst-case behavior.
>
> Also, it's often easier to measure worst-case (because what is "average"
> is more debatable than what is worst), so focusing on worst-case helps
> cut down the more-difficult and lengthy discussion that would be needed
> to define "average", and this focus can save everybody time.
>


So, in this case, my algorithm is of no use because it better than strstr()
in average cases but not in worst cases.

Average case means real world scenarios. Worst case means hand crafted
worst case inputs.

So, from my side, this discussion is now over.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 19:14                   ` Amit Choudhary
@ 2021-06-11 19:26                     ` Paul Eggert
  2021-06-11 19:41                       ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Paul Eggert @ 2021-06-11 19:26 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On 6/11/21 12:14 PM, Amit Choudhary via Libc-alpha wrote:
> If there are two algorithms and algorithm1 performs better than algorithm2
> in average case. But in worst case scenario, algorithm2 is better than
> algorithm1.

For glibc, worst-case is a big deal when it's measurable via the big-O 
notation. This is because of denial-of-service attacks and such-like. So 
in many cases we'll prefer an algorithm with a better worst-case behavior.

Also, it's often easier to measure worst-case (because what is "average" 
is more debatable than what is worst), so focusing on worst-case helps 
cut down the more-difficult and lengthy discussion that would be needed 
to define "average", and this focus can save everybody time.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 15:52                 ` Amit Choudhary
@ 2021-06-11 19:14                   ` Amit Choudhary
  2021-06-11 19:26                     ` Paul Eggert
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-11 19:14 UTC (permalink / raw)
  To: GNU C Library

To end this whole discussion, there is one point that needs to be clarified
but no one has spoken about it.

The point is this:

If there are two algorithms and algorithm1 performs better than algorithm2
in average case. But in worst case scenario, algorithm2 is better than
algorithm1.

And you can choose only one algorithm.

"""""So, which one would you choose for glibc?"""""

I will go for algorithm1 because it has better performance in average case.

But Wilco will go for algorithm2 because its has better performance in
worst case.

If this point gets clarified then there will be no point of further
discussion.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 13:46               ` Wilco Dijkstra
@ 2021-06-11 15:52                 ` Amit Choudhary
  2021-06-11 19:14                   ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-11 15:52 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Fri, Jun 11, 2021 at 7:16 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com>
wrote:

> Hi Amit,
>
> For example an attacker finds an application that uses strstr and provides
> worst case inputs on purpose, causing the application to hang. That is a
> denial of service attack. GLIBC tries hard to avoid quadratic algorithms
> as a
> result.
>

This is a problem of the application that it is not limiting the input size.

Problems like these have happened in the past and that's why we have
strncpy(), etc.

If we don't have strncpy() then denial of service attack is possible
through strcpy() also.

This is not theoretical, slowdowns on rare inputs create serious issues.
> Math functions are a good example, there were cases where some math
> functions were 1000 or even 100000 times slower on some inputs.
> This caused such issues for HPC code that they used to advice not to use
> GLIBC for math code - a supercomputer taking weeks rather than hours to
> solve a problem because it hit slow cases is disastrous.
>

The worst case for a string matching function is different than for maths.

String matching is very simple - just compare characters.

But maths is different - calculating square roots, multiplying 20 floating
point numbers, calculating logarithmic values of large numbers,
factorization, etc. are very CPU intensive.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 14:16                   ` Paul Zimmermann
@ 2021-06-11 14:27                     ` Amit Choudhary
  0 siblings, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-11 14:27 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: Wilco Dijkstra, GNU C Library

On Fri, Jun 11, 2021 at 7:46 PM Paul Zimmermann <Paul.Zimmermann@inria.fr>
wrote:

>        Dear Amit,
>
> > As I said before,  I am not desperate to get my code in glibc. So, if you
> > don't accept my algorithm, it is fine with me.
>
> the code you propose is not ready to get included in glibc. If you want
> your
> code to be considered, you must send a real patch against glibc,
> like for example:
>
> https://sourceware.org/pipermail/libc-alpha/2021-June/127242.html
>
> See https://sourceware.org/glibc/wiki/Contribution%20checklist for more
> details
>
> Best regards,
> Paul
>


I will send a patch after couple of days.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 14:04                 ` Amit Choudhary
@ 2021-06-11 14:16                   ` Paul Zimmermann
  2021-06-11 14:27                     ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Paul Zimmermann @ 2021-06-11 14:16 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Wilco.Dijkstra, libc-alpha

       Dear Amit,

> As I said before,  I am not desperate to get my code in glibc. So, if you
> don't accept my algorithm, it is fine with me.

the code you propose is not ready to get included in glibc. If you want your
code to be considered, you must send a real patch against glibc, 
like for example:

https://sourceware.org/pipermail/libc-alpha/2021-June/127242.html

See https://sourceware.org/glibc/wiki/Contribution%20checklist for more details

Best regards,
Paul

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 13:57               ` Paul Zimmermann
@ 2021-06-11 14:04                 ` Amit Choudhary
  2021-06-11 14:16                   ` Paul Zimmermann
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-11 14:04 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: Wilco Dijkstra, GNU C Library

On Fri, Jun 11, 2021 at 7:27 PM Paul Zimmermann <Paul.Zimmermann@inria.fr>
wrote:

>        Dear Amit,
>
> > The worst case scenario almost never occurs in real world situations. The
> > worst case scenario is almost always on paper.
>
> we can not know in advance what applications the GNU libc will be used for.
> This is very different from the case where you are the only user of your
> program, and you know for which inputs you will use it.
>
> You were told how you could convince that your algorithm is better:
>
> (1) provide a GNU libc patch with new benchmarks (if you think the current
>     ones are not sufficient)
> (2) then provide a GNU libc patch using your algorithm, and demonstrate
>     on the existing and new benchmarks that it is better than the current
> code
>
> Otherwise, all that discussion is a waste of time both for you and
> the GNU libc developers.
>


As I said before,  I am not desperate to get my code in glibc. So, if you
don't accept my algorithm, it is fine with me.

I am doing the discussion because I am getting a reply. If I don't get a
reply then I will not continue the discussion.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 12:42             ` Amit Choudhary
  2021-06-11 13:39               ` Amit Choudhary
  2021-06-11 13:46               ` Wilco Dijkstra
@ 2021-06-11 13:57               ` Paul Zimmermann
  2021-06-11 14:04                 ` Amit Choudhary
  2 siblings, 1 reply; 61+ messages in thread
From: Paul Zimmermann @ 2021-06-11 13:57 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Wilco.Dijkstra, libc-alpha

       Dear Amit,

> The worst case scenario almost never occurs in real world situations. The
> worst case scenario is almost always on paper.

we can not know in advance what applications the GNU libc will be used for.
This is very different from the case where you are the only user of your
program, and you know for which inputs you will use it.

You were told how you could convince that your algorithm is better:

(1) provide a GNU libc patch with new benchmarks (if you think the current
    ones are not sufficient)
(2) then provide a GNU libc patch using your algorithm, and demonstrate
    on the existing and new benchmarks that it is better than the current code

Otherwise, all that discussion is a waste of time both for you and
the GNU libc developers.

Best regards,
Paul Zimmermann

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 12:42             ` Amit Choudhary
  2021-06-11 13:39               ` Amit Choudhary
@ 2021-06-11 13:46               ` Wilco Dijkstra
  2021-06-11 15:52                 ` Amit Choudhary
  2021-06-11 13:57               ` Paul Zimmermann
  2 siblings, 1 reply; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-11 13:46 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

>> No, it's always located at the end - it is even tested explicitly.
>
> For some cases it is in the beginning and for some cases it is in the end.

No it is not.

> So, after this fix, my algo's worst case scenario will be O(n).

No - that doesn't solve the fundamental quadratic problem.

> Please give me an example of how would you encounter worst case sceraio in real world.

For example an attacker finds an application that uses strstr and provides
worst case inputs on purpose, causing the application to hang. That is a
denial of service attack. GLIBC tries hard to avoid quadratic algorithms as a
result.

This is not theoretical, slowdowns on rare inputs create serious issues.
Math functions are a good example, there were cases where some math
functions were 1000 or even 100000 times slower on some inputs.
This caused such issues for HPC code that they used to advice not to use
GLIBC for math code - a supercomputer taking weeks rather than hours to
solve a problem because it hit slow cases is disastrous.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 12:42             ` Amit Choudhary
@ 2021-06-11 13:39               ` Amit Choudhary
  2021-06-11 13:46               ` Wilco Dijkstra
  2021-06-11 13:57               ` Paul Zimmermann
  2 siblings, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-11 13:39 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 1115 bytes --]

On Fri, Jun 11, 2021 at 6:12 PM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

>
> On Fri, Jun 11, 2021, 5:15 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> wrote:
>
>> > The worst case scenario of my algorithm is O(mn) where m is the length
>> of
>> > the text and n is the length of the pattern.
>>
>> Exactly. And that is problematic if both m and n are large.
>>
>> > And this worst case scenario will be hit in only one case:
>>
>> There is not just one worst-case, the number is practically infinite.
>> A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA.
>>
>
>
> I can easily fix this by recording at which index it failed the last time.
> Then the first comparison will be at the index of B. And after first m
> comparisons, it will fail in the first comparison itself at index of B.
>
> So, after this fix, my algo's worst case scenario will be O(n).
>
> """""But, I would really appreciate if you can give me a real world
> scenario where this worst case scenario will happen."""""
>


I fixed this. I am attaching latest fixed and changed version of my program
(bzipped2).

Regards,
Amit

[-- Attachment #2: compare_string_search_algorithms.c.bz2 --]
[-- Type: application/x-bzip, Size: 25562 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 11:45           ` Wilco Dijkstra
@ 2021-06-11 12:42             ` Amit Choudhary
  2021-06-11 13:39               ` Amit Choudhary
                                 ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-11 12:42 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Fri, Jun 11, 2021, 5:15 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

>
>
> > I searched for the needle in the haystack using document editor's search
> > and found that needle was located in the beginning.
>
> No, it's always located at the end - it is even tested explicitly.
>


For some cases it is in the beginning and for some cases it is in the end.



> > The worst case scenario of my algorithm is O(mn) where m is the length of
> > the text and n is the length of the pattern.
>
> Exactly. And that is problematic if both m and n are large.
>
> > And this worst case scenario will be hit in only one case:
>
> There is not just one worst-case, the number is practically infinite.
> A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA.
>


I can easily fix this by recording at which index it failed the last time.
Then the first comparison will be at the index of B. And after first m
comparisons, it will fail in the first comparison itself at index of B.

So, after this fix, my algo's worst case scenario will be O(n).

"""""But, I would really appreciate if you can give me a real world
scenario where this worst case scenario will happen."""""



> > But my theory is this: We should be fastest 100% if the time. If that is
> not possible,
> > then we should try to get as close to 100% as possible.
> >
> > This means that if for average case, some algorithm is faster 99% of the
> time but
> > slower 1% of the time in the worst case scenario then we should go for
> this
> > algorithm rather than fretting over worst case scenario.
>
> For large needles and haystacks the worst case becomes more important
> because
> the slowdown becomes larger with the length. It's bad even if there is a
> 0.001%
> chance that your computer locks up for minutes.
>


The worst case scenario almost never occurs in real world situations. The
worst case scenario is almost always on paper.

Please give me an example of how would you encounter worst case sceraio in
real world.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10 11:08         ` Amit Choudhary
  2021-06-10 16:53           ` Amit Choudhary
@ 2021-06-11 11:45           ` Wilco Dijkstra
  2021-06-11 12:42             ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-11 11:45 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

> In my opinion, needle should be in the middle so as not to give any
> algorithm any unfair advantage - like Boyer-Moore which searches from the end.

There is no unfair advantage to any algorithm since they all start searching
from the start (note BM does not start at the end).

> I searched for the needle in the haystack using document editor's search
> and found that needle was located in the beginning.

No, it's always located at the end - it is even tested explicitly.

> The worst case scenario of my algorithm is O(mn) where m is the length of
> the text and n is the length of the pattern.

Exactly. And that is problematic if both m and n are large.

> And this worst case scenario will be hit in only one case:

There is not just one worst-case, the number is practically infinite.
A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA.

> But my theory is this: We should be fastest 100% if the time. If that is not possible,
> then we should try to get as close to 100% as possible.
> 
> This means that if for average case, some algorithm is faster 99% of the time but
> slower 1% of the time in the worst case scenario then we should go for this
> algorithm rather than fretting over worst case scenario.

For large needles and haystacks the worst case becomes more important because
the slowdown becomes larger with the length. It's bad even if there is a 0.001%
chance that your computer locks up for minutes.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10 20:04             ` Amit Choudhary
@ 2021-06-11  8:28               ` Amit Choudhary
  0 siblings, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-11  8:28 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Fri, Jun 11, 2021, 1:34 AM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

> I have now improved my algorithm by skipping some text based on a
> preprocessed array on pattern.
>
> Now, my algorithm is faster than strstr() for needles of 1 KB and above.
>
> But I am not sure if you guys are still interested.
>
> Anyways, I am attaching my program (bzipped2).
>
> Regards,
> Amit
>

This version of my program has a bug. If needed, I can send the fixed
version.

Also, as it was pointed out that long needles are not used in real world
but this is not the case.

Antivirus software compares long needles (may be greater than 4 KB also) to
search for virus signature in packets.

And this is done everyday, many many times, and that too on millions of
computers everyday.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10 16:53           ` Amit Choudhary
@ 2021-06-10 20:04             ` Amit Choudhary
  2021-06-11  8:28               ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-10 20:04 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 287 bytes --]

I have now improved my algorithm by skipping some text based on a
preprocessed array on pattern.

Now, my algorithm is faster than strstr() for needles of 1 KB and above.

But I am not sure if you guys are still interested.

Anyways, I am attaching my program (bzipped2).

Regards,
Amit

[-- Attachment #2: compare_string_search_algorithms.c.bz2 --]
[-- Type: application/bzip2, Size: 25450 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10 11:08         ` Amit Choudhary
@ 2021-06-10 16:53           ` Amit Choudhary
  2021-06-10 20:04             ` Amit Choudhary
  2021-06-11 11:45           ` Wilco Dijkstra
  1 sibling, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-10 16:53 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Thu, Jun 10, 2021 at 4:38 PM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

>
> The worst case scenario of my algorithm is O(mn) where m is the length of
> the text and n is the length of the pattern.
>
> And this worst case scenario will be hit in only one case:
>
> haystack is: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA...
> needle is: ABABA
>

I calculated again and it comes out that in worst case scenario also my
algorithm is linear - O(n).

=============================================

CASE 1: Let's consider this:

haystack: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA... (length = n)
needle: ABABABABA (length = m)

So, when it is going to match needle[0] with haystack[0], it will match and
all other 4 comparisons as well. So, it will do 5 comparisons and then it
will try to match the whole needle, but it will fail only after matching
[0]th character because the [1]st character (X, B) differs. So, it will do
7 comparisons in all when it matches all 5 initial comparisons.

Now, since the needle did not match, then 'i' will be incremented by 1 and
it will immediately fail after 1 comparison (needle[0] = A) is not equal to
((haystack[1] = X). So, it will do only 1 comparison.

This will repeat itself - 7 comparisons, 1 comparison, 7 comparisons, 1
comparison and so on....

So, for n/2 characters it will do 7 comparisons and for other n/2
characters it will do only 1 comparison.

So, total comparisons = (7*n)/2 + n/2 = 8n/2 = 4n.

So, this worst case scenario is O(n).

=============================================

CASE 2: Let's consider this:

haystack: ABCDEFGABCDEFGABCDEFGABCDEFG... (length = n)
needle: ABCDEXG (length = m) {{Here, only needle[m-2] and haystack[m-2]
differs}}.

In this case it will match all initial 5 comparisons for needle[0]. Then it
will try to match whole needle and will fail after (m-2) comparisons.

Now, since the needle did not match, then 'i' will be incremented by 1 and
it will immediately fail after 1 comparison (needle[0] = A) is not equal to
((haystack[1] = B). So, it will do only 1 comparison. This 1 comparison
will continue till next m-1 characters.

So, for 1st m characters, the number of comparisons are 5 + m - 2 + m - 1 =
2m + 2 comparisons.

This 2m + 2 comparisons will be done n/m times.

So, total number of comparisons done = (2m + 2)(n/m) =~ n.

So, this worst case scenario is also O(n).

  =============================================

So, now I think that I can safely say that the worst case scenario for my
algorithm is O(n).

If someone can prove that it is not O(n) but O(m*n) or O(n^2), etc. then
please let me know.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10  9:44       ` Wilco Dijkstra
@ 2021-06-10 11:08         ` Amit Choudhary
  2021-06-10 16:53           ` Amit Choudhary
  2021-06-11 11:45           ` Wilco Dijkstra
  0 siblings, 2 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-10 11:08 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 3049 bytes --]

On Thu, Jun 10, 2021, 3:14 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> Hi Amit,
>
> For testing correctness yes. However in a benchmark there is no difference
> if the haystack
> ends immediately after a match or continues forever.
>

In my opinion, needle should be in the middle so as not to give any
algorithm any unfair advantage - like Boyer-Moore which searches from the
end.

> It won't since the benchmark explicitly removes all early matches (and of
> course it checks that
> the result is correct as well).
>

I have not seen what you are pointing out. I can attach the file which has
all haystacks and needles, but it is 82 MB. I did bzip2 on it but still it
is 564 KB. So, I have removed all haystacks/needles which were a
combination of 'a's and 'b's. I have run bzip2 on that file and now it is
115 KB large. However, the file format is not very good but it is still
easily readable.

I searched for the needle in the haystack using document editor's search
and found that needle was located in the beginning.

> > So, if some algorithm is faster for real world scenario but not for
> worst case (and worst case
> > will not occur in real world), then you don't want to include that
> algorithm.
>
> Large needles are not real-world, so for these the worst-case matters. The
> TwoWay algorithm
> for large needles is old and relatively inefficient. Anyone who has an
> algorithm that is not
> only faster on real-world needles but also guarantees good worst-case
> performance on huge
> needles would have no trouble getting their patches accepted.
>

I can do more tests and provide the inputs and results but I will not use
benchmark tests because I don't find them useful.

I will also not fix benchmark tests.

If all this is acceptable then please let me know. In this case I will
start my testing again.

The worst case scenario of my algorithm is O(mn) where m is the length of
the text and n is the length of the pattern.

And this worst case scenario will be hit in only one case:

haystack is: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA...
needle is: ABABA

But my theory is this: We should be fastest 100% if the time. If that is
not possible, then we should try to get as close to 100% as possible.

This means that if for average case, some algorithm is faster 99% of the
time but slower 1% of the time in the worst case scenario then we should go
for this algorithm rather than fretting over worst case scenario.

So, please let me know if you are still amenable to include my algorithm
provided the results for my algorithm are better than strstr(). You can
tell me what haystack / needle to pick.

I am thinking of having 2 long haystacks copied from Internet and then a
function call in the program will generate needle (equal to length passed
to it) from the middle of the haystack (Mostly, the length of needle will
be greater than or equal to 4 KB).

But, please let me know if you have other ideas.

Please find attached the haystacks/needles file from benchmark tests of
strstr (115 KB, bzipped2).

Regards,
Amit

[-- Attachment #2: bench-strstr.out.txt.bz2 --]
[-- Type: application/x-bzip, Size: 117063 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10  2:42     ` Amit Choudhary
  2021-06-10  3:23       ` Dan Raymond
@ 2021-06-10  9:44       ` Wilco Dijkstra
  2021-06-10 11:08         ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-10  9:44 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

> It's located at the end of the haystack since that is the slowest and hardest case.
>
> In two way search, doesn't strstr() search from end also?

No, strstr must return the *first* match in the string, so it has to search from the start.
Hence placing the needle at the start is the easiest case, and the end is the worst case.
The TwoWay algorithm splits the needle into two parts and matches them separately -
but it still scans the haystack from the start.

> There is no point in placing it at the start or in the middle since those cases are
> already tested in the smaller haystack cases.
>
> I had printed haystack and needle for all cases but didn't see needle in the middle.
>
> Testing in the middle should be done for very long haystack also.

For testing correctness yes. However in a benchmark there is no difference if the haystack
ends immediately after a match or continues forever. A test like SMART with random needles
at random positions in a huge text would be useful to add of course.

> Also, repititive text is not very useful because the needle will be located in the beginning itself.

It won't since the benchmark explicitly removes all early matches (and of course it checks that
the result is correct as well).

> So, if some algorithm is faster for real world scenario but not for worst case (and worst case
> will not occur in real world), then you don't want to include that algorithm.

Large needles are not real-world, so for these the worst-case matters. The TwoWay algorithm
for large needles is old and relatively inefficient. Anyone who has an algorithm that is not
only faster on real-world needles but also guarantees good worst-case performance on huge
needles would have no trouble getting their patches accepted.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10  3:23       ` Dan Raymond
@ 2021-06-10  4:19         ` Amit Choudhary
  0 siblings, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-10  4:19 UTC (permalink / raw)
  To: Dan Raymond; +Cc: GNU C Library

On Thu, Jun 10, 2021, 8:54 AM Dan Raymond <draymond@foxvalley.net> wrote:

> Amit, here is a collection of quotes from you in this thread:
>
>     "I have invented the fastest string search algorithm. It is called
> "Choudhary Algorithm""
>
>     "I hope that people are not of the view that Indians can't invent or
> innovate"
>
>     "for 60 KB text and 3.2 KB pattern, strstr() is faster than my
> algorithm"
>
>     "for 60 KB text and 4.4 KB pattern, my algorithm is faster than
> strstr()"
>
>     "The current tests are skewed in favor of strstr()..."
>
>     "but I don't have time to fix strstr() benchmarks"
>
>     "people who maintain open source software will try defending what
> they do even when it is obvious that what they are doing is not correct"
>
> The reason you are not getting traction here is not because the
> community is racist, defensive, and unwilling to accept their mistakes.
> It is because you have failed to substantiate your claim that your
> algorithm is the "fastest string search algorithm".  By your own
> admission the existing implementation is faster when using existing
> benchmarks.  You claim that the benchmarks are flawed but when you were
> asked to submit improvements you refused.  You claim that "real world"
> scenarios favor your algorithm but then you state that your algorithm
> requires a 4.4 KB search string to demonstrate an advantage.  How is 4.4
> KB a "real world" scenario?  I'm sure there are applications that will
> search for a string that long but I've never encountered one.
>

First of all, I can give more proofs by testing from my end with different
haystacks / needles but I am not going to fix your benchmark issues.

I am only submitting my algorithm but you are putting extra work on me by
asking me to fix your benchmark issues. This is not my issue, this is the
issue of the person who designed and implemented the benchmark tests.

If you do like this, that is, if someone submits some work and then you ask
him to fix existing issues also then no one will be interested in
submitting their work because they may not want to fix your existing issues
in benchmark tests, etc.

Also, when I thought more, I realized that no one uses strstr() in real
world.

C is now-a-days used only in embedded software / kernel, etc. and I have
programmed a lot in C but I don't ever remember using strstr().

You people think that you are the smartest people on this earth so let me
tell you something about myself.

1. I am programming since 1998.

2. I have programmed in C, C++, JAVA, PHP, PYTHON, HTML, PERL, JAVASCRIPT,
ETC.

3. I had invented a new language and written its interpreter in 2002 when I
was an intern in a company.

4. I had written and published "How to implement system call". It can be
found here:
https://tldp.org/HOWTO/html_single/Implement-Sys-Call-Linux-2.6-i386/

5. I have contributed many patches to linux kernel but left contributing
more when linux kernel developers became defensive.

6. I have worked in Cisco Systems and Juniper Networks in USA.

7. Currently, I am working for a networking company and writing code for 4G
/ 5G networks.

8. I have done B. Tech. in Computer Science from IIT - Varanasi after
getting a rank in IIT-JEE. IIT-JEE is one of the toughest exams in this
world.

9. I have done Masters in Computer Networking from North Carolina State
University, NC, USA.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10  2:42     ` Amit Choudhary
@ 2021-06-10  3:23       ` Dan Raymond
  2021-06-10  4:19         ` Amit Choudhary
  2021-06-10  9:44       ` Wilco Dijkstra
  1 sibling, 1 reply; 61+ messages in thread
From: Dan Raymond @ 2021-06-10  3:23 UTC (permalink / raw)
  To: libc-alpha, amitchoudhary0523

Amit, here is a collection of quotes from you in this thread:

    "I have invented the fastest string search algorithm. It is called 
"Choudhary Algorithm""

    "I hope that people are not of the view that Indians can't invent or 
innovate"

    "for 60 KB text and 3.2 KB pattern, strstr() is faster than my 
algorithm"

    "for 60 KB text and 4.4 KB pattern, my algorithm is faster than 
strstr()"

    "The current tests are skewed in favor of strstr()..."

    "but I don't have time to fix strstr() benchmarks"

    "people who maintain open source software will try defending what 
they do even when it is obvious that what they are doing is not correct"

The reason you are not getting traction here is not because the 
community is racist, defensive, and unwilling to accept their mistakes.  
It is because you have failed to substantiate your claim that your 
algorithm is the "fastest string search algorithm".  By your own 
admission the existing implementation is faster when using existing 
benchmarks.  You claim that the benchmarks are flawed but when you were 
asked to submit improvements you refused.  You claim that "real world" 
scenarios favor your algorithm but then you state that your algorithm 
requires a 4.4 KB search string to demonstrate an advantage.  How is 4.4 
KB a "real world" scenario?  I'm sure there are applications that will 
search for a string that long but I've never encountered one.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 21:44   ` Wilco Dijkstra
@ 2021-06-10  2:42     ` Amit Choudhary
  2021-06-10  3:23       ` Dan Raymond
  2021-06-10  9:44       ` Wilco Dijkstra
  0 siblings, 2 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-10  2:42 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Thu, Jun 10, 2021, 3:14 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> Hi Amit,
>
> It's located at the end of the haystack since that is the slowest and
> hardest case.
>

In two way search, doesn't strstr() search from end also?

There is no point in placing it at the start or in the middle since those
> cases are
> already tested in the smaller haystack cases.
>

I had printed haystack and needle for all cases but didn't see needle in
the middle.

Testing in the middle should be done for very long haystack also.

Also, repititive text is not very useful because the needle will be located
in the beginning itself.

> That's on purpose to test specific worst-case inputs. The worst-case for
> your
> algorithm would be "a" repeated N times and the needle "a" repeated N/2
> times concatenated with "ba" which results in O(N * N) comparisons. And
> that
> is bad for large N.
>

So, if some algorithm is faster for real world scenario but not for worst
case (and worst case will not occur in real world), then you don't want to
include that algorithm.

This doesn't make sense.

But now this discussion is getting into useless defending mode.

I have seen this issue before also - people who maintain open source
software will try defending what they do even when it is obvious that what
they are doing is not correct.

Google / Facebook, etc. accept their mistakes / bugs easily and sincerely
but not open source software.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 11:42 ` Amit Choudhary
  2021-06-09 12:05   ` Phil Blundell
@ 2021-06-09 21:44   ` Wilco Dijkstra
  2021-06-10  2:42     ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-09 21:44 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

> I included my algorithm in bench-strstr.c and ran the tests myself.
> I could not find strstr-inputs file so I printed the haystack and needle.
>
> I found out many things which I think are not correct:
>
>1. In pass cases (positive cases where needle is part of haystack), the needle is 
> either located at the beginning of the haystack or at the end of the haystack.
> To mimic real world scenario, the needle should be at the middle of the haystack.

It's located at the end of the haystack since that is the slowest and hardest case.
There is no point in placing it at the start or in the middle since those cases are
already tested in the smaller haystack cases.

> 2. The large text is mostly repetition of two small paragraphs. This again is not a real world scenario.

Repetition doesn't matter much as long as the sequence is long enough and typical
English text. SMART has very large texts (including protein sequences), but the results
are the same.

> 3. Some tests for 256 byte needles have haystack and needle as a combination of
> characters 'a' and 'b'. No real world text has been used.
>
> 4. Tests for 1 KB / 4 KB / 8 KB / 28 KB needles also have  haystack and needle as a
> combination of characters 'a' and 'b'. No real world text has been used.

That's on purpose to test specific worst-case inputs. The worst-case for your
algorithm would be "a" repeated N times and the needle "a" repeated N/2
times concatenated with "ba" which results in O(N * N) comparisons. And that
is bad for large N.

> So, in my opinion strstr() has never been tested extensively for real world text.
> I think this is a very big shortcoming.

It has been tested extensively, including in SMART. If you're planning to improve
your algorithm I'd suggest to study that. It has lots of information about the
algorithms used and the C implementations are easy to understand, so it should
be very useful.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 14:08               ` Amit Choudhary
@ 2021-06-09 14:25                 ` Phil Blundell
  0 siblings, 0 replies; 61+ messages in thread
From: Phil Blundell @ 2021-06-09 14:25 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On Wed, Jun 09, 2021 at 07:38:01PM +0530, Amit Choudhary wrote:
> I have done the same tests on my side. I have one text of 60 KB and my
> needle sizes are 3.2 KB, 4.4 KB, 5.9 KB, 7 KB, and 28 KB.
> 
> So, my tests are similar to those of strstr() benchmark.
> 
> Also, I have already mentioned that my algorithm is faster than strstr()
> when needle size is greater than or equal to 4.4 KB.

Right, but I think you've only established that your algorithm is faster
for this one particular haystack and four different needles.  That's not
to say that it isn't faster in other cases, but we don't know that yet.
It's also not clear to me what the upper complexity bound is for your
algorithm and how that compares to the one we have at the moment.

> So, I will spend my time there to evaluate string search algorithms of
> these languages.

That's entirely fair enough.  Thanks for the suggestions anyway.

Phil

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 13:40             ` Adhemerval Zanella
@ 2021-06-09 14:08               ` Amit Choudhary
  2021-06-09 14:25                 ` Phil Blundell
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-09 14:08 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On Wed, Jun 9, 2021, 7:10 PM Adhemerval

>
> You send only one test with a specific usage, this is hardly a way to
> fully verify the performance implication of a new implementation.  That's
> why I think we need to integrate newer tests on the bench-strstr to have
> meaningful and *replicated* data.
>

I don't think that you have seen what is the haystack and needles for
strstr test.

Basically, there is one haystack and needles with different sizes.

I have done the same tests on my side. I have one text of 60 KB and my
needle sizes are 3.2 KB, 4.4 KB, 5.9 KB, 7 KB, and 28 KB.

So, my tests are similar to those of strstr() benchmark.

Also, I have already mentioned that my algorithm is faster than strstr()
when needle size is greater than or equal to 4.4 KB.

I haven't yet tested with 4 KB needle.

But, as Wilco pointed out, no one will use 4 KB needle so then it is
obvious that my algorithm will not be included in glibc.

So, why should I spend more time on this?

There are many other languages than C (like Java, PHP, etc.) that have
their own implementation of string search.

So, I will spend my time there to evaluate string search algorithms of
these languages.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:18     ` Amit Choudhary
@ 2021-06-09 13:48       ` Phil Blundell
  0 siblings, 0 replies; 61+ messages in thread
From: Phil Blundell @ 2021-06-09 13:48 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On Wed, Jun 09, 2021 at 05:48:17PM +0530, Amit Choudhary wrote:
> On Wed, Jun 9, 2021, 5:35 PM Phil Blundell <pb@pbcl.net> wrote:
> 
> >
> > Do you have any representative real-world testcases or statistics that you
> > can share with us?
> 
> 
> I had shared my program earlier which has around 60 KB of text copied from
> Internet.

How representative is this of the way that programs actually use strstr() 
in practice?  Do you have any statistics on that?

> So, if needle is located at the beginning or end of haystack, then strstr()
> will outperform many algorithms.
> 
> The real test for strstr() would be if needle is located in the middle of
> the haystack.

Right.  My question really was: where do we think the needle is, in fact,
located most often?  If the real-world scenario is that the needle often
is at the beginning or the end then optimising for that is the right
thing to do.

Phil

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 13:18           ` Amit Choudhary
  2021-06-09 13:32             ` Amit Choudhary
@ 2021-06-09 13:40             ` Adhemerval Zanella
  2021-06-09 14:08               ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Adhemerval Zanella @ 2021-06-09 13:40 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On 09/06/2021 10:18, Amit Choudhary wrote:
> On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella <
> adhemerval.zanella@linaro.org>
> 
>>
>> The problem is really hard to judge any improvement if we can't readily
>> verify and replicate the results in a consistent manner.  The benchtests
>> we have is not perfect, but it is a starting point so any developer and
>> reviewer can verify and play with the proposed change.
>>
> 
> The current tests are skewed in favor of strstr().

This is hardly a justification to not improve the current benchmark.  If
you check the bench-strstr.c repository history you will see that Wilco
has actively improved it so we could evaluate his newer implementation 
against the previous one.

> 
> I can do more testing from my side with my code and submit results again
> but I don't have time to fix strstr() benchmarks.
> 
> 
> Throwing numbers and external projects (specially ones that does not
>> seem to have much sense such as java source code)
> 
> 
> I have sent C code also.

You send only one test with a specific usage, this is hardly a way to
fully verify the performance implication of a new implementation.  That's
why I think we need to integrate newer tests on the bench-strstr to have
meaningful and *replicated* data.

I am being insistent on this benchmark point because otherwise most
likely no reviewer will try to spend time to evaluate it for you. Specially
with the idea you pose that current tests are 'skewed in favor' of current
implementation.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 13:18           ` Amit Choudhary
@ 2021-06-09 13:32             ` Amit Choudhary
  2021-06-09 13:40             ` Adhemerval Zanella
  1 sibling, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-09 13:32 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On Wed, Jun 9, 2021, 6:48 PM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

>
>
> On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella <
> adhemerval.zanella@linaro.org>
>
>>
>>
> I can do more testing from my side with my code and submit results again
> but I don't have time to fix strstr() benchmarks.
>

The main point is that it is not easy to fix strstr() benchmarks.

There is no strstr-inputs file while there are inputs files for other
functions.

If there were a strstr-inputs file then I would have changed it to include
real world text.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:55         ` Adhemerval Zanella
@ 2021-06-09 13:18           ` Amit Choudhary
  2021-06-09 13:32             ` Amit Choudhary
  2021-06-09 13:40             ` Adhemerval Zanella
  0 siblings, 2 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-09 13:18 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella <
adhemerval.zanella@linaro.org>

>
> The problem is really hard to judge any improvement if we can't readily
> verify and replicate the results in a consistent manner.  The benchtests
> we have is not perfect, but it is a starting point so any developer and
> reviewer can verify and play with the proposed change.
>

The current tests are skewed in favor of strstr().

I can do more testing from my side with my code and submit results again
but I don't have time to fix strstr() benchmarks.


Throwing numbers and external projects (specially ones that does not
> seem to have much sense such as java source code)


I have sent C code also.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:22       ` Amit Choudhary
@ 2021-06-09 12:55         ` Adhemerval Zanella
  2021-06-09 13:18           ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Adhemerval Zanella @ 2021-06-09 12:55 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On 09/06/2021 09:22, Amit Choudhary wrote:
> On Wed, Jun 9, 2021, 5:41 PM Adhemerval Zanella <
> adhemerval.zanella@linaro.org>
> 
>>
>> Indeed, as Siddhesh has noted I agree that a better bench-strstr with more
>> meaningful tests is the start point to evaluate a newer strstr
>> implementation.
>>
> 
> I invented my algorithm because I thought about it.
> 
> Getting my algorithm in glibc is not my goal.
> 
> I can defintely provide many real world haystack and needle in a file.
> 
> But, currently, I don't have much time to change strstr() benchmarks.

The problem is really hard to judge any improvement if we can't readily
verify and replicate the results in a consistent manner.  The benchtests
we have is not perfect, but it is a starting point so any developer and
reviewer can verify and play with the proposed change.

You outlined possible limitations of current benchtests, so I would
ask you if could improve the tests so it would make the review and
test easier.

Throwing numbers and external projects (specially ones that does not
seem to have much sense such as java source code) it is not a way to
discuss a potential performance improvements. Usually reviewers, including
myself, won't spend any time trying to adjust glibc code to check your
claims.

Also, I would like to ask to please work and listen to Wilco suggestions.
He was the one that code the current strstr implementation and he did a
bit fair or research on the topic before come with a newer implementation.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:11     ` Adhemerval Zanella
@ 2021-06-09 12:22       ` Amit Choudhary
  2021-06-09 12:55         ` Adhemerval Zanella
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-09 12:22 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library

On Wed, Jun 9, 2021, 5:41 PM Adhemerval Zanella <
adhemerval.zanella@linaro.org>

>
> Indeed, as Siddhesh has noted I agree that a better bench-strstr with more
> meaningful tests is the start point to evaluate a newer strstr
> implementation.
>

I invented my algorithm because I thought about it.

Getting my algorithm in glibc is not my goal.

I can defintely provide many real world haystack and needle in a file.

But, currently, I don't have much time to change strstr() benchmarks.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:05   ` Phil Blundell
  2021-06-09 12:11     ` Adhemerval Zanella
@ 2021-06-09 12:18     ` Amit Choudhary
  2021-06-09 13:48       ` Phil Blundell
  1 sibling, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-09 12:18 UTC (permalink / raw)
  To: Phil Blundell; +Cc: GNU C Library

On Wed, Jun 9, 2021, 5:35 PM Phil Blundell <pb@pbcl.net> wrote:

>
> Do you have any representative real-world testcases or statistics that you
> can share with us?

I had shared my program earlier which has around 60 KB of text copied from
Internet.

  For example, what is the reason for believing that the
> needle "should" be in the middle?
>

strstr() does a 2 way search - from the beginning of the haystack and from
the end of haystack.

So, if needle is located at the beginning or end of haystack, then strstr()
will outperform many algorithms.

The real test for strstr() would be if needle is located in the middle of
the haystack.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:05   ` Phil Blundell
@ 2021-06-09 12:11     ` Adhemerval Zanella
  2021-06-09 12:22       ` Amit Choudhary
  2021-06-09 12:18     ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Adhemerval Zanella @ 2021-06-09 12:11 UTC (permalink / raw)
  To: Phil Blundell, Amit Choudhary; +Cc: GNU C Library



On 09/06/2021 09:05, Phil Blundell via Libc-alpha wrote:
> On Wed, Jun 09, 2021 at 05:12:09PM +0530, Amit Choudhary via Libc-alpha wrote:
>> 1. In pass cases (positive cases where needle is part of haystack), the
>> needle is either located at the beginning of the haystack or at the end of
>> the haystack. To mimic real world scenario, the needle should be at the
>> middle of the haystack.
> 
> Do you have any representative real-world testcases or statistics that you 
> can share with us?  For example, what is the reason for believing that the
> needle "should" be in the middle?
> 
>> So, in my opinion strstr() has never been tested extensively for real world
>> text. I think this is a very big shortcoming.
> 
> I'm sure improvements would be welcome.

Indeed, as Siddhesh has noted I agree that a better bench-strstr with more
meaningful tests is the start point to evaluate a newer strstr implementation.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 11:42 ` Amit Choudhary
@ 2021-06-09 12:05   ` Phil Blundell
  2021-06-09 12:11     ` Adhemerval Zanella
  2021-06-09 12:18     ` Amit Choudhary
  2021-06-09 21:44   ` Wilco Dijkstra
  1 sibling, 2 replies; 61+ messages in thread
From: Phil Blundell @ 2021-06-09 12:05 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On Wed, Jun 09, 2021 at 05:12:09PM +0530, Amit Choudhary via Libc-alpha wrote:
> 1. In pass cases (positive cases where needle is part of haystack), the
> needle is either located at the beginning of the haystack or at the end of
> the haystack. To mimic real world scenario, the needle should be at the
> middle of the haystack.

Do you have any representative real-world testcases or statistics that you 
can share with us?  For example, what is the reason for believing that the
needle "should" be in the middle?

> So, in my opinion strstr() has never been tested extensively for real world
> text. I think this is a very big shortcoming.

I'm sure improvements would be welcome.

Phil


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 10:31 Wilco Dijkstra
@ 2021-06-09 11:42 ` Amit Choudhary
  2021-06-09 12:05   ` Phil Blundell
  2021-06-09 21:44   ` Wilco Dijkstra
  0 siblings, 2 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-09 11:42 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Wed, Jun 9, 2021 at 4:02 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com>
wrote:

> Hi Amit,
>
> > If my algorithm is beating strstr() for long search patterns then I think
> > that my algorithm should be included in glibc for long search patterns.
>
> I tried it in bench-strstr.c, strstr is often 10 times faster on many
> inputs.
> I only noticed a case with a 32KB needle and 64KB text where it beats
> strstr by
> a factor 3. On complex needles it behaves like basic_strstr (ie. quadratic
> time)

I included my algorithm in bench-strstr.c and ran the tests myself. I could
not find strstr-inputs file so I printed the haystack and needle.

I found out many things which I think are not correct:

1. In pass cases (positive cases where needle is part of haystack), the
needle is either located at the beginning of the haystack or at the end of
the haystack. To mimic real world scenario, the needle should be at the
middle of the haystack.

2. The large text is mostly repetition of two small paragraphs. This again
is not a real world scenario.

3. Some tests for 256 byte needles have haystack and needle as a
combination of characters 'a' and 'b'. No real world text has been used.

4. Tests for 1 KB / 4 KB / 8 KB / 28 KB needles also have  haystack and
needle as a combination of characters 'a' and 'b'. No real world text has
been used.

So, in my opinion strstr() has never been tested extensively for real world
text. I think this is a very big shortcoming.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Fastest String Search Algorithm.
@ 2021-06-09 10:31 Wilco Dijkstra
  2021-06-09 11:42 ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-09 10:31 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: 'GNU C Library'

Hi Amit,

> If my algorithm is beating strstr() for long search patterns then I think
> that my algorithm should be included in glibc for long search patterns.

I tried it in bench-strstr.c, strstr is often 10 times faster on many inputs.
I only noticed a case with a 32KB needle and 64KB text where it beats strstr by
a factor 3. On complex needles it behaves like basic_strstr (ie. quadratic time).

The existing code that handles large needles isn't highly optimized, so it is
certainly possible to improve its performance. Large needles are rare - a size
like 4KB will never be used in practice. A key issue with large needles is quadratic
performance, and any new algorithm will need to address that somehow.

So it's not as simple as showing your algorithm is faster in one case, you have to
show it is faster on average for all large needles and avoid quadratic performance
on specially crafted worst-case inputs. As others have pointed out, this is not an
easy task...

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09  7:54         ` Amit Choudhary
@ 2021-06-09  9:26           ` Phil Blundell
  0 siblings, 0 replies; 61+ messages in thread
From: Phil Blundell @ 2021-06-09  9:26 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On Wed, Jun 09, 2021 at 01:24:39PM +0530, Amit Choudhary wrote:
> So, similarly, a check can be made that if the needle length is greater
> than 4 KB then my algorithm should be used.

Okay, I see.  That would add an extra comparison and a probably-not-taken
branch for strstr() of strings between 256 and 4095 bytes in length,
but I imagine that's in the noise.

The comment above two_way_needle() describes its cost as:

   Return the first location of non-empty NEEDLE within HAYSTACK, or
   NULL.  HAYSTACK_LEN is the minimum known length of HAYSTACK.  This
   method is optimized for LONG_NEEDLE_THRESHOLD <= NEEDLE_LEN.
   Performance is guaranteed to be linear, with an initialization cost
   of 3 * NEEDLE_LEN + (1 << CHAR_BIT) operations.

   If AVAILABLE does not modify HAYSTACK_LEN (as in memmem), then at
   most 2 * HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching,
   and sublinear performance O(HAYSTACK_LEN / NEEDLE_LEN) is possible.
   If AVAILABLE modifies HAYSTACK_LEN (as in strstr), then at most 3 *
   HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching, and
   sublinear performance is not possible.

How does this compare to your algorithm?

Phil

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09  7:38       ` Phil Blundell
@ 2021-06-09  7:54         ` Amit Choudhary
  2021-06-09  9:26           ` Phil Blundell
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-09  7:54 UTC (permalink / raw)
  To: Phil Blundell; +Cc: GNU C Library

On Wed, Jun 9, 2021 at 1:08 PM Phil Blundell <pb@pbcl.net> wrote:

> On Wed, Jun 09, 2021 at 11:48:11AM +0530, Amit Choudhary via Libc-alpha
> wrote:
> > If my algorithm is beating strstr() for long search patterns then I think
> > that my algorithm should be included in glibc for long search patterns.
>
> When you say "included [...] for long search patterns", what exactly are
> you proposing?  Are you suggesting that strstr() should have some kind
> of frontend wrapper which computes the length of the pattern and then
> decides which algorithm to use?  If so then we would need to include the
> cost of that length computation in the performance calculations.
>

Current / Latest strstr() algorithm already computes the length of pattern
and decides which algorithm to use. Please look at
https://github.com/bminor/glibc/blob/master/string/strstr.c

Some code from this file is:

=========================================
/* Ensure haystack length is at least as long as needle length.
Since a match may occur early on in a huge haystack, use strnlen
and read ahead a few cachelines for improved performance. */
size_t ne_len = strlen ((const char*)ne);
size_t hs_len = __strnlen ((const char*)hs, ne_len | 512);
if (hs_len < ne_len)
      return NULL;

..
..
..

/* Use Two-Way algorithm for very long needles. */
if (__glibc_unlikely (ne_len > 256))
return two_way_long_needle (hs, hs_len, ne, ne_len);
==========================================

So, there is strlen for needle already for all searches.

So, if needle length is more than 256 then it uses two_way_long_needle
algorithm else it uses other algorithm.

So, similarly, a check can be made that if the needle length is greater
than 4 KB then my algorithm should be used.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09  6:18     ` Amit Choudhary
  2021-06-09  6:50       ` Siddhesh Poyarekar
@ 2021-06-09  7:38       ` Phil Blundell
  2021-06-09  7:54         ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Phil Blundell @ 2021-06-09  7:38 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On Wed, Jun 09, 2021 at 11:48:11AM +0530, Amit Choudhary via Libc-alpha wrote:
> If my algorithm is beating strstr() for long search patterns then I think
> that my algorithm should be included in glibc for long search patterns.

I'm not sure we have enough data yet to make a rational decision on this
kind of thing.  To evaluate a new algorithm properly we'd need to
understand its behaviour on a wide range of input data, together with
the frequency with which each kind of input is likely to occur.  It
would obviously be counterproductive to replace the existing algorithm
with one that performs better for rarely-encountered input but is worse
for frequently-encountered input.

When you say "included [...] for long search patterns", what exactly are
you proposing?  Are you suggesting that strstr() should have some kind
of frontend wrapper which computes the length of the pattern and then
decides which algorithm to use?  If so then we would need to include the
cost of that length computation in the performance calculations.

Thanks

Phil

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09  6:18     ` Amit Choudhary
@ 2021-06-09  6:50       ` Siddhesh Poyarekar
  2021-06-09  7:38       ` Phil Blundell
  1 sibling, 0 replies; 61+ messages in thread
From: Siddhesh Poyarekar @ 2021-06-09  6:50 UTC (permalink / raw)
  To: Amit Choudhary, Guo, Wangyang; +Cc: GNU C Library

On 6/9/21 11:48 AM, Amit Choudhary via Libc-alpha wrote:
> Have you tried with long texts (60 KB or more) and long search patterns (28
> KB or more). My algorithm is faster than others for long search patterns.
> 
> strstr() is faster with 60 KB of text and 3.2 KB of search pattern.
> 
> But my algorithm is faster with 60 KB of text and 4.4 KB / 28 KB of search
> pattern.
> 
> Please test with long texts and long search patterns.

Maybe we need to first start with you contributing the tests you're 
using the evaluate the algorithm to benchtests.  Please see 
benchtests/README and go through the existing tests to understand how 
they work.  That should give a more structured way to discuss the new 
strstr algorithm.

Once the change is in, I'd recommend adding your strstr implementation 
as an ifunc (see sysdeps/x86_64/strstr.c and sysdeps/x86_64/strstr_* as 
examples) and use `make bench` to get a more comprehensive view of the 
performance characteristics of your function vs all other 
implementations for that architecture; there are more than one.  It is a 
lot easier to have a conversation about performance characteristics of 
the algorithm with that.

> I hope that people are not of the view that Indians can't invent or
> innovate.

No.  Please stick to technical discussions.  We are a pretty diverse set 
of contributors in the glibc project and over the course of your 
discussions (that have barely lasted 3 days; it has taken many people 
weeks or even months to get major changes included into glibc because of 
the very wide impact of such changes) you have received comments from 
people in at least 4 different countries, including India.

Siddhesh

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09  5:33   ` Guo, Wangyang
@ 2021-06-09  6:18     ` Amit Choudhary
  2021-06-09  6:50       ` Siddhesh Poyarekar
  2021-06-09  7:38       ` Phil Blundell
  0 siblings, 2 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-09  6:18 UTC (permalink / raw)
  To: Guo, Wangyang; +Cc: H.J. Lu, GNU C Library

[-- Attachment #1: Type: text/plain, Size: 1275 bytes --]

On Wed, Jun 9, 2021 at 11:03 AM Guo, Wangyang <wangyang.guo@intel.com>
wrote:

> I have tried to convert Amit's algorithm in C and benchmark with
> bench-strstr.
> The result shows no advantage compared to existing implementations.
>

Have you tried with long texts (60 KB or more) and long search patterns (28
KB or more). My algorithm is faster than others for long search patterns.

strstr() is faster with 60 KB of text and 3.2 KB of search pattern.

But my algorithm is faster with 60 KB of text and 4.4 KB / 28 KB of search
pattern.

Please test with long texts and long search patterns.

I am attaching a zip file which contains the
compare_string_search_algorithms.c, strstr.c, and str-two-way.

I used the above to compare the performance of my algorithm vs latest
strstr().

The file compare_string_search_algorithms.c contains my implementation of
string search and 4 search patterns: 4.4 KB long, 3.2 KB long, 7 KB long,
and 28 KB long. The text is around 60 KB long.

You can test with these long patterns and long text.

If my algorithm is beating strstr() for long search patterns then I think
that my algorithm should be included in glibc for long search patterns.

I hope that people are not of the view that Indians can't invent or
innovate.

Regards,
Amit

[-- Attachment #2: compare_string_search_algorithms.zip --]
[-- Type: application/zip, Size: 41097 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: Fastest String Search Algorithm.
  2021-06-06 14:30 ` H.J. Lu
  2021-06-06 18:38   ` Amit Choudhary
@ 2021-06-09  5:33   ` Guo, Wangyang
  2021-06-09  6:18     ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Guo, Wangyang @ 2021-06-09  5:33 UTC (permalink / raw)
  To: H.J. Lu, Amit Choudhary; +Cc: GNU C Library

I have tried to convert Amit's algorithm in C and benchmark with bench-strstr.
The result shows no advantage compared to existing implementations.

BR
Wangyang

-----Original Message-----
From: H.J. Lu <hjl.tools@gmail.com> 
Sent: Sunday, June 6, 2021 10:30 PM
To: Amit Choudhary <amitchoudhary0523@gmail.com>; Guo, Wangyang <wangyang.guo@intel.com>
Cc: GNU C Library <libc-alpha@sourceware.org>
Subject: Re: Fastest String Search Algorithm.

On Sun, Jun 6, 2021 at 7:15 AM Amit Choudhary via Libc-alpha <libc-alpha@sourceware.org> wrote:
>
> Hi All,
>
> I have invented the fastest string search algorithm. It is called 
> "Choudhary Algorithm". I am sharing it here, in case glibc wants to 
> include it in for strstr().

We have been trying to improve strstr in glibc:

https://gitlab.com/x86-glibc/glibc/-/merge_requests/46

But we couldn't find a solution which is fast for all cases.
Wangyang, please work
with Amit to improve strstr in glibc.

> I have compared my algorithm my other standard algorithms and below 
> are the results.
>
> You can view the results and source code of the algorithm and 
> text/pattern
> here: 
> https://sourceforge.net/projects/fastest-string-search-algo/files/
>
> The search text is around 60 KB long and the pattern is 27 bytes.
>
> Result:
> ---------
>
> Pattern found at index 30384
>
> Time taken by Brute Force algorithm = 57,151,900 ns.
> Time taken by Boyer Moore algorithm = 25,012,100 ns.
> Time taken by Rabin Karp algorithm = 613,410,200 ns.
> Time taken by Knuth Morris Pratt Force algorithm = 58,461,200 ns.
> Time taken by Choudhary algorithm = 17,413,000 ns. (Fastest).
>
> Choudhary string search algorithm is the fastest.
>
> ----------------------------------------------------------------------
> ---------
>
> Regards,
> Amit

Thanks.

--
H.J.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-07 17:33 ` Florian Weimer
@ 2021-06-08  4:52   ` Siddhesh Poyarekar
  0 siblings, 0 replies; 61+ messages in thread
From: Siddhesh Poyarekar @ 2021-06-08  4:52 UTC (permalink / raw)
  To: Florian Weimer, Wilco Dijkstra via Libc-alpha
  Cc: amitchoudhary0523, Wilco Dijkstra

On 6/7/21 11:03 PM, Florian Weimer via Libc-alpha wrote:
> * Wilco Dijkstra via Libc-alpha:
> 
>> Hi Amit,
>>
>>> So, do you think that 2.28 version would be having the strstr() bug which
>>> fails to match very large patterns.
>>
>> 2.28 was the failing version indeed: https://sourceware.org/bugzilla/show_bug.cgi?id=23637
> 
> On the other hand, most distributions will have patched this by now.
> 
> Amit, a command like
> 
> rpm -q glibc
> dpkg -l libc6
> 
> will show you the distribution version.

... although if the comparison is with distribution strstr, it's likely 
that you're comparing with the microarchitecture optimised version of 
the function and not the generic strstr.

Siddhesh

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-07 11:48     ` Amit Choudhary
@ 2021-06-07 20:12       ` Paul Eggert
  0 siblings, 0 replies; 61+ messages in thread
From: Paul Eggert @ 2021-06-07 20:12 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library, wangyang.guo

On 6/7/21 4:48 AM, Amit Choudhary wrote:
> SMART only tests for the correctness of the algorithm and doesn't print any
> performance numbers.

SMART is just a research toolkit; you may have to futz with it a bit to 
get it to work for you. It does do timing; look in 
smart-13.03/source/timer.h.

By the way, SMART 13.03 crashes right away for me because it relies on 
undefined behavior in calls like strcmp ("foo", NULL); you might need to 
fix that too.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-07 16:17 Wilco Dijkstra
@ 2021-06-07 17:33 ` Florian Weimer
  2021-06-08  4:52   ` Siddhesh Poyarekar
  0 siblings, 1 reply; 61+ messages in thread
From: Florian Weimer @ 2021-06-07 17:33 UTC (permalink / raw)
  To: Wilco Dijkstra via Libc-alpha; +Cc: amitchoudhary0523, Wilco Dijkstra

* Wilco Dijkstra via Libc-alpha:

> Hi Amit,
>
>> So, do you think that 2.28 version would be having the strstr() bug which
>> fails to match very large patterns.
>
> 2.28 was the failing version indeed: https://sourceware.org/bugzilla/show_bug.cgi?id=23637

On the other hand, most distributions will have patched this by now.

Amit, a command like

rpm -q glibc
dpkg -l libc6

will show you the distribution version.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Fastest String Search Algorithm.
@ 2021-06-07 16:17 Wilco Dijkstra
  2021-06-07 17:33 ` Florian Weimer
  0 siblings, 1 reply; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-07 16:17 UTC (permalink / raw)
  To: amitchoudhary0523; +Cc: 'GNU C Library'

Hi Amit,

> So, do you think that 2.28 version would be having the strstr() bug which
> fails to match very large patterns.

2.28 was the failing version indeed: https://sourceware.org/bugzilla/show_bug.cgi?id=23637

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-07 15:21 ` Amit Choudhary
  2021-06-07 16:02   ` Amit Choudhary
@ 2021-06-07 16:04   ` Wilco Dijkstra
  1 sibling, 0 replies; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-07 16:04 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

>> Anyway, the generic strstr in a recent GLIBC is over 1000 times faster on this test.
>
> So, does this mean that gcc may not be having latest GLIBC.

GLIBC is part of your system, not GCC. Use ldd --version to get the GLIBC version
your system has. If it is old you could upgrade or build a newer GLIBC yourself.

> Can you please let me know what are modern string algorithms and where can
> I find them so that I can compare my algorithm with them.

The generic GLIBC strstr implementation I pointed you at is the one you need to beat
to prove your algorithm is better. It's the fastest known generic strstr implementation
and beats every algorithm in SMART on most inputs.

If you build your own GLIBC, you can run the strstr benchmark (glibc/benchtests/bench-
strstr.c). You can even add your algorithm in the comparison (search for "basic_strstr" -
you'll find it's easy to add).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-07 15:21 ` Amit Choudhary
@ 2021-06-07 16:02   ` Amit Choudhary
  2021-06-07 16:04   ` Wilco Dijkstra
  1 sibling, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-07 16:02 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

> It's possible you have an old GLIBC - I remember a strstr bug years ago
>> where
>> it failed to match very large patterns.
>>
>
The glibc on my system is ""GNU libc version: 2.28"". Version 2.28 was
released on 2018-08-0.

So, do you think that 2.28 version would be having the strstr() bug which
fails to match very large patterns.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-07 14:46 Wilco Dijkstra
@ 2021-06-07 15:21 ` Amit Choudhary
  2021-06-07 16:02   ` Amit Choudhary
  2021-06-07 16:04   ` Wilco Dijkstra
  0 siblings, 2 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-07 15:21 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

> It's possible you have an old GLIBC - I remember a strstr bug years ago
> where
> it failed to match very large patterns.
>

I am using gcc on linux. I will check the version tomorrow.

Anyway, the generic strstr in a recent GLIBC is over 1000 times faster on
> this test.
>

So, does this mean that gcc may not be having latest GLIBC.

Modern string
> algorithms are fast due to not inspecting every single character. They
> skip most of
> the input by jumping ahead by the size of the pattern after a mismatch. So
> perhaps counterintuitively, they become much faster with a larger search
> pattern.
>

Can you please let me know what are modern string algorithms and where can
I find them so that I can compare my algorithm with them.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Fastest String Search Algorithm.
@ 2021-06-07 14:46 Wilco Dijkstra
  2021-06-07 15:21 ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Wilco Dijkstra @ 2021-06-07 14:46 UTC (permalink / raw)
  To: 'GNU C Library', amitchoudhary0523

Hi Amit,

> I wrote a program to test my algorithm with strstr(). The text to search is
> around 60 KB and the pattern is around 1.3 KB.
> 
> The surprise is that """""strstr() failed to find the pattern."""""

It's possible you have an old GLIBC - I remember a strstr bug years ago where
it failed to match very large patterns.

Anyway, the generic strstr in a recent GLIBC is over 1000 times faster on this test.

Basically you don't want to compare with ancient Boyer-Moore or KMP (which AFAIK
nobody uses in the real world) but with the best modern algorithms. Modern string
algorithms are fast due to not inspecting every single character. They skip most of
the input by jumping ahead by the size of the pattern after a mismatch. So perhaps
counterintuitively, they become much faster with a larger search pattern.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-06 18:26   ` Amit Choudhary
@ 2021-06-07 11:48     ` Amit Choudhary
  2021-06-07 20:12       ` Paul Eggert
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-07 11:48 UTC (permalink / raw)
  To: Paul Eggert; +Cc: GNU C Library, wangyang.guo

On Sun, Jun 6, 2021 at 11:56 PM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

>
>> If I understand things correctly, the worst-case performance of the new
>> algorithm is the same (within a constant) as brute-force, so saying it's
>> the "fastest" needs to be qualified by the set of inputs for which it's
>> the fastest.
>
>
>
> I will include my algoritm in SMART and test.
>

I included my algorithm in SMART and tested. But """""SMART looks pretty
useless""""" unless I am missing something.

SMART only tests for the correctness of the algorithm and doesn't print any
performance numbers.

And the tests are also very basic. I am listing the tests it does below.

But, please let me know if I have missed something about SMART.

===============Test Cases by SMART==================

   // 1) search for "a" in "aaaaaaaaaa"
    // 2) search for "aa" in "aaaaaaaaaa"
    // 3) search for "aaaaaaaaaa" in "aaaaaaaaaa"
    // 4) search for "b" in "aaaaaaaaaa"
    // 5) search for "abab" in "ababababab"
    // 6) search for "a" in "ababababab"
    // 7) search for "aba" in "ababababab"
    // 8) search for "abc" in "ababababab"
    // 9) search for "ba" in "ababababab"
    // 10) search for "babbbbb" in "ababababab"
    // 11) search for "bcdefg" in "bcdefghilm"
    // 12) search for rand in rand
    // 13) search for rand in rand
    // 14) search for rand in rand
    // 15) search for rand in rand
    // 16) search for rand in rand
    // 17) search for rand in rand
    // 18) search for rand in rand
    // 19) search for "babbbbb" in "abababbbbb"
    // 20) search for "bababb" in "abababbbbb"

=============================================

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-06 14:30 ` H.J. Lu
@ 2021-06-06 18:38   ` Amit Choudhary
  2021-06-09  5:33   ` Guo, Wangyang
  1 sibling, 0 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-06 18:38 UTC (permalink / raw)
  To: H.J. Lu; +Cc: wangyang.guo, GNU C Library

>
>
> We have been trying to improve strstr in glibc:
>
> https://gitlab.com/x86-glibc/glibc/-/merge_requests/46
>
> But we couldn't find a solution which is fast for all cases.
> Wangyang, please work
> with Amit to improve strstr in glibc.
>

Hi Wangyang,

Please let me know what to do and I will do it.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-06 17:58 ` Paul Eggert
@ 2021-06-06 18:26   ` Amit Choudhary
  2021-06-07 11:48     ` Amit Choudhary
  0 siblings, 1 reply; 61+ messages in thread
From: Amit Choudhary @ 2021-06-06 18:26 UTC (permalink / raw)
  To: Paul Eggert; +Cc: libc-alpha, wangyang.guo

>
>
> If I understand things correctly, the worst-case performance of the new
> algorithm is the same (within a constant) as brute-force, so saying it's
> the "fastest" needs to be qualified by the set of inputs for which it's
> the fastest.

I will include my algoritm in SMART and test.

Also, in real world scenario, my algorithm will almost never hit worst case.

This is because in our day to day search which we encounter every day
(searcing for some text in a file, web page, etc.), my algorithm will
likely out perform other algorithms because the probability of elements of
pattern matching the corresponding elements of text (at position 0, n-1,
n/2, n/4, 3n/4) and then the whole string not matching is very very less.

n is the length of the pattern.

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-06 14:15 Amit Choudhary
  2021-06-06 14:30 ` H.J. Lu
@ 2021-06-06 17:58 ` Paul Eggert
  2021-06-06 18:26   ` Amit Choudhary
  1 sibling, 1 reply; 61+ messages in thread
From: Paul Eggert @ 2021-06-06 17:58 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: libc-alpha, wangyang.guo

On 6/6/21 7:15 AM, Amit Choudhary via Libc-alpha wrote:

> https://sourceforge.net/projects/fastest-string-search-algo/files/
It appears you intend for this code to be in the public domain. If so, 
please change "There is no license for any program/software given in 
this project" to "This project's software is in the public domain". 
Legally, "There is no license" means "I give no permission to use", 
which I assume is not your intent.

If I understand things correctly, the worst-case performance of the new 
algorithm is the same (within a constant) as brute-force, so saying it's 
the "fastest" needs to be qualified by the set of inputs for which it's 
the fastest. Presumably this would involve timing the algorithm with 
more data representing these inputs.

Here is a good example of such an analysis:

Faro S, Lecroq T. The exact online string matching problem: A review of 
the most recent results. ACM Comput Surv. 2013;45(2)1-42. 
https://doi.org/10.1145/2431211.2431212

This analysis uses SMART <https://smart-tool.github.io/smart/> which 
comes with a text corpus. The analysis is related to Charras & Lecroq's 
catalog of string-searching algorithms, here:

https://www-igm.univ-mlv.fr/~lecroq/string/

For a recent survey of the more-general topic, please see:

Hakak SI, Kamsin A, Shivakumara P, Gilkar SA, Khan WZ, Imran M. Exact 
string matching algorithms: survey, issues, and future research 
directions. IEEE Access. 2019;7:69614-69637. 
https://doi.org/10.1109/ACCESS.2019.2914071

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-06 14:15 Amit Choudhary
@ 2021-06-06 14:30 ` H.J. Lu
  2021-06-06 18:38   ` Amit Choudhary
  2021-06-09  5:33   ` Guo, Wangyang
  2021-06-06 17:58 ` Paul Eggert
  1 sibling, 2 replies; 61+ messages in thread
From: H.J. Lu @ 2021-06-06 14:30 UTC (permalink / raw)
  To: Amit Choudhary, wangyang.guo; +Cc: GNU C Library

On Sun, Jun 6, 2021 at 7:15 AM Amit Choudhary via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Hi All,
>
> I have invented the fastest string search algorithm. It is called
> "Choudhary Algorithm". I am sharing it here, in case glibc wants to include
> it in for strstr().

We have been trying to improve strstr in glibc:

https://gitlab.com/x86-glibc/glibc/-/merge_requests/46

But we couldn't find a solution which is fast for all cases.
Wangyang, please work
with Amit to improve strstr in glibc.

> I have compared my algorithm my other standard algorithms and below are the
> results.
>
> You can view the results and source code of the algorithm and text/pattern
> here: https://sourceforge.net/projects/fastest-string-search-algo/files/
>
> The search text is around 60 KB long and the pattern is 27 bytes.
>
> Result:
> ---------
>
> Pattern found at index 30384
>
> Time taken by Brute Force algorithm = 57,151,900 ns.
> Time taken by Boyer Moore algorithm = 25,012,100 ns.
> Time taken by Rabin Karp algorithm = 613,410,200 ns.
> Time taken by Knuth Morris Pratt Force algorithm = 58,461,200 ns.
> Time taken by Choudhary algorithm = 17,413,000 ns. (Fastest).
>
> Choudhary string search algorithm is the fastest.
>
> -------------------------------------------------------------------------------
>
> Regards,
> Amit

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Fastest String Search Algorithm.
@ 2021-06-06 14:15 Amit Choudhary
  2021-06-06 14:30 ` H.J. Lu
  2021-06-06 17:58 ` Paul Eggert
  0 siblings, 2 replies; 61+ messages in thread
From: Amit Choudhary @ 2021-06-06 14:15 UTC (permalink / raw)
  To: libc-alpha

Hi All,

I have invented the fastest string search algorithm. It is called
"Choudhary Algorithm". I am sharing it here, in case glibc wants to include
it in for strstr().

I have compared my algorithm my other standard algorithms and below are the
results.

You can view the results and source code of the algorithm and text/pattern
here: https://sourceforge.net/projects/fastest-string-search-algo/files/

The search text is around 60 KB long and the pattern is 27 bytes.

Result:
---------

Pattern found at index 30384

Time taken by Brute Force algorithm = 57,151,900 ns.
Time taken by Boyer Moore algorithm = 25,012,100 ns.
Time taken by Rabin Karp algorithm = 613,410,200 ns.
Time taken by Knuth Morris Pratt Force algorithm = 58,461,200 ns.
Time taken by Choudhary algorithm = 17,413,000 ns. (Fastest).

Choudhary string search algorithm is the fastest.

-------------------------------------------------------------------------------

Regards,
Amit

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2021-06-12  4:18 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07 10:53 Fastest String Search Algorithm Wilco Dijkstra
2021-06-07 12:33 ` Amit Choudhary
2021-06-08  6:33 ` Amit Choudhary
2021-06-08  7:25   ` Amit Choudhary
  -- strict thread matches above, loose matches on Subject: below --
2021-06-09 10:31 Wilco Dijkstra
2021-06-09 11:42 ` Amit Choudhary
2021-06-09 12:05   ` Phil Blundell
2021-06-09 12:11     ` Adhemerval Zanella
2021-06-09 12:22       ` Amit Choudhary
2021-06-09 12:55         ` Adhemerval Zanella
2021-06-09 13:18           ` Amit Choudhary
2021-06-09 13:32             ` Amit Choudhary
2021-06-09 13:40             ` Adhemerval Zanella
2021-06-09 14:08               ` Amit Choudhary
2021-06-09 14:25                 ` Phil Blundell
2021-06-09 12:18     ` Amit Choudhary
2021-06-09 13:48       ` Phil Blundell
2021-06-09 21:44   ` Wilco Dijkstra
2021-06-10  2:42     ` Amit Choudhary
2021-06-10  3:23       ` Dan Raymond
2021-06-10  4:19         ` Amit Choudhary
2021-06-10  9:44       ` Wilco Dijkstra
2021-06-10 11:08         ` Amit Choudhary
2021-06-10 16:53           ` Amit Choudhary
2021-06-10 20:04             ` Amit Choudhary
2021-06-11  8:28               ` Amit Choudhary
2021-06-11 11:45           ` Wilco Dijkstra
2021-06-11 12:42             ` Amit Choudhary
2021-06-11 13:39               ` Amit Choudhary
2021-06-11 13:46               ` Wilco Dijkstra
2021-06-11 15:52                 ` Amit Choudhary
2021-06-11 19:14                   ` Amit Choudhary
2021-06-11 19:26                     ` Paul Eggert
2021-06-11 19:41                       ` Amit Choudhary
2021-06-11 20:00                         ` Dan Raymond
2021-06-12  4:00                           ` Amit Choudhary
2021-06-12  4:18                             ` Amit Choudhary
2021-06-11 13:57               ` Paul Zimmermann
2021-06-11 14:04                 ` Amit Choudhary
2021-06-11 14:16                   ` Paul Zimmermann
2021-06-11 14:27                     ` Amit Choudhary
2021-06-07 16:17 Wilco Dijkstra
2021-06-07 17:33 ` Florian Weimer
2021-06-08  4:52   ` Siddhesh Poyarekar
2021-06-07 14:46 Wilco Dijkstra
2021-06-07 15:21 ` Amit Choudhary
2021-06-07 16:02   ` Amit Choudhary
2021-06-07 16:04   ` Wilco Dijkstra
2021-06-06 14:15 Amit Choudhary
2021-06-06 14:30 ` H.J. Lu
2021-06-06 18:38   ` Amit Choudhary
2021-06-09  5:33   ` Guo, Wangyang
2021-06-09  6:18     ` Amit Choudhary
2021-06-09  6:50       ` Siddhesh Poyarekar
2021-06-09  7:38       ` Phil Blundell
2021-06-09  7:54         ` Amit Choudhary
2021-06-09  9:26           ` Phil Blundell
2021-06-06 17:58 ` Paul Eggert
2021-06-06 18:26   ` Amit Choudhary
2021-06-07 11:48     ` Amit Choudhary
2021-06-07 20:12       ` Paul Eggert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).