Fastest String Search Algorithm.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Fastest String Search Algorithm.
@ 2021-06-09 10:31 Wilco Dijkstra
  2021-06-09 11:42 ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Wilco Dijkstra @ 2021-06-09 10:31 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: 'GNU C Library'

Hi Amit,

> If my algorithm is beating strstr() for long search patterns then I think
> that my algorithm should be included in glibc for long search patterns.

I tried it in bench-strstr.c, strstr is often 10 times faster on many inputs.
I only noticed a case with a 32KB needle and 64KB text where it beats strstr by
a factor 3. On complex needles it behaves like basic_strstr (ie. quadratic time).

The existing code that handles large needles isn't highly optimized, so it is
certainly possible to improve its performance. Large needles are rare - a size
like 4KB will never be used in practice. A key issue with large needles is quadratic
performance, and any new algorithm will need to address that somehow.

So it's not as simple as showing your algorithm is faster in one case, you have to
show it is faster on average for all large needles and avoid quadratic performance
on specially crafted worst-case inputs. As others have pointed out, this is not an
easy task...

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 10:31 Fastest String Search Algorithm Wilco Dijkstra
@ 2021-06-09 11:42 ` Amit Choudhary
  2021-06-09 12:05   ` Phil Blundell
  2021-06-09 21:44   ` Wilco Dijkstra
  0 siblings, 2 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-09 11:42 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Wed, Jun 9, 2021 at 4:02 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com>
wrote:

> Hi Amit,
>
> > If my algorithm is beating strstr() for long search patterns then I think
> > that my algorithm should be included in glibc for long search patterns.
>
> I tried it in bench-strstr.c, strstr is often 10 times faster on many
> inputs.
> I only noticed a case with a 32KB needle and 64KB text where it beats
> strstr by
> a factor 3. On complex needles it behaves like basic_strstr (ie. quadratic
> time)

I included my algorithm in bench-strstr.c and ran the tests myself. I could
not find strstr-inputs file so I printed the haystack and needle.

I found out many things which I think are not correct:

1. In pass cases (positive cases where needle is part of haystack), the
needle is either located at the beginning of the haystack or at the end of
the haystack. To mimic real world scenario, the needle should be at the
middle of the haystack.

2. The large text is mostly repetition of two small paragraphs. This again
is not a real world scenario.

3. Some tests for 256 byte needles have haystack and needle as a
combination of characters 'a' and 'b'. No real world text has been used.

4. Tests for 1 KB / 4 KB / 8 KB / 28 KB needles also have  haystack and
needle as a combination of characters 'a' and 'b'. No real world text has
been used.

So, in my opinion strstr() has never been tested extensively for real world
text. I think this is a very big shortcoming.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 11:42 ` Amit Choudhary
@ 2021-06-09 12:05   ` Phil Blundell
  2021-06-09 12:11     ` Adhemerval Zanella
  2021-06-09 12:18     ` Amit Choudhary
  2021-06-09 21:44   ` Wilco Dijkstra
  1 sibling, 2 replies; 45+ messages in thread
From: Phil Blundell @ 2021-06-09 12:05 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On Wed, Jun 09, 2021 at 05:12:09PM +0530, Amit Choudhary via Libc-alpha wrote:
> 1. In pass cases (positive cases where needle is part of haystack), the
> needle is either located at the beginning of the haystack or at the end of
> the haystack. To mimic real world scenario, the needle should be at the
> middle of the haystack.

Do you have any representative real-world testcases or statistics that you 
can share with us?  For example, what is the reason for believing that the
needle "should" be in the middle?

> So, in my opinion strstr() has never been tested extensively for real world
> text. I think this is a very big shortcoming.

I'm sure improvements would be welcome.

Phil


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:05   ` Phil Blundell
@ 2021-06-09 12:11     ` Adhemerval Zanella
  2021-06-09 12:22       ` Amit Choudhary
  2021-06-09 12:18     ` Amit Choudhary
  1 sibling, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2021-06-09 12:11 UTC (permalink / raw)
  To: Phil Blundell, Amit Choudhary; +Cc: GNU C Library



On 09/06/2021 09:05, Phil Blundell via Libc-alpha wrote:
> On Wed, Jun 09, 2021 at 05:12:09PM +0530, Amit Choudhary via Libc-alpha wrote:
>> 1. In pass cases (positive cases where needle is part of haystack), the
>> needle is either located at the beginning of the haystack or at the end of
>> the haystack. To mimic real world scenario, the needle should be at the
>> middle of the haystack.
> 
> Do you have any representative real-world testcases or statistics that you 
> can share with us?  For example, what is the reason for believing that the
> needle "should" be in the middle?
> 
>> So, in my opinion strstr() has never been tested extensively for real world
>> text. I think this is a very big shortcoming.
> 
> I'm sure improvements would be welcome.

Indeed, as Siddhesh has noted I agree that a better bench-strstr with more
meaningful tests is the start point to evaluate a newer strstr implementation.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:11     ` Adhemerval Zanella
@ 2021-06-09 12:22       ` Amit Choudhary
  2021-06-09 12:55         ` Adhemerval Zanella
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-09 12:22 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library

On Wed, Jun 9, 2021, 5:41 PM Adhemerval Zanella <
adhemerval.zanella@linaro.org>

>
> Indeed, as Siddhesh has noted I agree that a better bench-strstr with more
> meaningful tests is the start point to evaluate a newer strstr
> implementation.
>

I invented my algorithm because I thought about it.

Getting my algorithm in glibc is not my goal.

I can defintely provide many real world haystack and needle in a file.

But, currently, I don't have much time to change strstr() benchmarks.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:22       ` Amit Choudhary
@ 2021-06-09 12:55         ` Adhemerval Zanella
  2021-06-09 13:18           ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2021-06-09 12:55 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On 09/06/2021 09:22, Amit Choudhary wrote:
> On Wed, Jun 9, 2021, 5:41 PM Adhemerval Zanella <
> adhemerval.zanella@linaro.org>
> 
>>
>> Indeed, as Siddhesh has noted I agree that a better bench-strstr with more
>> meaningful tests is the start point to evaluate a newer strstr
>> implementation.
>>
> 
> I invented my algorithm because I thought about it.
> 
> Getting my algorithm in glibc is not my goal.
> 
> I can defintely provide many real world haystack and needle in a file.
> 
> But, currently, I don't have much time to change strstr() benchmarks.

The problem is really hard to judge any improvement if we can't readily
verify and replicate the results in a consistent manner.  The benchtests
we have is not perfect, but it is a starting point so any developer and
reviewer can verify and play with the proposed change.

You outlined possible limitations of current benchtests, so I would
ask you if could improve the tests so it would make the review and
test easier.

Throwing numbers and external projects (specially ones that does not
seem to have much sense such as java source code) it is not a way to
discuss a potential performance improvements. Usually reviewers, including
myself, won't spend any time trying to adjust glibc code to check your
claims.

Also, I would like to ask to please work and listen to Wilco suggestions.
He was the one that code the current strstr implementation and he did a
bit fair or research on the topic before come with a newer implementation.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:55         ` Adhemerval Zanella
@ 2021-06-09 13:18           ` Amit Choudhary
  2021-06-09 13:32             ` Amit Choudhary
  2021-06-09 13:40             ` Adhemerval Zanella
  0 siblings, 2 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-09 13:18 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella <
adhemerval.zanella@linaro.org>

>
> The problem is really hard to judge any improvement if we can't readily
> verify and replicate the results in a consistent manner.  The benchtests
> we have is not perfect, but it is a starting point so any developer and
> reviewer can verify and play with the proposed change.
>

The current tests are skewed in favor of strstr().

I can do more testing from my side with my code and submit results again
but I don't have time to fix strstr() benchmarks.


Throwing numbers and external projects (specially ones that does not
> seem to have much sense such as java source code)


I have sent C code also.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 13:18           ` Amit Choudhary
@ 2021-06-09 13:32             ` Amit Choudhary
  2021-06-09 13:40             ` Adhemerval Zanella
  1 sibling, 0 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-09 13:32 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On Wed, Jun 9, 2021, 6:48 PM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

>
>
> On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella <
> adhemerval.zanella@linaro.org>
>
>>
>>
> I can do more testing from my side with my code and submit results again
> but I don't have time to fix strstr() benchmarks.
>

The main point is that it is not easy to fix strstr() benchmarks.

There is no strstr-inputs file while there are inputs files for other
functions.

If there were a strstr-inputs file then I would have changed it to include
real world text.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 13:18           ` Amit Choudhary
  2021-06-09 13:32             ` Amit Choudhary
@ 2021-06-09 13:40             ` Adhemerval Zanella
  2021-06-09 14:08               ` Amit Choudhary
  1 sibling, 1 reply; 45+ messages in thread
From: Adhemerval Zanella @ 2021-06-09 13:40 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On 09/06/2021 10:18, Amit Choudhary wrote:
> On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella <
> adhemerval.zanella@linaro.org>
> 
>>
>> The problem is really hard to judge any improvement if we can't readily
>> verify and replicate the results in a consistent manner.  The benchtests
>> we have is not perfect, but it is a starting point so any developer and
>> reviewer can verify and play with the proposed change.
>>
> 
> The current tests are skewed in favor of strstr().

This is hardly a justification to not improve the current benchmark.  If
you check the bench-strstr.c repository history you will see that Wilco
has actively improved it so we could evaluate his newer implementation 
against the previous one.

> 
> I can do more testing from my side with my code and submit results again
> but I don't have time to fix strstr() benchmarks.
> 
> 
> Throwing numbers and external projects (specially ones that does not
>> seem to have much sense such as java source code)
> 
> 
> I have sent C code also.

You send only one test with a specific usage, this is hardly a way to
fully verify the performance implication of a new implementation.  That's
why I think we need to integrate newer tests on the bench-strstr to have
meaningful and *replicated* data.

I am being insistent on this benchmark point because otherwise most
likely no reviewer will try to spend time to evaluate it for you. Specially
with the idea you pose that current tests are 'skewed in favor' of current
implementation.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 13:40             ` Adhemerval Zanella
@ 2021-06-09 14:08               ` Amit Choudhary
  2021-06-09 14:25                 ` Phil Blundell
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-09 14:08 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra

On Wed, Jun 9, 2021, 7:10 PM Adhemerval

>
> You send only one test with a specific usage, this is hardly a way to
> fully verify the performance implication of a new implementation.  That's
> why I think we need to integrate newer tests on the bench-strstr to have
> meaningful and *replicated* data.
>

I don't think that you have seen what is the haystack and needles for
strstr test.

Basically, there is one haystack and needles with different sizes.

I have done the same tests on my side. I have one text of 60 KB and my
needle sizes are 3.2 KB, 4.4 KB, 5.9 KB, 7 KB, and 28 KB.

So, my tests are similar to those of strstr() benchmark.

Also, I have already mentioned that my algorithm is faster than strstr()
when needle size is greater than or equal to 4.4 KB.

I haven't yet tested with 4 KB needle.

But, as Wilco pointed out, no one will use 4 KB needle so then it is
obvious that my algorithm will not be included in glibc.

So, why should I spend more time on this?

There are many other languages than C (like Java, PHP, etc.) that have
their own implementation of string search.

So, I will spend my time there to evaluate string search algorithms of
these languages.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 14:08               ` Amit Choudhary
@ 2021-06-09 14:25                 ` Phil Blundell
  0 siblings, 0 replies; 45+ messages in thread
From: Phil Blundell @ 2021-06-09 14:25 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On Wed, Jun 09, 2021 at 07:38:01PM +0530, Amit Choudhary wrote:
> I have done the same tests on my side. I have one text of 60 KB and my
> needle sizes are 3.2 KB, 4.4 KB, 5.9 KB, 7 KB, and 28 KB.
> 
> So, my tests are similar to those of strstr() benchmark.
> 
> Also, I have already mentioned that my algorithm is faster than strstr()
> when needle size is greater than or equal to 4.4 KB.

Right, but I think you've only established that your algorithm is faster
for this one particular haystack and four different needles.  That's not
to say that it isn't faster in other cases, but we don't know that yet.
It's also not clear to me what the upper complexity bound is for your
algorithm and how that compares to the one we have at the moment.

> So, I will spend my time there to evaluate string search algorithms of
> these languages.

That's entirely fair enough.  Thanks for the suggestions anyway.

Phil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:05   ` Phil Blundell
  2021-06-09 12:11     ` Adhemerval Zanella
@ 2021-06-09 12:18     ` Amit Choudhary
  2021-06-09 13:48       ` Phil Blundell
  1 sibling, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-09 12:18 UTC (permalink / raw)
  To: Phil Blundell; +Cc: GNU C Library

On Wed, Jun 9, 2021, 5:35 PM Phil Blundell <pb@pbcl.net> wrote:

>
> Do you have any representative real-world testcases or statistics that you
> can share with us?

I had shared my program earlier which has around 60 KB of text copied from
Internet.

  For example, what is the reason for believing that the
> needle "should" be in the middle?
>

strstr() does a 2 way search - from the beginning of the haystack and from
the end of haystack.

So, if needle is located at the beginning or end of haystack, then strstr()
will outperform many algorithms.

The real test for strstr() would be if needle is located in the middle of
the haystack.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 12:18     ` Amit Choudhary
@ 2021-06-09 13:48       ` Phil Blundell
  0 siblings, 0 replies; 45+ messages in thread
From: Phil Blundell @ 2021-06-09 13:48 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On Wed, Jun 09, 2021 at 05:48:17PM +0530, Amit Choudhary wrote:
> On Wed, Jun 9, 2021, 5:35 PM Phil Blundell <pb@pbcl.net> wrote:
> 
> >
> > Do you have any representative real-world testcases or statistics that you
> > can share with us?
> 
> 
> I had shared my program earlier which has around 60 KB of text copied from
> Internet.

How representative is this of the way that programs actually use strstr() 
in practice?  Do you have any statistics on that?

> So, if needle is located at the beginning or end of haystack, then strstr()
> will outperform many algorithms.
> 
> The real test for strstr() would be if needle is located in the middle of
> the haystack.

Right.  My question really was: where do we think the needle is, in fact,
located most often?  If the real-world scenario is that the needle often
is at the beginning or the end then optimising for that is the right
thing to do.

Phil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 11:42 ` Amit Choudhary
  2021-06-09 12:05   ` Phil Blundell
@ 2021-06-09 21:44   ` Wilco Dijkstra
  2021-06-10  2:42     ` Amit Choudhary
  1 sibling, 1 reply; 45+ messages in thread
From: Wilco Dijkstra @ 2021-06-09 21:44 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

> I included my algorithm in bench-strstr.c and ran the tests myself.
> I could not find strstr-inputs file so I printed the haystack and needle.
>
> I found out many things which I think are not correct:
>
>1. In pass cases (positive cases where needle is part of haystack), the needle is 
> either located at the beginning of the haystack or at the end of the haystack.
> To mimic real world scenario, the needle should be at the middle of the haystack.

It's located at the end of the haystack since that is the slowest and hardest case.
There is no point in placing it at the start or in the middle since those cases are
already tested in the smaller haystack cases.

> 2. The large text is mostly repetition of two small paragraphs. This again is not a real world scenario.

Repetition doesn't matter much as long as the sequence is long enough and typical
English text. SMART has very large texts (including protein sequences), but the results
are the same.

> 3. Some tests for 256 byte needles have haystack and needle as a combination of
> characters 'a' and 'b'. No real world text has been used.
>
> 4. Tests for 1 KB / 4 KB / 8 KB / 28 KB needles also have  haystack and needle as a
> combination of characters 'a' and 'b'. No real world text has been used.

That's on purpose to test specific worst-case inputs. The worst-case for your
algorithm would be "a" repeated N times and the needle "a" repeated N/2
times concatenated with "ba" which results in O(N * N) comparisons. And that
is bad for large N.

> So, in my opinion strstr() has never been tested extensively for real world text.
> I think this is a very big shortcoming.

It has been tested extensively, including in SMART. If you're planning to improve
your algorithm I'd suggest to study that. It has lots of information about the
algorithms used and the C implementations are easy to understand, so it should
be very useful.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-09 21:44   ` Wilco Dijkstra
@ 2021-06-10  2:42     ` Amit Choudhary
  2021-06-10  3:23       ` Dan Raymond
  2021-06-10  9:44       ` Wilco Dijkstra
  0 siblings, 2 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-10  2:42 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Thu, Jun 10, 2021, 3:14 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> Hi Amit,
>
> It's located at the end of the haystack since that is the slowest and
> hardest case.
>

In two way search, doesn't strstr() search from end also?

There is no point in placing it at the start or in the middle since those
> cases are
> already tested in the smaller haystack cases.
>

I had printed haystack and needle for all cases but didn't see needle in
the middle.

Testing in the middle should be done for very long haystack also.

Also, repititive text is not very useful because the needle will be located
in the beginning itself.

> That's on purpose to test specific worst-case inputs. The worst-case for
> your
> algorithm would be "a" repeated N times and the needle "a" repeated N/2
> times concatenated with "ba" which results in O(N * N) comparisons. And
> that
> is bad for large N.
>

So, if some algorithm is faster for real world scenario but not for worst
case (and worst case will not occur in real world), then you don't want to
include that algorithm.

This doesn't make sense.

But now this discussion is getting into useless defending mode.

I have seen this issue before also - people who maintain open source
software will try defending what they do even when it is obvious that what
they are doing is not correct.

Google / Facebook, etc. accept their mistakes / bugs easily and sincerely
but not open source software.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10  2:42     ` Amit Choudhary
@ 2021-06-10  3:23       ` Dan Raymond
  2021-06-10  4:19         ` Amit Choudhary
  2021-06-10  9:44       ` Wilco Dijkstra
  1 sibling, 1 reply; 45+ messages in thread
From: Dan Raymond @ 2021-06-10  3:23 UTC (permalink / raw)
  To: libc-alpha, amitchoudhary0523

Amit, here is a collection of quotes from you in this thread:

    "I have invented the fastest string search algorithm. It is called 
"Choudhary Algorithm""

    "I hope that people are not of the view that Indians can't invent or 
innovate"

    "for 60 KB text and 3.2 KB pattern, strstr() is faster than my 
algorithm"

    "for 60 KB text and 4.4 KB pattern, my algorithm is faster than 
strstr()"

    "The current tests are skewed in favor of strstr()..."

    "but I don't have time to fix strstr() benchmarks"

    "people who maintain open source software will try defending what 
they do even when it is obvious that what they are doing is not correct"

The reason you are not getting traction here is not because the 
community is racist, defensive, and unwilling to accept their mistakes.  
It is because you have failed to substantiate your claim that your 
algorithm is the "fastest string search algorithm".  By your own 
admission the existing implementation is faster when using existing 
benchmarks.  You claim that the benchmarks are flawed but when you were 
asked to submit improvements you refused.  You claim that "real world" 
scenarios favor your algorithm but then you state that your algorithm 
requires a 4.4 KB search string to demonstrate an advantage.  How is 4.4 
KB a "real world" scenario?  I'm sure there are applications that will 
search for a string that long but I've never encountered one.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10  3:23       ` Dan Raymond
@ 2021-06-10  4:19         ` Amit Choudhary
  0 siblings, 0 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-10  4:19 UTC (permalink / raw)
  To: Dan Raymond; +Cc: GNU C Library

On Thu, Jun 10, 2021, 8:54 AM Dan Raymond <draymond@foxvalley.net> wrote:

> Amit, here is a collection of quotes from you in this thread:
>
>     "I have invented the fastest string search algorithm. It is called
> "Choudhary Algorithm""
>
>     "I hope that people are not of the view that Indians can't invent or
> innovate"
>
>     "for 60 KB text and 3.2 KB pattern, strstr() is faster than my
> algorithm"
>
>     "for 60 KB text and 4.4 KB pattern, my algorithm is faster than
> strstr()"
>
>     "The current tests are skewed in favor of strstr()..."
>
>     "but I don't have time to fix strstr() benchmarks"
>
>     "people who maintain open source software will try defending what
> they do even when it is obvious that what they are doing is not correct"
>
> The reason you are not getting traction here is not because the
> community is racist, defensive, and unwilling to accept their mistakes.
> It is because you have failed to substantiate your claim that your
> algorithm is the "fastest string search algorithm".  By your own
> admission the existing implementation is faster when using existing
> benchmarks.  You claim that the benchmarks are flawed but when you were
> asked to submit improvements you refused.  You claim that "real world"
> scenarios favor your algorithm but then you state that your algorithm
> requires a 4.4 KB search string to demonstrate an advantage.  How is 4.4
> KB a "real world" scenario?  I'm sure there are applications that will
> search for a string that long but I've never encountered one.
>

First of all, I can give more proofs by testing from my end with different
haystacks / needles but I am not going to fix your benchmark issues.

I am only submitting my algorithm but you are putting extra work on me by
asking me to fix your benchmark issues. This is not my issue, this is the
issue of the person who designed and implemented the benchmark tests.

If you do like this, that is, if someone submits some work and then you ask
him to fix existing issues also then no one will be interested in
submitting their work because they may not want to fix your existing issues
in benchmark tests, etc.

Also, when I thought more, I realized that no one uses strstr() in real
world.

C is now-a-days used only in embedded software / kernel, etc. and I have
programmed a lot in C but I don't ever remember using strstr().

You people think that you are the smartest people on this earth so let me
tell you something about myself.

1. I am programming since 1998.

2. I have programmed in C, C++, JAVA, PHP, PYTHON, HTML, PERL, JAVASCRIPT,
ETC.

3. I had invented a new language and written its interpreter in 2002 when I
was an intern in a company.

4. I had written and published "How to implement system call". It can be
found here:
https://tldp.org/HOWTO/html_single/Implement-Sys-Call-Linux-2.6-i386/

5. I have contributed many patches to linux kernel but left contributing
more when linux kernel developers became defensive.

6. I have worked in Cisco Systems and Juniper Networks in USA.

7. Currently, I am working for a networking company and writing code for 4G
/ 5G networks.

8. I have done B. Tech. in Computer Science from IIT - Varanasi after
getting a rank in IIT-JEE. IIT-JEE is one of the toughest exams in this
world.

9. I have done Masters in Computer Networking from North Carolina State
University, NC, USA.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10  2:42     ` Amit Choudhary
  2021-06-10  3:23       ` Dan Raymond
@ 2021-06-10  9:44       ` Wilco Dijkstra
  2021-06-10 11:08         ` Amit Choudhary
  1 sibling, 1 reply; 45+ messages in thread
From: Wilco Dijkstra @ 2021-06-10  9:44 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

> It's located at the end of the haystack since that is the slowest and hardest case.
>
> In two way search, doesn't strstr() search from end also?

No, strstr must return the *first* match in the string, so it has to search from the start.
Hence placing the needle at the start is the easiest case, and the end is the worst case.
The TwoWay algorithm splits the needle into two parts and matches them separately -
but it still scans the haystack from the start.

> There is no point in placing it at the start or in the middle since those cases are
> already tested in the smaller haystack cases.
>
> I had printed haystack and needle for all cases but didn't see needle in the middle.
>
> Testing in the middle should be done for very long haystack also.

For testing correctness yes. However in a benchmark there is no difference if the haystack
ends immediately after a match or continues forever. A test like SMART with random needles
at random positions in a huge text would be useful to add of course.

> Also, repititive text is not very useful because the needle will be located in the beginning itself.

It won't since the benchmark explicitly removes all early matches (and of course it checks that
the result is correct as well).

> So, if some algorithm is faster for real world scenario but not for worst case (and worst case
> will not occur in real world), then you don't want to include that algorithm.

Large needles are not real-world, so for these the worst-case matters. The TwoWay algorithm
for large needles is old and relatively inefficient. Anyone who has an algorithm that is not
only faster on real-world needles but also guarantees good worst-case performance on huge
needles would have no trouble getting their patches accepted.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10  9:44       ` Wilco Dijkstra
@ 2021-06-10 11:08         ` Amit Choudhary
  2021-06-10 16:53           ` Amit Choudhary
  2021-06-11 11:45           ` Wilco Dijkstra
  0 siblings, 2 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-10 11:08 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 3049 bytes --]

On Thu, Jun 10, 2021, 3:14 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> Hi Amit,
>
> For testing correctness yes. However in a benchmark there is no difference
> if the haystack
> ends immediately after a match or continues forever.
>

In my opinion, needle should be in the middle so as not to give any
algorithm any unfair advantage - like Boyer-Moore which searches from the
end.

> It won't since the benchmark explicitly removes all early matches (and of
> course it checks that
> the result is correct as well).
>

I have not seen what you are pointing out. I can attach the file which has
all haystacks and needles, but it is 82 MB. I did bzip2 on it but still it
is 564 KB. So, I have removed all haystacks/needles which were a
combination of 'a's and 'b's. I have run bzip2 on that file and now it is
115 KB large. However, the file format is not very good but it is still
easily readable.

I searched for the needle in the haystack using document editor's search
and found that needle was located in the beginning.

> > So, if some algorithm is faster for real world scenario but not for
> worst case (and worst case
> > will not occur in real world), then you don't want to include that
> algorithm.
>
> Large needles are not real-world, so for these the worst-case matters. The
> TwoWay algorithm
> for large needles is old and relatively inefficient. Anyone who has an
> algorithm that is not
> only faster on real-world needles but also guarantees good worst-case
> performance on huge
> needles would have no trouble getting their patches accepted.
>

I can do more tests and provide the inputs and results but I will not use
benchmark tests because I don't find them useful.

I will also not fix benchmark tests.

If all this is acceptable then please let me know. In this case I will
start my testing again.

The worst case scenario of my algorithm is O(mn) where m is the length of
the text and n is the length of the pattern.

And this worst case scenario will be hit in only one case:

haystack is: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA...
needle is: ABABA

But my theory is this: We should be fastest 100% if the time. If that is
not possible, then we should try to get as close to 100% as possible.

This means that if for average case, some algorithm is faster 99% of the
time but slower 1% of the time in the worst case scenario then we should go
for this algorithm rather than fretting over worst case scenario.

So, please let me know if you are still amenable to include my algorithm
provided the results for my algorithm are better than strstr(). You can
tell me what haystack / needle to pick.

I am thinking of having 2 long haystacks copied from Internet and then a
function call in the program will generate needle (equal to length passed
to it) from the middle of the haystack (Mostly, the length of needle will
be greater than or equal to 4 KB).

But, please let me know if you have other ideas.

Please find attached the haystacks/needles file from benchmark tests of
strstr (115 KB, bzipped2).

Regards,
Amit

[-- Attachment #2: bench-strstr.out.txt.bz2 --]
[-- Type: application/x-bzip, Size: 117063 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10 11:08         ` Amit Choudhary
@ 2021-06-10 16:53           ` Amit Choudhary
  2021-06-10 20:04             ` Amit Choudhary
  2021-06-11 11:45           ` Wilco Dijkstra
  1 sibling, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-10 16:53 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Thu, Jun 10, 2021 at 4:38 PM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

>
> The worst case scenario of my algorithm is O(mn) where m is the length of
> the text and n is the length of the pattern.
>
> And this worst case scenario will be hit in only one case:
>
> haystack is: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA...
> needle is: ABABA
>

I calculated again and it comes out that in worst case scenario also my
algorithm is linear - O(n).

=============================================

CASE 1: Let's consider this:

haystack: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA... (length = n)
needle: ABABABABA (length = m)

So, when it is going to match needle[0] with haystack[0], it will match and
all other 4 comparisons as well. So, it will do 5 comparisons and then it
will try to match the whole needle, but it will fail only after matching
[0]th character because the [1]st character (X, B) differs. So, it will do
7 comparisons in all when it matches all 5 initial comparisons.

Now, since the needle did not match, then 'i' will be incremented by 1 and
it will immediately fail after 1 comparison (needle[0] = A) is not equal to
((haystack[1] = X). So, it will do only 1 comparison.

This will repeat itself - 7 comparisons, 1 comparison, 7 comparisons, 1
comparison and so on....

So, for n/2 characters it will do 7 comparisons and for other n/2
characters it will do only 1 comparison.

So, total comparisons = (7*n)/2 + n/2 = 8n/2 = 4n.

So, this worst case scenario is O(n).

=============================================

CASE 2: Let's consider this:

haystack: ABCDEFGABCDEFGABCDEFGABCDEFG... (length = n)
needle: ABCDEXG (length = m) {{Here, only needle[m-2] and haystack[m-2]
differs}}.

In this case it will match all initial 5 comparisons for needle[0]. Then it
will try to match whole needle and will fail after (m-2) comparisons.

Now, since the needle did not match, then 'i' will be incremented by 1 and
it will immediately fail after 1 comparison (needle[0] = A) is not equal to
((haystack[1] = B). So, it will do only 1 comparison. This 1 comparison
will continue till next m-1 characters.

So, for 1st m characters, the number of comparisons are 5 + m - 2 + m - 1 =
2m + 2 comparisons.

This 2m + 2 comparisons will be done n/m times.

So, total number of comparisons done = (2m + 2)(n/m) =~ n.

So, this worst case scenario is also O(n).

  =============================================

So, now I think that I can safely say that the worst case scenario for my
algorithm is O(n).

If someone can prove that it is not O(n) but O(m*n) or O(n^2), etc. then
please let me know.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10 16:53           ` Amit Choudhary
@ 2021-06-10 20:04             ` Amit Choudhary
  2021-06-11  8:28               ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-10 20:04 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 287 bytes --]

I have now improved my algorithm by skipping some text based on a
preprocessed array on pattern.

Now, my algorithm is faster than strstr() for needles of 1 KB and above.

But I am not sure if you guys are still interested.

Anyways, I am attaching my program (bzipped2).

Regards,
Amit

[-- Attachment #2: compare_string_search_algorithms.c.bz2 --]
[-- Type: application/bzip2, Size: 25450 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10 20:04             ` Amit Choudhary
@ 2021-06-11  8:28               ` Amit Choudhary
  0 siblings, 0 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-11  8:28 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Fri, Jun 11, 2021, 1:34 AM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

> I have now improved my algorithm by skipping some text based on a
> preprocessed array on pattern.
>
> Now, my algorithm is faster than strstr() for needles of 1 KB and above.
>
> But I am not sure if you guys are still interested.
>
> Anyways, I am attaching my program (bzipped2).
>
> Regards,
> Amit
>

This version of my program has a bug. If needed, I can send the fixed
version.

Also, as it was pointed out that long needles are not used in real world
but this is not the case.

Antivirus software compares long needles (may be greater than 4 KB also) to
search for virus signature in packets.

And this is done everyday, many many times, and that too on millions of
computers everyday.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-10 11:08         ` Amit Choudhary
  2021-06-10 16:53           ` Amit Choudhary
@ 2021-06-11 11:45           ` Wilco Dijkstra
  2021-06-11 12:42             ` Amit Choudhary
  1 sibling, 1 reply; 45+ messages in thread
From: Wilco Dijkstra @ 2021-06-11 11:45 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

> In my opinion, needle should be in the middle so as not to give any
> algorithm any unfair advantage - like Boyer-Moore which searches from the end.

There is no unfair advantage to any algorithm since they all start searching
from the start (note BM does not start at the end).

> I searched for the needle in the haystack using document editor's search
> and found that needle was located in the beginning.

No, it's always located at the end - it is even tested explicitly.

> The worst case scenario of my algorithm is O(mn) where m is the length of
> the text and n is the length of the pattern.

Exactly. And that is problematic if both m and n are large.

> And this worst case scenario will be hit in only one case:

There is not just one worst-case, the number is practically infinite.
A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA.

> But my theory is this: We should be fastest 100% if the time. If that is not possible,
> then we should try to get as close to 100% as possible.
> 
> This means that if for average case, some algorithm is faster 99% of the time but
> slower 1% of the time in the worst case scenario then we should go for this
> algorithm rather than fretting over worst case scenario.

For large needles and haystacks the worst case becomes more important because
the slowdown becomes larger with the length. It's bad even if there is a 0.001%
chance that your computer locks up for minutes.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 11:45           ` Wilco Dijkstra
@ 2021-06-11 12:42             ` Amit Choudhary
  2021-06-11 13:39               ` Amit Choudhary
                                 ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-11 12:42 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Fri, Jun 11, 2021, 5:15 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

>
>
> > I searched for the needle in the haystack using document editor's search
> > and found that needle was located in the beginning.
>
> No, it's always located at the end - it is even tested explicitly.
>


For some cases it is in the beginning and for some cases it is in the end.



> > The worst case scenario of my algorithm is O(mn) where m is the length of
> > the text and n is the length of the pattern.
>
> Exactly. And that is problematic if both m and n are large.
>
> > And this worst case scenario will be hit in only one case:
>
> There is not just one worst-case, the number is practically infinite.
> A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA.
>


I can easily fix this by recording at which index it failed the last time.
Then the first comparison will be at the index of B. And after first m
comparisons, it will fail in the first comparison itself at index of B.

So, after this fix, my algo's worst case scenario will be O(n).

"""""But, I would really appreciate if you can give me a real world
scenario where this worst case scenario will happen."""""



> > But my theory is this: We should be fastest 100% if the time. If that is
> not possible,
> > then we should try to get as close to 100% as possible.
> >
> > This means that if for average case, some algorithm is faster 99% of the
> time but
> > slower 1% of the time in the worst case scenario then we should go for
> this
> > algorithm rather than fretting over worst case scenario.
>
> For large needles and haystacks the worst case becomes more important
> because
> the slowdown becomes larger with the length. It's bad even if there is a
> 0.001%
> chance that your computer locks up for minutes.
>


The worst case scenario almost never occurs in real world situations. The
worst case scenario is almost always on paper.

Please give me an example of how would you encounter worst case sceraio in
real world.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 12:42             ` Amit Choudhary
@ 2021-06-11 13:39               ` Amit Choudhary
  2021-06-11 13:46               ` Wilco Dijkstra
  2021-06-11 13:57               ` Paul Zimmermann
  2 siblings, 0 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-11 13:39 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 1115 bytes --]

On Fri, Jun 11, 2021 at 6:12 PM Amit Choudhary <amitchoudhary0523@gmail.com>
wrote:

>
> On Fri, Jun 11, 2021, 5:15 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> wrote:
>
>> > The worst case scenario of my algorithm is O(mn) where m is the length
>> of
>> > the text and n is the length of the pattern.
>>
>> Exactly. And that is problematic if both m and n are large.
>>
>> > And this worst case scenario will be hit in only one case:
>>
>> There is not just one worst-case, the number is practically infinite.
>> A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA.
>>
>
>
> I can easily fix this by recording at which index it failed the last time.
> Then the first comparison will be at the index of B. And after first m
> comparisons, it will fail in the first comparison itself at index of B.
>
> So, after this fix, my algo's worst case scenario will be O(n).
>
> """""But, I would really appreciate if you can give me a real world
> scenario where this worst case scenario will happen."""""
>


I fixed this. I am attaching latest fixed and changed version of my program
(bzipped2).

Regards,
Amit

[-- Attachment #2: compare_string_search_algorithms.c.bz2 --]
[-- Type: application/x-bzip, Size: 25562 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 12:42             ` Amit Choudhary
  2021-06-11 13:39               ` Amit Choudhary
@ 2021-06-11 13:46               ` Wilco Dijkstra
  2021-06-11 15:52                 ` Amit Choudhary
  2021-06-11 13:57               ` Paul Zimmermann
  2 siblings, 1 reply; 45+ messages in thread
From: Wilco Dijkstra @ 2021-06-11 13:46 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

Hi Amit,

>> No, it's always located at the end - it is even tested explicitly.
>
> For some cases it is in the beginning and for some cases it is in the end.

No it is not.

> So, after this fix, my algo's worst case scenario will be O(n).

No - that doesn't solve the fundamental quadratic problem.

> Please give me an example of how would you encounter worst case sceraio in real world.

For example an attacker finds an application that uses strstr and provides
worst case inputs on purpose, causing the application to hang. That is a
denial of service attack. GLIBC tries hard to avoid quadratic algorithms as a
result.

This is not theoretical, slowdowns on rare inputs create serious issues.
Math functions are a good example, there were cases where some math
functions were 1000 or even 100000 times slower on some inputs.
This caused such issues for HPC code that they used to advice not to use
GLIBC for math code - a supercomputer taking weeks rather than hours to
solve a problem because it hit slow cases is disastrous.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 13:46               ` Wilco Dijkstra
@ 2021-06-11 15:52                 ` Amit Choudhary
  2021-06-11 19:14                   ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-11 15:52 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Fri, Jun 11, 2021 at 7:16 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com>
wrote:

> Hi Amit,
>
> For example an attacker finds an application that uses strstr and provides
> worst case inputs on purpose, causing the application to hang. That is a
> denial of service attack. GLIBC tries hard to avoid quadratic algorithms
> as a
> result.
>

This is a problem of the application that it is not limiting the input size.

Problems like these have happened in the past and that's why we have
strncpy(), etc.

If we don't have strncpy() then denial of service attack is possible
through strcpy() also.

This is not theoretical, slowdowns on rare inputs create serious issues.
> Math functions are a good example, there were cases where some math
> functions were 1000 or even 100000 times slower on some inputs.
> This caused such issues for HPC code that they used to advice not to use
> GLIBC for math code - a supercomputer taking weeks rather than hours to
> solve a problem because it hit slow cases is disastrous.
>

The worst case for a string matching function is different than for maths.

String matching is very simple - just compare characters.

But maths is different - calculating square roots, multiplying 20 floating
point numbers, calculating logarithmic values of large numbers,
factorization, etc. are very CPU intensive.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 15:52                 ` Amit Choudhary
@ 2021-06-11 19:14                   ` Amit Choudhary
  2021-06-11 19:26                     ` Paul Eggert
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-11 19:14 UTC (permalink / raw)
  To: GNU C Library

To end this whole discussion, there is one point that needs to be clarified
but no one has spoken about it.

The point is this:

If there are two algorithms and algorithm1 performs better than algorithm2
in average case. But in worst case scenario, algorithm2 is better than
algorithm1.

And you can choose only one algorithm.

"""""So, which one would you choose for glibc?"""""

I will go for algorithm1 because it has better performance in average case.

But Wilco will go for algorithm2 because its has better performance in
worst case.

If this point gets clarified then there will be no point of further
discussion.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 19:14                   ` Amit Choudhary
@ 2021-06-11 19:26                     ` Paul Eggert
  2021-06-11 19:41                       ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Eggert @ 2021-06-11 19:26 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: GNU C Library

On 6/11/21 12:14 PM, Amit Choudhary via Libc-alpha wrote:
> If there are two algorithms and algorithm1 performs better than algorithm2
> in average case. But in worst case scenario, algorithm2 is better than
> algorithm1.

For glibc, worst-case is a big deal when it's measurable via the big-O 
notation. This is because of denial-of-service attacks and such-like. So 
in many cases we'll prefer an algorithm with a better worst-case behavior.

Also, it's often easier to measure worst-case (because what is "average" 
is more debatable than what is worst), so focusing on worst-case helps 
cut down the more-difficult and lengthy discussion that would be needed 
to define "average", and this focus can save everybody time.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 19:26                     ` Paul Eggert
@ 2021-06-11 19:41                       ` Amit Choudhary
  2021-06-11 20:00                         ` Dan Raymond
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-11 19:41 UTC (permalink / raw)
  To: Paul Eggert; +Cc: GNU C Library

On Sat, Jun 12, 2021, 12:56 AM Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 6/11/21 12:14 PM, Amit Choudhary via Libc-alpha wrote:
> > If there are two algorithms and algorithm1 performs better than
> algorithm2
> > in average case. But in worst case scenario, algorithm2 is better than
> > algorithm1.
>
> For glibc, worst-case is a big deal when it's measurable via the big-O
> notation. This is because of denial-of-service attacks and such-like. So
> in many cases we'll prefer an algorithm with a better worst-case behavior.
>
> Also, it's often easier to measure worst-case (because what is "average"
> is more debatable than what is worst), so focusing on worst-case helps
> cut down the more-difficult and lengthy discussion that would be needed
> to define "average", and this focus can save everybody time.
>


So, in this case, my algorithm is of no use because it better than strstr()
in average cases but not in worst cases.

Average case means real world scenarios. Worst case means hand crafted
worst case inputs.

So, from my side, this discussion is now over.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 19:41                       ` Amit Choudhary
@ 2021-06-11 20:00                         ` Dan Raymond
  2021-06-12  4:00                           ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Dan Raymond @ 2021-06-11 20:00 UTC (permalink / raw)
  To: libc-alpha, amitchoudhary0523

> So, in this case, my algorithm is of no use because it better than strstr()
> in average cases but not in worst cases.
>
> Average case means real world scenarios. Worst case means hand crafted
> worst case inputs.

Real world scenarios are precisely the ones that are sensitive to 
worst-case performance.  In the real world you can't predict inputs and 
there are consequences to unexpected performance problems.  Denial of 
service attacks are just one example.  By the way your previous argument 
that antivirus software uses strstr() to search for virus signatures is 
false.  A virus signature is not a NUL-terminated string.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 20:00                         ` Dan Raymond
@ 2021-06-12  4:00                           ` Amit Choudhary
  2021-06-12  4:18                             ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-12  4:00 UTC (permalink / raw)
  To: Dan Raymond; +Cc: GNU C Library

On Sat, Jun 12, 2021, 1:32 AM Dan Raymond <draymond@foxvalley.net> wrote:
  By the way your previous argument

> that antivirus software uses strstr() to search for virus signatures is
> false.  A virus signature is not a NUL-terminated string.
>

I never said that strstr() is used in antivirus software.

I only said that antivirus software use long needles for searching and I
said this because someone had put forth the argument that long needles
don't occur in real world.

But as you said that in real world anything can happen then the current
version of strstr() is susceptible to denial of service attack because for
long needles (32 KB, etc.), strstr() is slow.

But one thing is very ironic. On one hand people say that anything can
happen in real world and on the other they contradict this by saying that
long needles don't occur in real world.

If anything can happen in real world, then strstr() should be testef for 10
MB long needle in 100 MB long haystack.

But this is not tested. Actually, strstr() has not been tested for needles
beyond 1 KB and haystack beyong 65 KB.

So, the argument that anything can happen in the real world in not
reflected in the benchmark tests for strstr().

But, please don't ask me to fix benchmark tests.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-12  4:00                           ` Amit Choudhary
@ 2021-06-12  4:18                             ` Amit Choudhary
  2021-06-12  7:16                               ` [Final testing with benchmark tests] " Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-12  4:18 UTC (permalink / raw)
  To: GNU C Library

By the way, Paul Eggert thinks that this discussion is a waste of time for
glibc people.

So, I will request people to stop this thread of discussion now.

So, please don't reply to any message in this thread.

But if someone replies, then I might reply back.

But please don't blame me for this discussion. People who replied /
responded in this thread are equally at fault for continuing this
discussion.

Paul blamed me for wasting glibc developers time and I really didn't like
that.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Final testing with benchmark tests] Fastest String Search Algorithm.
  2021-06-12  4:18                             ` Amit Choudhary
@ 2021-06-12  7:16                               ` Amit Choudhary
  2021-06-12  8:01                                 ` Paul Zimmermann
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-12  7:16 UTC (permalink / raw)
  To: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 458 bytes --]

Hi All,

I ran the benchmark tests for my algorithm and for strstr().

Total number of tests = 332.

My algorithm is faster 182 times.

strstr() is faster 150 times.

For worst case scenario my algorithm is not faster than strstr().

Please find the attached file "bench-strstr.out".

The input had a variation of " and ' in unicode I guess, so I changed them
to " and ' of english alphabet.

Please find the attached file "bench-strstr.out".

Regards,
Amit

[-- Attachment #2: bench-strstr.out --]
[-- Type: application/octet-stream, Size: 7601 bytes --]


c: This means that choudhary algorithm is faster in this case
s: This means that strstr() is faster in this case

   choudhary algo       strstr() 
==============================================================
c: 10453.9				19389.8
c: 1105.47				5363.28
c: 11310.9				19953.9
c: 1313.28				2221.88
c: 1392.97				1971.09
c: 1506.25				3337.5
c: 1557.81				2535.16
c: 1593.75				2062.5
c: 1663.28				2941.41
c: 16780.5				13126.6
c: 1684.38				1958.59
c: 16975				47172.7
c: 17593.8				48645.3
c: 1857.03				2639.06
c: 2064.06				4439.06
c: 20754.7				36578.9
c: 208.594				229.688
c: 20830.5				36247.7
c: 225				    839.062
c: 225.781				622.656
c: 232.812				338.281
c: 232.812				487.5
c: 234.375				333.594
c: 235.938				532.812
c: 239.844				1327.34
c: 242.969				542.969
c: 245.312				641.406
c: 248.438				878.906
c: 250.781				638.281
c: 255.469				1050.78
c: 256.25				571.875
c: 256.25				949.219
c: 259.375				766.406
c: 260.938				1163.28
c: 261.719				1049.22
c: 261.719				397.656
c: 262.5				387.5
c: 2663.28				25135.2
c: 267.188				856.25
c: 2681.25				2820.31
c: 272.656				1092.19
c: 2727.34				4802.34
c: 275				    289.062
c: 27609.4				208379
c: 279.688				357.031
c: 279.688				901.562
c: 280.469				887.5
c: 280.469				996.875
c: 294.531				1128.12
c: 294.531				1317.19
c: 295.312				631.25
c: 2960.16				3276.56
c: 298.438				361.719
c: 298.438				997.656
c: 302.344				1568.75
c: 304.688				396.094
c: 304.688				500.781
c: 3047.66				4343.75
c: 306.25				410.156
c: 309.375				509.375
c: 309.375				603.125
c: 310.938				625.781
c: 3138.28				25089.8
c: 315.625				574.219
c: 316.406				876.562
c: 317.969				564.062
c: 3181.25				4013.28
c: 319.531				1339.06
c: 320.312				1094.53
c: 323.438				698.438
c: 324.219				696.094
c: 332.031				493.75
c: 334.375				398.438
c: 335.156				573.438
c: 335.938				1003.91
c: 344.531				955.469
c: 345.312				1021.09
c: 347.656				650.781
c: 348.438				1131.25
c: 356.25				513.281
c: 358.594				1125.78
c: 359.375				1137.5
c: 359.375				479.688
c: 360.938				1371.88
c: 363.281				565.625
c: 363.281				679.688
c: 364.062				399.219
c: 364.062				403.125
c: 367.188				1046.88
c: 367.188				782.812
c: 373.438				848.438
c: 374.219				652.344
c: 374.219				899.219
c: 376.562				1054.69
c: 3761.72				4484.38
c: 380.469				418.75
c: 381.25				391.406
c: 383.594				435.156
c: 389.844				638.281
c: 392.188				626.562
c: 398.438				616.406
c: 405.469				482.812
c: 414.062				910.938
c: 4192.97				4861.72
c: 421.094				1138.28
c: 421.094				1151.56
c: 422.656				704.688
c: 428.125				436.719
c: 428.125				615.625
c: 428.906				1072.66
c: 428.906				1203.12
c: 432.812				1599.22
c: 433.594				1090.62
c: 434.375				617.969
c: 435.938				579.688
c: 437.5				870.312
c: 438.281				455.469
c: 439.062				468.75
c: 439.062				494.531
c: 439.062				582.812
c: 442.969				1020.31
c: 443.75				859.375
c: 446.094				522.656
c: 447.656				1488.28
c: 448.438				915.625
c: 456.25				1046.09
c: 457.812				660.938
c: 467.188				533.594
c: 479.688				748.438
c: 483.594				1385.94
c: 483.594				929.688
c: 48618.8				76608.6
c: 490.625				496.875
c: 493.75				1175
c: 493.75				641.406
c: 502.344				664.062
c: 503.125				1589.84
c: 503.906				1421.09
c: 511.719				1041.41
c: 517.969				525.781
c: 51726.6				75812.5
c: 521.094				1164.84
c: 527.344				666.406
c: 53248.4				71593.8
c: 53338.3				74159.4
c: 537.5				720.312
c: 542.969				1200.78
c: 550.781				1092.19
c: 551.562				589.844
c: 58021.9				54292.2
c: 586.719				1548.44
c: 598.438				654.688
c: 603.125				923.438
c: 607.031				1297.66
c: 607.812				1372.66
c: 628.125				732.812
c: 636.719				932.031
c: 661.719				1125.78
c: 665.625				1517.97
c: 681.25				778.906
c: 687.5				1145.31
c: 7032.81				9925.78
c: 710.938				2553.12
c: 712.5				734.375
c: 714.062				5446.09
c: 714.062				808.594
c: 717.188				1564.06
c: 725				    1553.12
c: 733.594				1417.19
c: 735.156				1314.84
c: 7624.22				10057.8
c: 778.906				1423.44
c: 786.719				1428.91
c: 7864.06				9322.66
c: 804.688				1451.56
c: 807.031				1608.59
c: 807.812				1328.91
c: 888230				    1.03685e+06
c: 921.094				1739.84
c: 936.719				1496.09
c: 953.125				1503.12
c: 970.312				1877.34

s: 1.02894e+08				329970
s: 1.09189e+07				44238.3
s: 1.3712e+06				1.35065e+06
s: 1.45997e+06				1.42336e+06
s: 1.51957e+06				1.47132e+06
s: 1.56013e+06				1.49793e+06
s: 1.57082e+06				1.49573e+06
s: 1.90037e+08				345765
s: 1.99525e+08				56757
s: 1002.34				57.0312
s: 1005.47				70.3125
s: 10122.7				4859.38
s: 1034.38				117.969
s: 1056.25				50				    
s: 1078.12				56.25
s: 1185.16				88.2812
s: 1194.53				77.3438
s: 1206.25				67.1875
s: 1244.53				97.6562
s: 12890.6				6563.28
s: 13193				10729.7
s: 133709				39225.8
s: 138122				40568
s: 14467.2				7554.69
s: 1471.09				83.5938
s: 1499.22				62.5
s: 17743				13910.2
s: 1817.19				1497.66
s: 1924.22				12392.2
s: 2.81242e+07				291209
s: 2.84542e+06				314376
s: 225.781				195.312
s: 22570.3				14451.6
s: 238.281				61.7188
s: 24030.5				14821.9
s: 242.188				48.4375
s: 242.969				46.875
s: 2447.66				1464.84
s: 246.875				54.6875
s: 253.906				74.2188
s: 2553.91				12784.4
s: 260.156				54.6875
s: 26240.6				13806.2
s: 275				    185.156
s: 275				    52.3438
s: 275.781				92.1875
s: 27962.5				208714
s: 281.25				227.344
s: 282.031				356.25
s: 282.031				98.4375
s: 286.719				100				    
s: 288.281				57.8125
s: 290.625				46.875
s: 296.094				284.375
s: 3.68725e+06				49706.2
s: 3.75006e+08				106175
s: 300.781				100				    
s: 302.344				246.875
s: 305.469				282.812
s: 30765.6				17076.6
s: 311.719				173.438
s: 31432.8				13260.2
s: 31812.5				17928.9
s: 325				    216.406
s: 325				    253.125
s: 332.812				51.5625
s: 335.938				277.344
s: 3385.94				3069.53
s: 340.625				310.938
s: 342.188				129.688
s: 342.188				96.0938
s: 3433.59				3389.84
s: 348.438				330.469
s: 358.594				108.594
s: 359.375				98.4375
s: 362.5				346.875
s: 363.281				359.375
s: 367.188				87.5
s: 367.969				209.375
s: 3692.97				2645.31
s: 370.312				369.531
s: 380.469				295.312
s: 387.5		    		183.594
s: 392.969				307.031
s: 39461.7				31028.1
s: 397.656				132.031
s: 398.438				92.9688
s: 39895.3				32588.3
s: 399.219				231.25
s: 4.92685e+08				526623
s: 4.99937e+07				25910.9
s: 403.125				129.688
s: 403.125				381.25
s: 407.031				47.6562
s: 416.406				92.1875
s: 421.094				131.25
s: 425.781				250.781
s: 426.562				267.969
s: 432.812				87.5
s: 443.75				353.125
s: 4471.88				3709.38
s: 450.781				375				    
s: 452.344				412.5
s: 460.156				119.531
s: 469.531				384.375
s: 471.875				242.969
s: 475.781				84.375
s: 480.469				274.219
s: 482.031				428.125
s: 485.156				246.094
s: 486.719				332.031
s: 517.188				351.562
s: 517.188				492.969
s: 525.781				50				    
s: 543.75				338.281
s: 546.094				98.4375
s: 546.875				303.906
s: 553.125				268.75
s: 56386.7				54726.6
s: 567.188				531.25
s: 568.75				62.5
s: 591.406				46.0938
s: 5931.25				2490.62
s: 608.594				285.156
s: 624.219				225				    
s: 627.344				342.969
s: 635.156				84.375
s: 641.406				111.719
s: 641.406				405.469
s: 642.188				612.5
s: 66678.1				22666.4
s: 68185.9				23725.8
s: 68239.8				43514.1
s: 68878.1				42614.1
s: 6993.75				4001.56
s: 715.625				321.875
s: 750				    389.062
s: 752.344				246.875
s: 7521.88				4145.31
s: 762.5				561.719
s: 7743.75				2208.59
s: 793.75				60.9375
s: 8.20611e+06				307977
s: 817.188				71.0938
s: 835.156				403.125
s: 8601.56				4188.28
s: 8924.22				8514.06
s: 9.82335e+08				357959
s: 912.5				628.906
s: 9723.44				8916.41
==============================================================


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Final testing with benchmark tests] Fastest String Search Algorithm.
  2021-06-12  7:16                               ` [Final testing with benchmark tests] " Amit Choudhary
@ 2021-06-12  8:01                                 ` Paul Zimmermann
  2021-06-12  8:41                                   ` Amit Choudhary
  2021-06-12 10:56                                   ` Siddhesh Poyarekar
  0 siblings, 2 replies; 45+ messages in thread
From: Paul Zimmermann @ 2021-06-12  8:01 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: libc-alpha

       Dear Amit,

> I ran the benchmark tests for my algorithm and for strstr().
> 
> Total number of tests = 332.
> 
> My algorithm is faster 182 times.
> 
> strstr() is faster 150 times.

and your algorithm can be up to 3531 times slower than strstr():

s: 3.75006e+08				106175

The largest strstr() time is 1.5e6, whereas the largest time of your
algorithm is 9.8e8, i.e., 656 times larger.

Can you improve your algorithm so that the worst case time is similar to that
of strstr()?

Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Final testing with benchmark tests] Fastest String Search Algorithm.
  2021-06-12  8:01                                 ` Paul Zimmermann
@ 2021-06-12  8:41                                   ` Amit Choudhary
  2021-06-12 13:29                                     ` Amit Choudhary
  2021-06-12 10:56                                   ` Siddhesh Poyarekar
  1 sibling, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-12  8:41 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: GNU C Library

On Sat, Jun 12, 2021 at 1:31 PM Paul Zimmermann <Paul.Zimmermann@inria.fr>
wrote:

>        Dear Amit,
>
> > I ran the benchmark tests for my algorithm and for strstr().
> >
> > Total number of tests = 332.
> >
> > My algorithm is faster 182 times.
> >
> > strstr() is faster 150 times.
>
> and your algorithm can be up to 3531 times slower than strstr():
>
> s: 3.75006e+08                          106175
>
> The largest strstr() time is 1.5e6, whereas the largest time of your
> algorithm is 9.8e8, i.e., 656 times larger.
>
> Can you improve your algorithm so that the worst case time is similar to
> that
> of strstr()?
>
> Paul
>

I don't think my algorithm can be optimized more. Anyways, I had invented
my algorithm thinking about normal average case scenarios and not for worst
case scenarios.

I was talking with Dr. Moore (inventor of Boyer-Moore algorithm) and he
suggested me to demonstrate that my algorithm is better for average case by
doing exhaustive testing across a wide variety of patterns and texts.

He also told me that he had invented his algorithm for average case too and
not thinking much about worst case scenario.

==================What Dr. Moore said========================
If your algorithm is worse case O(mn) then you should
somehow argue (and demonstrate with exhausting testing
across a wide variety of patterns and texts) that its
``average'' behavior is better!  The point is that the
naive algorithm is worst-case O(mn) and why should a
person use yours rather than that simpler algorithm?
==========================================================

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Final testing with benchmark tests] Fastest String Search Algorithm.
  2021-06-12  8:41                                   ` Amit Choudhary
@ 2021-06-12 13:29                                     ` Amit Choudhary
  0 siblings, 0 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-12 13:29 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 2501 bytes --]

I optimized my algorithm and I ran the benchmark tests again for complex
needles only.

Now, because of my optimization, my algorithm is faster than strstr() in
worst case scenario also.

Please find the results below.

(I am attaching the file also.

********************************************************************************************************************************************

S: strstr() is faster (5 times)
C: choudhary algorithm is faster (25 times)

                                       twoway_strstr    chal

=========================================================================
S: Length 65536/ 64, complex needle 1: 49578.9 147700
S: Length 65536/256, complex needle 1: 47225 140834
S: Length 65536/1024, complex needle 1: 23112.5 138987
S: Length 65536/4096, complex needle 1: 47460.2 144605
S: Length 65536/8192, complex needle 1: 88410.9 153084

C: Length 65536/ 64, complex needle 2: 335094 133933
C: Length 65536/ 64, complex needle 3: 1.58005e+06 138910
C: Length 65536/256, complex needle 2: 329751 137798
C: Length 65536/256, complex needle 3: 1.5351e+06 137376
C: Length 65536/1024, complex needle 2: 295184 135700
C: Length 65536/1024, complex needle 3: 1.50964e+06    136388
C: Length 65536/4096, complex needle 2: 303698 145081
C: Length 65536/4096, complex needle 3: 1.45478e+06 135175
C: Length 65536/8192, complex needle 2: 328699 152917
C: Length 65536/8192, complex needle 3: 1.394e+06 133727
C: Length 65536/28672, complex needle 1: 275716 193667
C: Length 65536/28672, complex needle 2: 450194 196630
C: Length 65536/28672, complex needle 3: 1.02255e+06 126090
C: Length 65536/32768, complex needle 1: 316611 201831
C: Length 65536/32768, complex needle 2: 471559 204520
C: Length 65536/32768, complex needle 3: 948845 124384
C: Length 65536/40960, complex needle 1: 409391 220575
C: Length 65536/40960, complex needle 2: 516094 237880
C: Length 65536/40960, complex needle 3: 831214 121766
C: Length 65536/51200, complex needle 1: 495124 241209
C: Length 65536/51200, complex needle 2: 560630 241209
C: Length 65536/51200, complex needle 3: 653866 117998
C: Length 65536/65536, complex needle 1: 625619 269684
C: Length 65536/65536, complex needle 2: 649592 269709
C: Length 65536/65536, complex needle 3: 508453 113078
=========================================================================

********************************************************************************************************************************************

Regards,
Amit

[-- Attachment #2: bench-strstr.out --]
[-- Type: application/octet-stream, Size: 2104 bytes --]


S: strstr() is faster (5 times)
C: choudhary algorithm is faster (25 times)

                       	                twoway_strstr	    chal

=========================================================================
S: Length 65536/ 64, complex needle 1:		49578.9				147700		
S: Length 65536/256, complex needle 1:		47225				140834		
S: Length 65536/1024, complex needle 1:	23112.5				138987		
S: Length 65536/4096, complex needle 1:	47460.2				144605		
S: Length 65536/8192, complex needle 1:	88410.9				153084		

C: Length 65536/ 64, complex needle 2:		335094				133933		
C: Length 65536/ 64, complex needle 3:		1.58005e+06			138910		
C: Length 65536/256, complex needle 2:		329751				137798		
C: Length 65536/256, complex needle 3:		1.5351e+06			137376		
C: Length 65536/1024, complex needle 2:	295184				135700		
C: Length 65536/1024, complex needle 3:	1.50964e+06		    136388		
C: Length 65536/4096, complex needle 2:	303698				145081		
C: Length 65536/4096, complex needle 3:	1.45478e+06			135175		
C: Length 65536/8192, complex needle 2:	328699				152917		
C: Length 65536/8192, complex needle 3:	1.394e+06			133727		
C: Length 65536/28672, complex needle 1:	275716				193667		
C: Length 65536/28672, complex needle 2:	450194				196630		
C: Length 65536/28672, complex needle 3:	1.02255e+06			126090		
C: Length 65536/32768, complex needle 1:	316611				201831		
C: Length 65536/32768, complex needle 2:	471559				204520		
C: Length 65536/32768, complex needle 3:	948845				124384		
C: Length 65536/40960, complex needle 1:	409391				220575		
C: Length 65536/40960, complex needle 2:	516094				237880		
C: Length 65536/40960, complex needle 3:	831214				121766		
C: Length 65536/51200, complex needle 1:	495124				241209		
C: Length 65536/51200, complex needle 2:	560630				241209		
C: Length 65536/51200, complex needle 3:	653866				117998		
C: Length 65536/65536, complex needle 1:	625619				269684		
C: Length 65536/65536, complex needle 2:	649592				269709		
C: Length 65536/65536, complex needle 3:	508453				113078		
=========================================================================


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Final testing with benchmark tests] Fastest String Search Algorithm.
  2021-06-12  8:01                                 ` Paul Zimmermann
  2021-06-12  8:41                                   ` Amit Choudhary
@ 2021-06-12 10:56                                   ` Siddhesh Poyarekar
  2021-06-12 13:47                                     ` Amit Choudhary
  1 sibling, 1 reply; 45+ messages in thread
From: Siddhesh Poyarekar @ 2021-06-12 10:56 UTC (permalink / raw)
  To: Paul Zimmermann, Amit Choudhary; +Cc: libc-alpha

On 6/12/21 1:31 PM, Paul Zimmermann wrote:
>         Dear Amit,
> 
>> I ran the benchmark tests for my algorithm and for strstr().
>>
>> Total number of tests = 332.
>>
>> My algorithm is faster 182 times.
>>
>> strstr() is faster 150 times.
> 
> and your algorithm can be up to 3531 times slower than strstr():
> 
> s: 3.75006e+08				106175
> 
> The largest strstr() time is 1.5e6, whereas the largest time of your
> algorithm is 9.8e8, i.e., 656 times larger.
> 
> Can you improve your algorithm so that the worst case time is similar to that
> of strstr()?

This is going around in circles so I am going to summarize the several 
problems (and not just the worst case time) here that need to be 
addressed before this proposal can even be considered with any 
seriousness.  You may consider this my objection to the proposal 
conditional these points being resolved.

Starting with the technical issues:

1. In the benchmark results shared, the geomean performance of the 
proposed algorithm is worse by 22.62%.  This is by no means (no pun 
intended) a better result.  In general, worse cases are much worse than 
when they're better.

2. The worst case performances are thousands of times slower as PaulZ 
pointed out.  But wait, it's worse because it's probably not an issue of 
scaling with length...

3. ... The benchmark results shared don't actually indicate which length 
pairs result in those worst case performances but if I had to assume 
that it's related to glibc benchmark inputs, (there's one input less 
than the glibc benchmarks) it appears that the performance starts going 
downhill from 224 byte haystacks, especially with misaligned strings.

4. Given that the benchmark program is outside of the benchtests 
infrastructure (because this is clearly not the benchtests program), 
there is no information about how the benchmark is run and what 
conditions affect its output.

5. The baseline for comparison is not clear.  Is it the default strstr() 
in string/strstr.c?  Is it the system strstr()?  In the latter case, it 
could be anything from string/strstr.c to the various 
microarchitecture-specific asm ifuncs.

6. Amit has made handwavy claims about why a certain set of inputs 
matter over others without providing any references to prior art or 
reviewable experimental data that can back his claims.  The proposal 
needs clearer commentary here with proper references.

7. Refusal to use the glibc-benchtests - either in its current form or 
by adding more inputs that the community agrees as relevant - is IMO a 
blocker to getting any changes accepted to strstr().  A better strstr() 
must show better results on the benchmarks, without which future 
maintenance of the function will be challenging.  If Amit doesn't write 
those, someone else needs to volunteer to do it for him.

Now on to the non-technical issues:

1. First and foremost, Amit has been insinuating racial discrimination 
or crying victim when someone makes suggestions for verification or 
points out flaws in the algorithm.  He was warned once[1] but has 
refused to treat community members with respect.  Let this not be the 
low we set for acceptable behaviour in the community.

2. There is a plain text algorithm and several C code snippets, none of 
which have proper copyright notices.  There is no indication even of 
whether Amit owns the copyright for the shared content (as opposed to 
his employer since many employers claim ownership for everything the 
employee creates) or whether he has given us permission to use the content.

3. On a related point, it is unclear if Amit or his employer have signed 
a copyright assignment with the FSF.  I personally would like to see 
glibc move away from the assignment like gcc did recently, but that's an 
orthogonal point at the moment.

4. Amit has on various occasions expressed disinterest in contributing 
to glibc and put conditions on what he will or will not do even before 
posting a single line patch.  I have no confidence that if we do end up 
accepting his contribution at some point, he will make himself available 
to maintain the code he contributed, leaving the community to deal with 
the technical debt.

Siddhesh

[1] https://sourceware.org/pipermail/libc-alpha/2021-June/127380.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Final testing with benchmark tests] Fastest String Search Algorithm.
  2021-06-12 10:56                                   ` Siddhesh Poyarekar
@ 2021-06-12 13:47                                     ` Amit Choudhary
  2021-06-13  3:32                                       ` Siddhesh Poyarekar
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-12 13:47 UTC (permalink / raw)
  To: Siddhesh Poyarekar; +Cc: Paul Zimmermann, GNU C Library

Siddhesh,

In my opinion, your comments are totally meaningless.

I am not going to write more because this will again get into long
discussion.

If my algorithm is better and if it is not included in glibc code then I am
not going to suffer, the world will suffer because the world will be
missing out on a better algorithm.

I will always use my string search algorithm instead of strstr().

Regards,
Amit


On Sat, Jun 12, 2021 at 4:26 PM Siddhesh Poyarekar <siddhesh@gotplt.org>
wrote:

> On 6/12/21 1:31 PM, Paul Zimmermann wrote:
> >         Dear Amit,
> >
> >> I ran the benchmark tests for my algorithm and for strstr().
> >>
> >> Total number of tests = 332.
> >>
> >> My algorithm is faster 182 times.
> >>
> >> strstr() is faster 150 times.
> >
> > and your algorithm can be up to 3531 times slower than strstr():
> >
> > s: 3.75006e+08                                106175
> >
> > The largest strstr() time is 1.5e6, whereas the largest time of your
> > algorithm is 9.8e8, i.e., 656 times larger.
> >
> > Can you improve your algorithm so that the worst case time is similar to
> that
> > of strstr()?
>
> This is going around in circles so I am going to summarize the several
> problems (and not just the worst case time) here that need to be
> addressed before this proposal can even be considered with any
> seriousness.  You may consider this my objection to the proposal
> conditional these points being resolved.
>
> Starting with the technical issues:
>
> 1. In the benchmark results shared, the geomean performance of the
> proposed algorithm is worse by 22.62%.  This is by no means (no pun
> intended) a better result.  In general, worse cases are much worse than
> when they're better.
>
> 2. The worst case performances are thousands of times slower as PaulZ
> pointed out.  But wait, it's worse because it's probably not an issue of
> scaling with length...
>
> 3. ... The benchmark results shared don't actually indicate which length
> pairs result in those worst case performances but if I had to assume
> that it's related to glibc benchmark inputs, (there's one input less
> than the glibc benchmarks) it appears that the performance starts going
> downhill from 224 byte haystacks, especially with misaligned strings.
>
> 4. Given that the benchmark program is outside of the benchtests
> infrastructure (because this is clearly not the benchtests program),
> there is no information about how the benchmark is run and what
> conditions affect its output.
>
> 5. The baseline for comparison is not clear.  Is it the default strstr()
> in string/strstr.c?  Is it the system strstr()?  In the latter case, it
> could be anything from string/strstr.c to the various
> microarchitecture-specific asm ifuncs.
>
> 6. Amit has made handwavy claims about why a certain set of inputs
> matter over others without providing any references to prior art or
> reviewable experimental data that can back his claims.  The proposal
> needs clearer commentary here with proper references.
>
> 7. Refusal to use the glibc-benchtests - either in its current form or
> by adding more inputs that the community agrees as relevant - is IMO a
> blocker to getting any changes accepted to strstr().  A better strstr()
> must show better results on the benchmarks, without which future
> maintenance of the function will be challenging.  If Amit doesn't write
> those, someone else needs to volunteer to do it for him.
>
> Now on to the non-technical issues:
>
> 1. First and foremost, Amit has been insinuating racial discrimination
> or crying victim when someone makes suggestions for verification or
> points out flaws in the algorithm.  He was warned once[1] but has
> refused to treat community members with respect.  Let this not be the
> low we set for acceptable behaviour in the community.
>
> 2. There is a plain text algorithm and several C code snippets, none of
> which have proper copyright notices.  There is no indication even of
> whether Amit owns the copyright for the shared content (as opposed to
> his employer since many employers claim ownership for everything the
> employee creates) or whether he has given us permission to use the content.
>
> 3. On a related point, it is unclear if Amit or his employer have signed
> a copyright assignment with the FSF.  I personally would like to see
> glibc move away from the assignment like gcc did recently, but that's an
> orthogonal point at the moment.
>
> 4. Amit has on various occasions expressed disinterest in contributing
> to glibc and put conditions on what he will or will not do even before
> posting a single line patch.  I have no confidence that if we do end up
> accepting his contribution at some point, he will make himself available
> to maintain the code he contributed, leaving the community to deal with
> the technical debt.
>
> Siddhesh
>
> [1] https://sourceware.org/pipermail/libc-alpha/2021-June/127380.html
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Final testing with benchmark tests] Fastest String Search Algorithm.
  2021-06-12 13:47                                     ` Amit Choudhary
@ 2021-06-13  3:32                                       ` Siddhesh Poyarekar
  2021-06-13  4:06                                         ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Siddhesh Poyarekar @ 2021-06-13  3:32 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Paul Zimmermann, GNU C Library

On 6/12/21 7:17 PM, Amit Choudhary wrote:
> Siddhesh,
> 
> In my opinion, your comments are totally meaningless.

Therein lies the problem; you continue to not take feedback and top it 
off with being rude.

> I am not going to write more because this will again get into long 
> discussion.
> 
> If my algorithm is better and if it is not included in glibc code then I 
> am not going to suffer, the world will suffer because the world will be 
> missing out on a better algorithm.

So be it.  No code is worth the kind of toxic behaviour you've 
demonstrated in this thread.

Good bye,
Siddhesh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Final testing with benchmark tests] Fastest String Search Algorithm.
  2021-06-13  3:32                                       ` Siddhesh Poyarekar
@ 2021-06-13  4:06                                         ` Amit Choudhary
  0 siblings, 0 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-13  4:06 UTC (permalink / raw)
  To: Siddhesh Poyarekar; +Cc: Paul Zimmermann, GNU C Library

Toxic behaviour has been demonstrated by you the most and other glibc
people.

I made a big mistake by contacting glibc.

Parting comments: To hell with you all.

Amit


On Sun, Jun 13, 2021, 9:03 AM Siddhesh Poyarekar <siddhesh@gotplt.org>
wrote:

> On 6/12/21 7:17 PM, Amit Choudhary wrote:
> > Siddhesh,
> >
> > In my opinion, your comments are totally meaningless.
>
> Therein lies the problem; you continue to not take feedback and top it
> off with being rude.
>
> > I am not going to write more because this will again get into long
> > discussion.
> >
> > If my algorithm is better and if it is not included in glibc code then I
> > am not going to suffer, the world will suffer because the world will be
> > missing out on a better algorithm.
>
> So be it.  No code is worth the kind of toxic behaviour you've
> demonstrated in this thread.
>
> Good bye,
> Siddhesh
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 12:42             ` Amit Choudhary
  2021-06-11 13:39               ` Amit Choudhary
  2021-06-11 13:46               ` Wilco Dijkstra
@ 2021-06-11 13:57               ` Paul Zimmermann
  2021-06-11 14:04                 ` Amit Choudhary
  2 siblings, 1 reply; 45+ messages in thread
From: Paul Zimmermann @ 2021-06-11 13:57 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Wilco.Dijkstra, libc-alpha

       Dear Amit,

> The worst case scenario almost never occurs in real world situations. The
> worst case scenario is almost always on paper.

we can not know in advance what applications the GNU libc will be used for.
This is very different from the case where you are the only user of your
program, and you know for which inputs you will use it.

You were told how you could convince that your algorithm is better:

(1) provide a GNU libc patch with new benchmarks (if you think the current
    ones are not sufficient)
(2) then provide a GNU libc patch using your algorithm, and demonstrate
    on the existing and new benchmarks that it is better than the current code

Otherwise, all that discussion is a waste of time both for you and
the GNU libc developers.

Best regards,
Paul Zimmermann

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 13:57               ` Paul Zimmermann
@ 2021-06-11 14:04                 ` Amit Choudhary
  2021-06-11 14:16                   ` Paul Zimmermann
  0 siblings, 1 reply; 45+ messages in thread
From: Amit Choudhary @ 2021-06-11 14:04 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: Wilco Dijkstra, GNU C Library

On Fri, Jun 11, 2021 at 7:27 PM Paul Zimmermann <Paul.Zimmermann@inria.fr>
wrote:

>        Dear Amit,
>
> > The worst case scenario almost never occurs in real world situations. The
> > worst case scenario is almost always on paper.
>
> we can not know in advance what applications the GNU libc will be used for.
> This is very different from the case where you are the only user of your
> program, and you know for which inputs you will use it.
>
> You were told how you could convince that your algorithm is better:
>
> (1) provide a GNU libc patch with new benchmarks (if you think the current
>     ones are not sufficient)
> (2) then provide a GNU libc patch using your algorithm, and demonstrate
>     on the existing and new benchmarks that it is better than the current
> code
>
> Otherwise, all that discussion is a waste of time both for you and
> the GNU libc developers.
>


As I said before,  I am not desperate to get my code in glibc. So, if you
don't accept my algorithm, it is fine with me.

I am doing the discussion because I am getting a reply. If I don't get a
reply then I will not continue the discussion.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 14:04                 ` Amit Choudhary
@ 2021-06-11 14:16                   ` Paul Zimmermann
  2021-06-11 14:27                     ` Amit Choudhary
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Zimmermann @ 2021-06-11 14:16 UTC (permalink / raw)
  To: Amit Choudhary; +Cc: Wilco.Dijkstra, libc-alpha

       Dear Amit,

> As I said before,  I am not desperate to get my code in glibc. So, if you
> don't accept my algorithm, it is fine with me.

the code you propose is not ready to get included in glibc. If you want your
code to be considered, you must send a real patch against glibc, 
like for example:

https://sourceware.org/pipermail/libc-alpha/2021-June/127242.html

See https://sourceware.org/glibc/wiki/Contribution%20checklist for more details

Best regards,
Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Fastest String Search Algorithm.
  2021-06-11 14:16                   ` Paul Zimmermann
@ 2021-06-11 14:27                     ` Amit Choudhary
  0 siblings, 0 replies; 45+ messages in thread
From: Amit Choudhary @ 2021-06-11 14:27 UTC (permalink / raw)
  To: Paul Zimmermann; +Cc: Wilco Dijkstra, GNU C Library

On Fri, Jun 11, 2021 at 7:46 PM Paul Zimmermann <Paul.Zimmermann@inria.fr>
wrote:

>        Dear Amit,
>
> > As I said before,  I am not desperate to get my code in glibc. So, if you
> > don't accept my algorithm, it is fine with me.
>
> the code you propose is not ready to get included in glibc. If you want
> your
> code to be considered, you must send a real patch against glibc,
> like for example:
>
> https://sourceware.org/pipermail/libc-alpha/2021-June/127242.html
>
> See https://sourceware.org/glibc/wiki/Contribution%20checklist for more
> details
>
> Best regards,
> Paul
>


I will send a patch after couple of days.

Regards,
Amit

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-06-13  4:06 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-09 10:31 Fastest String Search Algorithm Wilco Dijkstra
2021-06-09 11:42 ` Amit Choudhary
2021-06-09 12:05   ` Phil Blundell
2021-06-09 12:11     ` Adhemerval Zanella
2021-06-09 12:22       ` Amit Choudhary
2021-06-09 12:55         ` Adhemerval Zanella
2021-06-09 13:18           ` Amit Choudhary
2021-06-09 13:32             ` Amit Choudhary
2021-06-09 13:40             ` Adhemerval Zanella
2021-06-09 14:08               ` Amit Choudhary
2021-06-09 14:25                 ` Phil Blundell
2021-06-09 12:18     ` Amit Choudhary
2021-06-09 13:48       ` Phil Blundell
2021-06-09 21:44   ` Wilco Dijkstra
2021-06-10  2:42     ` Amit Choudhary
2021-06-10  3:23       ` Dan Raymond
2021-06-10  4:19         ` Amit Choudhary
2021-06-10  9:44       ` Wilco Dijkstra
2021-06-10 11:08         ` Amit Choudhary
2021-06-10 16:53           ` Amit Choudhary
2021-06-10 20:04             ` Amit Choudhary
2021-06-11  8:28               ` Amit Choudhary
2021-06-11 11:45           ` Wilco Dijkstra
2021-06-11 12:42             ` Amit Choudhary
2021-06-11 13:39               ` Amit Choudhary
2021-06-11 13:46               ` Wilco Dijkstra
2021-06-11 15:52                 ` Amit Choudhary
2021-06-11 19:14                   ` Amit Choudhary
2021-06-11 19:26                     ` Paul Eggert
2021-06-11 19:41                       ` Amit Choudhary
2021-06-11 20:00                         ` Dan Raymond
2021-06-12  4:00                           ` Amit Choudhary
2021-06-12  4:18                             ` Amit Choudhary
2021-06-12  7:16                               ` [Final testing with benchmark tests] " Amit Choudhary
2021-06-12  8:01                                 ` Paul Zimmermann
2021-06-12  8:41                                   ` Amit Choudhary
2021-06-12 13:29                                     ` Amit Choudhary
2021-06-12 10:56                                   ` Siddhesh Poyarekar
2021-06-12 13:47                                     ` Amit Choudhary
2021-06-13  3:32                                       ` Siddhesh Poyarekar
2021-06-13  4:06                                         ` Amit Choudhary
2021-06-11 13:57               ` Paul Zimmermann
2021-06-11 14:04                 ` Amit Choudhary
2021-06-11 14:16                   ` Paul Zimmermann
2021-06-11 14:27                     ` Amit Choudhary

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).