* Fastest String Search Algorithm.
@ 2021-06-09 10:31 Wilco Dijkstra
2021-06-09 11:42 ` Amit Choudhary
0 siblings, 1 reply; 45+ messages in thread
From: Wilco Dijkstra @ 2021-06-09 10:31 UTC (permalink / raw)
To: Amit Choudhary; +Cc: 'GNU C Library'
Hi Amit,
> If my algorithm is beating strstr() for long search patterns then I think
> that my algorithm should be included in glibc for long search patterns.
I tried it in bench-strstr.c, strstr is often 10 times faster on many inputs.
I only noticed a case with a 32KB needle and 64KB text where it beats strstr by
a factor 3. On complex needles it behaves like basic_strstr (ie. quadratic time).
The existing code that handles large needles isn't highly optimized, so it is
certainly possible to improve its performance. Large needles are rare - a size
like 4KB will never be used in practice. A key issue with large needles is quadratic
performance, and any new algorithm will need to address that somehow.
So it's not as simple as showing your algorithm is faster in one case, you have to
show it is faster on average for all large needles and avoid quadratic performance
on specially crafted worst-case inputs. As others have pointed out, this is not an
easy task...
Cheers,
Wilco
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 10:31 Fastest String Search Algorithm Wilco Dijkstra @ 2021-06-09 11:42 ` Amit Choudhary 2021-06-09 12:05 ` Phil Blundell 2021-06-09 21:44 ` Wilco Dijkstra 0 siblings, 2 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-09 11:42 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library On Wed, Jun 9, 2021 at 4:02 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Hi Amit, > > > If my algorithm is beating strstr() for long search patterns then I think > > that my algorithm should be included in glibc for long search patterns. > > I tried it in bench-strstr.c, strstr is often 10 times faster on many > inputs. > I only noticed a case with a 32KB needle and 64KB text where it beats > strstr by > a factor 3. On complex needles it behaves like basic_strstr (ie. quadratic > time) I included my algorithm in bench-strstr.c and ran the tests myself. I could not find strstr-inputs file so I printed the haystack and needle. I found out many things which I think are not correct: 1. In pass cases (positive cases where needle is part of haystack), the needle is either located at the beginning of the haystack or at the end of the haystack. To mimic real world scenario, the needle should be at the middle of the haystack. 2. The large text is mostly repetition of two small paragraphs. This again is not a real world scenario. 3. Some tests for 256 byte needles have haystack and needle as a combination of characters 'a' and 'b'. No real world text has been used. 4. Tests for 1 KB / 4 KB / 8 KB / 28 KB needles also have haystack and needle as a combination of characters 'a' and 'b'. No real world text has been used. So, in my opinion strstr() has never been tested extensively for real world text. I think this is a very big shortcoming. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 11:42 ` Amit Choudhary @ 2021-06-09 12:05 ` Phil Blundell 2021-06-09 12:11 ` Adhemerval Zanella 2021-06-09 12:18 ` Amit Choudhary 2021-06-09 21:44 ` Wilco Dijkstra 1 sibling, 2 replies; 45+ messages in thread From: Phil Blundell @ 2021-06-09 12:05 UTC (permalink / raw) To: Amit Choudhary; +Cc: GNU C Library On Wed, Jun 09, 2021 at 05:12:09PM +0530, Amit Choudhary via Libc-alpha wrote: > 1. In pass cases (positive cases where needle is part of haystack), the > needle is either located at the beginning of the haystack or at the end of > the haystack. To mimic real world scenario, the needle should be at the > middle of the haystack. Do you have any representative real-world testcases or statistics that you can share with us? For example, what is the reason for believing that the needle "should" be in the middle? > So, in my opinion strstr() has never been tested extensively for real world > text. I think this is a very big shortcoming. I'm sure improvements would be welcome. Phil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 12:05 ` Phil Blundell @ 2021-06-09 12:11 ` Adhemerval Zanella 2021-06-09 12:22 ` Amit Choudhary 2021-06-09 12:18 ` Amit Choudhary 1 sibling, 1 reply; 45+ messages in thread From: Adhemerval Zanella @ 2021-06-09 12:11 UTC (permalink / raw) To: Phil Blundell, Amit Choudhary; +Cc: GNU C Library On 09/06/2021 09:05, Phil Blundell via Libc-alpha wrote: > On Wed, Jun 09, 2021 at 05:12:09PM +0530, Amit Choudhary via Libc-alpha wrote: >> 1. In pass cases (positive cases where needle is part of haystack), the >> needle is either located at the beginning of the haystack or at the end of >> the haystack. To mimic real world scenario, the needle should be at the >> middle of the haystack. > > Do you have any representative real-world testcases or statistics that you > can share with us? For example, what is the reason for believing that the > needle "should" be in the middle? > >> So, in my opinion strstr() has never been tested extensively for real world >> text. I think this is a very big shortcoming. > > I'm sure improvements would be welcome. Indeed, as Siddhesh has noted I agree that a better bench-strstr with more meaningful tests is the start point to evaluate a newer strstr implementation. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 12:11 ` Adhemerval Zanella @ 2021-06-09 12:22 ` Amit Choudhary 2021-06-09 12:55 ` Adhemerval Zanella 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-09 12:22 UTC (permalink / raw) To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library On Wed, Jun 9, 2021, 5:41 PM Adhemerval Zanella < adhemerval.zanella@linaro.org> > > Indeed, as Siddhesh has noted I agree that a better bench-strstr with more > meaningful tests is the start point to evaluate a newer strstr > implementation. > I invented my algorithm because I thought about it. Getting my algorithm in glibc is not my goal. I can defintely provide many real world haystack and needle in a file. But, currently, I don't have much time to change strstr() benchmarks. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 12:22 ` Amit Choudhary @ 2021-06-09 12:55 ` Adhemerval Zanella 2021-06-09 13:18 ` Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Adhemerval Zanella @ 2021-06-09 12:55 UTC (permalink / raw) To: Amit Choudhary; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra On 09/06/2021 09:22, Amit Choudhary wrote: > On Wed, Jun 9, 2021, 5:41 PM Adhemerval Zanella < > adhemerval.zanella@linaro.org> > >> >> Indeed, as Siddhesh has noted I agree that a better bench-strstr with more >> meaningful tests is the start point to evaluate a newer strstr >> implementation. >> > > I invented my algorithm because I thought about it. > > Getting my algorithm in glibc is not my goal. > > I can defintely provide many real world haystack and needle in a file. > > But, currently, I don't have much time to change strstr() benchmarks. The problem is really hard to judge any improvement if we can't readily verify and replicate the results in a consistent manner. The benchtests we have is not perfect, but it is a starting point so any developer and reviewer can verify and play with the proposed change. You outlined possible limitations of current benchtests, so I would ask you if could improve the tests so it would make the review and test easier. Throwing numbers and external projects (specially ones that does not seem to have much sense such as java source code) it is not a way to discuss a potential performance improvements. Usually reviewers, including myself, won't spend any time trying to adjust glibc code to check your claims. Also, I would like to ask to please work and listen to Wilco suggestions. He was the one that code the current strstr implementation and he did a bit fair or research on the topic before come with a newer implementation. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 12:55 ` Adhemerval Zanella @ 2021-06-09 13:18 ` Amit Choudhary 2021-06-09 13:32 ` Amit Choudhary 2021-06-09 13:40 ` Adhemerval Zanella 0 siblings, 2 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-09 13:18 UTC (permalink / raw) To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella < adhemerval.zanella@linaro.org> > > The problem is really hard to judge any improvement if we can't readily > verify and replicate the results in a consistent manner. The benchtests > we have is not perfect, but it is a starting point so any developer and > reviewer can verify and play with the proposed change. > The current tests are skewed in favor of strstr(). I can do more testing from my side with my code and submit results again but I don't have time to fix strstr() benchmarks. Throwing numbers and external projects (specially ones that does not > seem to have much sense such as java source code) I have sent C code also. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 13:18 ` Amit Choudhary @ 2021-06-09 13:32 ` Amit Choudhary 2021-06-09 13:40 ` Adhemerval Zanella 1 sibling, 0 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-09 13:32 UTC (permalink / raw) To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra On Wed, Jun 9, 2021, 6:48 PM Amit Choudhary <amitchoudhary0523@gmail.com> wrote: > > > On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella < > adhemerval.zanella@linaro.org> > >> >> > I can do more testing from my side with my code and submit results again > but I don't have time to fix strstr() benchmarks. > The main point is that it is not easy to fix strstr() benchmarks. There is no strstr-inputs file while there are inputs files for other functions. If there were a strstr-inputs file then I would have changed it to include real world text. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 13:18 ` Amit Choudhary 2021-06-09 13:32 ` Amit Choudhary @ 2021-06-09 13:40 ` Adhemerval Zanella 2021-06-09 14:08 ` Amit Choudhary 1 sibling, 1 reply; 45+ messages in thread From: Adhemerval Zanella @ 2021-06-09 13:40 UTC (permalink / raw) To: Amit Choudhary; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra On 09/06/2021 10:18, Amit Choudhary wrote: > On Wed, Jun 9, 2021, 6:25 PM Adhemerval Zanella < > adhemerval.zanella@linaro.org> > >> >> The problem is really hard to judge any improvement if we can't readily >> verify and replicate the results in a consistent manner. The benchtests >> we have is not perfect, but it is a starting point so any developer and >> reviewer can verify and play with the proposed change. >> > > The current tests are skewed in favor of strstr(). This is hardly a justification to not improve the current benchmark. If you check the bench-strstr.c repository history you will see that Wilco has actively improved it so we could evaluate his newer implementation against the previous one. > > I can do more testing from my side with my code and submit results again > but I don't have time to fix strstr() benchmarks. > > > Throwing numbers and external projects (specially ones that does not >> seem to have much sense such as java source code) > > > I have sent C code also. You send only one test with a specific usage, this is hardly a way to fully verify the performance implication of a new implementation. That's why I think we need to integrate newer tests on the bench-strstr to have meaningful and *replicated* data. I am being insistent on this benchmark point because otherwise most likely no reviewer will try to spend time to evaluate it for you. Specially with the idea you pose that current tests are 'skewed in favor' of current implementation. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 13:40 ` Adhemerval Zanella @ 2021-06-09 14:08 ` Amit Choudhary 2021-06-09 14:25 ` Phil Blundell 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-09 14:08 UTC (permalink / raw) To: Adhemerval Zanella; +Cc: Phil Blundell, GNU C Library, Wilco Dijkstra On Wed, Jun 9, 2021, 7:10 PM Adhemerval > > You send only one test with a specific usage, this is hardly a way to > fully verify the performance implication of a new implementation. That's > why I think we need to integrate newer tests on the bench-strstr to have > meaningful and *replicated* data. > I don't think that you have seen what is the haystack and needles for strstr test. Basically, there is one haystack and needles with different sizes. I have done the same tests on my side. I have one text of 60 KB and my needle sizes are 3.2 KB, 4.4 KB, 5.9 KB, 7 KB, and 28 KB. So, my tests are similar to those of strstr() benchmark. Also, I have already mentioned that my algorithm is faster than strstr() when needle size is greater than or equal to 4.4 KB. I haven't yet tested with 4 KB needle. But, as Wilco pointed out, no one will use 4 KB needle so then it is obvious that my algorithm will not be included in glibc. So, why should I spend more time on this? There are many other languages than C (like Java, PHP, etc.) that have their own implementation of string search. So, I will spend my time there to evaluate string search algorithms of these languages. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 14:08 ` Amit Choudhary @ 2021-06-09 14:25 ` Phil Blundell 0 siblings, 0 replies; 45+ messages in thread From: Phil Blundell @ 2021-06-09 14:25 UTC (permalink / raw) To: Amit Choudhary; +Cc: GNU C Library On Wed, Jun 09, 2021 at 07:38:01PM +0530, Amit Choudhary wrote: > I have done the same tests on my side. I have one text of 60 KB and my > needle sizes are 3.2 KB, 4.4 KB, 5.9 KB, 7 KB, and 28 KB. > > So, my tests are similar to those of strstr() benchmark. > > Also, I have already mentioned that my algorithm is faster than strstr() > when needle size is greater than or equal to 4.4 KB. Right, but I think you've only established that your algorithm is faster for this one particular haystack and four different needles. That's not to say that it isn't faster in other cases, but we don't know that yet. It's also not clear to me what the upper complexity bound is for your algorithm and how that compares to the one we have at the moment. > So, I will spend my time there to evaluate string search algorithms of > these languages. That's entirely fair enough. Thanks for the suggestions anyway. Phil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 12:05 ` Phil Blundell 2021-06-09 12:11 ` Adhemerval Zanella @ 2021-06-09 12:18 ` Amit Choudhary 2021-06-09 13:48 ` Phil Blundell 1 sibling, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-09 12:18 UTC (permalink / raw) To: Phil Blundell; +Cc: GNU C Library On Wed, Jun 9, 2021, 5:35 PM Phil Blundell <pb@pbcl.net> wrote: > > Do you have any representative real-world testcases or statistics that you > can share with us? I had shared my program earlier which has around 60 KB of text copied from Internet. For example, what is the reason for believing that the > needle "should" be in the middle? > strstr() does a 2 way search - from the beginning of the haystack and from the end of haystack. So, if needle is located at the beginning or end of haystack, then strstr() will outperform many algorithms. The real test for strstr() would be if needle is located in the middle of the haystack. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 12:18 ` Amit Choudhary @ 2021-06-09 13:48 ` Phil Blundell 0 siblings, 0 replies; 45+ messages in thread From: Phil Blundell @ 2021-06-09 13:48 UTC (permalink / raw) To: Amit Choudhary; +Cc: GNU C Library On Wed, Jun 09, 2021 at 05:48:17PM +0530, Amit Choudhary wrote: > On Wed, Jun 9, 2021, 5:35 PM Phil Blundell <pb@pbcl.net> wrote: > > > > > Do you have any representative real-world testcases or statistics that you > > can share with us? > > > I had shared my program earlier which has around 60 KB of text copied from > Internet. How representative is this of the way that programs actually use strstr() in practice? Do you have any statistics on that? > So, if needle is located at the beginning or end of haystack, then strstr() > will outperform many algorithms. > > The real test for strstr() would be if needle is located in the middle of > the haystack. Right. My question really was: where do we think the needle is, in fact, located most often? If the real-world scenario is that the needle often is at the beginning or the end then optimising for that is the right thing to do. Phil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 11:42 ` Amit Choudhary 2021-06-09 12:05 ` Phil Blundell @ 2021-06-09 21:44 ` Wilco Dijkstra 2021-06-10 2:42 ` Amit Choudhary 1 sibling, 1 reply; 45+ messages in thread From: Wilco Dijkstra @ 2021-06-09 21:44 UTC (permalink / raw) To: Amit Choudhary; +Cc: GNU C Library Hi Amit, > I included my algorithm in bench-strstr.c and ran the tests myself. > I could not find strstr-inputs file so I printed the haystack and needle. > > I found out many things which I think are not correct: > >1. In pass cases (positive cases where needle is part of haystack), the needle is > either located at the beginning of the haystack or at the end of the haystack. > To mimic real world scenario, the needle should be at the middle of the haystack. It's located at the end of the haystack since that is the slowest and hardest case. There is no point in placing it at the start or in the middle since those cases are already tested in the smaller haystack cases. > 2. The large text is mostly repetition of two small paragraphs. This again is not a real world scenario. Repetition doesn't matter much as long as the sequence is long enough and typical English text. SMART has very large texts (including protein sequences), but the results are the same. > 3. Some tests for 256 byte needles have haystack and needle as a combination of > characters 'a' and 'b'. No real world text has been used. > > 4. Tests for 1 KB / 4 KB / 8 KB / 28 KB needles also have haystack and needle as a > combination of characters 'a' and 'b'. No real world text has been used. That's on purpose to test specific worst-case inputs. The worst-case for your algorithm would be "a" repeated N times and the needle "a" repeated N/2 times concatenated with "ba" which results in O(N * N) comparisons. And that is bad for large N. > So, in my opinion strstr() has never been tested extensively for real world text. > I think this is a very big shortcoming. It has been tested extensively, including in SMART. If you're planning to improve your algorithm I'd suggest to study that. It has lots of information about the algorithms used and the C implementations are easy to understand, so it should be very useful. Cheers, Wilco ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-09 21:44 ` Wilco Dijkstra @ 2021-06-10 2:42 ` Amit Choudhary 2021-06-10 3:23 ` Dan Raymond 2021-06-10 9:44 ` Wilco Dijkstra 0 siblings, 2 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-10 2:42 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library On Thu, Jun 10, 2021, 3:14 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Hi Amit, > > It's located at the end of the haystack since that is the slowest and > hardest case. > In two way search, doesn't strstr() search from end also? There is no point in placing it at the start or in the middle since those > cases are > already tested in the smaller haystack cases. > I had printed haystack and needle for all cases but didn't see needle in the middle. Testing in the middle should be done for very long haystack also. Also, repititive text is not very useful because the needle will be located in the beginning itself. > That's on purpose to test specific worst-case inputs. The worst-case for > your > algorithm would be "a" repeated N times and the needle "a" repeated N/2 > times concatenated with "ba" which results in O(N * N) comparisons. And > that > is bad for large N. > So, if some algorithm is faster for real world scenario but not for worst case (and worst case will not occur in real world), then you don't want to include that algorithm. This doesn't make sense. But now this discussion is getting into useless defending mode. I have seen this issue before also - people who maintain open source software will try defending what they do even when it is obvious that what they are doing is not correct. Google / Facebook, etc. accept their mistakes / bugs easily and sincerely but not open source software. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-10 2:42 ` Amit Choudhary @ 2021-06-10 3:23 ` Dan Raymond 2021-06-10 4:19 ` Amit Choudhary 2021-06-10 9:44 ` Wilco Dijkstra 1 sibling, 1 reply; 45+ messages in thread From: Dan Raymond @ 2021-06-10 3:23 UTC (permalink / raw) To: libc-alpha, amitchoudhary0523 Amit, here is a collection of quotes from you in this thread: "I have invented the fastest string search algorithm. It is called "Choudhary Algorithm"" "I hope that people are not of the view that Indians can't invent or innovate" "for 60 KB text and 3.2 KB pattern, strstr() is faster than my algorithm" "for 60 KB text and 4.4 KB pattern, my algorithm is faster than strstr()" "The current tests are skewed in favor of strstr()..." "but I don't have time to fix strstr() benchmarks" "people who maintain open source software will try defending what they do even when it is obvious that what they are doing is not correct" The reason you are not getting traction here is not because the community is racist, defensive, and unwilling to accept their mistakes. It is because you have failed to substantiate your claim that your algorithm is the "fastest string search algorithm". By your own admission the existing implementation is faster when using existing benchmarks. You claim that the benchmarks are flawed but when you were asked to submit improvements you refused. You claim that "real world" scenarios favor your algorithm but then you state that your algorithm requires a 4.4 KB search string to demonstrate an advantage. How is 4.4 KB a "real world" scenario? I'm sure there are applications that will search for a string that long but I've never encountered one. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-10 3:23 ` Dan Raymond @ 2021-06-10 4:19 ` Amit Choudhary 0 siblings, 0 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-10 4:19 UTC (permalink / raw) To: Dan Raymond; +Cc: GNU C Library On Thu, Jun 10, 2021, 8:54 AM Dan Raymond <draymond@foxvalley.net> wrote: > Amit, here is a collection of quotes from you in this thread: > > "I have invented the fastest string search algorithm. It is called > "Choudhary Algorithm"" > > "I hope that people are not of the view that Indians can't invent or > innovate" > > "for 60 KB text and 3.2 KB pattern, strstr() is faster than my > algorithm" > > "for 60 KB text and 4.4 KB pattern, my algorithm is faster than > strstr()" > > "The current tests are skewed in favor of strstr()..." > > "but I don't have time to fix strstr() benchmarks" > > "people who maintain open source software will try defending what > they do even when it is obvious that what they are doing is not correct" > > The reason you are not getting traction here is not because the > community is racist, defensive, and unwilling to accept their mistakes. > It is because you have failed to substantiate your claim that your > algorithm is the "fastest string search algorithm". By your own > admission the existing implementation is faster when using existing > benchmarks. You claim that the benchmarks are flawed but when you were > asked to submit improvements you refused. You claim that "real world" > scenarios favor your algorithm but then you state that your algorithm > requires a 4.4 KB search string to demonstrate an advantage. How is 4.4 > KB a "real world" scenario? I'm sure there are applications that will > search for a string that long but I've never encountered one. > First of all, I can give more proofs by testing from my end with different haystacks / needles but I am not going to fix your benchmark issues. I am only submitting my algorithm but you are putting extra work on me by asking me to fix your benchmark issues. This is not my issue, this is the issue of the person who designed and implemented the benchmark tests. If you do like this, that is, if someone submits some work and then you ask him to fix existing issues also then no one will be interested in submitting their work because they may not want to fix your existing issues in benchmark tests, etc. Also, when I thought more, I realized that no one uses strstr() in real world. C is now-a-days used only in embedded software / kernel, etc. and I have programmed a lot in C but I don't ever remember using strstr(). You people think that you are the smartest people on this earth so let me tell you something about myself. 1. I am programming since 1998. 2. I have programmed in C, C++, JAVA, PHP, PYTHON, HTML, PERL, JAVASCRIPT, ETC. 3. I had invented a new language and written its interpreter in 2002 when I was an intern in a company. 4. I had written and published "How to implement system call". It can be found here: https://tldp.org/HOWTO/html_single/Implement-Sys-Call-Linux-2.6-i386/ 5. I have contributed many patches to linux kernel but left contributing more when linux kernel developers became defensive. 6. I have worked in Cisco Systems and Juniper Networks in USA. 7. Currently, I am working for a networking company and writing code for 4G / 5G networks. 8. I have done B. Tech. in Computer Science from IIT - Varanasi after getting a rank in IIT-JEE. IIT-JEE is one of the toughest exams in this world. 9. I have done Masters in Computer Networking from North Carolina State University, NC, USA. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-10 2:42 ` Amit Choudhary 2021-06-10 3:23 ` Dan Raymond @ 2021-06-10 9:44 ` Wilco Dijkstra 2021-06-10 11:08 ` Amit Choudhary 1 sibling, 1 reply; 45+ messages in thread From: Wilco Dijkstra @ 2021-06-10 9:44 UTC (permalink / raw) To: Amit Choudhary; +Cc: GNU C Library Hi Amit, > It's located at the end of the haystack since that is the slowest and hardest case. > > In two way search, doesn't strstr() search from end also? No, strstr must return the *first* match in the string, so it has to search from the start. Hence placing the needle at the start is the easiest case, and the end is the worst case. The TwoWay algorithm splits the needle into two parts and matches them separately - but it still scans the haystack from the start. > There is no point in placing it at the start or in the middle since those cases are > already tested in the smaller haystack cases. > > I had printed haystack and needle for all cases but didn't see needle in the middle. > > Testing in the middle should be done for very long haystack also. For testing correctness yes. However in a benchmark there is no difference if the haystack ends immediately after a match or continues forever. A test like SMART with random needles at random positions in a huge text would be useful to add of course. > Also, repititive text is not very useful because the needle will be located in the beginning itself. It won't since the benchmark explicitly removes all early matches (and of course it checks that the result is correct as well). > So, if some algorithm is faster for real world scenario but not for worst case (and worst case > will not occur in real world), then you don't want to include that algorithm. Large needles are not real-world, so for these the worst-case matters. The TwoWay algorithm for large needles is old and relatively inefficient. Anyone who has an algorithm that is not only faster on real-world needles but also guarantees good worst-case performance on huge needles would have no trouble getting their patches accepted. Cheers, Wilco ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-10 9:44 ` Wilco Dijkstra @ 2021-06-10 11:08 ` Amit Choudhary 2021-06-10 16:53 ` Amit Choudhary 2021-06-11 11:45 ` Wilco Dijkstra 0 siblings, 2 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-10 11:08 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library [-- Attachment #1: Type: text/plain, Size: 3049 bytes --] On Thu, Jun 10, 2021, 3:14 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Hi Amit, > > For testing correctness yes. However in a benchmark there is no difference > if the haystack > ends immediately after a match or continues forever. > In my opinion, needle should be in the middle so as not to give any algorithm any unfair advantage - like Boyer-Moore which searches from the end. > It won't since the benchmark explicitly removes all early matches (and of > course it checks that > the result is correct as well). > I have not seen what you are pointing out. I can attach the file which has all haystacks and needles, but it is 82 MB. I did bzip2 on it but still it is 564 KB. So, I have removed all haystacks/needles which were a combination of 'a's and 'b's. I have run bzip2 on that file and now it is 115 KB large. However, the file format is not very good but it is still easily readable. I searched for the needle in the haystack using document editor's search and found that needle was located in the beginning. > > So, if some algorithm is faster for real world scenario but not for > worst case (and worst case > > will not occur in real world), then you don't want to include that > algorithm. > > Large needles are not real-world, so for these the worst-case matters. The > TwoWay algorithm > for large needles is old and relatively inefficient. Anyone who has an > algorithm that is not > only faster on real-world needles but also guarantees good worst-case > performance on huge > needles would have no trouble getting their patches accepted. > I can do more tests and provide the inputs and results but I will not use benchmark tests because I don't find them useful. I will also not fix benchmark tests. If all this is acceptable then please let me know. In this case I will start my testing again. The worst case scenario of my algorithm is O(mn) where m is the length of the text and n is the length of the pattern. And this worst case scenario will be hit in only one case: haystack is: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA... needle is: ABABA But my theory is this: We should be fastest 100% if the time. If that is not possible, then we should try to get as close to 100% as possible. This means that if for average case, some algorithm is faster 99% of the time but slower 1% of the time in the worst case scenario then we should go for this algorithm rather than fretting over worst case scenario. So, please let me know if you are still amenable to include my algorithm provided the results for my algorithm are better than strstr(). You can tell me what haystack / needle to pick. I am thinking of having 2 long haystacks copied from Internet and then a function call in the program will generate needle (equal to length passed to it) from the middle of the haystack (Mostly, the length of needle will be greater than or equal to 4 KB). But, please let me know if you have other ideas. Please find attached the haystacks/needles file from benchmark tests of strstr (115 KB, bzipped2). Regards, Amit [-- Attachment #2: bench-strstr.out.txt.bz2 --] [-- Type: application/x-bzip, Size: 117063 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-10 11:08 ` Amit Choudhary @ 2021-06-10 16:53 ` Amit Choudhary 2021-06-10 20:04 ` Amit Choudhary 2021-06-11 11:45 ` Wilco Dijkstra 1 sibling, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-10 16:53 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library On Thu, Jun 10, 2021 at 4:38 PM Amit Choudhary <amitchoudhary0523@gmail.com> wrote: > > The worst case scenario of my algorithm is O(mn) where m is the length of > the text and n is the length of the pattern. > > And this worst case scenario will be hit in only one case: > > haystack is: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA... > needle is: ABABA > I calculated again and it comes out that in worst case scenario also my algorithm is linear - O(n). ============================================= CASE 1: Let's consider this: haystack: AXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA... (length = n) needle: ABABABABA (length = m) So, when it is going to match needle[0] with haystack[0], it will match and all other 4 comparisons as well. So, it will do 5 comparisons and then it will try to match the whole needle, but it will fail only after matching [0]th character because the [1]st character (X, B) differs. So, it will do 7 comparisons in all when it matches all 5 initial comparisons. Now, since the needle did not match, then 'i' will be incremented by 1 and it will immediately fail after 1 comparison (needle[0] = A) is not equal to ((haystack[1] = X). So, it will do only 1 comparison. This will repeat itself - 7 comparisons, 1 comparison, 7 comparisons, 1 comparison and so on.... So, for n/2 characters it will do 7 comparisons and for other n/2 characters it will do only 1 comparison. So, total comparisons = (7*n)/2 + n/2 = 8n/2 = 4n. So, this worst case scenario is O(n). ============================================= CASE 2: Let's consider this: haystack: ABCDEFGABCDEFGABCDEFGABCDEFG... (length = n) needle: ABCDEXG (length = m) {{Here, only needle[m-2] and haystack[m-2] differs}}. In this case it will match all initial 5 comparisons for needle[0]. Then it will try to match whole needle and will fail after (m-2) comparisons. Now, since the needle did not match, then 'i' will be incremented by 1 and it will immediately fail after 1 comparison (needle[0] = A) is not equal to ((haystack[1] = B). So, it will do only 1 comparison. This 1 comparison will continue till next m-1 characters. So, for 1st m characters, the number of comparisons are 5 + m - 2 + m - 1 = 2m + 2 comparisons. This 2m + 2 comparisons will be done n/m times. So, total number of comparisons done = (2m + 2)(n/m) =~ n. So, this worst case scenario is also O(n). ============================================= So, now I think that I can safely say that the worst case scenario for my algorithm is O(n). If someone can prove that it is not O(n) but O(m*n) or O(n^2), etc. then please let me know. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-10 16:53 ` Amit Choudhary @ 2021-06-10 20:04 ` Amit Choudhary 2021-06-11 8:28 ` Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-10 20:04 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library [-- Attachment #1: Type: text/plain, Size: 287 bytes --] I have now improved my algorithm by skipping some text based on a preprocessed array on pattern. Now, my algorithm is faster than strstr() for needles of 1 KB and above. But I am not sure if you guys are still interested. Anyways, I am attaching my program (bzipped2). Regards, Amit [-- Attachment #2: compare_string_search_algorithms.c.bz2 --] [-- Type: application/bzip2, Size: 25450 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-10 20:04 ` Amit Choudhary @ 2021-06-11 8:28 ` Amit Choudhary 0 siblings, 0 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-11 8:28 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library On Fri, Jun 11, 2021, 1:34 AM Amit Choudhary <amitchoudhary0523@gmail.com> wrote: > I have now improved my algorithm by skipping some text based on a > preprocessed array on pattern. > > Now, my algorithm is faster than strstr() for needles of 1 KB and above. > > But I am not sure if you guys are still interested. > > Anyways, I am attaching my program (bzipped2). > > Regards, > Amit > This version of my program has a bug. If needed, I can send the fixed version. Also, as it was pointed out that long needles are not used in real world but this is not the case. Antivirus software compares long needles (may be greater than 4 KB also) to search for virus signature in packets. And this is done everyday, many many times, and that too on millions of computers everyday. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-10 11:08 ` Amit Choudhary 2021-06-10 16:53 ` Amit Choudhary @ 2021-06-11 11:45 ` Wilco Dijkstra 2021-06-11 12:42 ` Amit Choudhary 1 sibling, 1 reply; 45+ messages in thread From: Wilco Dijkstra @ 2021-06-11 11:45 UTC (permalink / raw) To: Amit Choudhary; +Cc: GNU C Library Hi Amit, > In my opinion, needle should be in the middle so as not to give any > algorithm any unfair advantage - like Boyer-Moore which searches from the end. There is no unfair advantage to any algorithm since they all start searching from the start (note BM does not start at the end). > I searched for the needle in the haystack using document editor's search > and found that needle was located in the beginning. No, it's always located at the end - it is even tested explicitly. > The worst case scenario of my algorithm is O(mn) where m is the length of > the text and n is the length of the pattern. Exactly. And that is problematic if both m and n are large. > And this worst case scenario will be hit in only one case: There is not just one worst-case, the number is practically infinite. A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA. > But my theory is this: We should be fastest 100% if the time. If that is not possible, > then we should try to get as close to 100% as possible. > > This means that if for average case, some algorithm is faster 99% of the time but > slower 1% of the time in the worst case scenario then we should go for this > algorithm rather than fretting over worst case scenario. For large needles and haystacks the worst case becomes more important because the slowdown becomes larger with the length. It's bad even if there is a 0.001% chance that your computer locks up for minutes. Cheers, Wilco ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 11:45 ` Wilco Dijkstra @ 2021-06-11 12:42 ` Amit Choudhary 2021-06-11 13:39 ` Amit Choudhary ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-11 12:42 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library On Fri, Jun 11, 2021, 5:15 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > > > > I searched for the needle in the haystack using document editor's search > > and found that needle was located in the beginning. > > No, it's always located at the end - it is even tested explicitly. > For some cases it is in the beginning and for some cases it is in the end. > > The worst case scenario of my algorithm is O(mn) where m is the length of > > the text and n is the length of the pattern. > > Exactly. And that is problematic if both m and n are large. > > > And this worst case scenario will be hit in only one case: > > There is not just one worst-case, the number is practically infinite. > A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA. > I can easily fix this by recording at which index it failed the last time. Then the first comparison will be at the index of B. And after first m comparisons, it will fail in the first comparison itself at index of B. So, after this fix, my algo's worst case scenario will be O(n). """""But, I would really appreciate if you can give me a real world scenario where this worst case scenario will happen.""""" > > But my theory is this: We should be fastest 100% if the time. If that is > not possible, > > then we should try to get as close to 100% as possible. > > > > This means that if for average case, some algorithm is faster 99% of the > time but > > slower 1% of the time in the worst case scenario then we should go for > this > > algorithm rather than fretting over worst case scenario. > > For large needles and haystacks the worst case becomes more important > because > the slowdown becomes larger with the length. It's bad even if there is a > 0.001% > chance that your computer locks up for minutes. > The worst case scenario almost never occurs in real world situations. The worst case scenario is almost always on paper. Please give me an example of how would you encounter worst case sceraio in real world. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 12:42 ` Amit Choudhary @ 2021-06-11 13:39 ` Amit Choudhary 2021-06-11 13:46 ` Wilco Dijkstra 2021-06-11 13:57 ` Paul Zimmermann 2 siblings, 0 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-11 13:39 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library [-- Attachment #1: Type: text/plain, Size: 1115 bytes --] On Fri, Jun 11, 2021 at 6:12 PM Amit Choudhary <amitchoudhary0523@gmail.com> wrote: > > On Fri, Jun 11, 2021, 5:15 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> > wrote: > >> > The worst case scenario of my algorithm is O(mn) where m is the length >> of >> > the text and n is the length of the pattern. >> >> Exactly. And that is problematic if both m and n are large. >> >> > And this worst case scenario will be hit in only one case: >> >> There is not just one worst-case, the number is practically infinite. >> A typical worst case is AAAAAAAAAAAAAAA with needle AAAAAAABA. >> > > > I can easily fix this by recording at which index it failed the last time. > Then the first comparison will be at the index of B. And after first m > comparisons, it will fail in the first comparison itself at index of B. > > So, after this fix, my algo's worst case scenario will be O(n). > > """""But, I would really appreciate if you can give me a real world > scenario where this worst case scenario will happen.""""" > I fixed this. I am attaching latest fixed and changed version of my program (bzipped2). Regards, Amit [-- Attachment #2: compare_string_search_algorithms.c.bz2 --] [-- Type: application/x-bzip, Size: 25562 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 12:42 ` Amit Choudhary 2021-06-11 13:39 ` Amit Choudhary @ 2021-06-11 13:46 ` Wilco Dijkstra 2021-06-11 15:52 ` Amit Choudhary 2021-06-11 13:57 ` Paul Zimmermann 2 siblings, 1 reply; 45+ messages in thread From: Wilco Dijkstra @ 2021-06-11 13:46 UTC (permalink / raw) To: Amit Choudhary; +Cc: GNU C Library Hi Amit, >> No, it's always located at the end - it is even tested explicitly. > > For some cases it is in the beginning and for some cases it is in the end. No it is not. > So, after this fix, my algo's worst case scenario will be O(n). No - that doesn't solve the fundamental quadratic problem. > Please give me an example of how would you encounter worst case sceraio in real world. For example an attacker finds an application that uses strstr and provides worst case inputs on purpose, causing the application to hang. That is a denial of service attack. GLIBC tries hard to avoid quadratic algorithms as a result. This is not theoretical, slowdowns on rare inputs create serious issues. Math functions are a good example, there were cases where some math functions were 1000 or even 100000 times slower on some inputs. This caused such issues for HPC code that they used to advice not to use GLIBC for math code - a supercomputer taking weeks rather than hours to solve a problem because it hit slow cases is disastrous. Cheers, Wilco ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 13:46 ` Wilco Dijkstra @ 2021-06-11 15:52 ` Amit Choudhary 2021-06-11 19:14 ` Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-11 15:52 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library On Fri, Jun 11, 2021 at 7:16 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Hi Amit, > > For example an attacker finds an application that uses strstr and provides > worst case inputs on purpose, causing the application to hang. That is a > denial of service attack. GLIBC tries hard to avoid quadratic algorithms > as a > result. > This is a problem of the application that it is not limiting the input size. Problems like these have happened in the past and that's why we have strncpy(), etc. If we don't have strncpy() then denial of service attack is possible through strcpy() also. This is not theoretical, slowdowns on rare inputs create serious issues. > Math functions are a good example, there were cases where some math > functions were 1000 or even 100000 times slower on some inputs. > This caused such issues for HPC code that they used to advice not to use > GLIBC for math code - a supercomputer taking weeks rather than hours to > solve a problem because it hit slow cases is disastrous. > The worst case for a string matching function is different than for maths. String matching is very simple - just compare characters. But maths is different - calculating square roots, multiplying 20 floating point numbers, calculating logarithmic values of large numbers, factorization, etc. are very CPU intensive. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 15:52 ` Amit Choudhary @ 2021-06-11 19:14 ` Amit Choudhary 2021-06-11 19:26 ` Paul Eggert 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-11 19:14 UTC (permalink / raw) To: GNU C Library To end this whole discussion, there is one point that needs to be clarified but no one has spoken about it. The point is this: If there are two algorithms and algorithm1 performs better than algorithm2 in average case. But in worst case scenario, algorithm2 is better than algorithm1. And you can choose only one algorithm. """""So, which one would you choose for glibc?""""" I will go for algorithm1 because it has better performance in average case. But Wilco will go for algorithm2 because its has better performance in worst case. If this point gets clarified then there will be no point of further discussion. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 19:14 ` Amit Choudhary @ 2021-06-11 19:26 ` Paul Eggert 2021-06-11 19:41 ` Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Paul Eggert @ 2021-06-11 19:26 UTC (permalink / raw) To: Amit Choudhary; +Cc: GNU C Library On 6/11/21 12:14 PM, Amit Choudhary via Libc-alpha wrote: > If there are two algorithms and algorithm1 performs better than algorithm2 > in average case. But in worst case scenario, algorithm2 is better than > algorithm1. For glibc, worst-case is a big deal when it's measurable via the big-O notation. This is because of denial-of-service attacks and such-like. So in many cases we'll prefer an algorithm with a better worst-case behavior. Also, it's often easier to measure worst-case (because what is "average" is more debatable than what is worst), so focusing on worst-case helps cut down the more-difficult and lengthy discussion that would be needed to define "average", and this focus can save everybody time. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 19:26 ` Paul Eggert @ 2021-06-11 19:41 ` Amit Choudhary 2021-06-11 20:00 ` Dan Raymond 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-11 19:41 UTC (permalink / raw) To: Paul Eggert; +Cc: GNU C Library On Sat, Jun 12, 2021, 12:56 AM Paul Eggert <eggert@cs.ucla.edu> wrote: > On 6/11/21 12:14 PM, Amit Choudhary via Libc-alpha wrote: > > If there are two algorithms and algorithm1 performs better than > algorithm2 > > in average case. But in worst case scenario, algorithm2 is better than > > algorithm1. > > For glibc, worst-case is a big deal when it's measurable via the big-O > notation. This is because of denial-of-service attacks and such-like. So > in many cases we'll prefer an algorithm with a better worst-case behavior. > > Also, it's often easier to measure worst-case (because what is "average" > is more debatable than what is worst), so focusing on worst-case helps > cut down the more-difficult and lengthy discussion that would be needed > to define "average", and this focus can save everybody time. > So, in this case, my algorithm is of no use because it better than strstr() in average cases but not in worst cases. Average case means real world scenarios. Worst case means hand crafted worst case inputs. So, from my side, this discussion is now over. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 19:41 ` Amit Choudhary @ 2021-06-11 20:00 ` Dan Raymond 2021-06-12 4:00 ` Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Dan Raymond @ 2021-06-11 20:00 UTC (permalink / raw) To: libc-alpha, amitchoudhary0523 > So, in this case, my algorithm is of no use because it better than strstr() > in average cases but not in worst cases. > > Average case means real world scenarios. Worst case means hand crafted > worst case inputs. Real world scenarios are precisely the ones that are sensitive to worst-case performance. In the real world you can't predict inputs and there are consequences to unexpected performance problems. Denial of service attacks are just one example. By the way your previous argument that antivirus software uses strstr() to search for virus signatures is false. A virus signature is not a NUL-terminated string. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 20:00 ` Dan Raymond @ 2021-06-12 4:00 ` Amit Choudhary 2021-06-12 4:18 ` Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-12 4:00 UTC (permalink / raw) To: Dan Raymond; +Cc: GNU C Library On Sat, Jun 12, 2021, 1:32 AM Dan Raymond <draymond@foxvalley.net> wrote: By the way your previous argument > that antivirus software uses strstr() to search for virus signatures is > false. A virus signature is not a NUL-terminated string. > I never said that strstr() is used in antivirus software. I only said that antivirus software use long needles for searching and I said this because someone had put forth the argument that long needles don't occur in real world. But as you said that in real world anything can happen then the current version of strstr() is susceptible to denial of service attack because for long needles (32 KB, etc.), strstr() is slow. But one thing is very ironic. On one hand people say that anything can happen in real world and on the other they contradict this by saying that long needles don't occur in real world. If anything can happen in real world, then strstr() should be testef for 10 MB long needle in 100 MB long haystack. But this is not tested. Actually, strstr() has not been tested for needles beyond 1 KB and haystack beyong 65 KB. So, the argument that anything can happen in the real world in not reflected in the benchmark tests for strstr(). But, please don't ask me to fix benchmark tests. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-12 4:00 ` Amit Choudhary @ 2021-06-12 4:18 ` Amit Choudhary 2021-06-12 7:16 ` [Final testing with benchmark tests] " Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-12 4:18 UTC (permalink / raw) To: GNU C Library By the way, Paul Eggert thinks that this discussion is a waste of time for glibc people. So, I will request people to stop this thread of discussion now. So, please don't reply to any message in this thread. But if someone replies, then I might reply back. But please don't blame me for this discussion. People who replied / responded in this thread are equally at fault for continuing this discussion. Paul blamed me for wasting glibc developers time and I really didn't like that. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* [Final testing with benchmark tests] Fastest String Search Algorithm. 2021-06-12 4:18 ` Amit Choudhary @ 2021-06-12 7:16 ` Amit Choudhary 2021-06-12 8:01 ` Paul Zimmermann 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-12 7:16 UTC (permalink / raw) To: GNU C Library [-- Attachment #1: Type: text/plain, Size: 458 bytes --] Hi All, I ran the benchmark tests for my algorithm and for strstr(). Total number of tests = 332. My algorithm is faster 182 times. strstr() is faster 150 times. For worst case scenario my algorithm is not faster than strstr(). Please find the attached file "bench-strstr.out". The input had a variation of " and ' in unicode I guess, so I changed them to " and ' of english alphabet. Please find the attached file "bench-strstr.out". Regards, Amit [-- Attachment #2: bench-strstr.out --] [-- Type: application/octet-stream, Size: 7601 bytes --] c: This means that choudhary algorithm is faster in this case s: This means that strstr() is faster in this case choudhary algo strstr() ============================================================== c: 10453.9 19389.8 c: 1105.47 5363.28 c: 11310.9 19953.9 c: 1313.28 2221.88 c: 1392.97 1971.09 c: 1506.25 3337.5 c: 1557.81 2535.16 c: 1593.75 2062.5 c: 1663.28 2941.41 c: 16780.5 13126.6 c: 1684.38 1958.59 c: 16975 47172.7 c: 17593.8 48645.3 c: 1857.03 2639.06 c: 2064.06 4439.06 c: 20754.7 36578.9 c: 208.594 229.688 c: 20830.5 36247.7 c: 225 839.062 c: 225.781 622.656 c: 232.812 338.281 c: 232.812 487.5 c: 234.375 333.594 c: 235.938 532.812 c: 239.844 1327.34 c: 242.969 542.969 c: 245.312 641.406 c: 248.438 878.906 c: 250.781 638.281 c: 255.469 1050.78 c: 256.25 571.875 c: 256.25 949.219 c: 259.375 766.406 c: 260.938 1163.28 c: 261.719 1049.22 c: 261.719 397.656 c: 262.5 387.5 c: 2663.28 25135.2 c: 267.188 856.25 c: 2681.25 2820.31 c: 272.656 1092.19 c: 2727.34 4802.34 c: 275 289.062 c: 27609.4 208379 c: 279.688 357.031 c: 279.688 901.562 c: 280.469 887.5 c: 280.469 996.875 c: 294.531 1128.12 c: 294.531 1317.19 c: 295.312 631.25 c: 2960.16 3276.56 c: 298.438 361.719 c: 298.438 997.656 c: 302.344 1568.75 c: 304.688 396.094 c: 304.688 500.781 c: 3047.66 4343.75 c: 306.25 410.156 c: 309.375 509.375 c: 309.375 603.125 c: 310.938 625.781 c: 3138.28 25089.8 c: 315.625 574.219 c: 316.406 876.562 c: 317.969 564.062 c: 3181.25 4013.28 c: 319.531 1339.06 c: 320.312 1094.53 c: 323.438 698.438 c: 324.219 696.094 c: 332.031 493.75 c: 334.375 398.438 c: 335.156 573.438 c: 335.938 1003.91 c: 344.531 955.469 c: 345.312 1021.09 c: 347.656 650.781 c: 348.438 1131.25 c: 356.25 513.281 c: 358.594 1125.78 c: 359.375 1137.5 c: 359.375 479.688 c: 360.938 1371.88 c: 363.281 565.625 c: 363.281 679.688 c: 364.062 399.219 c: 364.062 403.125 c: 367.188 1046.88 c: 367.188 782.812 c: 373.438 848.438 c: 374.219 652.344 c: 374.219 899.219 c: 376.562 1054.69 c: 3761.72 4484.38 c: 380.469 418.75 c: 381.25 391.406 c: 383.594 435.156 c: 389.844 638.281 c: 392.188 626.562 c: 398.438 616.406 c: 405.469 482.812 c: 414.062 910.938 c: 4192.97 4861.72 c: 421.094 1138.28 c: 421.094 1151.56 c: 422.656 704.688 c: 428.125 436.719 c: 428.125 615.625 c: 428.906 1072.66 c: 428.906 1203.12 c: 432.812 1599.22 c: 433.594 1090.62 c: 434.375 617.969 c: 435.938 579.688 c: 437.5 870.312 c: 438.281 455.469 c: 439.062 468.75 c: 439.062 494.531 c: 439.062 582.812 c: 442.969 1020.31 c: 443.75 859.375 c: 446.094 522.656 c: 447.656 1488.28 c: 448.438 915.625 c: 456.25 1046.09 c: 457.812 660.938 c: 467.188 533.594 c: 479.688 748.438 c: 483.594 1385.94 c: 483.594 929.688 c: 48618.8 76608.6 c: 490.625 496.875 c: 493.75 1175 c: 493.75 641.406 c: 502.344 664.062 c: 503.125 1589.84 c: 503.906 1421.09 c: 511.719 1041.41 c: 517.969 525.781 c: 51726.6 75812.5 c: 521.094 1164.84 c: 527.344 666.406 c: 53248.4 71593.8 c: 53338.3 74159.4 c: 537.5 720.312 c: 542.969 1200.78 c: 550.781 1092.19 c: 551.562 589.844 c: 58021.9 54292.2 c: 586.719 1548.44 c: 598.438 654.688 c: 603.125 923.438 c: 607.031 1297.66 c: 607.812 1372.66 c: 628.125 732.812 c: 636.719 932.031 c: 661.719 1125.78 c: 665.625 1517.97 c: 681.25 778.906 c: 687.5 1145.31 c: 7032.81 9925.78 c: 710.938 2553.12 c: 712.5 734.375 c: 714.062 5446.09 c: 714.062 808.594 c: 717.188 1564.06 c: 725 1553.12 c: 733.594 1417.19 c: 735.156 1314.84 c: 7624.22 10057.8 c: 778.906 1423.44 c: 786.719 1428.91 c: 7864.06 9322.66 c: 804.688 1451.56 c: 807.031 1608.59 c: 807.812 1328.91 c: 888230 1.03685e+06 c: 921.094 1739.84 c: 936.719 1496.09 c: 953.125 1503.12 c: 970.312 1877.34 s: 1.02894e+08 329970 s: 1.09189e+07 44238.3 s: 1.3712e+06 1.35065e+06 s: 1.45997e+06 1.42336e+06 s: 1.51957e+06 1.47132e+06 s: 1.56013e+06 1.49793e+06 s: 1.57082e+06 1.49573e+06 s: 1.90037e+08 345765 s: 1.99525e+08 56757 s: 1002.34 57.0312 s: 1005.47 70.3125 s: 10122.7 4859.38 s: 1034.38 117.969 s: 1056.25 50 s: 1078.12 56.25 s: 1185.16 88.2812 s: 1194.53 77.3438 s: 1206.25 67.1875 s: 1244.53 97.6562 s: 12890.6 6563.28 s: 13193 10729.7 s: 133709 39225.8 s: 138122 40568 s: 14467.2 7554.69 s: 1471.09 83.5938 s: 1499.22 62.5 s: 17743 13910.2 s: 1817.19 1497.66 s: 1924.22 12392.2 s: 2.81242e+07 291209 s: 2.84542e+06 314376 s: 225.781 195.312 s: 22570.3 14451.6 s: 238.281 61.7188 s: 24030.5 14821.9 s: 242.188 48.4375 s: 242.969 46.875 s: 2447.66 1464.84 s: 246.875 54.6875 s: 253.906 74.2188 s: 2553.91 12784.4 s: 260.156 54.6875 s: 26240.6 13806.2 s: 275 185.156 s: 275 52.3438 s: 275.781 92.1875 s: 27962.5 208714 s: 281.25 227.344 s: 282.031 356.25 s: 282.031 98.4375 s: 286.719 100 s: 288.281 57.8125 s: 290.625 46.875 s: 296.094 284.375 s: 3.68725e+06 49706.2 s: 3.75006e+08 106175 s: 300.781 100 s: 302.344 246.875 s: 305.469 282.812 s: 30765.6 17076.6 s: 311.719 173.438 s: 31432.8 13260.2 s: 31812.5 17928.9 s: 325 216.406 s: 325 253.125 s: 332.812 51.5625 s: 335.938 277.344 s: 3385.94 3069.53 s: 340.625 310.938 s: 342.188 129.688 s: 342.188 96.0938 s: 3433.59 3389.84 s: 348.438 330.469 s: 358.594 108.594 s: 359.375 98.4375 s: 362.5 346.875 s: 363.281 359.375 s: 367.188 87.5 s: 367.969 209.375 s: 3692.97 2645.31 s: 370.312 369.531 s: 380.469 295.312 s: 387.5 183.594 s: 392.969 307.031 s: 39461.7 31028.1 s: 397.656 132.031 s: 398.438 92.9688 s: 39895.3 32588.3 s: 399.219 231.25 s: 4.92685e+08 526623 s: 4.99937e+07 25910.9 s: 403.125 129.688 s: 403.125 381.25 s: 407.031 47.6562 s: 416.406 92.1875 s: 421.094 131.25 s: 425.781 250.781 s: 426.562 267.969 s: 432.812 87.5 s: 443.75 353.125 s: 4471.88 3709.38 s: 450.781 375 s: 452.344 412.5 s: 460.156 119.531 s: 469.531 384.375 s: 471.875 242.969 s: 475.781 84.375 s: 480.469 274.219 s: 482.031 428.125 s: 485.156 246.094 s: 486.719 332.031 s: 517.188 351.562 s: 517.188 492.969 s: 525.781 50 s: 543.75 338.281 s: 546.094 98.4375 s: 546.875 303.906 s: 553.125 268.75 s: 56386.7 54726.6 s: 567.188 531.25 s: 568.75 62.5 s: 591.406 46.0938 s: 5931.25 2490.62 s: 608.594 285.156 s: 624.219 225 s: 627.344 342.969 s: 635.156 84.375 s: 641.406 111.719 s: 641.406 405.469 s: 642.188 612.5 s: 66678.1 22666.4 s: 68185.9 23725.8 s: 68239.8 43514.1 s: 68878.1 42614.1 s: 6993.75 4001.56 s: 715.625 321.875 s: 750 389.062 s: 752.344 246.875 s: 7521.88 4145.31 s: 762.5 561.719 s: 7743.75 2208.59 s: 793.75 60.9375 s: 8.20611e+06 307977 s: 817.188 71.0938 s: 835.156 403.125 s: 8601.56 4188.28 s: 8924.22 8514.06 s: 9.82335e+08 357959 s: 912.5 628.906 s: 9723.44 8916.41 ============================================================== ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Final testing with benchmark tests] Fastest String Search Algorithm. 2021-06-12 7:16 ` [Final testing with benchmark tests] " Amit Choudhary @ 2021-06-12 8:01 ` Paul Zimmermann 2021-06-12 8:41 ` Amit Choudhary 2021-06-12 10:56 ` Siddhesh Poyarekar 0 siblings, 2 replies; 45+ messages in thread From: Paul Zimmermann @ 2021-06-12 8:01 UTC (permalink / raw) To: Amit Choudhary; +Cc: libc-alpha Dear Amit, > I ran the benchmark tests for my algorithm and for strstr(). > > Total number of tests = 332. > > My algorithm is faster 182 times. > > strstr() is faster 150 times. and your algorithm can be up to 3531 times slower than strstr(): s: 3.75006e+08 106175 The largest strstr() time is 1.5e6, whereas the largest time of your algorithm is 9.8e8, i.e., 656 times larger. Can you improve your algorithm so that the worst case time is similar to that of strstr()? Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Final testing with benchmark tests] Fastest String Search Algorithm. 2021-06-12 8:01 ` Paul Zimmermann @ 2021-06-12 8:41 ` Amit Choudhary 2021-06-12 13:29 ` Amit Choudhary 2021-06-12 10:56 ` Siddhesh Poyarekar 1 sibling, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-12 8:41 UTC (permalink / raw) To: Paul Zimmermann; +Cc: GNU C Library On Sat, Jun 12, 2021 at 1:31 PM Paul Zimmermann <Paul.Zimmermann@inria.fr> wrote: > Dear Amit, > > > I ran the benchmark tests for my algorithm and for strstr(). > > > > Total number of tests = 332. > > > > My algorithm is faster 182 times. > > > > strstr() is faster 150 times. > > and your algorithm can be up to 3531 times slower than strstr(): > > s: 3.75006e+08 106175 > > The largest strstr() time is 1.5e6, whereas the largest time of your > algorithm is 9.8e8, i.e., 656 times larger. > > Can you improve your algorithm so that the worst case time is similar to > that > of strstr()? > > Paul > I don't think my algorithm can be optimized more. Anyways, I had invented my algorithm thinking about normal average case scenarios and not for worst case scenarios. I was talking with Dr. Moore (inventor of Boyer-Moore algorithm) and he suggested me to demonstrate that my algorithm is better for average case by doing exhaustive testing across a wide variety of patterns and texts. He also told me that he had invented his algorithm for average case too and not thinking much about worst case scenario. ==================What Dr. Moore said======================== If your algorithm is worse case O(mn) then you should somehow argue (and demonstrate with exhausting testing across a wide variety of patterns and texts) that its ``average'' behavior is better! The point is that the naive algorithm is worst-case O(mn) and why should a person use yours rather than that simpler algorithm? ========================================================== Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Final testing with benchmark tests] Fastest String Search Algorithm. 2021-06-12 8:41 ` Amit Choudhary @ 2021-06-12 13:29 ` Amit Choudhary 0 siblings, 0 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-12 13:29 UTC (permalink / raw) To: Paul Zimmermann; +Cc: GNU C Library [-- Attachment #1: Type: text/plain, Size: 2501 bytes --] I optimized my algorithm and I ran the benchmark tests again for complex needles only. Now, because of my optimization, my algorithm is faster than strstr() in worst case scenario also. Please find the results below. (I am attaching the file also. ******************************************************************************************************************************************** S: strstr() is faster (5 times) C: choudhary algorithm is faster (25 times) twoway_strstr chal ========================================================================= S: Length 65536/ 64, complex needle 1: 49578.9 147700 S: Length 65536/256, complex needle 1: 47225 140834 S: Length 65536/1024, complex needle 1: 23112.5 138987 S: Length 65536/4096, complex needle 1: 47460.2 144605 S: Length 65536/8192, complex needle 1: 88410.9 153084 C: Length 65536/ 64, complex needle 2: 335094 133933 C: Length 65536/ 64, complex needle 3: 1.58005e+06 138910 C: Length 65536/256, complex needle 2: 329751 137798 C: Length 65536/256, complex needle 3: 1.5351e+06 137376 C: Length 65536/1024, complex needle 2: 295184 135700 C: Length 65536/1024, complex needle 3: 1.50964e+06 136388 C: Length 65536/4096, complex needle 2: 303698 145081 C: Length 65536/4096, complex needle 3: 1.45478e+06 135175 C: Length 65536/8192, complex needle 2: 328699 152917 C: Length 65536/8192, complex needle 3: 1.394e+06 133727 C: Length 65536/28672, complex needle 1: 275716 193667 C: Length 65536/28672, complex needle 2: 450194 196630 C: Length 65536/28672, complex needle 3: 1.02255e+06 126090 C: Length 65536/32768, complex needle 1: 316611 201831 C: Length 65536/32768, complex needle 2: 471559 204520 C: Length 65536/32768, complex needle 3: 948845 124384 C: Length 65536/40960, complex needle 1: 409391 220575 C: Length 65536/40960, complex needle 2: 516094 237880 C: Length 65536/40960, complex needle 3: 831214 121766 C: Length 65536/51200, complex needle 1: 495124 241209 C: Length 65536/51200, complex needle 2: 560630 241209 C: Length 65536/51200, complex needle 3: 653866 117998 C: Length 65536/65536, complex needle 1: 625619 269684 C: Length 65536/65536, complex needle 2: 649592 269709 C: Length 65536/65536, complex needle 3: 508453 113078 ========================================================================= ******************************************************************************************************************************************** Regards, Amit [-- Attachment #2: bench-strstr.out --] [-- Type: application/octet-stream, Size: 2104 bytes --] S: strstr() is faster (5 times) C: choudhary algorithm is faster (25 times) twoway_strstr chal ========================================================================= S: Length 65536/ 64, complex needle 1: 49578.9 147700 S: Length 65536/256, complex needle 1: 47225 140834 S: Length 65536/1024, complex needle 1: 23112.5 138987 S: Length 65536/4096, complex needle 1: 47460.2 144605 S: Length 65536/8192, complex needle 1: 88410.9 153084 C: Length 65536/ 64, complex needle 2: 335094 133933 C: Length 65536/ 64, complex needle 3: 1.58005e+06 138910 C: Length 65536/256, complex needle 2: 329751 137798 C: Length 65536/256, complex needle 3: 1.5351e+06 137376 C: Length 65536/1024, complex needle 2: 295184 135700 C: Length 65536/1024, complex needle 3: 1.50964e+06 136388 C: Length 65536/4096, complex needle 2: 303698 145081 C: Length 65536/4096, complex needle 3: 1.45478e+06 135175 C: Length 65536/8192, complex needle 2: 328699 152917 C: Length 65536/8192, complex needle 3: 1.394e+06 133727 C: Length 65536/28672, complex needle 1: 275716 193667 C: Length 65536/28672, complex needle 2: 450194 196630 C: Length 65536/28672, complex needle 3: 1.02255e+06 126090 C: Length 65536/32768, complex needle 1: 316611 201831 C: Length 65536/32768, complex needle 2: 471559 204520 C: Length 65536/32768, complex needle 3: 948845 124384 C: Length 65536/40960, complex needle 1: 409391 220575 C: Length 65536/40960, complex needle 2: 516094 237880 C: Length 65536/40960, complex needle 3: 831214 121766 C: Length 65536/51200, complex needle 1: 495124 241209 C: Length 65536/51200, complex needle 2: 560630 241209 C: Length 65536/51200, complex needle 3: 653866 117998 C: Length 65536/65536, complex needle 1: 625619 269684 C: Length 65536/65536, complex needle 2: 649592 269709 C: Length 65536/65536, complex needle 3: 508453 113078 ========================================================================= ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Final testing with benchmark tests] Fastest String Search Algorithm. 2021-06-12 8:01 ` Paul Zimmermann 2021-06-12 8:41 ` Amit Choudhary @ 2021-06-12 10:56 ` Siddhesh Poyarekar 2021-06-12 13:47 ` Amit Choudhary 1 sibling, 1 reply; 45+ messages in thread From: Siddhesh Poyarekar @ 2021-06-12 10:56 UTC (permalink / raw) To: Paul Zimmermann, Amit Choudhary; +Cc: libc-alpha On 6/12/21 1:31 PM, Paul Zimmermann wrote: > Dear Amit, > >> I ran the benchmark tests for my algorithm and for strstr(). >> >> Total number of tests = 332. >> >> My algorithm is faster 182 times. >> >> strstr() is faster 150 times. > > and your algorithm can be up to 3531 times slower than strstr(): > > s: 3.75006e+08 106175 > > The largest strstr() time is 1.5e6, whereas the largest time of your > algorithm is 9.8e8, i.e., 656 times larger. > > Can you improve your algorithm so that the worst case time is similar to that > of strstr()? This is going around in circles so I am going to summarize the several problems (and not just the worst case time) here that need to be addressed before this proposal can even be considered with any seriousness. You may consider this my objection to the proposal conditional these points being resolved. Starting with the technical issues: 1. In the benchmark results shared, the geomean performance of the proposed algorithm is worse by 22.62%. This is by no means (no pun intended) a better result. In general, worse cases are much worse than when they're better. 2. The worst case performances are thousands of times slower as PaulZ pointed out. But wait, it's worse because it's probably not an issue of scaling with length... 3. ... The benchmark results shared don't actually indicate which length pairs result in those worst case performances but if I had to assume that it's related to glibc benchmark inputs, (there's one input less than the glibc benchmarks) it appears that the performance starts going downhill from 224 byte haystacks, especially with misaligned strings. 4. Given that the benchmark program is outside of the benchtests infrastructure (because this is clearly not the benchtests program), there is no information about how the benchmark is run and what conditions affect its output. 5. The baseline for comparison is not clear. Is it the default strstr() in string/strstr.c? Is it the system strstr()? In the latter case, it could be anything from string/strstr.c to the various microarchitecture-specific asm ifuncs. 6. Amit has made handwavy claims about why a certain set of inputs matter over others without providing any references to prior art or reviewable experimental data that can back his claims. The proposal needs clearer commentary here with proper references. 7. Refusal to use the glibc-benchtests - either in its current form or by adding more inputs that the community agrees as relevant - is IMO a blocker to getting any changes accepted to strstr(). A better strstr() must show better results on the benchmarks, without which future maintenance of the function will be challenging. If Amit doesn't write those, someone else needs to volunteer to do it for him. Now on to the non-technical issues: 1. First and foremost, Amit has been insinuating racial discrimination or crying victim when someone makes suggestions for verification or points out flaws in the algorithm. He was warned once[1] but has refused to treat community members with respect. Let this not be the low we set for acceptable behaviour in the community. 2. There is a plain text algorithm and several C code snippets, none of which have proper copyright notices. There is no indication even of whether Amit owns the copyright for the shared content (as opposed to his employer since many employers claim ownership for everything the employee creates) or whether he has given us permission to use the content. 3. On a related point, it is unclear if Amit or his employer have signed a copyright assignment with the FSF. I personally would like to see glibc move away from the assignment like gcc did recently, but that's an orthogonal point at the moment. 4. Amit has on various occasions expressed disinterest in contributing to glibc and put conditions on what he will or will not do even before posting a single line patch. I have no confidence that if we do end up accepting his contribution at some point, he will make himself available to maintain the code he contributed, leaving the community to deal with the technical debt. Siddhesh [1] https://sourceware.org/pipermail/libc-alpha/2021-June/127380.html ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Final testing with benchmark tests] Fastest String Search Algorithm. 2021-06-12 10:56 ` Siddhesh Poyarekar @ 2021-06-12 13:47 ` Amit Choudhary 2021-06-13 3:32 ` Siddhesh Poyarekar 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-12 13:47 UTC (permalink / raw) To: Siddhesh Poyarekar; +Cc: Paul Zimmermann, GNU C Library Siddhesh, In my opinion, your comments are totally meaningless. I am not going to write more because this will again get into long discussion. If my algorithm is better and if it is not included in glibc code then I am not going to suffer, the world will suffer because the world will be missing out on a better algorithm. I will always use my string search algorithm instead of strstr(). Regards, Amit On Sat, Jun 12, 2021 at 4:26 PM Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: > On 6/12/21 1:31 PM, Paul Zimmermann wrote: > > Dear Amit, > > > >> I ran the benchmark tests for my algorithm and for strstr(). > >> > >> Total number of tests = 332. > >> > >> My algorithm is faster 182 times. > >> > >> strstr() is faster 150 times. > > > > and your algorithm can be up to 3531 times slower than strstr(): > > > > s: 3.75006e+08 106175 > > > > The largest strstr() time is 1.5e6, whereas the largest time of your > > algorithm is 9.8e8, i.e., 656 times larger. > > > > Can you improve your algorithm so that the worst case time is similar to > that > > of strstr()? > > This is going around in circles so I am going to summarize the several > problems (and not just the worst case time) here that need to be > addressed before this proposal can even be considered with any > seriousness. You may consider this my objection to the proposal > conditional these points being resolved. > > Starting with the technical issues: > > 1. In the benchmark results shared, the geomean performance of the > proposed algorithm is worse by 22.62%. This is by no means (no pun > intended) a better result. In general, worse cases are much worse than > when they're better. > > 2. The worst case performances are thousands of times slower as PaulZ > pointed out. But wait, it's worse because it's probably not an issue of > scaling with length... > > 3. ... The benchmark results shared don't actually indicate which length > pairs result in those worst case performances but if I had to assume > that it's related to glibc benchmark inputs, (there's one input less > than the glibc benchmarks) it appears that the performance starts going > downhill from 224 byte haystacks, especially with misaligned strings. > > 4. Given that the benchmark program is outside of the benchtests > infrastructure (because this is clearly not the benchtests program), > there is no information about how the benchmark is run and what > conditions affect its output. > > 5. The baseline for comparison is not clear. Is it the default strstr() > in string/strstr.c? Is it the system strstr()? In the latter case, it > could be anything from string/strstr.c to the various > microarchitecture-specific asm ifuncs. > > 6. Amit has made handwavy claims about why a certain set of inputs > matter over others without providing any references to prior art or > reviewable experimental data that can back his claims. The proposal > needs clearer commentary here with proper references. > > 7. Refusal to use the glibc-benchtests - either in its current form or > by adding more inputs that the community agrees as relevant - is IMO a > blocker to getting any changes accepted to strstr(). A better strstr() > must show better results on the benchmarks, without which future > maintenance of the function will be challenging. If Amit doesn't write > those, someone else needs to volunteer to do it for him. > > Now on to the non-technical issues: > > 1. First and foremost, Amit has been insinuating racial discrimination > or crying victim when someone makes suggestions for verification or > points out flaws in the algorithm. He was warned once[1] but has > refused to treat community members with respect. Let this not be the > low we set for acceptable behaviour in the community. > > 2. There is a plain text algorithm and several C code snippets, none of > which have proper copyright notices. There is no indication even of > whether Amit owns the copyright for the shared content (as opposed to > his employer since many employers claim ownership for everything the > employee creates) or whether he has given us permission to use the content. > > 3. On a related point, it is unclear if Amit or his employer have signed > a copyright assignment with the FSF. I personally would like to see > glibc move away from the assignment like gcc did recently, but that's an > orthogonal point at the moment. > > 4. Amit has on various occasions expressed disinterest in contributing > to glibc and put conditions on what he will or will not do even before > posting a single line patch. I have no confidence that if we do end up > accepting his contribution at some point, he will make himself available > to maintain the code he contributed, leaving the community to deal with > the technical debt. > > Siddhesh > > [1] https://sourceware.org/pipermail/libc-alpha/2021-June/127380.html > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Final testing with benchmark tests] Fastest String Search Algorithm. 2021-06-12 13:47 ` Amit Choudhary @ 2021-06-13 3:32 ` Siddhesh Poyarekar 2021-06-13 4:06 ` Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Siddhesh Poyarekar @ 2021-06-13 3:32 UTC (permalink / raw) To: Amit Choudhary; +Cc: Paul Zimmermann, GNU C Library On 6/12/21 7:17 PM, Amit Choudhary wrote: > Siddhesh, > > In my opinion, your comments are totally meaningless. Therein lies the problem; you continue to not take feedback and top it off with being rude. > I am not going to write more because this will again get into long > discussion. > > If my algorithm is better and if it is not included in glibc code then I > am not going to suffer, the world will suffer because the world will be > missing out on a better algorithm. So be it. No code is worth the kind of toxic behaviour you've demonstrated in this thread. Good bye, Siddhesh ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Final testing with benchmark tests] Fastest String Search Algorithm. 2021-06-13 3:32 ` Siddhesh Poyarekar @ 2021-06-13 4:06 ` Amit Choudhary 0 siblings, 0 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-13 4:06 UTC (permalink / raw) To: Siddhesh Poyarekar; +Cc: Paul Zimmermann, GNU C Library Toxic behaviour has been demonstrated by you the most and other glibc people. I made a big mistake by contacting glibc. Parting comments: To hell with you all. Amit On Sun, Jun 13, 2021, 9:03 AM Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: > On 6/12/21 7:17 PM, Amit Choudhary wrote: > > Siddhesh, > > > > In my opinion, your comments are totally meaningless. > > Therein lies the problem; you continue to not take feedback and top it > off with being rude. > > > I am not going to write more because this will again get into long > > discussion. > > > > If my algorithm is better and if it is not included in glibc code then I > > am not going to suffer, the world will suffer because the world will be > > missing out on a better algorithm. > > So be it. No code is worth the kind of toxic behaviour you've > demonstrated in this thread. > > Good bye, > Siddhesh > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 12:42 ` Amit Choudhary 2021-06-11 13:39 ` Amit Choudhary 2021-06-11 13:46 ` Wilco Dijkstra @ 2021-06-11 13:57 ` Paul Zimmermann 2021-06-11 14:04 ` Amit Choudhary 2 siblings, 1 reply; 45+ messages in thread From: Paul Zimmermann @ 2021-06-11 13:57 UTC (permalink / raw) To: Amit Choudhary; +Cc: Wilco.Dijkstra, libc-alpha Dear Amit, > The worst case scenario almost never occurs in real world situations. The > worst case scenario is almost always on paper. we can not know in advance what applications the GNU libc will be used for. This is very different from the case where you are the only user of your program, and you know for which inputs you will use it. You were told how you could convince that your algorithm is better: (1) provide a GNU libc patch with new benchmarks (if you think the current ones are not sufficient) (2) then provide a GNU libc patch using your algorithm, and demonstrate on the existing and new benchmarks that it is better than the current code Otherwise, all that discussion is a waste of time both for you and the GNU libc developers. Best regards, Paul Zimmermann ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 13:57 ` Paul Zimmermann @ 2021-06-11 14:04 ` Amit Choudhary 2021-06-11 14:16 ` Paul Zimmermann 0 siblings, 1 reply; 45+ messages in thread From: Amit Choudhary @ 2021-06-11 14:04 UTC (permalink / raw) To: Paul Zimmermann; +Cc: Wilco Dijkstra, GNU C Library On Fri, Jun 11, 2021 at 7:27 PM Paul Zimmermann <Paul.Zimmermann@inria.fr> wrote: > Dear Amit, > > > The worst case scenario almost never occurs in real world situations. The > > worst case scenario is almost always on paper. > > we can not know in advance what applications the GNU libc will be used for. > This is very different from the case where you are the only user of your > program, and you know for which inputs you will use it. > > You were told how you could convince that your algorithm is better: > > (1) provide a GNU libc patch with new benchmarks (if you think the current > ones are not sufficient) > (2) then provide a GNU libc patch using your algorithm, and demonstrate > on the existing and new benchmarks that it is better than the current > code > > Otherwise, all that discussion is a waste of time both for you and > the GNU libc developers. > As I said before, I am not desperate to get my code in glibc. So, if you don't accept my algorithm, it is fine with me. I am doing the discussion because I am getting a reply. If I don't get a reply then I will not continue the discussion. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 14:04 ` Amit Choudhary @ 2021-06-11 14:16 ` Paul Zimmermann 2021-06-11 14:27 ` Amit Choudhary 0 siblings, 1 reply; 45+ messages in thread From: Paul Zimmermann @ 2021-06-11 14:16 UTC (permalink / raw) To: Amit Choudhary; +Cc: Wilco.Dijkstra, libc-alpha Dear Amit, > As I said before, I am not desperate to get my code in glibc. So, if you > don't accept my algorithm, it is fine with me. the code you propose is not ready to get included in glibc. If you want your code to be considered, you must send a real patch against glibc, like for example: https://sourceware.org/pipermail/libc-alpha/2021-June/127242.html See https://sourceware.org/glibc/wiki/Contribution%20checklist for more details Best regards, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Fastest String Search Algorithm. 2021-06-11 14:16 ` Paul Zimmermann @ 2021-06-11 14:27 ` Amit Choudhary 0 siblings, 0 replies; 45+ messages in thread From: Amit Choudhary @ 2021-06-11 14:27 UTC (permalink / raw) To: Paul Zimmermann; +Cc: Wilco Dijkstra, GNU C Library On Fri, Jun 11, 2021 at 7:46 PM Paul Zimmermann <Paul.Zimmermann@inria.fr> wrote: > Dear Amit, > > > As I said before, I am not desperate to get my code in glibc. So, if you > > don't accept my algorithm, it is fine with me. > > the code you propose is not ready to get included in glibc. If you want > your > code to be considered, you must send a real patch against glibc, > like for example: > > https://sourceware.org/pipermail/libc-alpha/2021-June/127242.html > > See https://sourceware.org/glibc/wiki/Contribution%20checklist for more > details > > Best regards, > Paul > I will send a patch after couple of days. Regards, Amit ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2021-06-13 4:06 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-09 10:31 Fastest String Search Algorithm Wilco Dijkstra 2021-06-09 11:42 ` Amit Choudhary 2021-06-09 12:05 ` Phil Blundell 2021-06-09 12:11 ` Adhemerval Zanella 2021-06-09 12:22 ` Amit Choudhary 2021-06-09 12:55 ` Adhemerval Zanella 2021-06-09 13:18 ` Amit Choudhary 2021-06-09 13:32 ` Amit Choudhary 2021-06-09 13:40 ` Adhemerval Zanella 2021-06-09 14:08 ` Amit Choudhary 2021-06-09 14:25 ` Phil Blundell 2021-06-09 12:18 ` Amit Choudhary 2021-06-09 13:48 ` Phil Blundell 2021-06-09 21:44 ` Wilco Dijkstra 2021-06-10 2:42 ` Amit Choudhary 2021-06-10 3:23 ` Dan Raymond 2021-06-10 4:19 ` Amit Choudhary 2021-06-10 9:44 ` Wilco Dijkstra 2021-06-10 11:08 ` Amit Choudhary 2021-06-10 16:53 ` Amit Choudhary 2021-06-10 20:04 ` Amit Choudhary 2021-06-11 8:28 ` Amit Choudhary 2021-06-11 11:45 ` Wilco Dijkstra 2021-06-11 12:42 ` Amit Choudhary 2021-06-11 13:39 ` Amit Choudhary 2021-06-11 13:46 ` Wilco Dijkstra 2021-06-11 15:52 ` Amit Choudhary 2021-06-11 19:14 ` Amit Choudhary 2021-06-11 19:26 ` Paul Eggert 2021-06-11 19:41 ` Amit Choudhary 2021-06-11 20:00 ` Dan Raymond 2021-06-12 4:00 ` Amit Choudhary 2021-06-12 4:18 ` Amit Choudhary 2021-06-12 7:16 ` [Final testing with benchmark tests] " Amit Choudhary 2021-06-12 8:01 ` Paul Zimmermann 2021-06-12 8:41 ` Amit Choudhary 2021-06-12 13:29 ` Amit Choudhary 2021-06-12 10:56 ` Siddhesh Poyarekar 2021-06-12 13:47 ` Amit Choudhary 2021-06-13 3:32 ` Siddhesh Poyarekar 2021-06-13 4:06 ` Amit Choudhary 2021-06-11 13:57 ` Paul Zimmermann 2021-06-11 14:04 ` Amit Choudhary 2021-06-11 14:16 ` Paul Zimmermann 2021-06-11 14:27 ` Amit Choudhary
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).