public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "jmichae3 at yahoo dot com" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/62318] optimization of string searches using Nigel Horspool's Boyer-Moore-Horpool algorithm Date: Sun, 21 Sep 2014 08:47:00 -0000 [thread overview] Message-ID: <bug-62318-4-mqwHxOzN8P@http.gcc.gnu.org/bugzilla/> (raw) In-Reply-To: <bug-62318-4@http.gcc.gnu.org/bugzilla/> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62318 --- Comment #6 from Jim Michaels <jmichae3 at yahoo dot com> --- ummm. I could get personal permission to use the algorithm. but that does not give gnu permission to use the algorithm. that's why I posted this here. that does not necessarily mean I could post the algorithm to GNU, because I am not associated with GNU (I don't intend to bristle someone), I am just a user of gcc trying to make a difference in speed of the compiler. I have an i7-3970x 64GB RAM on windows and I am doing file processing that should *not* take 5 minutes for a 1ast timing, bare or MT19937 random number writes I can do 86.437894MB/s on a hard drive. but my programs that do std::string processing are many many times slower. other compilers out there are just as bad. I was hoping to make a dent in this performance sluggishness... I am just saying this cpu I have should blaze a trail. most of that program is a fair amount of string processing, lookups in vectors, vector structures, etc. I avoided using any of my own classes. most of this stuff is processed in RAM, so it's got to be the (I got the O() notation wrong) algorithm. Boyer-Moore is a very good general string search algorithm (but only for 8 and 16-bit characters) that's been around for at least 20 years. should not be hard at all to convert. I might do it to my personal compiler myself if the ACM article and permission doesn't cost too much. I am using mingw-w64 windows x64 and x32 target, windows x64 host port of gcc. http://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf http://www.cs.utexas.edu/users/moore my only concern is that if I were to post the algorithm, me not being in the GNU organization, GNU would still have to get separate permission, first. I could say "OK, I've got it, get permission, then I will post the patch." but I write commercial code. so then there's the whole FSF thing and "tainted code" (I intend to distribute my source code and some libraries if I am allowed). someone said Horspool involved startup cost building lookup tables and therein lies the difficulty, so I got that part wrong I just found out. it is used in std::search. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3905.html I already have code which converts a stream of bytes to UTF-8 and back based on http://doc.cat-v.org/bell_labs/utf-8_history but that's beside the point. I would be very happy to just jump in and render a patch. I just wanted to get the permissions thing done proper. it's a VERY SIMPLE algorithm, all of about 10-20 lines. looked like a piece of cake. I am just concerned about my "tainting" of your code (?) and permissions, etc. is this something that I, not associated with GNU, can do? yes, no? my guess is no. that is why I made this request. I know *you* can do it without causing problems. http://www.boost.org/doc/libs/1_50_0/libs/algorithm/doc/html/algorithm/Searching.html however, boyer-moore would not be recommendable for u32string, since it would allocate an array that has 2^32 32-bit (16GB) elements as an "alphabet" (part of the startup costs). 65536 16-bit (128KB) elements for u16string, 256 bytes for 8-bit char_t. you get the idea. 32-bit targets would obviously fail to compile, and significant memory loss on 64-bit systems, and they would fail to run on average joe's box. so I guess instead of horspool's algorithm for strings, your O(m+n) algorithm in strstr() would be a more effective replacement. please. who at the FSF deals with permissions? not sure if this applies: it's for "small alphabets" (as in DNA): http://www.cs.utexas.edu/users/moore/publications/sustik-moore.pdf the US Department of Commerce regulates hashes I think such as Rabin-Karp above a certain number of bits. when you start getting to char16_t and char32_t that hash number of bits due to multiplies has already reached 32 bits. sigh. this is why computers are slow. I don't know if strstr is in this category. maybe std::string.find() should be left the way it is. maybe some unit-testing on that small-alphabet thing would be interesting, but I am unsure if it would have the same memory-use problem for char32_t as Boyer-Moore.
next prev parent reply other threads:[~2014-09-21 8:47 UTC|newest] Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top 2014-08-31 6:48 [Bug libstdc++/62318] New: " jmichae3 at yahoo dot com 2014-08-31 6:50 ` [Bug libstdc++/62318] " jmichae3 at yahoo dot com 2014-08-31 7:37 ` schwab@linux-m68k.org 2014-08-31 12:32 ` manu at gcc dot gnu.org 2014-09-01 15:34 ` joseph at codesourcery dot com 2014-09-21 8:47 ` jmichae3 at yahoo dot com [this message] 2014-10-29 18:33 ` redi at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-62318-4-mqwHxOzN8P@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).