From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 5578 invoked by alias); 21 Sep 2014 08:47:21 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 5551 invoked by uid 48); 21 Sep 2014 08:47:15 -0000 From: "jmichae3 at yahoo dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/62318] optimization of string searches using Nigel Horspool's Boyer-Moore-Horpool algorithm Date: Sun, 21 Sep 2014 08:47:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libstdc++ X-Bugzilla-Version: 4.9.1 X-Bugzilla-Keywords: X-Bugzilla-Severity: enhancement X-Bugzilla-Who: jmichae3 at yahoo dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-09/txt/msg02021.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62318 --- Comment #6 from Jim Michaels --- ummm. I could get personal permission to use the algorithm. but that does not give gnu permission to use the algorithm. that's why I posted this here. that does not necessarily mean I could post the algorithm to GNU, because I am not associated with GNU (I don't intend to bristle someone), I am just a user of gcc trying to make a difference in speed of the compiler. I have an i7-3970x 64GB RAM on windows and I am doing file processing that should *not* take 5 minutes for a 1ast timing, bare or MT19937 random number writes I can do 86.437894MB/s on a hard drive. but my programs that do std::string processing are many many times slower. other compilers out there are just as bad. I was hoping to make a dent in this performance sluggishness... I am just saying this cpu I have should blaze a trail. most of that program is a fair amount of string processing, lookups in vectors, vector structures, etc. I avoided using any of my own classes. most of this stuff is processed in RAM, so it's got to be the (I got the O() notation wrong) algorithm. Boyer-Moore is a very good general string search algorithm (but only for 8 and 16-bit characters) that's been around for at least 20 years. should not be hard at all to convert. I might do it to my personal compiler myself if the ACM article and permission doesn't cost too much. I am using mingw-w64 windows x64 and x32 target, windows x64 host port of gcc. http://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf http://www.cs.utexas.edu/users/moore my only concern is that if I were to post the algorithm, me not being in the GNU organization, GNU would still have to get separate permission, first. I could say "OK, I've got it, get permission, then I will post the patch." but I write commercial code. so then there's the whole FSF thing and "tainted code" (I intend to distribute my source code and some libraries if I am allowed). someone said Horspool involved startup cost building lookup tables and therein lies the difficulty, so I got that part wrong I just found out. it is used in std::search. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3905.html I already have code which converts a stream of bytes to UTF-8 and back based on http://doc.cat-v.org/bell_labs/utf-8_history but that's beside the point. I would be very happy to just jump in and render a patch. I just wanted to get the permissions thing done proper. it's a VERY SIMPLE algorithm, all of about 10-20 lines. looked like a piece of cake. I am just concerned about my "tainting" of your code (?) and permissions, etc. is this something that I, not associated with GNU, can do? yes, no? my guess is no. that is why I made this request. I know *you* can do it without causing problems. http://www.boost.org/doc/libs/1_50_0/libs/algorithm/doc/html/algorithm/Searching.html however, boyer-moore would not be recommendable for u32string, since it would allocate an array that has 2^32 32-bit (16GB) elements as an "alphabet" (part of the startup costs). 65536 16-bit (128KB) elements for u16string, 256 bytes for 8-bit char_t. you get the idea. 32-bit targets would obviously fail to compile, and significant memory loss on 64-bit systems, and they would fail to run on average joe's box. so I guess instead of horspool's algorithm for strings, your O(m+n) algorithm in strstr() would be a more effective replacement. please. who at the FSF deals with permissions? not sure if this applies: it's for "small alphabets" (as in DNA): http://www.cs.utexas.edu/users/moore/publications/sustik-moore.pdf the US Department of Commerce regulates hashes I think such as Rabin-Karp above a certain number of bits. when you start getting to char16_t and char32_t that hash number of bits due to multiplies has already reached 32 bits. sigh. this is why computers are slow. I don't know if strstr is in this category. maybe std::string.find() should be left the way it is. maybe some unit-testing on that small-alphabet thing would be interesting, but I am unsure if it would have the same memory-use problem for char32_t as Boyer-Moore.