[Bug libstdc++/62318] optimization of string searches using Nigel Horspool's Boyer-Moore-Horpool algorithm

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "jmichae3 at yahoo dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libstdc++/62318] optimization of string searches using Nigel Horspool's Boyer-Moore-Horpool algorithm
Date: Sun, 21 Sep 2014 08:47:00 -0000	[thread overview]
Message-ID: <bug-62318-4-mqwHxOzN8P@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-62318-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62318

--- Comment #6 from Jim Michaels <jmichae3 at yahoo dot com> ---
ummm. I could get personal permission to use the algorithm. but that does not
give gnu permission to use the algorithm. that's why I posted this here.  that
does not necessarily mean I could post the algorithm to GNU, because I am not
associated with GNU (I don't intend to bristle someone), I am just a user of
gcc trying to make a difference in speed of the compiler. I have an i7-3970x
64GB RAM on windows and I am doing file processing that should *not* take 5
minutes for a 1ast timing, bare or MT19937 random number writes I can do
86.437894MB/s on a hard drive. but my programs that do std::string processing
are many many times slower. 

other compilers out there are just as bad. I was hoping to make a dent in this
performance sluggishness... I am just saying this cpu I have should blaze a
trail. most of that program is a fair amount of string processing, lookups in
vectors, vector structures, etc. I avoided using any of my own classes.

most of this stuff is processed in RAM, so it's got to be the (I got the O()
notation wrong) algorithm. Boyer-Moore is a very good general string search
algorithm (but only for 8 and 16-bit characters) that's been around for at
least 20 years. should not be hard at all to convert. I might do it to my
personal compiler myself if the ACM article and permission doesn't cost too
much. I am using mingw-w64 windows x64 and x32 target, windows x64 host port of
gcc. 
http://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf
http://www.cs.utexas.edu/users/moore

my only concern is that if I were to post the algorithm, me not being in the
GNU organization, GNU would still have to get separate permission, first. I
could say "OK, I've got it, get permission, then I will post the patch." but I
write commercial code. so then there's the whole FSF thing and "tainted code"
(I intend to distribute my source code and some libraries if I am allowed).

someone said Horspool involved startup cost building lookup tables and therein
lies the difficulty, so I got that part wrong I just found out. it is used in
std::search. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3905.html

I already have code which converts a stream of bytes to UTF-8 and back based on
http://doc.cat-v.org/bell_labs/utf-8_history

but that's beside the point.

I would be very happy to just jump in and render a patch. I just wanted to get
the permissions thing done proper. it's a VERY SIMPLE algorithm, all of about
10-20 lines. looked like a piece of cake. I am just concerned about my
"tainting" of your code (?) and permissions, etc.

is this something that I, not associated with GNU, can do? yes, no? my guess is
no. that is why I made this request. I know *you* can do it without causing
problems.

http://www.boost.org/doc/libs/1_50_0/libs/algorithm/doc/html/algorithm/Searching.html

however, boyer-moore would not be recommendable for u32string, since it would
allocate an array that has 2^32 32-bit (16GB) elements as an "alphabet" (part
of the startup costs). 65536 16-bit (128KB) elements for u16string, 256 bytes
for 8-bit char_t. you get the idea. 32-bit targets would obviously fail to
compile, and significant memory loss on 64-bit systems, and they would fail to
run on average joe's box.

so I guess instead of horspool's algorithm for strings, your O(m+n) algorithm
in strstr() would be a more effective replacement. please.

who at the FSF deals with permissions?

not sure if this applies: it's for "small alphabets" (as in DNA):
http://www.cs.utexas.edu/users/moore/publications/sustik-moore.pdf

the US Department of Commerce regulates hashes I think such as Rabin-Karp above
a certain number of bits. when you start getting to char16_t and char32_t that
hash number of bits due to multiplies has already reached 32 bits. sigh. this
is why computers are slow.  I don't know if strstr is in this category. 

maybe std::string.find() should be left the way it is.

maybe some unit-testing on that small-alphabet thing would be interesting, but
I am unsure if it would have the same memory-use problem for char32_t as
Boyer-Moore.

next prev parent reply	other threads:[~2014-09-21  8:47 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-31  6:48 [Bug libstdc++/62318] New: " jmichae3 at yahoo dot com
2014-08-31  6:50 ` [Bug libstdc++/62318] " jmichae3 at yahoo dot com
2014-08-31  7:37 ` schwab@linux-m68k.org
2014-08-31 12:32 ` manu at gcc dot gnu.org
2014-09-01 15:34 ` joseph at codesourcery dot com
2014-09-21  8:47 ` jmichae3 at yahoo dot com [this message]
2014-10-29 18:33 ` redi at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-62318-4-mqwHxOzN8P@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).