Hashing of "switch/case" selections

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Hashing of "switch/case" selections
@ 2000-12-27 20:41 Andy Walker
  2000-12-27 21:33 ` I have found an ICE Magnus Fromreide
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andy Walker @ 2000-12-27 20:41 UTC (permalink / raw)
  To: GCC general mailing list

I have been doing some research on this.  Quick definitions for those
who are not Sorting and Searching Wizards:  Hash indicies: unique
preHashed index values that each have a meaning.  A Hash function: a
function applied to a value (usually an integer, or rarely, a character
string) to get a result.  A unique Hash: a Hash function that returns a
unique result for every unique input.  A perfect Hash: similar to a
unique Hash, but not as stringent; a Hash function that returns a unique
result for every Hash index (note that invalid, ie non-index values may,
and often do, return the same result as a valid Hash index). A minimal,
perfect Hash: a Hash function that returns "n"  sequential values for
all "n" valid Hash indicies, not necessarily in the same order..  A
near-minimal, perfect Hash: as close as I can get to the the previous
definition.

Unique Hashes are usually too much to ask for. I have not found any
references that begin to suggest how to design and code such a beast.

Therefore, the perfect Hash is the minimum acceptable function.  The
first consequence of this approach is that a table of the valid indicies
must be maintained for verification, to distinguish the invalid values
from the valid indicies. If there is a table, then a high and low test
must be performed to guarantee that the Hash result falls in the range
of the table, before the verification test.  If we use a multiplication
for the Hash function, I expect the multiply instruction, plus
associated register loads, to be at least the equivalent of three test
and conditional branch instruction pairs.  (no divide instructions -- I
am allergic to those).

A binary tree search using only five test and conditional branch pairs
plus a final jump table/verification table lookup and comparision, will
compile more quickly, will run at least as fast, and will take up less
space than the simplest general Hash function..  If there are 32 or
fewer cases, binary search is king.  Note that this is totally unrelated
to the spread of the index values.

I have recently changed jobs, but at my old job I did a rough and
unscientific analysis of a modestly sized "C/C++" library.  I found
rougly one-hundred-three instances of "switch" with cases.  Ninety-two
of them had fewer than 32 cases, and most of those were sixteen cases or
less.  Of the remaining nine, all but one were handling ASCII characters
as integers for parsing functions.

Just for peace of mind, the present tool (binary tree) handles most
situations in the most efficient manner, IMHO.  Also, none of the
algorithms I found could guarantee a near-minimal, perfect Hash in
linear time.  A few wild stabs in the dark often bring success. When
they do not, the binary tree is the reliable backup plan.

I am still working on "switch" for 256 or fewer cases, and more than 256
cases.  The algorithm I am analysing is a multiplication with truncation
(/256), and mulitiplication then squaring with truncation (/256).  When
I get some reliable timing numbers, I will post them here for further
discussion.

My experience as a programmer is that programs are almost never written
with switches to more than a few hundred cases.  I personally have never
seen one.  More than that, and the programmer finds a more efficient,
more readable, more maintainable solution, and he can do so because he
has a deep knowledge of the problem that is unavailable to the
compiler.  My seat-of-pants estimate is that 400 cases is a good, maybe
even high, maximum for Hash evaluations.  A program containing a switch
statement with more than 400 cases in it probably has enough other
severe design problems that it will never be worth running anyway.

    Andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* I have found an ICE
  2000-12-27 20:41 Hashing of "switch/case" selections Andy Walker
@ 2000-12-27 21:33 ` Magnus Fromreide
  2001-01-01 14:30 ` Hashing of "switch/case" selections Nix
  2001-01-04  1:39 ` Zack Weinberg
  2 siblings, 0 replies; 12+ messages in thread
From: Magnus Fromreide @ 2000-12-27 21:33 UTC (permalink / raw)
  To: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3249 bytes --]

I have configured the compiler as follows:

../gcc/configure --prefix=/usr/local/src/gnu/gcc-2.97/install
--enable-languages=c,c++ --enable-version-specific-runtime-libs
--disable-nls

and have found the following oddities:

make check results in

...
./mkcheck 0 `pwd` ../../../gcc/libstdc++-v3
running mkcheck
./mkcheck: testing the build directory
You need bash 2.x to run mkcheck.  Exiting.

How do I specify the path to bash 2.x? I do have one installed.

Compiling the attached file results in the following errors:

% /usr/local/src/gnu/gcc-2.97/install/bin/g++ -DHAVE_CONFIG_H -I. \
-I../2257-2 -I. -Wall -ansi -posix -pedantic -W -O3 -v \
-ftemplate-depth-23 -c ../2257-2/test.C
Reading specs from
/usr/local/src/gnu/gcc-2.97/install/lib/gcc-lib/i586-pc-linux-gnu/2.97/specs
Configured with: ../gcc/configure
--prefix=/usr/local/src/gnu/gcc-2.97/install --enable-languages=c,c++
--enable-version-specific-runtime-libs --disable-nls
gcc version 2.97 20001227 (experimental)

/usr/local/src/gnu/gcc-2.97/install/lib/gcc-lib/i586-pc-linux-gnu/2.97/cc1plus
-v -I. -I../2257-2 -I. -D__GNUC__=2 -D__GNUC_MINOR__=97
-D__GNUC_PATCHLEVEL__=0 -D__ELF__ -D__unix__ -D__linux__ -D__unix
-D__linux -Asystem=posix -D__OPTIMIZE__ -D__STDC_HOSTED__=1 -Wall -W
-pedantic -Acpu=i386 -Amachine=i386 -D__i386 -D__i386__ -D__tune_i586__
-D__tune_pentium__ -D_POSIX_SOURCE -DHAVE_CONFIG_H -MD .deps/test.pp
../2257-2/test.C -D__GNUG__=2 -D__GXX_ABI_VERSION=100 -D__STRICT_ANSI__
-trigraphs -$ -quiet -dumpbase test.C -ansi -O3 -Wall -W -pedantic -ansi
-version -fnew-abi -ftemplate-depth-23 -o /tmp/ccMXEkFo.s
GNU CPP version 2.97 20001227 (experimental) (cpplib) (i386 Linux/ELF)
GNU C++ version 2.97 20001227 (experimental) (i586-pc-linux-gnu) compiled
by GNU C version 2.97 20001227 (experimental).
ignoring duplicate directory "."
ignoring duplicate directory
"/usr/local/src/gnu/gcc-2.97/install/i586-pc-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 .
 ../2257-2
 /usr/local/src/gnu/gcc-2.97/install/lib/gcc-lib/i586-pc-linux-gnu/2.97/include/g++
 /usr/local/src/gnu/gcc-2.97/install/i586-pc-linux-gnu/include/
 /usr/local/include
 /usr/local/src/gnu/gcc-2.97/install/lib/gcc-lib/i586-pc-linux-gnu/2.97/include
 /usr/include
End of search list.
In file included from
/usr/local/src/gnu/gcc-2.97/install/lib/gcc-lib/i586-pc-linux-gnu/2.97/include/g++/cstdlib:2,
                 from ../2257-2/test.C:1:
/usr/local/src/gnu/gcc-2.97/install/lib/gcc-lib/i586-pc-linux-gnu/2.97/include/g++/bits/std_cstdlib.h:39:28:
bits/c++config.h: No such file or directory
../2257-2/test.C: In function `void fun()':
../2257-2/test.C:102: Internal error: Segmentation fault.
Please submit a full bug report.
 See <URL: http://www.gnu.org/software/gcc/bugs.html > for instructions.

The compiler do install bits/c++config.h as
/usr/local/src/gnu/gcc-2.97/install/i586-pc-linux-gnu/include/g++/bits/c++config.h
but for some reason is that not on the search path.

The internal error seems to originate in the method body being optimized
away from the tree, why this happens is currently beyond me. With -O2 it
compiles.

Magnus Fromreide					+46-13 17 68 48
MÃ¥rdtorpsgatan 21, 2tr					magfr@lysator.liu.se
S-584 34  LINKÃ–PING

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
  2000-12-27 20:41 Hashing of "switch/case" selections Andy Walker
  2000-12-27 21:33 ` I have found an ICE Magnus Fromreide
@ 2001-01-01 14:30 ` Nix
  2001-01-04  1:39 ` Zack Weinberg
  2 siblings, 0 replies; 12+ messages in thread
From: Nix @ 2001-01-01 14:30 UTC (permalink / raw)
  To: Andy Walker; +Cc: GCC general mailing list

On 28 Dec 2000, Andy Walker stated:
> My experience as a programmer is that programs are almost never
> written with switches to more than a few hundred cases.  I personally
> have never seen one.  More than that, and the programmer finds a more
> efficient, more readable, more maintainable solution, and he can do so
> because he has a deep knowledge of the problem that is unavailable to
> the compiler.  My seat-of-pants estimate is that 400 cases is a good,
> maybe even high, maximum for Hash evaluations.  A program containing a
> switch statement with more than 400 cases in it probably has enough
> other severe design problems that it will never be worth running
> anyway.

Like GCC, for instance? insn-attrtab.c contains switches with many, many
cases.

Don't make the error of assuming that all programs are written by
humans.  Automatically generated code can do things that would make any
sane programmer blanch --- but can also occupy hot spots and thus need
optimization.

Massive gotos, insanely vast statically initialized arrays and huge
switch statements are common. Look at yacc-generated output, for
instance.

-- 
`Umbilical cords are weird squishy rubbery things.
 Kinda like clams.' --- Dan Birchall in the Monastery

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
  2000-12-27 20:41 Hashing of "switch/case" selections Andy Walker
  2000-12-27 21:33 ` I have found an ICE Magnus Fromreide
  2001-01-01 14:30 ` Hashing of "switch/case" selections Nix
@ 2001-01-04  1:39 ` Zack Weinberg
  2001-01-05 10:54   ` Michael Meissner
  2001-01-07 18:12   ` Andy Walker
  2 siblings, 2 replies; 12+ messages in thread
From: Zack Weinberg @ 2001-01-04  1:39 UTC (permalink / raw)
  To: Andy Walker; +Cc: GCC general mailing list

On Wed, Dec 27, 2000 at 10:43:40PM -0600, Andy Walker wrote:
[much interesting stuff]
> My experience as a programmer is that programs are almost never written
> with switches to more than a few hundred cases.  I personally have never
> seen one.  More than that, and the programmer finds a more efficient,
> more readable, more maintainable solution, and he can do so because he
> has a deep knowledge of the problem that is unavailable to the
> compiler.  My seat-of-pants estimate is that 400 cases is a good, maybe
> even high, maximum for Hash evaluations.  A program containing a switch
> statement with more than 400 cases in it probably has enough other
> severe design problems that it will never be worth running anyway.

This is true in my experience for code written by humans, but not
necessarily for machine-generated code.  Take a look at insn-attrtab.c
in your GCC build directory, or at cp/parse.c in the source tree.

zw

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
  2001-01-04  1:39 ` Zack Weinberg
@ 2001-01-05 10:54   ` Michael Meissner
  2001-01-07 18:37     ` Andy Walker
  2001-01-07 18:12   ` Andy Walker
  1 sibling, 1 reply; 12+ messages in thread
From: Michael Meissner @ 2001-01-05 10:54 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Andy Walker, GCC general mailing list

On Thu, Jan 04, 2001 at 01:39:10AM -0800, Zack Weinberg wrote:
> On Wed, Dec 27, 2000 at 10:43:40PM -0600, Andy Walker wrote:
> [much interesting stuff]
> > My experience as a programmer is that programs are almost never written
> > with switches to more than a few hundred cases.  I personally have never
> > seen one.  More than that, and the programmer finds a more efficient,
> > more readable, more maintainable solution, and he can do so because he
> > has a deep knowledge of the problem that is unavailable to the
> > compiler.  My seat-of-pants estimate is that 400 cases is a good, maybe
> > even high, maximum for Hash evaluations.  A program containing a switch
> > statement with more than 400 cases in it probably has enough other
> > severe design problems that it will never be worth running anyway.
> 
> This is true in my experience for code written by humans, but not
> necessarily for machine-generated code.  Take a look at insn-attrtab.c
> in your GCC build directory, or at cp/parse.c in the source tree.

Some of the simulators in the gdb releases have the ability to build very giant
switch tables to speed up decoding machine instructions, though we haven't
built them with the largest tables in quite some time, because the compiler's
memory was going into the gigabytes.

At a former job, I recall hearing that they had built a profiler/simulator for
the machine, and used a completely dense 65k element switch table.

-- 
Michael Meissner, Red Hat, Inc.  (GCC group)
PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
Work:	  meissner@redhat.com		phone: +1 978-486-9304
Non-work: meissner@spectacle-pond.org	fax:   +1 978-692-4482

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
  2001-01-05 10:54   ` Michael Meissner
@ 2001-01-07 18:37     ` Andy Walker
  0 siblings, 0 replies; 12+ messages in thread
From: Andy Walker @ 2001-01-07 18:37 UTC (permalink / raw)
  To: Michael Meissner; +Cc: Zack Weinberg, GCC general mailing list

Interesting comment. I will take it into account.

    Andy

Michael Meissner wrote:

> On Thu, Jan 04, 2001 at 01:39:10AM -0800, Zack Weinberg wrote:
> > On Wed, Dec 27, 2000 at 10:43:40PM -0600, Andy Walker wrote:
> > [much interesting stuff]
> > > My experience as a programmer is that programs are almost never written
> > > with switches to more than a few hundred cases.  I personally have never
> > > seen one.  More than that, and the programmer finds a more efficient,
> > > more readable, more maintainable solution, and he can do so because he
> > > has a deep knowledge of the problem that is unavailable to the
> > > compiler.  My seat-of-pants estimate is that 400 cases is a good, maybe
> > > even high, maximum for Hash evaluations.  A program containing a switch
> > > statement with more than 400 cases in it probably has enough other
> > > severe design problems that it will never be worth running anyway.
> >
> > This is true in my experience for code written by humans, but not
> > necessarily for machine-generated code.  Take a look at insn-attrtab.c
> > in your GCC build directory, or at cp/parse.c in the source tree.
>
> Some of the simulators in the gdb releases have the ability to build very giant
> switch tables to speed up decoding machine instructions, though we haven't
> built them with the largest tables in quite some time, because the compiler's
> memory was going into the gigabytes.
>
> At a former job, I recall hearing that they had built a profiler/simulator for
> the machine, and used a completely dense 65k element switch table.
>
> --
> Michael Meissner, Red Hat, Inc.  (GCC group)
> PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
> Work:     meissner@redhat.com           phone: +1 978-486-9304
> Non-work: meissner@spectacle-pond.org   fax:   +1 978-692-4482

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
  2001-01-04  1:39 ` Zack Weinberg
  2001-01-05 10:54   ` Michael Meissner
@ 2001-01-07 18:12   ` Andy Walker
  1 sibling, 0 replies; 12+ messages in thread
From: Andy Walker @ 2001-01-07 18:12 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: GCC general mailing list

Ok. I will.

    Andy

Zack Weinberg wrote:

> On Wed, Dec 27, 2000 at 10:43:40PM -0600, Andy Walker wrote:
> [much interesting stuff]
> > My experience as a programmer is that programs are almost never written
> > with switches to more than a few hundred cases.  I personally have never
> > seen one.  More than that, and the programmer finds a more efficient,
> > more readable, more maintainable solution, and he can do so because he
> > has a deep knowledge of the problem that is unavailable to the
> > compiler.  My seat-of-pants estimate is that 400 cases is a good, maybe
> > even high, maximum for Hash evaluations.  A program containing a switch
> > statement with more than 400 cases in it probably has enough other
> > severe design problems that it will never be worth running anyway.
>
> This is true in my experience for code written by humans, but not
> necessarily for machine-generated code.  Take a look at insn-attrtab.c
> in your GCC build directory, or at cp/parse.c in the source tree.
>
> zw

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
@ 2000-12-28  7:15 Robert Dewar
  2000-12-28 21:06 ` Dave Korn
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2000-12-28  7:15 UTC (permalink / raw)
  To: gcc, jawalker

unique hashes are not even desirable, note that in particular the
original input meets the definition of a unique hash. I don't find
the concept at all useful in this context.

What you are loking for is of course a near-minimal perfect hash, that
is what BCPL compilers used for case statements. Quite a bit was published
in the BCPL context about this approach, so I recommend looking into the
literature (it is quite old, you might have to go to a library - GASP :-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
  2000-12-28  7:15 Robert Dewar
@ 2000-12-28 21:06 ` Dave Korn
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Korn @ 2000-12-28 21:06 UTC (permalink / raw)
  To: gcc

> unique hashes are not even desirable, note that in particular the
> original input meets the definition of a unique hash. I don't find
> the concept at all useful in this context.

  I think Mr. Huffman and Mr. Kern might disagree..... surely a
.ZIP file is a unique hash of a given input, and one that has a 
very useful difference from that original input.  Of course, IANAQCS.

             DaveK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
@ 2000-12-28  7:19 Robert Dewar
  0 siblings, 0 replies; 12+ messages in thread
From: Robert Dewar @ 2000-12-28  7:19 UTC (permalink / raw)
  To: gcc, jawalker

<<associated register loads, to be at least the equivalent of three test
and conditional branch instruction pairs.  (no divide instructions -- I
am allergic to those).
>>

Conditional jumps can be extremely expensive if mispredicted as will happen
in this case (they can cost 10's of cycles), so your relative estimate of
efficiencies here is badly off for most modern architectures.

<<Just for peace of mind, the present tool (binary tree) handles most
situations in the most efficient manner, IMHO.  Also, none of the
algorithms I found could guarantee a near-minimal, perfect Hash in
linear time.  A few wild stabs in the dark often bring success. When
they do not, the binary tree is the reliable backup plan.
>>

Who cares if the compilation time algorithm is linear time, that's a 
quite inappropriate criterion.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
@ 2000-12-28 21:10 dewar
  2000-12-28 22:41 ` Dave Korn
  0 siblings, 1 reply; 12+ messages in thread
From: dewar @ 2000-12-28 21:10 UTC (permalink / raw)
  To: davek-ml, gcc

<<  I think Mr. Huffman and Mr. Kern might disagree..... surely a
.ZIP file is a unique hash of a given input, and one that has a
very useful difference from that original input.  Of course, IANAQCS.
>>

A zip file is simply a 1-1 mapping that may (but obviously does not always
) have a length less than the original file -- obviously if the zip file
is shorter than the original file in some cases, it will be longer inother
cases.

No one would call this a hash function in my experience. As I say, the
central point of a hash function is that it performs a many to one
mapping, that's what makes it interesting.

To call all 1-1 mappings hash functions is truly odd terminology :-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Hashing of "switch/case" selections
  2000-12-28 21:10 dewar
@ 2000-12-28 22:41 ` Dave Korn
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Korn @ 2000-12-28 22:41 UTC (permalink / raw)
  To: Robert Dewar, gcc

----- Original Message -----
From: <dewar@gnat.com>
Sent: Friday, December 29, 2000 5:10 AM

> <<  I think Mr. Huffman and Mr. Kern might disagree..... surely a
> .ZIP file is a unique hash of a given input, and one that has a
> very useful difference from that original input.  Of course, IANAQCS.
> >>
>
> A zip file is simply a 1-1 mapping that may (but obviously does not
> always) have a length less than the original file -- obviously if the
zip file
> is shorter than the original file in some cases, it will be longer
> in other cases.
>
> No one would call this a hash function in my experience. As I say, the
> central point of a hash function is that it performs a many to one
> mapping, that's what makes it interesting.

  The definition of 'unique' hash given in Andy's original mail seems to
me to imply that it is a one to one mapping. I do agree with you about the
conventional uses of hashes in (eg) string lookups; but how about crypto
hashes? A non-reversible hashing function that produced input the same
length as its output might be very useful there. Of course, IANAQCA. And
zipping a file is a lousy way to encrypt it.

> To call all 1-1 mappings hash functions is truly odd terminology :-)

  Ah, but I wasn't referring to *all* 1-1 mappings. I chose one that had
the property that for a non-flat frequency distribution of the input
range, the output will be shorter than the input. This seemed useful and
interesting to me.  I was stretching a metaphor slightly for humour. But
the point I was making is that if there's redundancy in the set of
possible input data, there may be some value in a unique hash. Mind you,
that's probably because I misread your original point

>"unique hashes are not even desirable, note that in particular the
>original input meets the definition of a unique hash. I don't find the
>concept at all useful in this context."

 and missed out the fairly vital qualifier, " in this context".  D'oh! I
believe it's time I went offline for a few hours....

           DaveK

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2001-01-07 18:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-27 20:41 Hashing of "switch/case" selections Andy Walker
2000-12-27 21:33 ` I have found an ICE Magnus Fromreide
2001-01-01 14:30 ` Hashing of "switch/case" selections Nix
2001-01-04  1:39 ` Zack Weinberg
2001-01-05 10:54   ` Michael Meissner
2001-01-07 18:37     ` Andy Walker
2001-01-07 18:12   ` Andy Walker
2000-12-28  7:15 Robert Dewar
2000-12-28 21:06 ` Dave Korn
2000-12-28  7:19 Robert Dewar
2000-12-28 21:10 dewar
2000-12-28 22:41 ` Dave Korn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).