public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Hashing of "switch/case" selections
@ 2000-12-27 20:41 Andy Walker
  2000-12-27 21:33 ` I have found an ICE Magnus Fromreide
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andy Walker @ 2000-12-27 20:41 UTC (permalink / raw)
  To: GCC general mailing list

I have been doing some research on this.  Quick definitions for those
who are not Sorting and Searching Wizards:  Hash indicies: unique
preHashed index values that each have a meaning.  A Hash function: a
function applied to a value (usually an integer, or rarely, a character
string) to get a result.  A unique Hash: a Hash function that returns a
unique result for every unique input.  A perfect Hash: similar to a
unique Hash, but not as stringent; a Hash function that returns a unique
result for every Hash index (note that invalid, ie non-index values may,
and often do, return the same result as a valid Hash index). A minimal,
perfect Hash: a Hash function that returns "n"  sequential values for
all "n" valid Hash indicies, not necessarily in the same order..  A
near-minimal, perfect Hash: as close as I can get to the the previous
definition.

Unique Hashes are usually too much to ask for. I have not found any
references that begin to suggest how to design and code such a beast.

Therefore, the perfect Hash is the minimum acceptable function.  The
first consequence of this approach is that a table of the valid indicies
must be maintained for verification, to distinguish the invalid values
from the valid indicies. If there is a table, then a high and low test
must be performed to guarantee that the Hash result falls in the range
of the table, before the verification test.  If we use a multiplication
for the Hash function, I expect the multiply instruction, plus
associated register loads, to be at least the equivalent of three test
and conditional branch instruction pairs.  (no divide instructions -- I
am allergic to those).

A binary tree search using only five test and conditional branch pairs
plus a final jump table/verification table lookup and comparision, will
compile more quickly, will run at least as fast, and will take up less
space than the simplest general Hash function..  If there are 32 or
fewer cases, binary search is king.  Note that this is totally unrelated
to the spread of the index values.

I have recently changed jobs, but at my old job I did a rough and
unscientific analysis of a modestly sized "C/C++" library.  I found
rougly one-hundred-three instances of "switch" with cases.  Ninety-two
of them had fewer than 32 cases, and most of those were sixteen cases or
less.  Of the remaining nine, all but one were handling ASCII characters
as integers for parsing functions.

Just for peace of mind, the present tool (binary tree) handles most
situations in the most efficient manner, IMHO.  Also, none of the
algorithms I found could guarantee a near-minimal, perfect Hash in
linear time.  A few wild stabs in the dark often bring success. When
they do not, the binary tree is the reliable backup plan.

I am still working on "switch" for 256 or fewer cases, and more than 256
cases.  The algorithm I am analysing is a multiplication with truncation
(/256), and mulitiplication then squaring with truncation (/256).  When
I get some reliable timing numbers, I will post them here for further
discussion.

My experience as a programmer is that programs are almost never written
with switches to more than a few hundred cases.  I personally have never
seen one.  More than that, and the programmer finds a more efficient,
more readable, more maintainable solution, and he can do so because he
has a deep knowledge of the problem that is unavailable to the
compiler.  My seat-of-pants estimate is that 400 cases is a good, maybe
even high, maximum for Hash evaluations.  A program containing a switch
statement with more than 400 cases in it probably has enough other
severe design problems that it will never be worth running anyway.

    Andy



^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: Hashing of "switch/case" selections
@ 2000-12-28  7:15 Robert Dewar
  2000-12-28 21:06 ` Dave Korn
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Dewar @ 2000-12-28  7:15 UTC (permalink / raw)
  To: gcc, jawalker

unique hashes are not even desirable, note that in particular the
original input meets the definition of a unique hash. I don't find
the concept at all useful in this context.

What you are loking for is of course a near-minimal perfect hash, that
is what BCPL compilers used for case statements. Quite a bit was published
in the BCPL context about this approach, so I recommend looking into the
literature (it is quite old, you might have to go to a library - GASP :-)

^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: Hashing of "switch/case" selections
@ 2000-12-28  7:19 Robert Dewar
  0 siblings, 0 replies; 12+ messages in thread
From: Robert Dewar @ 2000-12-28  7:19 UTC (permalink / raw)
  To: gcc, jawalker

<<associated register loads, to be at least the equivalent of three test
and conditional branch instruction pairs.  (no divide instructions -- I
am allergic to those).
>>

Conditional jumps can be extremely expensive if mispredicted as will happen
in this case (they can cost 10's of cycles), so your relative estimate of
efficiencies here is badly off for most modern architectures.

<<Just for peace of mind, the present tool (binary tree) handles most
situations in the most efficient manner, IMHO.  Also, none of the
algorithms I found could guarantee a near-minimal, perfect Hash in
linear time.  A few wild stabs in the dark often bring success. When
they do not, the binary tree is the reliable backup plan.
>>

Who cares if the compilation time algorithm is linear time, that's a 
quite inappropriate criterion.

^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: Hashing of "switch/case" selections
@ 2000-12-28 21:10 dewar
  2000-12-28 22:41 ` Dave Korn
  0 siblings, 1 reply; 12+ messages in thread
From: dewar @ 2000-12-28 21:10 UTC (permalink / raw)
  To: davek-ml, gcc

<<  I think Mr. Huffman and Mr. Kern might disagree..... surely a
.ZIP file is a unique hash of a given input, and one that has a
very useful difference from that original input.  Of course, IANAQCS.
>>

A zip file is simply a 1-1 mapping that may (but obviously does not always
) have a length less than the original file -- obviously if the zip file
is shorter than the original file in some cases, it will be longer inother
cases.

No one would call this a hash function in my experience. As I say, the
central point of a hash function is that it performs a many to one
mapping, that's what makes it interesting.

To call all 1-1 mappings hash functions is truly odd terminology :-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2001-01-07 18:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-27 20:41 Hashing of "switch/case" selections Andy Walker
2000-12-27 21:33 ` I have found an ICE Magnus Fromreide
2001-01-01 14:30 ` Hashing of "switch/case" selections Nix
2001-01-04  1:39 ` Zack Weinberg
2001-01-05 10:54   ` Michael Meissner
2001-01-07 18:37     ` Andy Walker
2001-01-07 18:12   ` Andy Walker
2000-12-28  7:15 Robert Dewar
2000-12-28 21:06 ` Dave Korn
2000-12-28  7:19 Robert Dewar
2000-12-28 21:10 dewar
2000-12-28 22:41 ` Dave Korn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).