public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libc/25924] New: Very poor choice of hash function in hsearch
@ 2020-05-05 14:05 witold.baryluk+sourceware at gmail dot com
  2020-05-05 15:04 ` [Bug libc/25924] " witold.baryluk+sourceware at gmail dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: witold.baryluk+sourceware at gmail dot com @ 2020-05-05 14:05 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

            Bug ID: 25924
           Summary: Very poor choice of hash function in hsearch
           Product: glibc
           Version: 2.30
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: witold.baryluk+sourceware at gmail dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

hsearch_r right now uses this code to computer first hash:


glibc/misc/hsearch_r.c in hsearch_r function:

====
  unsigned int hval;
  unsigned int count;
  unsigned int len = strlen (item.key);
  unsigned int idx;

  /* Compute an value for the given string. Perhaps use a better method. */
  hval = len;
  count = len;
  while (count-- > 0)
    {
      hval <<= 4;
      hval += item.key[count];
    }
  if (hval == 0)
    ++hval;

  /* First hash function: simply take the modul but prevent zero. */
  idx = hval % htab->size + 1;
====


This is extremely primitive and ad-hoc.

I started digging into it, when I switched from key of the form sprintf("%d",
i), to sprintf("%x", i), in a hope to speed up the code ("%x" is way faster
than "%d" to encode), and the performance of hsearch_r dropped significantly
(because number of collisions increased a lot). Almost twice slower.

I recommend extracting minimal code from xxHash (more specifically xxh3
version), as it is extremely fast, has very good latency even on short keys,
and superb statistical properties.

https://user-images.githubusercontent.com/750081/61976089-aedeab00-af9f-11e9-9239-e5375d6c080f.png

https://github.com/Cyan4973/xxHash  , BSD license.

Added benefit of xxHash is that it can be initialized with a secret (i.e.
random key), so on each execution the hashing is pseudo-random even for the
same keys. This can be used to prevent DoS attacks on hsearch_r - without
suitable attacker can feed keys that hash to known hashes (and indexes) and
makes number of collisions extremely high.

However, there are other hashes that are similarly fast has good properties and
are simpler to implement, for example see a bit dated but still useful
comparison at http://www.azillionmonkeys.com/qed/hash.html


Looking at the rest of the hsearch_r code, it looks like it is trying to
implement double hashing scheme for resolving collisions, but is reusing hval
with different modulus as a step. That is okish, but in many cases, will still
produce poor performance (big number of colissions), if the hval in the first
place is poor hash function and maps to not just same idx, but same hval. And I
think the currently used hash function is very prone to that, especially for
strings with hexdecimal characters (hex characters interplay with <<= 4 in a
way that essentially removes the mixing, i.e. for ).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-06 21:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-05 14:05 [Bug libc/25924] New: Very poor choice of hash function in hsearch witold.baryluk+sourceware at gmail dot com
2020-05-05 15:04 ` [Bug libc/25924] " witold.baryluk+sourceware at gmail dot com
2020-05-05 19:08 ` carlos at redhat dot com
2020-05-06 20:36 ` witold.baryluk+sourceware at gmail dot com
2020-05-06 20:40 ` witold.baryluk+sourceware at gmail dot com
2020-05-06 21:04 ` carlos at redhat dot com
2020-05-06 21:08 ` witold.baryluk+sourceware at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).