public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "witold.baryluk+sourceware at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug libc/25924] New: Very poor choice of hash function in hsearch
Date: Tue, 05 May 2020 14:05:29 +0000	[thread overview]
Message-ID: <bug-25924-131@http.sourceware.org/bugzilla/> (raw)

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

            Bug ID: 25924
           Summary: Very poor choice of hash function in hsearch
           Product: glibc
           Version: 2.30
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: witold.baryluk+sourceware at gmail dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

hsearch_r right now uses this code to computer first hash:


glibc/misc/hsearch_r.c in hsearch_r function:

====
  unsigned int hval;
  unsigned int count;
  unsigned int len = strlen (item.key);
  unsigned int idx;

  /* Compute an value for the given string. Perhaps use a better method. */
  hval = len;
  count = len;
  while (count-- > 0)
    {
      hval <<= 4;
      hval += item.key[count];
    }
  if (hval == 0)
    ++hval;

  /* First hash function: simply take the modul but prevent zero. */
  idx = hval % htab->size + 1;
====


This is extremely primitive and ad-hoc.

I started digging into it, when I switched from key of the form sprintf("%d",
i), to sprintf("%x", i), in a hope to speed up the code ("%x" is way faster
than "%d" to encode), and the performance of hsearch_r dropped significantly
(because number of collisions increased a lot). Almost twice slower.

I recommend extracting minimal code from xxHash (more specifically xxh3
version), as it is extremely fast, has very good latency even on short keys,
and superb statistical properties.

https://user-images.githubusercontent.com/750081/61976089-aedeab00-af9f-11e9-9239-e5375d6c080f.png

https://github.com/Cyan4973/xxHash  , BSD license.

Added benefit of xxHash is that it can be initialized with a secret (i.e.
random key), so on each execution the hashing is pseudo-random even for the
same keys. This can be used to prevent DoS attacks on hsearch_r - without
suitable attacker can feed keys that hash to known hashes (and indexes) and
makes number of collisions extremely high.

However, there are other hashes that are similarly fast has good properties and
are simpler to implement, for example see a bit dated but still useful
comparison at http://www.azillionmonkeys.com/qed/hash.html


Looking at the rest of the hsearch_r code, it looks like it is trying to
implement double hashing scheme for resolving collisions, but is reusing hval
with different modulus as a step. That is okish, but in many cases, will still
produce poor performance (big number of colissions), if the hval in the first
place is poor hash function and maps to not just same idx, but same hval. And I
think the currently used hash function is very prone to that, especially for
strings with hexdecimal characters (hex characters interplay with <<= 4 in a
way that essentially removes the mixing, i.e. for ).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

             reply	other threads:[~2020-05-05 14:05 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-05 14:05 witold.baryluk+sourceware at gmail dot com [this message]
2020-05-05 15:04 ` [Bug libc/25924] " witold.baryluk+sourceware at gmail dot com
2020-05-05 19:08 ` carlos at redhat dot com
2020-05-06 20:36 ` witold.baryluk+sourceware at gmail dot com
2020-05-06 20:40 ` witold.baryluk+sourceware at gmail dot com
2020-05-06 21:04 ` carlos at redhat dot com
2020-05-06 21:08 ` witold.baryluk+sourceware at gmail dot com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-25924-131@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).