public inbox for glibc-bugs@sourceware.org help / color / mirror / Atom feed
From: "witold.baryluk+sourceware at gmail dot com" <sourceware-bugzilla@sourceware.org> To: glibc-bugs@sourceware.org Subject: [Bug libc/25924] New: Very poor choice of hash function in hsearch Date: Tue, 05 May 2020 14:05:29 +0000 [thread overview] Message-ID: <bug-25924-131@http.sourceware.org/bugzilla/> (raw) https://sourceware.org/bugzilla/show_bug.cgi?id=25924 Bug ID: 25924 Summary: Very poor choice of hash function in hsearch Product: glibc Version: 2.30 Status: UNCONFIRMED Severity: normal Priority: P2 Component: libc Assignee: unassigned at sourceware dot org Reporter: witold.baryluk+sourceware at gmail dot com CC: drepper.fsp at gmail dot com Target Milestone: --- hsearch_r right now uses this code to computer first hash: glibc/misc/hsearch_r.c in hsearch_r function: ==== unsigned int hval; unsigned int count; unsigned int len = strlen (item.key); unsigned int idx; /* Compute an value for the given string. Perhaps use a better method. */ hval = len; count = len; while (count-- > 0) { hval <<= 4; hval += item.key[count]; } if (hval == 0) ++hval; /* First hash function: simply take the modul but prevent zero. */ idx = hval % htab->size + 1; ==== This is extremely primitive and ad-hoc. I started digging into it, when I switched from key of the form sprintf("%d", i), to sprintf("%x", i), in a hope to speed up the code ("%x" is way faster than "%d" to encode), and the performance of hsearch_r dropped significantly (because number of collisions increased a lot). Almost twice slower. I recommend extracting minimal code from xxHash (more specifically xxh3 version), as it is extremely fast, has very good latency even on short keys, and superb statistical properties. https://user-images.githubusercontent.com/750081/61976089-aedeab00-af9f-11e9-9239-e5375d6c080f.png https://github.com/Cyan4973/xxHash , BSD license. Added benefit of xxHash is that it can be initialized with a secret (i.e. random key), so on each execution the hashing is pseudo-random even for the same keys. This can be used to prevent DoS attacks on hsearch_r - without suitable attacker can feed keys that hash to known hashes (and indexes) and makes number of collisions extremely high. However, there are other hashes that are similarly fast has good properties and are simpler to implement, for example see a bit dated but still useful comparison at http://www.azillionmonkeys.com/qed/hash.html Looking at the rest of the hsearch_r code, it looks like it is trying to implement double hashing scheme for resolving collisions, but is reusing hval with different modulus as a step. That is okish, but in many cases, will still produce poor performance (big number of colissions), if the hval in the first place is poor hash function and maps to not just same idx, but same hval. And I think the currently used hash function is very prone to that, especially for strings with hexdecimal characters (hex characters interplay with <<= 4 in a way that essentially removes the mixing, i.e. for ). -- You are receiving this mail because: You are on the CC list for the bug.
next reply other threads:[~2020-05-05 14:05 UTC|newest] Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-05-05 14:05 witold.baryluk+sourceware at gmail dot com [this message] 2020-05-05 15:04 ` [Bug libc/25924] " witold.baryluk+sourceware at gmail dot com 2020-05-05 19:08 ` carlos at redhat dot com 2020-05-06 20:36 ` witold.baryluk+sourceware at gmail dot com 2020-05-06 20:40 ` witold.baryluk+sourceware at gmail dot com 2020-05-06 21:04 ` carlos at redhat dot com 2020-05-06 21:08 ` witold.baryluk+sourceware at gmail dot com
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-25924-131@http.sourceware.org/bugzilla/ \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs@sourceware.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).