[Bug libc/25924] New: Very poor choice of hash function in hsearch

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libc/25924] New: Very poor choice of hash function in hsearch
@ 2020-05-05 14:05 witold.baryluk+sourceware at gmail dot com
  2020-05-05 15:04 ` [Bug libc/25924] " witold.baryluk+sourceware at gmail dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: witold.baryluk+sourceware at gmail dot com @ 2020-05-05 14:05 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

            Bug ID: 25924
           Summary: Very poor choice of hash function in hsearch
           Product: glibc
           Version: 2.30
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: witold.baryluk+sourceware at gmail dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

hsearch_r right now uses this code to computer first hash:

glibc/misc/hsearch_r.c in hsearch_r function:

====
  unsigned int hval;
  unsigned int count;
  unsigned int len = strlen (item.key);
  unsigned int idx;

  /* Compute an value for the given string. Perhaps use a better method. */
  hval = len;
  count = len;
  while (count-- > 0)
    {
      hval <<= 4;
      hval += item.key[count];
    }
  if (hval == 0)
    ++hval;

  /* First hash function: simply take the modul but prevent zero. */
  idx = hval % htab->size + 1;
====

This is extremely primitive and ad-hoc.

I started digging into it, when I switched from key of the form sprintf("%d",
i), to sprintf("%x", i), in a hope to speed up the code ("%x" is way faster
than "%d" to encode), and the performance of hsearch_r dropped significantly
(because number of collisions increased a lot). Almost twice slower.

I recommend extracting minimal code from xxHash (more specifically xxh3
version), as it is extremely fast, has very good latency even on short keys,
and superb statistical properties.

https://user-images.githubusercontent.com/750081/61976089-aedeab00-af9f-11e9-9239-e5375d6c080f.png

https://github.com/Cyan4973/xxHash  , BSD license.

Added benefit of xxHash is that it can be initialized with a secret (i.e.
random key), so on each execution the hashing is pseudo-random even for the
same keys. This can be used to prevent DoS attacks on hsearch_r - without
suitable attacker can feed keys that hash to known hashes (and indexes) and
makes number of collisions extremely high.

However, there are other hashes that are similarly fast has good properties and
are simpler to implement, for example see a bit dated but still useful
comparison at http://www.azillionmonkeys.com/qed/hash.html

Looking at the rest of the hsearch_r code, it looks like it is trying to
implement double hashing scheme for resolving collisions, but is reusing hval
with different modulus as a step. That is okish, but in many cases, will still
produce poor performance (big number of colissions), if the hval in the first
place is poor hash function and maps to not just same idx, but same hval. And I
think the currently used hash function is very prone to that, especially for
strings with hexdecimal characters (hex characters interplay with <<= 4 in a
way that essentially removes the mixing, i.e. for ).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug libc/25924] Very poor choice of hash function in hsearch
  2020-05-05 14:05 [Bug libc/25924] New: Very poor choice of hash function in hsearch witold.baryluk+sourceware at gmail dot com
@ 2020-05-05 15:04 ` witold.baryluk+sourceware at gmail dot com
  2020-05-05 19:08 ` carlos at redhat dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: witold.baryluk+sourceware at gmail dot com @ 2020-05-05 15:04 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

--- Comment #1 from Witold Baryluk <witold.baryluk+sourceware at gmail dot com> ---
xxh3 could be an overkill, but I found this nice and short hash function:

https://github.com/ZilongTan/fast-hash

this is entire source code:

https://github.com/ZilongTan/fast-hash/blob/master/fasthash.c

/* The MIT License
   Copyright (C) 2012 Zilong Tan (eric.zltan@gmail.com)
   Permission is hereby granted, free of charge, to any person
   obtaining a copy of this software and associated documentation
   files (the "Software"), to deal in the Software without
   restriction, including without limitation the rights to use, copy,
   modify, merge, publish, distribute, sublicense, and/or sell copies
   of the Software, and to permit persons to whom the Software is
   furnished to do so, subject to the following conditions:
   The above copyright notice and this permission notice shall be
   included in all copies or substantial portions of the Software.
   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
   SOFTWARE.
*/

#include "fasthash.h"

// Compression function for Merkle-Damgard construction.
// This function is generated using the framework provided.
#define mix(h) ({                                       \
                        (h) ^= (h) >> 23;               \
                        (h) *= 0x2127599bf4325c37ULL;   \
                        (h) ^= (h) >> 47; })

uint64_t fasthash64(const void *buf, size_t len, uint64_t seed)
{
        const uint64_t    m = 0x880355f21e6d1965ULL;
        const uint64_t *pos = (const uint64_t *)buf;
        const uint64_t *end = pos + (len / 8);
        const unsigned char *pos2;
        uint64_t h = seed ^ (len * m);
        uint64_t v;

        while (pos != end) {
                v  = *pos++;
                h ^= mix(v);
                h *= m;
        }

        pos2 = (const unsigned char*)pos;
        v = 0;

        switch (len & 7) {
        case 7: v ^= (uint64_t)pos2[6] << 48;
        case 6: v ^= (uint64_t)pos2[5] << 40;
        case 5: v ^= (uint64_t)pos2[4] << 32;
        case 4: v ^= (uint64_t)pos2[3] << 24;
        case 3: v ^= (uint64_t)pos2[2] << 16;
        case 2: v ^= (uint64_t)pos2[1] << 8;
        case 1: v ^= (uint64_t)pos2[0];
                h ^= mix(v);
                h *= m;
        }

        return mix(h);
} 

uint32_t fasthash32(const void *buf, size_t len, uint32_t seed)
{
        // the following trick converts the 64-bit hashcode to Fermat
        // residue, which shall retain information from both the higher
        // and lower parts of hashcode.
        uint64_t h = fasthash64(buf, len, seed);
        return h - (h >> 32);
}

The added benefit is that is also performs good for small and big keys. It
consumes 8-bytes at the time when possible.

The seed could be random initialized on program startup, and be the same for
all tables and threads.

I did some small benchmarks, creating table with size 30_000_000, and then
inserting 20_000_000 entries with sequential integer numbers:

current hash "%d" - 4.296s
current hash "%x" - 6.012s

fasthash "%d" - 4.466s
fasthash "%x" - 4.434s

(real time, best of 3 runs).

The time includes call to hcreate_r and malloc for each entry key. The big part
of the program runtime is spend in malloc and snprintf.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug libc/25924] Very poor choice of hash function in hsearch
  2020-05-05 14:05 [Bug libc/25924] New: Very poor choice of hash function in hsearch witold.baryluk+sourceware at gmail dot com
  2020-05-05 15:04 ` [Bug libc/25924] " witold.baryluk+sourceware at gmail dot com
@ 2020-05-05 19:08 ` carlos at redhat dot com
  2020-05-06 20:36 ` witold.baryluk+sourceware at gmail dot com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: carlos at redhat dot com @ 2020-05-05 19:08 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com
             Status|UNCONFIRMED                 |SUSPENDED
     Ever confirmed|0                           |1
           Severity|normal                      |enhancement
   Last reconfirmed|                            |2020-05-05

--- Comment #2 from Carlos O'Donell <carlos at redhat dot com> ---
I agree, there might be better hash implementations, but selecting one also
requires performance testing on the various architectures that glibc supports.
Therefore there are really many steps to getting performance related work
accepted in the project:

(1) Proof of performance for various architectures, or work with the
architecture maintainers to get them to run microbenchmarks for you.

(2) The patch, which has to follow our contribution checklist (more on that
later).

(3) Tests for the changes. Systems level work requires good testing.

Now back to the contribution checklist:

As a GNU project we have some specific copyright assignment requirements,
particularly:
https://sourceware.org/glib/wiki/Contribution%20checklist#FSF_copyright_Assignment
https://www.gnu.org/licenses/why-assign.en.html

You can see our full contribution checklist here:
https://sourceware.org/glibc/wiki/Contribution%20checklist

We want the original author of that code to disclaim or assign copyright to the
FSF for the project to use.

I'm going to move this feature request into SUSPENDED. We can move this out of
SUSPENDED if we have an implementation that can be assigned to the FSF for
glibc to use to improve the hash function.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug libc/25924] Very poor choice of hash function in hsearch
  2020-05-05 14:05 [Bug libc/25924] New: Very poor choice of hash function in hsearch witold.baryluk+sourceware at gmail dot com
  2020-05-05 15:04 ` [Bug libc/25924] " witold.baryluk+sourceware at gmail dot com
  2020-05-05 19:08 ` carlos at redhat dot com
@ 2020-05-06 20:36 ` witold.baryluk+sourceware at gmail dot com
  2020-05-06 20:40 ` witold.baryluk+sourceware at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: witold.baryluk+sourceware at gmail dot com @ 2020-05-06 20:36 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

--- Comment #3 from Witold Baryluk <witold.baryluk+sourceware at gmail dot com> ---
Re:

(1) Proof of performance for various architectures, or work with the
architecture maintainers to get them to run microbenchmarks for you.

I am not sure what you refer to, but the performance (number of colissions in
the hash table for a given set of keys and order) of this hash function is
architecture independent.

Re:
(2)

As of the licensing, the only option is to use existing hash function
(developing a new hash function is not easy, and usually ends up bad), which
implementation would definitively be already licensed and owned by somebody
else, so copyright assignment to FSF is simply impossible. So I don't know how
would you go from there. All hash functions developed and published are usually
published under permissive non-copyleft licenses.

AFAIK you can integrate MIT code into GPL / LGPL code base, without asking the
copyright owners, as long as the used code is marked with the license text. It
states so directly in the license text.

The integration itself trivial (literally 2 lines of code).

However, I can't grant FSF copyright assignment.

You are free to contact the copyright owner of any related code get one.

Re:
(3) Tests for the changes. Systems level work requires good testing.

Are you implying glibc doesn't have tests and test infrastructure at the
moment? Fix for this bug doesn't introduce any public APIs.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug libc/25924] Very poor choice of hash function in hsearch
  2020-05-05 14:05 [Bug libc/25924] New: Very poor choice of hash function in hsearch witold.baryluk+sourceware at gmail dot com
                   ` (2 preceding siblings ...)
  2020-05-06 20:36 ` witold.baryluk+sourceware at gmail dot com
@ 2020-05-06 20:40 ` witold.baryluk+sourceware at gmail dot com
  2020-05-06 21:04 ` carlos at redhat dot com
  2020-05-06 21:08 ` witold.baryluk+sourceware at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: witold.baryluk+sourceware at gmail dot com @ 2020-05-06 20:40 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

--- Comment #4 from Witold Baryluk <witold.baryluk+sourceware at gmail dot com> ---
I did a bit more of benchmarking on amd64, (AMD Threadripper TR 2950X), and
excluded the time for malloc / snprintf. I also instrumented the while loop in
hsearch_r to count collisions during insertion (ENTER). 20M insertions into
table of size 30M. Wall clock time - best of 5 runs.

old hash:
dec_keys  1.144s   6149640 collisions
hex_keys  3.319s  87011211 collisions

new hash, fasthash64:
dec_keys  1.146s   2313573 collisions
hex_keys  1.170s   2312409 collisions

It is just an example. But you can see new hash (or other good hash) has
significantly less collisions, and "%d" and "%x" style keys having essentially
same (and fast) performance and no collisions blow-up.

I did not measure the lookup (FIND), but considering number of collisions with
existing hash, I expect old hash to perform very bad.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug libc/25924] Very poor choice of hash function in hsearch
  2020-05-05 14:05 [Bug libc/25924] New: Very poor choice of hash function in hsearch witold.baryluk+sourceware at gmail dot com
                   ` (3 preceding siblings ...)
  2020-05-06 20:40 ` witold.baryluk+sourceware at gmail dot com
@ 2020-05-06 21:04 ` carlos at redhat dot com
  2020-05-06 21:08 ` witold.baryluk+sourceware at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: carlos at redhat dot com @ 2020-05-06 21:04 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

--- Comment #5 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Witold Baryluk from comment #3)
> (1) Proof of performance for various architectures, or work with the
> architecture maintainers to get them to run microbenchmarks for you.
> 
> I am not sure what you refer to, but the performance (number of colissions
> in the hash table for a given set of keys and order) of this hash function
> is architecture independent.

Yes, number of collisions is an abstract measure of performance for the
algorithm.

To make performance changes to glibc we add microbenchmarks that model specific
workloads. We would need to add glibc/benchtests/* microbenchmarks that measure
the performance of the APIs in question with specific workloads.

Alternatively real performance numbers across a variety of architectures.

> Re:
> (2)
> 
> As of the licensing, the only option is to use existing hash function
> (developing a new hash function is not easy, and usually ends up bad), which
> implementation would definitively be already licensed and owned by somebody
> else, so copyright assignment to FSF is simply impossible. 

It is not impossible. As you note, we have to reach out to the authors and ask
if they will contribute the copyright of the code to the FSF and glibc.

> So I don't know
> how would you go from there. All hash functions developed and published are
> usually published under permissive non-copyleft licenses.
> 
> AFAIK you can integrate MIT code into GPL / LGPL code base, without asking
> the copyright owners, as long as the used code is marked with the license
> text. It states so directly in the license text.

Integration of code and the licenses is not the problem.

Please see:
"Why the FSF gets copyright assignments from contributors"
https://www.gnu.org/licenses/why-assign.en.html

We have integrated some non-LGPL code into glibc, but we would have to have a
very strong reason to do so again.

In this case I would need to see a more exhaustive search of hash functions and
which licenses they are under, and which authors would be interested in
collaborating with glibc.

> Re:
> (3) Tests for the changes. Systems level work requires good testing.
> 
> Are you implying glibc doesn't have tests and test infrastructure at the
> moment? Fix for this bug doesn't introduce any public APIs.

We have only 3 hsearch tests to verify the correct operation of hsearch.

We have empirical data from hours of operation that the *current*
implementation appears to be correct (but not as fast).

In order to change an implementation that hours of operation and few tests,
would require we increase the number of tests to validate the new
implementation.

Thus we would need more hsearch tests added.

Even if you can't provide a new hsearch implementation, it would be great to
have someone add more hsearch tests, and an hsearch microbenchmark. This would
go a long way to preparing the project to drop in a new hash function for
hsearch.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug libc/25924] Very poor choice of hash function in hsearch
  2020-05-05 14:05 [Bug libc/25924] New: Very poor choice of hash function in hsearch witold.baryluk+sourceware at gmail dot com
                   ` (4 preceding siblings ...)
  2020-05-06 21:04 ` carlos at redhat dot com
@ 2020-05-06 21:08 ` witold.baryluk+sourceware at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: witold.baryluk+sourceware at gmail dot com @ 2020-05-06 21:08 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

--- Comment #6 from Witold Baryluk <witold.baryluk+sourceware at gmail dot com> ---
Thanks for clarifications Carlos. I will look into tests first for hsearch. It
is always good to have some, including (deterministically) randomized mini
tests for example.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-06 21:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-05 14:05 [Bug libc/25924] New: Very poor choice of hash function in hsearch witold.baryluk+sourceware at gmail dot com
2020-05-05 15:04 ` [Bug libc/25924] " witold.baryluk+sourceware at gmail dot com
2020-05-05 19:08 ` carlos at redhat dot com
2020-05-06 20:36 ` witold.baryluk+sourceware at gmail dot com
2020-05-06 20:40 ` witold.baryluk+sourceware at gmail dot com
2020-05-06 21:04 ` carlos at redhat dot com
2020-05-06 21:08 ` witold.baryluk+sourceware at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).