public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "witold.baryluk+sourceware at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug libc/25924] Very poor choice of hash function in hsearch
Date: Tue, 05 May 2020 15:04:16 +0000	[thread overview]
Message-ID: <bug-25924-131-20S9Hzz2GU@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-25924-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=25924

--- Comment #1 from Witold Baryluk <witold.baryluk+sourceware at gmail dot com> ---
xxh3 could be an overkill, but I found this nice and short hash function:


https://github.com/ZilongTan/fast-hash

this is entire source code:

https://github.com/ZilongTan/fast-hash/blob/master/fasthash.c

/* The MIT License
   Copyright (C) 2012 Zilong Tan (eric.zltan@gmail.com)
   Permission is hereby granted, free of charge, to any person
   obtaining a copy of this software and associated documentation
   files (the "Software"), to deal in the Software without
   restriction, including without limitation the rights to use, copy,
   modify, merge, publish, distribute, sublicense, and/or sell copies
   of the Software, and to permit persons to whom the Software is
   furnished to do so, subject to the following conditions:
   The above copyright notice and this permission notice shall be
   included in all copies or substantial portions of the Software.
   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
   SOFTWARE.
*/

#include "fasthash.h"

// Compression function for Merkle-Damgard construction.
// This function is generated using the framework provided.
#define mix(h) ({                                       \
                        (h) ^= (h) >> 23;               \
                        (h) *= 0x2127599bf4325c37ULL;   \
                        (h) ^= (h) >> 47; })

uint64_t fasthash64(const void *buf, size_t len, uint64_t seed)
{
        const uint64_t    m = 0x880355f21e6d1965ULL;
        const uint64_t *pos = (const uint64_t *)buf;
        const uint64_t *end = pos + (len / 8);
        const unsigned char *pos2;
        uint64_t h = seed ^ (len * m);
        uint64_t v;

        while (pos != end) {
                v  = *pos++;
                h ^= mix(v);
                h *= m;
        }

        pos2 = (const unsigned char*)pos;
        v = 0;

        switch (len & 7) {
        case 7: v ^= (uint64_t)pos2[6] << 48;
        case 6: v ^= (uint64_t)pos2[5] << 40;
        case 5: v ^= (uint64_t)pos2[4] << 32;
        case 4: v ^= (uint64_t)pos2[3] << 24;
        case 3: v ^= (uint64_t)pos2[2] << 16;
        case 2: v ^= (uint64_t)pos2[1] << 8;
        case 1: v ^= (uint64_t)pos2[0];
                h ^= mix(v);
                h *= m;
        }

        return mix(h);
} 

uint32_t fasthash32(const void *buf, size_t len, uint32_t seed)
{
        // the following trick converts the 64-bit hashcode to Fermat
        // residue, which shall retain information from both the higher
        // and lower parts of hashcode.
        uint64_t h = fasthash64(buf, len, seed);
        return h - (h >> 32);
}


The added benefit is that is also performs good for small and big keys. It
consumes 8-bytes at the time when possible.

The seed could be random initialized on program startup, and be the same for
all tables and threads.

I did some small benchmarks, creating table with size 30_000_000, and then
inserting 20_000_000 entries with sequential integer numbers:

current hash "%d" - 4.296s
current hash "%x" - 6.012s

fasthash "%d" - 4.466s
fasthash "%x" - 4.434s

(real time, best of 3 runs).

The time includes call to hcreate_r and malloc for each entry key. The big part
of the program runtime is spend in malloc and snprintf.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

  reply	other threads:[~2020-05-05 15:04 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-05 14:05 [Bug libc/25924] New: " witold.baryluk+sourceware at gmail dot com
2020-05-05 15:04 ` witold.baryluk+sourceware at gmail dot com [this message]
2020-05-05 19:08 ` [Bug libc/25924] " carlos at redhat dot com
2020-05-06 20:36 ` witold.baryluk+sourceware at gmail dot com
2020-05-06 20:40 ` witold.baryluk+sourceware at gmail dot com
2020-05-06 21:04 ` carlos at redhat dot com
2020-05-06 21:08 ` witold.baryluk+sourceware at gmail dot com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-25924-131-20S9Hzz2GU@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).