From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id A9C74389001E; Tue, 5 May 2020 15:04:16 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A9C74389001E From: "witold.baryluk+sourceware at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug libc/25924] Very poor choice of hash function in hsearch Date: Tue, 05 May 2020 15:04:16 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: libc X-Bugzilla-Version: 2.30 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: witold.baryluk+sourceware at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 May 2020 15:04:16 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25924 --- Comment #1 from Witold Baryluk --- xxh3 could be an overkill, but I found this nice and short hash function: https://github.com/ZilongTan/fast-hash this is entire source code: https://github.com/ZilongTan/fast-hash/blob/master/fasthash.c /* The MIT License Copyright (C) 2012 Zilong Tan (eric.zltan@gmail.com) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ #include "fasthash.h" // Compression function for Merkle-Damgard construction. // This function is generated using the framework provided. #define mix(h) ({ \ (h) ^=3D (h) >> 23; \ (h) *=3D 0x2127599bf4325c37ULL; \ (h) ^=3D (h) >> 47; }) uint64_t fasthash64(const void *buf, size_t len, uint64_t seed) { const uint64_t m =3D 0x880355f21e6d1965ULL; const uint64_t *pos =3D (const uint64_t *)buf; const uint64_t *end =3D pos + (len / 8); const unsigned char *pos2; uint64_t h =3D seed ^ (len * m); uint64_t v; while (pos !=3D end) { v =3D *pos++; h ^=3D mix(v); h *=3D m; } pos2 =3D (const unsigned char*)pos; v =3D 0; switch (len & 7) { case 7: v ^=3D (uint64_t)pos2[6] << 48; case 6: v ^=3D (uint64_t)pos2[5] << 40; case 5: v ^=3D (uint64_t)pos2[4] << 32; case 4: v ^=3D (uint64_t)pos2[3] << 24; case 3: v ^=3D (uint64_t)pos2[2] << 16; case 2: v ^=3D (uint64_t)pos2[1] << 8; case 1: v ^=3D (uint64_t)pos2[0]; h ^=3D mix(v); h *=3D m; } return mix(h); }=20 uint32_t fasthash32(const void *buf, size_t len, uint32_t seed) { // the following trick converts the 64-bit hashcode to Fermat // residue, which shall retain information from both the higher // and lower parts of hashcode. uint64_t h =3D fasthash64(buf, len, seed); return h - (h >> 32); } The added benefit is that is also performs good for small and big keys. It consumes 8-bytes at the time when possible. The seed could be random initialized on program startup, and be the same for all tables and threads. I did some small benchmarks, creating table with size 30_000_000, and then inserting 20_000_000 entries with sequential integer numbers: current hash "%d" - 4.296s current hash "%x" - 6.012s fasthash "%d" - 4.466s fasthash "%x" - 4.434s (real time, best of 3 runs). The time includes call to hcreate_r and malloc for each entry key. The big = part of the program runtime is spend in malloc and snprintf. --=20 You are receiving this mail because: You are on the CC list for the bug.=