From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8FE2E388F06F; Tue, 5 May 2020 14:05:29 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8FE2E388F06F From: "witold.baryluk+sourceware at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug libc/25924] New: Very poor choice of hash function in hsearch Date: Tue, 05 May 2020 14:05:29 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: libc X-Bugzilla-Version: 2.30 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: witold.baryluk+sourceware at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 May 2020 14:05:29 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25924 Bug ID: 25924 Summary: Very poor choice of hash function in hsearch Product: glibc Version: 2.30 Status: UNCONFIRMED Severity: normal Priority: P2 Component: libc Assignee: unassigned at sourceware dot org Reporter: witold.baryluk+sourceware at gmail dot com CC: drepper.fsp at gmail dot com Target Milestone: --- hsearch_r right now uses this code to computer first hash: glibc/misc/hsearch_r.c in hsearch_r function: =3D=3D=3D=3D unsigned int hval; unsigned int count; unsigned int len =3D strlen (item.key); unsigned int idx; /* Compute an value for the given string. Perhaps use a better method. */ hval =3D len; count =3D len; while (count-- > 0) { hval <<=3D 4; hval +=3D item.key[count]; } if (hval =3D=3D 0) ++hval; /* First hash function: simply take the modul but prevent zero. */ idx =3D hval % htab->size + 1; =3D=3D=3D=3D This is extremely primitive and ad-hoc. I started digging into it, when I switched from key of the form sprintf("%d= ", i), to sprintf("%x", i), in a hope to speed up the code ("%x" is way faster than "%d" to encode), and the performance of hsearch_r dropped significantly (because number of collisions increased a lot). Almost twice slower. I recommend extracting minimal code from xxHash (more specifically xxh3 version), as it is extremely fast, has very good latency even on short keys, and superb statistical properties. https://user-images.githubusercontent.com/750081/61976089-aedeab00-af9f-11e= 9-9239-e5375d6c080f.png https://github.com/Cyan4973/xxHash , BSD license. Added benefit of xxHash is that it can be initialized with a secret (i.e. random key), so on each execution the hashing is pseudo-random even for the same keys. This can be used to prevent DoS attacks on hsearch_r - without suitable attacker can feed keys that hash to known hashes (and indexes) and makes number of collisions extremely high. However, there are other hashes that are similarly fast has good properties= and are simpler to implement, for example see a bit dated but still useful comparison at http://www.azillionmonkeys.com/qed/hash.html Looking at the rest of the hsearch_r code, it looks like it is trying to implement double hashing scheme for resolving collisions, but is reusing hv= al with different modulus as a step. That is okish, but in many cases, will st= ill produce poor performance (big number of colissions), if the hval in the fir= st place is poor hash function and maps to not just same idx, but same hval. A= nd I think the currently used hash function is very prone to that, especially for strings with hexdecimal characters (hex characters interplay with <<=3D 4 i= n a way that essentially removes the mixing, i.e. for ). --=20 You are receiving this mail because: You are on the CC list for the bug.=