public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* [Bug runtime/19802] New: bad hash value distribution and horrible performance for large arrays with multiple small-integer indices
@ 2016-03-10  2:24 raeburn at permabit dot com
  2016-03-10 21:08 ` [Bug runtime/19802] " dsmith at redhat dot com
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: raeburn at permabit dot com @ 2016-03-10  2:24 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=19802

            Bug ID: 19802
           Summary: bad hash value distribution and horrible performance
                    for large arrays with multiple small-integer indices
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: raeburn at permabit dot com
  Target Milestone: ---

Created attachment 9082
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9082&action=edit
stap script to report on hash-value distribution for array indexing

I've been trying to figure out why a big SystemTap script of mine is so slow.
Running "perf top" against it shows me that it's spending a huge amount of time
walking through entries in a bucket in an associative array (holding
aggregates, if that matters).

It's a big array. I've declared it as holding 500000 elements, and after
noticing that MAXMAPENTRIES also controls the range of hash values, I set that
to 500000 as well. But the script still spends a huge amount of time in that
small loop. This made me question whether the distribution across buckets was
good.

In my script I'm using two small integers as keys; the first counts up from one
and sometimes gets to 10k or so, and the second never gets above about 70.
(Some combination of the two index values are never used, and we trim some of
the array contents periodically, so we shouldn't be filling it.)

So, I wrote a loop using embedded C to pass pairs of small integers to hash_iix
and processed the result. The results were disappointing: While the results
varied from run to run because of seeding, in each run the low bits changed
fairly little.

In one run, testing hash_iix(i,j) for i in 0..2000 and j in 0..70, with
MAXMAPENTRIES=500000 (so 19 bits of hash value, table size 524288, more than 3x
the number of entries to supposedly be stored in some associative array), about
47 of the hash values were repeated 35 times each.

I see two particular problems:

First, the hash function in map-gen.c is not a terribly good mixing function.
If the per-key hashes only vary in a few bits, especially if those bit-ranges
overlap between keys, the results won't be well distributed. And with several
keys, it's shifting the earlier results to the left and doing XORs, which means
the first key doesn't contribute at all to the lower bits of the output. (If
the hash table size were prime it would be another matter, but it's a power of
two so we're just masking off the high bits when we do the modulus operation.)

Second, hash_64 isn't a good choice for a hash function if you don't have a lot
of varying input bits, and you need a lot of output bits, and you don't follow
it with a good mixing function.

A hash table size of 524288 should be adequate room for the 140000 possible
combinations.

But hash_64 returns a truncated product of the input with:

   /*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
   #define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL

So asking for 19 bits back from hash_64 discards the low 45 bits of the
product, and the contribution from the 2^51 and higher terms doesn't really
affect the lowest 6 of the remaining bits. Those bits are only influenced by
the "-2^18+1" part of the multiplier. But if the varying low input bits are
fewer than about 26 bits, the product with "-2^18+1" won't vary in high enough
bits in the result to make much difference (if any) after the right shift.

Basically, the low bits coming back from hash_64 don't vary, in this test.

Do this for two keys with small ranges, and combine them poorly, and as I said
above, the XOR preserves the fixed-ness of the low bits.

There are tweaks that could be done to try to better propagate variations in a
few bits throughout the result, but it might be better to borrow work from
people who've already gone down that road, and incorporate code from CityHash
or one of the other fast ones with suitable licensing.

In my case, I've found a workaround: If instead of using "stat[A,B]" (A
counting up from 1, B in 1..70) I use:

  stat[(A * 128 + B) << 26]

... then the distribution is much more even, and the SystemTap probe routine
I'm using is no longer taking half a core's worth of CPU time in the thread
that triggers it.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug runtime/19802] bad hash value distribution and horrible performance for large arrays with multiple small-integer indices
  2016-03-10  2:24 [Bug runtime/19802] New: bad hash value distribution and horrible performance for large arrays with multiple small-integer indices raeburn at permabit dot com
@ 2016-03-10 21:08 ` dsmith at redhat dot com
  2016-03-10 22:50 ` raeburn at permabit dot com
  2016-03-13 22:04 ` fche at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: dsmith at redhat dot com @ 2016-03-10 21:08 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=19802

David Smith <dsmith at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dsmith at redhat dot com

--- Comment #1 from David Smith <dsmith at redhat dot com> ---
(In reply to Ken Raeburn from comment #0)
> I see two particular problems:
> 
> First, the hash function in map-gen.c is not a terribly good mixing
> function. If the per-key hashes only vary in a few bits, especially if those
> bit-ranges overlap between keys, the results won't be well distributed. And
> with several keys, it's shifting the earlier results to the left and doing
> XORs, which means the first key doesn't contribute at all to the lower bits
> of the output. (If the hash table size were prime it would be another
> matter, but it's a power of two so we're just masking off the high bits when
> we do the modulus operation.)
> 
> Second, hash_64 isn't a good choice for a hash function if you don't have a
> lot of varying input bits, and you need a lot of output bits, and you don't
> follow it with a good mixing function.

I took a bit of a look at this today. We're using the kernel's hash_{64,32}
function. (We do have a copy of it down src/runtime/dyninst/linux_hash.c, but
that just for the use of "stap --dyninst".)

The advantage the kernel has when using hash_{64,32} on multiple values is that
it knows what data is it is mixing, so it can mix the data semi-intelligently.
Here are some examples from the kernel source:

from drivers/net/ethernet/mellanox/mlx4/en_netdev.c:

====
filter_hash_bucket(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
                   __be16 src_port, __be16 dst_port)
{
        unsigned long l;
        int bucket_idx;

        l = (__force unsigned long)src_port |
            ((__force unsigned long)dst_port << 2);
        l ^= (__force unsigned long)(src_ip ^ dst_ip);

        bucket_idx = hash_long(l, MLX4_EN_FILTER_HASH_SHIFT);
====

from fs/mbcache.c:

====
mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
                   sector_t block)
{
        unsigned int bucket;
        struct hlist_bl_node *l;
        struct mb_cache_entry *ce;
        struct hlist_bl_head *block_hash_p;

        bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
                           cache->c_bucket_bits);
====

In systemtap's case, we don't know if we're going to be mixing small integers
or pointer values, which makes things tricky when it comes to mixing the data

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug runtime/19802] bad hash value distribution and horrible performance for large arrays with multiple small-integer indices
  2016-03-10  2:24 [Bug runtime/19802] New: bad hash value distribution and horrible performance for large arrays with multiple small-integer indices raeburn at permabit dot com
  2016-03-10 21:08 ` [Bug runtime/19802] " dsmith at redhat dot com
@ 2016-03-10 22:50 ` raeburn at permabit dot com
  2016-03-13 22:04 ` fche at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: raeburn at permabit dot com @ 2016-03-10 22:50 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=19802

--- Comment #2 from Ken Raeburn <raeburn at permabit dot com> ---
> I took a bit of a look at this today. We're using the kernel's hash_{64,32}
> function. (We do have a copy of it down src/runtime/dyninst/linux_hash.c, but
> that just for the use of "stap --dyninst".)
> 
> The advantage the kernel has when using hash_{64,32} on multiple values is that
> it knows what data is it is mixing, so it can mix the data semi-intelligently.

Yes, that’s how I’m working around the problem for now. But you really have to
understand how hash_64 works for your use case.

> In systemtap's case, we don't know if we're going to be mixing small integers
> or pointer values, which makes things tricky when it comes to mixing the data

Right. But having either stage work well would probably save us; it’s the two
working poorly in conjunction that gives a bad outcome in cases like mine.
Really, if the second stage (combining hashes of individual keys) worked really
well, we could probably forego hash_64 and just use the supplied integer value
directly, maybe after XOR with the seed….

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug runtime/19802] bad hash value distribution and horrible performance for large arrays with multiple small-integer indices
  2016-03-10  2:24 [Bug runtime/19802] New: bad hash value distribution and horrible performance for large arrays with multiple small-integer indices raeburn at permabit dot com
  2016-03-10 21:08 ` [Bug runtime/19802] " dsmith at redhat dot com
  2016-03-10 22:50 ` raeburn at permabit dot com
@ 2016-03-13 22:04 ` fche at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: fche at redhat dot com @ 2016-03-13 22:04 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=19802

Frank Ch. Eigler <fche at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fche at redhat dot com
           Assignee|systemtap at sourceware dot org    |fche at redhat dot com

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-03-13 22:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-10  2:24 [Bug runtime/19802] New: bad hash value distribution and horrible performance for large arrays with multiple small-integer indices raeburn at permabit dot com
2016-03-10 21:08 ` [Bug runtime/19802] " dsmith at redhat dot com
2016-03-10 22:50 ` raeburn at permabit dot com
2016-03-13 22:04 ` fche at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).