Hi!

The following patches introduce an optional ELF hash section
replacement, optimized for speed and data cache accesses, which in our
tests improves dynamic linking by about 50%.  Prelinking of course
eliminates the costs altogether but if prelinked apps use dlopen they
benefit for this portion.  This is incidently where some apps today
incur high costs today.

The initial design was done by Ulrich Drepper and was discussed with
Michael Meeks a few months ago as well.  But nothing came off of these
discussions and the proposed approach is quite different.

The binutils patch adds a new option to ld, --hash-style, which allows
selection of which hash sections to emit.  ld can either emit the old
style SHT_HASH section, or the new style SHT_GNU_HASH, or both (and
perhaps in the future there could be a mode in which it would emit
both, but SHT_HASH for slow compatibility only (minimal number of
buckets)).

The .gnu.hash section uses always sh_entsize 4 (so doesn't repeat the
historic mistakes on Alpha/s390x with .hash).  The first word there is
nbuckets like in .hash section, followed by nbuckets offsets into the
chains area of the new section.  If the offset is 0xffffffff, it means
there are no defined symbol names with hash % nbuckets equal to the
offset's position.  Otherwise, offset N means the corresponding chain
starts at offset (1 + nbuckets + N) * 4 into .gnu.hash section.  The
chain are does not mirror in size the symbol table anymore.

Each chain starts with a symbol index word, followed by chain length
word and then length times hash value of the corresponding symbol
name.  If DT_GNU_HASH is present, .dynsym must be sorted, so that all
symbols with the same name hash value % nbuckets are grouped together.
So, if some chain contains

symindx0 3 hash0 hash1 hash2

then dynamic symbol symindx0 has name hash hash0, symindx0+1 has hash1 and
symindx0+2 has hash2 and (hash0%nbuckets)==(hash1%nbuckets)==(hash2%nbuckets).

The hash function used is

static uint_fast32_t
dl_new_hash (const char *s)
{
  uint_fast32_t h = 5381;
  for (unsigned char c = *s; c != '\0'; c = *++s)
    h = h * 33 + c;
  return h & 0xffffffff;
}

(Dan Bernstein's string hash function posted eons ago on comp.lang.c.)

For an unsuccessful lookup (the usual case), the dynamic linker has to
read the buckets[hash % nbuckets] (one cache line) and then go through
the chain if it is not 0xffffffff, comparing each hashN with hash and
as the hashN values are 32-bit and the hashing function spreads the
strings well, it is very likely that if the hash is equal, then we
have found the symbol (and just need to verify .dynsym/.dynstr), if it
is different from all hashN values in the chain, ld.so can go on with
another shared library.  Usually the whole chain will be in one cache
line or at most 2 cache lines.

We have tested a bunch of different hash functions on the set of ~
530000 unique dynamic symbols in one Linux installation and Dan
Bernstein's hash had the fewest collisions (only 29), with very short
maximum common prefixes for the colliding symbols (just 3 chars) and
is also very cheap to compute (h * 33 is (h << 5) + h), also both ld
-O1 and non-optimizing ld hash sizing gave good results with that hash
function.  The standard SysV ELF hash function gave bad results, on
the same set of symbols had 1726 collisions and some of the colliding
symbols had very long common name prefixes.  One of the reasons could
be that SysV ELF hash function is 28 bit, while Bernstein's is 32 bit.

Attached stats file shows the biggest benchmark we used (a proglet
that attempts to do all dynamic linking OpenOffice.org 2.0.3
swriter.bin does), most of the libraries (except about 8 libs from the
total of 146 libraries used by the testcase) were relinked with
-Wl,--hash-style=both and glibc on top of the DT_GNU_HASH support
patch had also a hack, where if LD_X=1 was in environment, it would
pretend DT_GNU_HASH is not present to allow easier benchmarking.  The
LD_X support will not be in the final glibc patch.

Ok for trunk?

	Jakub