Re: DT_GNU_HASH: ~ 50% dynamic linking improvement

public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
       [not found] <BAY22-F18E31CAFD18B61AA2516BCC77D0@phx.gbl>
@ 2006-07-03  6:38 ` djamel anonymous
  2006-07-03 11:33   ` djamel anonymous
  0 siblings, 1 reply; 11+ messages in thread
From: djamel anonymous @ 2006-07-03  6:38 UTC (permalink / raw)
  To: djam8193ah, drepper, michael.meeks; +Cc: jakub, binutils, libc-alpha

Hello. after the last post on the mailing list i found that my suggestion 
was too complicated and suboptimal.i think that i have found a simple and 
efficient method to implement the additonal sparse hash table.the proposed 
hash table is of size approximately m*8 bits were m is the nearest power of 
two to nysmbols.
for example for nsymbols=3071 m will be 2048, for nsymbols=3072 m will be 
4096.the index into the
hash table will be simply hash%(m*8) which will be computed quickly because 
it is a power of two.
a bit of index x into the hash table will be zero if and only if no symbol 
has hash%(m*8)==x.
the size of the hash table will be between 5.3333*nsymbols and 
10.6666*nsymobls which is a very small space overhead.
to look for a symbol with hash value x we have to compute i=x%(m*8) then 
check if the bit i is set or not. if it is cleared then the symbol is not 
present otherwise the search will continue in
the hash table that is currently implemented.I think this suggestion which 
is in the litterature called a bloom filter with parameter k=1 has several 
advantage:
1-it has a very small space overhead: between 5.3333*nsymbols bits and 
10.6666*nsymbols bits  .
2-it is very easy to implement.
3-it will incure a very small cache misses if the search is unsuccesfull 
there will be only
one memory access approximately in 7 cases from 8.
4-the index into the hash table is very cheap to cumpute as it is done with 
a binary and.compared to a real modulo which is very expensive; on an athlon 
or pentium it takes between 20 and 40 cycles.

although the index into the hash function is simple to compute, it may 
however be quite efficient
because the lower bit of the chosen hash function are affected by every 
character in the symbol.
static uint_fast32_t
dl_new_hash (const char *s)
{
  uint_fast32_t h = 5381;
  for (unsigned char c = *s; c != '\0'; c = *++s)
    h = h * 33 + c;
  return h & 0xffffffff;
}
in conclusion i think that this method will improve the speed for 
unsuccesfull searches because
it needs to compute a power of two modulo instead of expensive normal 
modulo, and that it will do one memory access most of the time.the space 
overhead will be of around nsymbols bytes (between
nsymobls*0.6666 and nsumbols*1.3333).
best regards.

_________________________________________________________________
Testez Windows Live Mail Beta ! http://www.ideas.live.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-07-03  6:38 ` DT_GNU_HASH: ~ 50% dynamic linking improvement djamel anonymous
@ 2006-07-03 11:33   ` djamel anonymous
  2006-07-03 12:01     ` Jakub Jelinek
  2006-07-03 14:46     ` Ulrich Drepper
  0 siblings, 2 replies; 11+ messages in thread
From: djamel anonymous @ 2006-07-03 11:33 UTC (permalink / raw)
  To: drepper, michael.meeks; +Cc: jakub, binutils, libc-alpha

Hello, i am writing again for a small suggestion that may reduce the space 
occupied by the hash
table and reduce cache fills probability.
my suggestion is to align bucket chains to 8 bytes boundaries and to 
suppress the length field when the chain is of length 1.this is simply done 
by changing the signification of the buckets array values to :
offset=(bucket[hash%nbuckets]&)0xfffffffe
which will correspond to the offset: ((2 + nbuckets + N )&(-2)) * 4 which is 
aligned on 8 bytes boundary.the signification of the chain will depend on 
the lowest bit of bucket[hash%nbuckets]:
if ((bucket[hash%nbuckets]&0x1)==1) then the chain is of length 1 and 
contains :
symindx hash
if ((bucket[hash%nbuckets]&0x1)==0) then the chain is of variable length 
larger than 1 and contains :
symindx0 n hash0 hash1 ....   hashn
so the chains space usage  will be as follows :
1-a chain of length 1 that occupied previously 12 bytes will occupy 8 bytes.
2-a chain of length 2 that occupied previously 16 bytes will occupy 16 
bytes.
3-a chain of length 3 that occupied previously 20 bytes will occupy 24 
bytes.
4-a chain of length 4 that occupied previously 24 bytes will occupy 24 
bytes.
5-a chain of length 4 that occupied previously 28 bytes will occupy 32 
bytes.
.
.
if there is much more chains of length 1 than of size 3 or 5 then there will 
be a significant reduction in space usage.

for a chain of length one there will never be more than one cache fill while 
there was
previously a probability of 12.5% to do a second cache fill (assuming a 
cache of 64 bytes );the chain will cause a second cache fill if it begins at 
position 56 or 60 inside the cache line, and it has 16 possible positions. 
and hence the probability to do a second cache fill is of 2/16.
for a chain of length 2 similarly the probability of doing a second cache 
fill is reduced from
18.75% to 12.5%.

in conclusion if nbuckets is of the same order as nsymbols, the space usage 
will be reduced by
something near 4 bytes per symbol and the cache behavior will be better.
i hope my posting is useful.
best regards.

_________________________________________________________________
Testez Windows Llive Mail Beta ! 
http://www.msn.fr/newhotmail/Default.asp?Ath=f

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-07-03 11:33   ` djamel anonymous
@ 2006-07-03 12:01     ` Jakub Jelinek
  2006-07-03 15:07       ` djamel anonymous
  2006-07-03 14:46     ` Ulrich Drepper
  1 sibling, 1 reply; 11+ messages in thread
From: Jakub Jelinek @ 2006-07-03 12:01 UTC (permalink / raw)
  To: djamel anonymous; +Cc: drepper, michael.meeks, binutils, libc-alpha

On Mon, Jul 03, 2006 at 11:33:21AM +0000, djamel anonymous wrote:
> if ((bucket[hash%nbuckets]&0x1)==1) then the chain is of length 1 and 
> contains :
> symindx hash
> if ((bucket[hash%nbuckets]&0x1)==0) then the chain is of variable length 
> larger than 1 and contains :
> symindx0 n hash0 hash1 ....   hashn
> so the chains space usage  will be as follows :
> 1-a chain of length 1 that occupied previously 12 bytes will occupy 8 bytes.
> 2-a chain of length 2 that occupied previously 16 bytes will occupy 16 
> bytes.
> 3-a chain of length 3 that occupied previously 20 bytes will occupy 24 
> bytes.
> 4-a chain of length 4 that occupied previously 24 bytes will occupy 24 
> bytes.
> 5-a chain of length 4 that occupied previously 28 bytes will occupy 32 
> bytes.

This will
1) complicate the lookup function
2) optimize for an uncommon case

The linker can size nbuckets as it wishes of course, but trying
hard to have huge nbuckets to have length 1 chains is with DT_GNU_HASH a bad
idea.  The best chain length is IMHO something like 2-6 entries, certainly
it should fit into a cacheline, but chain with length 1 is just a waste
of space in the first part of the .gnu.hash section.
The linker easily can align the whole .gnu.hash section on say 64 bytes
and reorder the chains, so that it minimizes the number of chains crossing
a 64B boundary (assuming we say 64B is the common cache line length)
and it can do so without making the section very complicated.

Say for libstdc++.so.6 we have currently:
libstdc++.so.6.0.8

Histogram for bucket list length (total of 1013 buckets):
 Length  Number     % of total  Coverage
      0  36         (  3.6%)
      1  134        ( 13.2%)      4.1%
      2  196        ( 19.3%)     16.1%
      3  231        ( 22.8%)     37.3%
      4  194        ( 19.2%)     61.0%
      5  121        ( 11.9%)     79.6%
      6  60         (  5.9%)     90.6%
      7  27         (  2.7%)     96.4%
      8  7          (  0.7%)     98.1%
      9  7          (  0.7%)    100.0%

Histogram for `.gnu.hash' bucket list length (total of 1022 buckets):
 Length  Number     % of total  Coverage
      0  52         (  5.1%)
      1  131        ( 12.8%)      4.1%
      2  225        ( 22.0%)     18.4%
      3  233        ( 22.8%)     40.5%
      4  171        ( 16.7%)     62.2%
      5  115        ( 11.3%)     80.4%
      6  64         (  6.3%)     92.5%
      7  18         (  1.8%)     96.5%
      8  8          (  0.8%)     98.5%
      9  4          (  0.4%)     99.7%
     10  1          (  0.1%)    100.0%

For libc-2.4.90.so:

Histogram for bucket list length (total of 1023 buckets):
 Length  Number     % of total  Coverage
      0  117        ( 11.4%)
      1  311        ( 30.4%)     15.1%
      2  257        ( 25.1%)     40.1%
      3  189        ( 18.5%)     67.6%
      4  97         (  9.5%)     86.5%
      5  38         (  3.7%)     95.7%
      6  10         (  1.0%)     98.6%
      7  4          (  0.4%)    100.0%

Histogram for `.gnu.hash' bucket list length (total of 1011 buckets):
 Length  Number     % of total  Coverage
      0  124        ( 12.3%)
      1  254        ( 25.1%)     12.4%
      2  296        ( 29.3%)     41.2%
      3  198        ( 19.6%)     70.2%
      4  98         (  9.7%)     89.3%
      5  29         (  2.9%)     96.4%
      6  10         (  1.0%)     99.3%
      7  2          (  0.2%)    100.0%

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-07-03 11:33   ` djamel anonymous
  2006-07-03 12:01     ` Jakub Jelinek
@ 2006-07-03 14:46     ` Ulrich Drepper
  2006-07-03 17:30       ` djamel anonymous
  1 sibling, 1 reply; 11+ messages in thread
From: Ulrich Drepper @ 2006-07-03 14:46 UTC (permalink / raw)
  To: djamel anonymous; +Cc: jakub, binutils, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1464 bytes --]

djamel anonymous wrote:
> Hello, i am writing again for a small suggestion that may reduce the
> space occupied by the hash
> table and reduce cache fills probability.

I see absolutely no benefit at all in any of this.  The second hash
table with the required second memory load would kill all the
performance advantage.  It's sole purpose would be to not compare the
chain list elements at all.  But:

- it has to be pessimistic.  I.e., for the first table there are also
  collisions and when they happen, the search must be made

- we limit the hash chain to a single cache line whenever possible and
  this is as I showed before almost always the case.  The operations
  all work on L1 cache and even 8 or 10 memory loads are cheaper
  than loading another cache line.  And that number of loads is just a
  worst case behavior, usually there are fewer.

- each element in your second hash table must point to the beginning of
  a chain.  So here the huge overhead in the hash table sizing is
  costing a lot.  Furthermore, the division is nowadays no problem at
  all anymore.  It's fast.  If you really care, modify your linker to
  size the hash table with a power of two and change the dynamic linker
  to optimize that case.  It's easy enough to do.  The linker already
  has a function to size the table more intelligently if requested.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-07-03 12:01     ` Jakub Jelinek
@ 2006-07-03 15:07       ` djamel anonymous
  0 siblings, 0 replies; 11+ messages in thread
From: djamel anonymous @ 2006-07-03 15:07 UTC (permalink / raw)
  To: jakub; +Cc: drepper, michael.meeks, binutils

Hello, i am writing again for porential speed improvement.the space used for 
the hash table can be reduced by almost nchains words by using the least 
siginificant bit of the hash value stored
in the chains a a chain terminator; this is possible because of the fact 
that :
(hash1%nbuckets==hahs2%nbuckets and (hash1/2)==(hash2/2) ) =>   hash1==hash2 
for every nbuckets>=2
this space improvement could be transformed into speed by various ways.i 
suggest 3 different ways:
1-increase nbuckets by 50%: assuming almost each bucket points to a chain, 
increasing the
number of buckets by 50% and deleting the length word from each chain will 
result in in the same space usage.
2-use a second sparse hash table as suggested in my previous post: this 
space hash table will have an average size of nsymbols bytes which will be 
exactly the minimal size of the buckets array (nsymbols/4)*4 bytes. and 
hence there will be a net space gain compared to the previous 
impelemntation.
3-associate an additional word with each bucket; this word will avoid 
accessing the chain in at least 7/8 of the time . to apply this solution the 
bucket index will be obtained by :
bucket=(hash%(nbuckets*32))/32 inside the additional word a bit will be set 
if and only if an element in the chain exists with the value 
(hash%(nbuckets*32))%32 equal to i.

the first solution seems to be the less attractive because it will most 
likely give a modest improvement. the second solution avoids doing the 
expensive operation of modulo nbuckets, but it seems it is unlikely that it 
will be accepted because it adds a second hash table.
i hope this post will be useful.
best regards

_________________________________________________________________
MSN Messenger: appels gratuits de PC Ã  PC ! 
http://www.msn.fr/msger/default.asp

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-07-03 14:46     ` Ulrich Drepper
@ 2006-07-03 17:30       ` djamel anonymous
  2006-07-03 17:48         ` Ulrich Drepper
  0 siblings, 1 reply; 11+ messages in thread
From: djamel anonymous @ 2006-07-03 17:30 UTC (permalink / raw)
  To: drepper; +Cc: jakub, binutils, libc-alpha

first thank you for replying to my post.

>I see absolutely no benefit at all in any of this.  The second hash
>table with the required second memory load would kill all the
>performance advantage.  It's sole purpose would be to not compare the
>chain list elements at all.  But:
>
>- it has to be pessimistic.  I.e., for the first table there are also
>   collisions and when they happen, the search must be made
>
in fact this second hash table is playing a role of a filter before the 
original table; a search
will be done into the original table if and only if the the corresponding 
bit is set in the new hash table .

>- each element in your second hash table must point to the beginning of
>   a chain.  So here the huge overhead in the hash table sizing is
>   costing a lot.  Furthermore, the division is nowadays no problem at
this new additional hash table will occupy on average nsymbols*8 bits which 
is nsymbols bytes; this is because we need only 1 bit per bucket; if this 
bit is set we continue the search into the original table otherwise the 
search is stopped after only one memory access and without doing actual 
division.

>   costing a lot.  Furthermore, the division is nowadays no problem at
>   all anymore.  It's fast.  If you really care, modify your linker to
>   size the hash table with a power of two and change the dynamic linker
>   to optimize that case.  It's easy enough to do.  The linker already
>   has a function to size the table more intelligently if requested.

the second hash table has in fact a smaller size than the original table, 
this is why its size is less important than the original one.the search into 
the original table will be done if and only if
the search into the new hash table is succesfull, and only then an actual 
division will be done to compute the bucket number into the original table 
.but this division will have much less impact on performance because it is 
done only in unlikely case of a succesfull search into the first hash table.
best reagrds.

_________________________________________________________________
MSN Messenger : discutez en direct avec vos amis ! 
http://www.msn.fr/msger/default.asp

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-07-03 17:30       ` djamel anonymous
@ 2006-07-03 17:48         ` Ulrich Drepper
  0 siblings, 0 replies; 11+ messages in thread
From: Ulrich Drepper @ 2006-07-03 17:48 UTC (permalink / raw)
  To: djamel anonymous; +Cc: jakub, binutils, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1373 bytes --]

djamel anonymous wrote:
> this new additional hash table will occupy on average nsymbols*8 bits
> which is nsymbols bytes; this is because we need only 1 bit per bucket;

It doesn't matter that the second table is small.  The CPU has to read
the entire cache line.  And if he have a conflict (which is likely) we
have to read the second hash table.  That's what's killing the
performance.  L2 access is about 5 times slower than L1.  And since it's
unlikely you hit L2 in the cases where it really counts (i.e., big
programs with many DSOs and many symbols) you more likely hit the main
memory.  And here the factor is something like 80 (in ideal conditions).
 So, the most important optimization is to reduce the number of cache
lines needed.

Prefiltering is only of advantage if using the second table is more
expensive.  It isn't here, as I explained.  All the entries of the hash
chain are most likely in one cache line and therefore, unless we have a
full hash value match, we have to read one cache line.  With your method
you can get by with reading only one cache line only if there is no hash
collision.  I.e., no collision for any of the symbols used, defined or
undefined in the specific DSO.  That's not likely for any reasonable
hash table size.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-06-29 19:39 ` Michael Meeks
  2006-06-29 21:52   ` Jakub Jelinek
  2006-06-30 23:55   ` Ulrich Drepper
@ 2006-07-03  9:26   ` Jakub Jelinek
  2 siblings, 0 replies; 11+ messages in thread
From: Jakub Jelinek @ 2006-07-03  9:26 UTC (permalink / raw)
  To: Michael Meeks; +Cc: binutils, libc-alpha, Ulrich Drepper

On Thu, Jun 29, 2006 at 07:27:06PM +0100, Michael Meeks wrote:
> 	1. Extensibility
> 		+ I want to see a multiply per lookup.
> 		+ ie. since the L2 cache misses are what hammer us -
> 		  if [sic] we want to extend the .chain -Bdirect
> 		  support in future - being able to put that in
> 		  the same record is really important.
> 		+ a 'record size' field gives us that with ~no
> 		  bloat as of now.
> 		+ [ and I really think some form of direct linking
> 		  is the future - but we can discuss that later 
> 		  clearly ]
> 		+ ie. having an extensibility story that is better
> 		  than 'just add a new section [ somewhere miles
> 		  away in memory/not-in-cache ] is not so fun.

On the other side, it is IMHO much better to keep using one section
for one purpose, as ELF has always been doing.  So, use a hash section
for mapping a symbol name to a set of indexes of dynamic symbol definition
candidates, not less, not more.  That's what DT_HASH has been doing and
DT_GNU_HASH is doing too.  Direct linking is mainly useful for OSes where
there is one entity controlling the whole thing, in the Open Source world
it can IMHO work only in very limited cases for program which have the
vast majority of dependent libraries coming from the same source, with the
same schedule, with bugfixes leading to shipping a new version of the whole
thing.  If the libraries on the other side are updated independently,
without one body controlling them all, symbols keep being added and removed
(for C the latter would be an ABI break, but for C++ the latter often
happens just when you change the internal implementation of some
function/method and it no longer references some function/method which suddenly
does not need to be compiled in) and direct linking just will lead to
horrible ODR and pointer equality violations that programs will sooner or
later break on.  So for direct linking which certainly won't be used for
all programs, but perhaps just a very limited set, you really should use
a separate section.

> 		+ Furthermore - by inspection it can be seen that
> 		  1/2^32 ~=== 1/2^30 [1]
> 			+ using this theorem, we can easily steal
> 			  1/2 bits from the top of the stored hash
> 			  and use them as chain terminators.
> 			+ ie. lets have:
> 
> .hash[symhash % nbuckets] -> .chain[5]
> 
> .chain[5] -> [ (not-end) | hashval[5] & 0x3fff ]
> .chain[6] -> [ (not-end) | hashval[6] & 0x3fff ]
> .chain[7] -> [ (is-end)  | hashval[7] & 0x3fff ]
> 
> 			+ that saves us several bytes per chain,
> 			  and cf. above, no measurable loss in
> 			  performance. [ of course chain[5] 
> 			  <-> .dynsym[5] here as before ]
> 
> 		+ in fact, for extensibility a per-shlib hash
> 		  comparison 'mask' field would make for a rather
> 		  better behaviour - so we can snarf bits from
> 		  here in future if we come up with brighter ideas.

Applications keep growing every year and while I currently counted
~ 500000 different symbol names, it can soon double, triple and keep
increasing.  So, while we have ATM just a few collisions, there can be
more and more in the future.  And handling those is expensive.
And, all linker can do for this is try to keep all the chains short enough
to fit into cachelines.

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-06-29 19:39 ` Michael Meeks
  2006-06-29 21:52   ` Jakub Jelinek
@ 2006-06-30 23:55   ` Ulrich Drepper
  2006-07-03  9:26   ` Jakub Jelinek
  2 siblings, 0 replies; 11+ messages in thread
From: Ulrich Drepper @ 2006-06-30 23:55 UTC (permalink / raw)
  To: michael.meeks; +Cc: Jakub Jelinek, binutils, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 3591 bytes --]

Michael Meeks wrote:
> and here was I thinking I was going to have to write a length
> paper to try to convince Ulrich :-) You made my month (really).

What are you talking about?  I never said that I'm opposed to changing
the hash table.  I'm not going to agree to direct bindings, that's all,
and you constantly insisted on bringing that nonsense up.

> 	Anyhow - I've been thinking about various other optimisations, many of
> which we can easily & neatly fit into this nice framework, I've numbered
> them to help chewing them over:
> 
> 	1. Extensibility
> 		+ I want to see a multiply per lookup.
> 		+ ie. since the L2 cache misses are what hammer us -
> 		  if [sic] we want to extend the .chain -Bdirect
> 		  support in future - being able to put that in
> 		  the same record is really important.
> 		+ a 'record size' field gives us that with ~no
> 		  bloat as of now.

You absolutely cannot add direct bindings as an optional feature.
Period.  A binary using direct bindings can and often will have
different effective bindings then a binary which uses normal lookup.
It's complete impossible.  A program can either be compiled and even
_designed_ for direct binding or not.

Adding support in any form for such an incompatible feature to the hash
table which is purely an optional extension is unacceptable.

> 		+ Since we can sort (almost all) of .dynsym -
> 		  there is no problem ensuring that ~all symbols
> 		  that we don't want in .hash occur at the end
> 		  of the .dynsym table [ cf. the next UNDEF point ].

This can only possibly work if you have spare bits...

> 		+ Furthermore - by inspection it can be seen that
> 		  1/2^32 ~=== 1/2^30 [1]
> 			+ using this theorem, we can easily steal
> 			  1/2 bits from the top of the stored hash
> 			  and use them as chain terminators.

... and this is nonsense.  You cannot compute the probabilities by
looking at the complete set of symbols.  You have to determine which
kind of symbols have the same hash value after then change and then make
sure they probability of having those is the same as in the complete
set.  This (most) likely is not the case.

Instead what will happen is that the symbols which are used together in
a program have a higher likelihood to collide (common prefixes etc).  If
even the reduction from 32 to 31 bits produces on the full set a factor
o >2 more duplicates (Jakub tested it) this number is likely higher in
the real world.

On the other hand, what is gained?  You already said we have rather
short hash chains.  Most likely not longer than 5 or 6.  I can count the
number of DSOs with longer chains on my hands and there is not more than
one longer chain in each DSO.  Now do the math: even assuming even
distribution of the chain we have on average 6 chain elements per cache
line.  If the linker sorts them cache lines we can fit in 14.  That's
enough even for the worst chain length.

So, on the one hand you save one 4-byte word which in almost no case
will cause an additional cache line to be loaded.  And even if it would,
the cache lines are consecutive the the processors prefetching takes
case of it.  On the other hand you more than double the likelihood to
have to compare strings.  This is IMO reason enough to not change the
format.

> 	3. What we hash:
> 		+ it is incredible to me that UNDEF symbols are

You didn't read what Jakub wrote.  We don't hash UNDEF records in the
new format.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-06-29 19:39 ` Michael Meeks
@ 2006-06-29 21:52   ` Jakub Jelinek
  2006-06-30 23:55   ` Ulrich Drepper
  2006-07-03  9:26   ` Jakub Jelinek
  2 siblings, 0 replies; 11+ messages in thread
From: Jakub Jelinek @ 2006-06-29 21:52 UTC (permalink / raw)
  To: Michael Meeks; +Cc: binutils, libc-alpha, Ulrich Drepper

On Thu, Jun 29, 2006 at 07:27:06PM +0100, Michael Meeks wrote:
> 	3. What we hash:
> 		+ it is incredible to me that UNDEF symbols are
> 		  included in the symbol hash; I see no legitimate
> 		  use for that [ eg. 'free' is an expensive hit in the
> 		  hash for virtually every app ;-]

More comments later, but as you can see in the patch, undefined
symbols are never added to .gnu.hash chains, only defined ones
(and, as I was lazy, also some of the undefined PLT symbols [*]).

That's what the:
+  /* Ignore also local symbols and undefined symbols.  */
+  if (h->forced_local
+      || h->root.type == bfd_link_hash_undefined
+      || h->root.type == bfd_link_hash_undefweak
+      || ((h->root.type == bfd_link_hash_defined
+          || h->root.type == bfd_link_hash_defweak)
+         && h->root.u.def.section->output_section == NULL))
+    return TRUE;
in elf_collect_gnu_hash_codes and elf_renumber_gnu_hash_syms
does, .dynsym after sorting contains first the special symbols
(0, STT_SECTION, STB_LOCAL), then SHN_UNDEF symbols and only
after this contains the defined ones, sorted by the hash % nbuckets
value.

[*] SHN_UNDEF symbols with st_value != 0 need to be in .gnu.hash
(and are there), as they are used when resolving relocations that
take address of functions.  These are at that point still
bfd_link_hash_defined with non-NULL output_section and only
during dynamic symbol finalization are turned into st_shndx SHN_UNDEF.
But, on i386/x86_64 we have an optimization where if the binary
never takes address of some function, st_value on its SHN_UNDEF
symbol is cleared (see e.g. elf64_x86_64_finish_dynamic_symbol
          /* Mark the symbol as undefined, rather than as defined in
             the .plt section.  Leave the value if there were any
             relocations where pointer equality matters (this is a clue
             for the dynamic linker, to make function pointer
             comparisons work between an application and shared
             library), otherwise set it to zero.  If a function is only
             called from a binary, there is no need to slow down
             shared libraries because of that.  */
          sym->st_shndx = SHN_UNDEF;
          if (!h->pointer_equality_needed)
            sym->st_value = 0;
).  Such symbols could be left out of .gnu.hash too, assuming we
add some target hook, as pointer_equality_needed bit is target specific.

	Jakub

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
  2006-06-28 17:21 [PATCH] " Jakub Jelinek
@ 2006-06-29 19:39 ` Michael Meeks
  2006-06-29 21:52   ` Jakub Jelinek
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Michael Meeks @ 2006-06-29 19:39 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: binutils, libc-alpha, Ulrich Drepper

Hi Jakub,

On Wed, 2006-06-28 at 19:09 +0200, Jakub Jelinek wrote:
> The following patches introduce an optional ELF hash section

	Dudie; you rock ! - thanks for taking this on board; this is really
nice :-) and here was I thinking I was going to have to write a length
paper to try to convince Ulrich :-) You made my month (really).

> The initial design was done by Ulrich Drepper and was discussed with
> Michael Meeks a few months ago as well.  But nothing came off of these
> discussions and the proposed approach is quite different.

	This looks rather like -zdynsort + -zhash-vals prototype; but I'm too
pleased to quibble :-)

	Anyhow - I've been thinking about various other optimisations, many of
which we can easily & neatly fit into this nice framework, I've numbered
them to help chewing them over:

	1. Extensibility
		+ I want to see a multiply per lookup.
		+ ie. since the L2 cache misses are what hammer us -
		  if [sic] we want to extend the .chain -Bdirect
		  support in future - being able to put that in
		  the same record is really important.
		+ a 'record size' field gives us that with ~no
		  bloat as of now.
		+ [ and I really think some form of direct linking
		  is the future - but we can discuss that later 
		  clearly ]
		+ ie. having an extensibility story that is better
		  than 'just add a new section [ somewhere miles
		  away in memory/not-in-cache ] is not so fun.

	2. chain walking:
		+ the 'not having a .chain <-> .dynsym offset 
		  equivalence is an interesting innovation over my
		  approach; having said that I'm not sure why
		  you did that. Most chains are rather short - and
		  adding a set of bytes that are almost always
		  length 1/2/3 seems strange; may I suggest a
		  better approach ?
		+ Since we can sort (almost all) of .dynsym -
		  there is no problem ensuring that ~all symbols
		  that we don't want in .hash occur at the end
		  of the .dynsym table [ cf. the next UNDEF point ].
		+ Furthermore - by inspection it can be seen that
		  1/2^32 ~=== 1/2^30 [1]
			+ using this theorem, we can easily steal
			  1/2 bits from the top of the stored hash
			  and use them as chain terminators.
			+ ie. lets have:

.hash[symhash % nbuckets] -> .chain[5]

.chain[5] -> [ (not-end) | hashval[5] & 0x3fff ]
.chain[6] -> [ (not-end) | hashval[6] & 0x3fff ]
.chain[7] -> [ (is-end)  | hashval[7] & 0x3fff ]

			+ that saves us several bytes per chain,
			  and cf. above, no measurable loss in
			  performance. [ of course chain[5] 
			  <-> .dynsym[5] here as before ]

		+ in fact, for extensibility a per-shlib hash
		  comparison 'mask' field would make for a rather
		  better behaviour - so we can snarf bits from
		  here in future if we come up with brighter ideas.

	3. What we hash:
		+ it is incredible to me that UNDEF symbols are
		  included in the symbol hash; I see no legitimate
		  use for that [ eg. 'free' is an expensive hit in the
		  hash for virtually every app ;-]
		+ with 2 bits, a 'chain terminator' and a
		  'defined terminator' if we don't bin the undef's
		  from the hash completely we can at least sort
		  them to the end of the chain and often avoid
		  comparing them - [ currently they would be
		  filtered out only in check_match after a .dynsym
		  check for "if ((sym->st_value == 0 /* No value.  */...
		      || (type_class & (sym->st_shndx == SHN_UNDEF)))"
		+ This may seem an insignificant win: most
		  'normal' libs I've seen only import 1/3rd
		  of their symbols
			+ OTOH - C++ 'plugins' [ which we can't
			  load RTLD_LOCAL because of vague symbols
			  tend to have map files / visibility markup 
			  to reduce their exposed symbol count
			  substantially, eg.
			+ as an example 'writer' lives in 'libsw' using
			  objdump -T libsw680li.so  | grep UND | wc -l:
				Undefined: 6465
				Defined:   3580
			+ ie. for every 'defined' symbol that someone 
			  might legitimately want to look up we have
			  ~2 other symbols that they just -cannot- want
			  to compare.
		+ ie. in some cases by dropping UNDEF symbols, we can
		  have our new / larger .chain section and still save 
		  space [ on disk & in cache ].

	So - some comments on your patch. I notice you don't sort .dynstr as I
did ;-) that simplifies life clearly; of course - I have devious reasons
for wanting that. Wrt. -Bdirect if we sort our relocations right and use
-Bdirect, we can turn linking into a linear walk of both the library
list, .hash, .chain, .dynsym, and .dynstr [ if we play our cards
right ;-]; but that's for another day [ I'd just love
to make linking completely disappear from the L2 cache miss profile
(when using C) - I think it can be done ;-].

	Reading the patch - I really like it :-) my calculation logic to lookup
the hash stuff in a separate section was *really* ugly and evil, and
would have required some re-factoring to make it at all pleasant, your
chain walking code is sweet and pleasant to the eye etc. :-)

	Anyhow - we can build some other rather nice optimisations on top of
this I think; and if we can do some of the above we can save space as
well for a pure gnu-hash system I think. My only concern really is with
extensibility.

	Thanks so much for doing this - of course, as cache sizes increase this
may appear to be less useful[2]; but the faster we can make the linker,
the more we can split up OO.o into more libs, lazy load more,
componentize more and (hopefully) save size/time on startup & at
runtime.

	Regards,

		Michael.

[1] - so your paragraph on the hash:

	"530000 unique dynamic symbols ... Dan Bernstein's hash ... fewest
collisions (only 29) ... The standard SysV ELF hash function gave bad
results ... 1726 collisions"

	Is interesting - and I'm sure the hash you propose is great - but of
course, with the hash comparison the reduction in the number of L2 cache
misses is so great that the 'quality' of the hash at this level is
rather uninteresting wrt. the optimisation [ IHMO etc. ].

	ie. 29/530000 ~= 1726/530000, well nearly anyway ;-) focusing on a
0.005% hit vs. 0.3% is interesting but not worth fighting over. More
interestingly can we pick a few bits to steal and only degrade that by a
factor of 4 or so ? ie. 0.005% vs. 0.02% ;-)

[2] - the Core Duo here with it's 2Mb cache is amazingly faster for OO.o
linking, but clearly at smaller sizes there is this appalling cliff to
drop-off wrt. linking performance.

-- 
 michael.meeks@novell.com  <><, Pseudo Engineer, itinerant idiot



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-07-03 17:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BAY22-F18E31CAFD18B61AA2516BCC77D0@phx.gbl>
2006-07-03  6:38 ` DT_GNU_HASH: ~ 50% dynamic linking improvement djamel anonymous
2006-07-03 11:33   ` djamel anonymous
2006-07-03 12:01     ` Jakub Jelinek
2006-07-03 15:07       ` djamel anonymous
2006-07-03 14:46     ` Ulrich Drepper
2006-07-03 17:30       ` djamel anonymous
2006-07-03 17:48         ` Ulrich Drepper
2006-06-28 17:21 [PATCH] " Jakub Jelinek
2006-06-29 19:39 ` Michael Meeks
2006-06-29 21:52   ` Jakub Jelinek
2006-06-30 23:55   ` Ulrich Drepper
2006-07-03  9:26   ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).