OK it took me 2 hours to make this work, and right now I do not want to
deal with figuring out the rest of it because I don't even have my
compsci AAS and have NO algorithms experience so this is brute force
learning and I am learning that debugging algorithms written by
horrible, horrible first-year compsci students is IRRITATING.

........ frustration vented, let's continue.

(yes this message is REALLY ugly, and confusing, sorry.  My thoughts are
unorganized and I don't have enough data for a good argument here YET.)

A while ago I proposed replacing the symbol strings (what section is
that, .dynstr?) with patricia tries instead of hash chains (which as I
understand are just linked lists jammed in a hash bucket) because of
Drepper's paper about DSOs and common prefixes.


Anyway, the gist of the idea is that I'm assuming the current symbol
string structure is as follows:

[Hash Table]--->[Bucket]-->[Symbol foobar]
                        |->[Symbol foobaz]
                        |->[Symbol foogonk]...

I'm looking at instead

[Hash Table]--->[Bucket]-->[Symbol foo]-->[ba]-->[r]
                                       |      |->[z]
                                       |->[gonk]

Such that when we have common prefixes, at the point where the strings
diverge, we simply have to follow a branch.


I wrote a small test program (attached) that builds a patricia trie and
then dumps it, giving an analysis of the maximum depth (branches taken)
and the longest common prefix.

Well, Drepper's right at least.  libgtk shows 10 branches when I
patricia trie up the symbols, and a maximum common prefix of 35
characters; where with libgtkmm I get 16 branches and the biggest common
prefix it can find is 229 characters.  Landing these in a hash bucket is
wasteful; to beat it I'll need a branch that's less intensive than 3.5
character comparisons... or not.

If we assume there's only 2 entries sharing the common prefix, that's
great.  Except, when we branch, we have AT LEAST two entries sharing
THIS common prefix.  If we branch AGAIN, then we have at least THREE
entries sharing the first prefix, and at least TWO entries sharing the
shorter prefix (at the first branch).  So if we branch TEN TIMES,
there's a prefix shared 10 times, a longer one shared 9 times, a longer
one shared 8 times, ... a longer one shared twice, and then a unique
string.  I have no idea what the implications are though I can guess
there are branches at namespace and class when dealing with C++ symbols.

(the real theoretical minimum is not 35 / 10; it's (1+2+3+4+5+6+7+8+9
+35) / 10 == 80/10 == 8.  It doesn't much matter)

The number of branches is easily reduced by reducing the number of items
in the hash bucket (if there's 5 items, you can only branch in 5 places
and probably won't approach that) and using Patricia Tries in place of
hash chains instead of replacing the whole hash table.  This is probably
a good idea, since the only space-efficient storage methods I can come
up with would be O(n) for n=num_branches operations per branch (each
operation is 1 character comparison*; on top of that is one additional
long addition.)

*(comparing two strings, each character comparison is a compare and a
long addition to iterate in a loop)

So far I do NOT have a way to store this such that the branches are
guaranteed less intensive than 8 character comparisons; and don't have
real numbers on what I need.  I'm still thinking this is probably a win
since we're unlikely to branch 8 times from a common prefix and even if
we do we're suddenly faced with a situation where we CAN assume we're
winning something*; Drepper are you around, I could use your input.

*(every extra direction we can take at a branch represents one more
comparison of however many characters long the prefix is; so if we
branch 8 ways after a prefix of 2 characters, we're spending 8 + 7 + 6 +
5 + 4 + 3 + 2 + 1 to avoid (8 + 7 + 6 + 5 + 4 + 3 + 2 + 1) * 2...
somebody verify this for me, I think I just did something awesome and
that shouldn't be possible)

For those who want to keep trying, my crappy analysis program is
attached.  I used the below command to divine some useful numbers:

readelf -sW /usr/lib/libgtk.so | tail -n +4 | awk '{print $8}' | grep -v
'^$' | ./ptree 


(In case you haven't noticed yet, I trust my intuitions to fill in where
my technical knowledge is lacking; and then try to find out if anyone
else has something solid for me.

-- 
John Moser <john.r.moser@gmail.com>