Patricia Trie pseudo-test

public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed

* Patricia Trie pseudo-test
@ 2006-08-22  7:34 John Moser
  2006-08-22 11:45 ` John Moser
  0 siblings, 1 reply; 2+ messages in thread
From: John Moser @ 2006-08-22  7:34 UTC (permalink / raw)
  To: libc-alpha, binutils

[-- Attachment #1.1: Type: text/plain, Size: 4582 bytes --]

OK it took me 2 hours to make this work, and right now I do not want to
deal with figuring out the rest of it because I don't even have my
compsci AAS and have NO algorithms experience so this is brute force
learning and I am learning that debugging algorithms written by
horrible, horrible first-year compsci students is IRRITATING.

........ frustration vented, let's continue.

(yes this message is REALLY ugly, and confusing, sorry.  My thoughts are
unorganized and I don't have enough data for a good argument here YET.)

A while ago I proposed replacing the symbol strings (what section is
that, .dynstr?) with patricia tries instead of hash chains (which as I
understand are just linked lists jammed in a hash bucket) because of
Drepper's paper about DSOs and common prefixes.

Anyway, the gist of the idea is that I'm assuming the current symbol
string structure is as follows:

[Hash Table]--->[Bucket]-->[Symbol foobar]
                        |->[Symbol foobaz]
                        |->[Symbol foogonk]...

I'm looking at instead

[Hash Table]--->[Bucket]-->[Symbol foo]-->[ba]-->[r]
                                       |      |->[z]
                                       |->[gonk]

Such that when we have common prefixes, at the point where the strings
diverge, we simply have to follow a branch.

I wrote a small test program (attached) that builds a patricia trie and
then dumps it, giving an analysis of the maximum depth (branches taken)
and the longest common prefix.

Well, Drepper's right at least.  libgtk shows 10 branches when I
patricia trie up the symbols, and a maximum common prefix of 35
characters; where with libgtkmm I get 16 branches and the biggest common
prefix it can find is 229 characters.  Landing these in a hash bucket is
wasteful; to beat it I'll need a branch that's less intensive than 3.5
character comparisons... or not.

If we assume there's only 2 entries sharing the common prefix, that's
great.  Except, when we branch, we have AT LEAST two entries sharing
THIS common prefix.  If we branch AGAIN, then we have at least THREE
entries sharing the first prefix, and at least TWO entries sharing the
shorter prefix (at the first branch).  So if we branch TEN TIMES,
there's a prefix shared 10 times, a longer one shared 9 times, a longer
one shared 8 times, ... a longer one shared twice, and then a unique
string.  I have no idea what the implications are though I can guess
there are branches at namespace and class when dealing with C++ symbols.

(the real theoretical minimum is not 35 / 10; it's (1+2+3+4+5+6+7+8+9
+35) / 10 == 80/10 == 8.  It doesn't much matter)

The number of branches is easily reduced by reducing the number of items
in the hash bucket (if there's 5 items, you can only branch in 5 places
and probably won't approach that) and using Patricia Tries in place of
hash chains instead of replacing the whole hash table.  This is probably
a good idea, since the only space-efficient storage methods I can come
up with would be O(n) for n=num_branches operations per branch (each
operation is 1 character comparison*; on top of that is one additional
long addition.)

*(comparing two strings, each character comparison is a compare and a
long addition to iterate in a loop)

So far I do NOT have a way to store this such that the branches are
guaranteed less intensive than 8 character comparisons; and don't have
real numbers on what I need.  I'm still thinking this is probably a win
since we're unlikely to branch 8 times from a common prefix and even if
we do we're suddenly faced with a situation where we CAN assume we're
winning something*; Drepper are you around, I could use your input.

*(every extra direction we can take at a branch represents one more
comparison of however many characters long the prefix is; so if we
branch 8 ways after a prefix of 2 characters, we're spending 8 + 7 + 6 +
5 + 4 + 3 + 2 + 1 to avoid (8 + 7 + 6 + 5 + 4 + 3 + 2 + 1) * 2...
somebody verify this for me, I think I just did something awesome and
that shouldn't be possible)

For those who want to keep trying, my crappy analysis program is
attached.  I used the below command to divine some useful numbers:

readelf -sW /usr/lib/libgtk.so | tail -n +4 | awk '{print $8}' | grep -v
'^$' | ./ptree 

(In case you haven't noticed yet, I trust my intuitions to fill in where
my technical knowledge is lacking; and then try to find out if anyone
else has something solid for me.

-- 
John Moser <john.r.moser@gmail.com>

[-- Attachment #1.2: ptree.c --]
[-- Type: text/x-csrc, Size: 4791 bytes --]

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

/* Trie structure:
 *
 * HEAD
 * |-Child -> Sibling -> Sibling
 *            |-Child -> Sibling -> Sibling
 *            ......
 */
struct patricia {
	char	*text;
	void	*data;
	struct patricia
		*sibling;
	struct patricia
		*child;
};

struct patricia phead = { "", 0, 0, 0};

int strprefix(const char *s1, const char *s2) {
	int i = 0;
	while(s1[i] == s2[i] && s1[i])
		i++;
	return i;
}

static inline struct patricia *new_patricia_node(char *text,
	void *data,
	struct patricia *sibling,
	struct patricia *child) {
	struct patricia *new_node = NULL;

	new_node = malloc(sizeof(struct patricia));
	new_node->text		= text;
	new_node->data		= data;
	new_node->sibling	= sibling;
	new_node->child		= child;

	return new_node;
}

int add_to_patricia(char *key,
	void *value) {
	struct patricia *node = &phead;
	struct patricia *new_node = NULL;
	int i=0,j=0;
	char *str;

    while(1) {
	j += i;
	i = strprefix(key + j, node->text);

	/* No move, test sibling */
	if (!i) {
		/* No sibling, we're becoming one */
		if (!node->sibling) {
			str = malloc(strlen(key + j + i) + 1);
			strcpy(str, key + j + i);

			new_node = new_patricia_node(str, value, NULL, NULL);
			node->sibling = new_node;
			break;
		}
		node = node->sibling;
		continue;
	}

	/* Holy crap we found this node already! Change value
	 * this is the only time these are equal  */
	if (!key[j + i] && !node->text[i]) { 
		node->data = value;
		break;
	}

	/* The node has to be split; we're becoming the parent */
	if (!key[j + i]) {
		/*The end of the node's text becomes our child's text*/
		str = malloc(strlen(node->text + i) + 1);
		strcpy(str, node->text + i);
		/* node's siblings become our siblings;
  		 * but it keeps its children */
		new_node = new_patricia_node(str, node->data, NULL, node->child);

		/*Shrink node->text*/
		node->text[i]	= '\0';
		node->text = realloc(node->text, strlen(node->text) + 1);
		node->data	= value;
		node->child	= new_node;

		break;
	}

	/* We completed this node; we'll be a child*/
	if (!node->text[i]) {
		if (!node->child) {
			str = malloc(strlen(key + j + i) + 1);
			strcpy(str, key + j + i);

			new_node = new_patricia_node(str, value, NULL, NULL);
			node->child	  = new_node;

			break;
		}
		node = node->child;
		continue;
	}

	/* We have to split this node and demote both the key and the
	 * node to children */
	if (node->text[i] && key[j + i]) {
		/* First, make this node its own child */
		/* Use the string suffix as the text */
		str = malloc(strlen(node->text + i) + 1);
		strcpy(str, node->text + i);

		/* Also demote the data */
		new_node = new_patricia_node(str, node->data, NULL, NULL);
		node->child	  = new_node;
		node->data	  = NULL;
		/* Truncate the parent string */
		node->text[i]	  = '\0';
		node->text = realloc(node->text, strlen(node->text) + 1);

		node = node->child;

		/* Now insert our key */
		/* Use the string suffix as the text */
		str = malloc(strlen(key + j + i) + 1);
		strcpy(str, key + j + i);

		new_node = new_patricia_node(str, value, NULL, NULL);
		node->sibling	  = new_node;

		break;
	}

    }

	return 0;
}

#if 1
int maxdepth = 0;
int cpxlen = 0;
void ptree_spill(char *s, struct patricia *node, int depth, int len) {
	printf("n:%p t:%s v:%i s:%p c:%p\n"
		"     p:%s d:%i x: %i l:%i\n",
		node,
		node->text,
		node->data ? *((int*)node->data) : 0,
		node->sibling,
		node->child,
		s ? s : "",
		depth,
		len,
		len + strlen(node->text));
	if (depth > maxdepth)
		maxdepth = depth;
	if (len > cpxlen) // common prefix length
		cpxlen = len;
	if (node->child)
		ptree_spill(node->text, node->child, depth+1, len + strlen(node->text));
	if (node->sibling)
		ptree_spill(s, node->sibling, depth, len);
}

int main() {
	int data = 5;
	int data1 = 12;
	int data2 = 6;
	int data3 = 8;
#if 0
	printf("Insert 1\n");
	add_to_patricia("patricia", &data);
	ptree_spill("", &phead);
	printf("Insert 2\n");
	add_to_patricia("patrikana", &data1);
	ptree_spill("", &phead);
	printf("Insert 3\n");
	add_to_patricia("patriciaworm", &data2);
	ptree_spill("", &phead);
	printf("Insert 4\n");
	add_to_patricia("patriciacool", &data3);
	ptree_spill("", &phead);
#else
  char *line = NULL;
  size_t len = 0;
  ssize_t n;
  int sint=0;
  unsigned int (*hash) (const char *);

  while ((n = getline (&line, &len, stdin)) > 0)
    {
      if (line[n - 1] == '\n')
	line[n - 1] = '\0';
	add_to_patricia(line, &data);
    }
	ptree_spill("", &phead, 0, 0);
	printf("Max depth:  %i\n"
		"Max common prefix length:  %i\n",
		maxdepth, cpxlen);
#endif
	return 0;
}
#endif

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 827 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Patricia Trie pseudo-test
  2006-08-22  7:34 Patricia Trie pseudo-test John Moser
@ 2006-08-22 11:45 ` John Moser
  0 siblings, 0 replies; 2+ messages in thread
From: John Moser @ 2006-08-22 11:45 UTC (permalink / raw)
  To: libc-alpha; +Cc: binutils

[-- Attachment #1: Type: text/plain, Size: 2457 bytes --]

On Tue, 2006-08-22 at 01:47 -0400, John Moser wrote:

[...]
> I'm still thinking this is probably a win
> since we're unlikely to branch 8 times from a common prefix and even if
> we do we're suddenly faced with a situation where we CAN assume we're
> winning something*; Drepper are you around, I could use your input.

So I had some math run.


You have N string elements each sharing a common prefix of X characters.

If you arrange these in a trie, you have N branches to follow after the
prefix.

Each direction tested in a branch is identical in operation to a
character comparison.  (I designed the storage format that way
specifically.)

Every string is searched for.  The linker eventually goes through all
symbols, taking every possible search path.

If the strings are in a linked list, you compare the common prefixes 1+2
+...+N times, making X comparisons each time, totaling (1+2+...+N)*X or
X*((N+1) * N)/2 character comparisons.

If the strings are in a trie, you compare the common prefixes N times
and compare each branch direction ((N+1)*N)/2 times, totaling N*X + ((N
+1)*N)/2 branches.

These simplify to (N*(1 + N)*X)/2 (linked list) and (N*(1 + N))/2 + N*X
(patricia trie).

A little help from #math (who helped me figure out the simplified
versions) and we get:

<+Cale> % Reduce[{x*((n + 1)*n)/2 > n*x + ((n + 1)*n)/2, x > 0, n > 0}]
<mbot> Cale: x > 1 && n > (1 + x)/(-1 + x)

Now here's some interesting thoughts:

 X = 2; N > 3/1 (3)
 X = 3; N > 4/2 (2)
 X = 4; N > 5/3 (1.6)

Once we've passed prefixes 4 characters long, any branch will be a gain
because N will be 2 or more.  Also the more directions we branch in, the
more time we save; for example, take a prefix 3 characters long (let's
say namespace std, which is probably like _Z22std anyway) and branch it
3, 5, and 10 ways.
    LL     PT
3:  18     15
5:  45     30
10: 165    85

Or let's try a branch between 2 with prefixes 3, 5, 10, 20, and 50 long:
    LL     PT
3:  9      9
5:  15     13
10: 30     23
20: 60     43
50: 150    103

In other words, this is harmless if prefixes are longer than 2
characters; and a gain if prefixes are longer than 3 characters OR the
number of symbols in the bucket sharing that number of prefixes is
greater than 2.  This is EXTREMELY likely.

Of course, this is still all in theory.  But the numbers are pretty.
> 
-- 
John Moser <john.r.moser@gmail.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 827 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2006-08-22  6:44 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-22  7:34 Patricia Trie pseudo-test John Moser
2006-08-22 11:45 ` John Moser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).