From: Mark Wielaard <mark@klomp.org>
To: "Martin Liška" <mliska@suse.cz>
Cc: dwz@sourceware.org
Subject: Re: [PATCH] Use xxHash hashing algorithm.
Date: Thu, 30 Jun 2022 18:13:14 +0200 [thread overview]
Message-ID: <9936e9f84660a3a47e7a87567bab131d54fdb221.camel@klomp.org> (raw)
In-Reply-To: <af538e0e-4b1d-e0b6-f325-2bd6d4264ea9@suse.cz>
Hi Martin,
On Mon, 2022-06-27 at 15:41 +0200, Martin Liška wrote:
> On 6/25/22 21:44, Mark Wielaard wrote:
> > Could you resent this patch with git send-email or rebase it
> > against
> > current master. As is it doesn't apply giving errors when trying to
> > use it with git am.
>
> Sure, rebased and attached to this email.
Thanks. That applies cleanly. And (after installing xxhash-devel)
compiles and make checks nicely.
> > We probably also need a configure change to detect whether xxhash
> > is actually available.
>
> With my patch, one will see a compilation error. Can you please help
> me with the configure detection?
Ha. I just looked at our configure file and it, uh, is almost a noop.
It doesn't no any checking, not even for having libelf. Lets make
updating configure a separate task.
> > > diff --git a/Makefile b/Makefile
> > > index 9394ef4..89546c2 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -8,7 +8,7 @@ CFLAGS = -O2 -g
> > > DWZ_VERSION := $(shell cat $(srcdir)/VERSION)
> > > CFLAGS_VERSION = -DDWZ_VERSION='"$(DWZ_VERSION)"'
> > > CFLAGS_COPYRIGHT = $(shell cat $(srcdir)/COPYRIGHT_YEARS)
> > > -CFLAGS_COMMON = -Wall -W -D_FILE_OFFSET_BITS=64
> > > +CFLAGS_COMMON = -Wall -W -D_FILE_OFFSET_BITS=64
> > > -DXXH_INLINE_ALL=1
> > > override CFLAGS += $(CFLAGS_COMMON) $(CFLAGS_VERSION)
> > > $(CFLAGS_COPYRIGHT)
> > > prefix = /usr
> > > @@ -17,7 +17,7 @@ bindir = $(exec_prefix)/bin
> > > datarootdir = $(prefix)/share
> > > mandir = $(datarootdir)/man
> > > OBJECTS = args.o dwz.o hashtab.o pool.o sha1.o dwarfnames.o
> > > -LIBS=-lelf
> > > +LIBS=-lelf -lxxhash
> > > dwz: $(OBJECTS)
> > > $(CC) $(LDFLAGS) -o $@ $^ $(LIBS)
> > > args.o: native.o
> >
> > So we use XXH_INLINE_ALL. Do we then still need to link with
> > -lxxhash?
>
> Correct, it's not needed.
Good.
> > Allocating and reusing the global state is convenient, but might
> > make
> > parellizing dwz somewhat more complicated. Was this design used
> > deliberately, or can we easily switch to stack allocating the
> > state?
>
> Yes, but the alternative approach would be making it a stack
> variable in each function where we use it. Note right now, processes
> are used for parallel execution.
OK, so nothing stands in the way of doing that if we do want to do in-
process parallelism.
> > Also I find the naming slightly confusing. We used
> > iterative_hash_object to update a hash with an integral value, and
> > iterative_hash to update a hash with a size plus pointer. Now it
> > seems
> > we swap the meaning. iterative_hash gets replace by
> > hash_update_state_object. And iterative_hash_object gets replaced
> > by
> > hash_update_state. Was the meaning of "object" deliberately
> > swapped?
>
> No, fixed that.
Thanks, that makes it a lot simpler imho.
> > Also it seems iterative_hash_object with inital zero value is
> > replaced
> > by hash, which isn't defined here but in hashtab.h. Why the split
> > in definitions?
>
> All moved to dwz.c right now.
And added comments for everything. Thanks.
> > > + hash_init_state ();
> > > + hash_update_state (die->u.p1.die_hash);
> > > + hash_update_state (s);
> > > + hash_update_state (cu_file->time);
> > > + hash_update_state (cu_file->size);
> > > + hash_update_state_object (cu_file->file, file_len +
> > > 1);
> >
> > OK.
> >
> > > if (cu_file->dir)
> > > - die->u.p1.die_hash
> > > - = iterative_hash (cu_file->dir,
> > > - strlen (cu_file->dir) + 1,
> > > - die->u.p1.die_hash);
> > > + {
> > > + hash_update_state_object (cu_file->dir,
> > > + strlen (cu_file->dir) +
> > > 1);
> > > + }
> >
> > OK, assuming the die->u.p1.die_hash below are alwasy done/correct.
> > But brackets aren't really needed.
> >
> > > /* Ignore DW_AT_comp_dir for DW_AT_*_file <built-in>
> > > etc. if immediately followed by DW_AT_*_line
> > > 0. */
> > > else if (cu_file-
> > > >file_angle_brackets_encapsulated_no_slash
> > > @@ -3427,7 +3439,13 @@ checksum_die (DSO *dso, dw_cu_ref cu,
> > > dw_die_ref top_die, dw_die_ref die)
> > > ? DW_AT_decl_line : DW_AT_call_line)
> > > && t->attr[i + 1].form == DW_FORM_data1
> > > && *new_ptr == 0)
> > > - break;
> > > + {
> > > + die->u.p1.die_hash = hash_digest ();
> > > + break;
> > > + }
> > > +
> > > + die->u.p1.die_hash = hash_digest ();
> >
> > I find the logic hard to follow here. What makes sure the
> > die->u.p1.die_hash is always updated? Isn't it updated twice
> > (redundantly) in the above case? And needless brackets.
>
> Yeah, it's hard because the 'break;' means the next call to:
> die->u.p1.die_hash = hash_digest ();
> is not executed. And die->u.p1.die_hash is used in the current version
> as initial hash value (and it's modified in each iterative_hash_object call).
Yeah, I see it is slightly easier to follow when not in patch form. It
does look correct.
> > > diff --git a/hashtab.h b/hashtab.h
> > > index cb3da01..509e627 100644
> > > --- a/hashtab.h
> > > +++ b/hashtab.h
> > > @@ -153,9 +153,11 @@ extern void htab_restore (htab_t, const
> > > char *, htab_restorefn);
> > > #endif
> > > -/* An iterative hash function for arbitrary data. */
> > > -extern hashval_t iterative_hash (const void *, size_t,
> > > hashval_t);
> > > +#include <xxhash.h>
> > > +
> > > /* Shorthand for hashing something with an intrinsic size. */
> > > +#define hash(IN,LEN) XXH64(IN, LEN, 0)
> > > +#define iterative_hash(IN,LEN,INIT) XXH64(IN, LEN, INIT)
> > > #define iterative_hash_object(OB,INIT) iterative_hash (&OB,
> > > sizeof (OB), INIT)
> > > #ifdef __cplusplus
> >
> > See above for the question why everything hash related isn't
> > defined
> > in the same place.
>
> Done.
OK. dwz.c is fine. But any reason why you picked that instead of
hashtab.h?
> >
> > How much of hashtab.{c,h} are we still using?
>
> wc -l hashtab.[ch]
> 628 hashtab.c
> 160 hashtab.h
> 788 total
>
> with my patch.
Ah, right, all the actual hashtable code is still there of course. We
only replaced the hash, not the datastructure.
> Please take a look at the updated patch.
I read it again and it looks good to me.
Thanks,
Mark
next prev parent reply other threads:[~2022-06-30 16:13 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-05 8:17 Martin Liška
2022-03-31 11:29 ` Martin Liška
2022-04-28 10:06 ` Martin Liška
2022-06-20 13:29 ` Martin Liška
2022-06-25 19:44 ` Mark Wielaard
2022-06-27 13:41 ` Martin Liška
2022-06-30 16:13 ` Mark Wielaard [this message]
2022-07-04 13:31 ` Martin Liška
2022-07-04 15:40 ` Mark Wielaard
2022-07-07 12:29 ` Martin Liška
2022-07-07 13:36 ` xxhash devel on buildbot workers (Was: [PATCH] Use xxHash hashing algorithm) Mark Wielaard
2022-07-07 15:04 ` xxhash devel on buildbot workers Thomas Fitzsimmons
2022-07-11 7:48 ` xxhash devel on buildbot workers (Was: [PATCH] Use xxHash hashing algorithm) Dan Horák
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9936e9f84660a3a47e7a87567bab131d54fdb221.camel@klomp.org \
--to=mark@klomp.org \
--cc=dwz@sourceware.org \
--cc=mliska@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).