Re: [PATCH] Use xxHash hashing algorithm.

public inbox for dwz@sourceware.org
 help / color / mirror / Atom feed

From: Mark Wielaard <mark@klomp.org>
To: "Martin Liška" <mliska@suse.cz>
Cc: dwz@sourceware.org
Subject: Re: [PATCH] Use xxHash hashing algorithm.
Date: Thu, 30 Jun 2022 18:13:14 +0200	[thread overview]
Message-ID: <9936e9f84660a3a47e7a87567bab131d54fdb221.camel@klomp.org> (raw)
In-Reply-To: <af538e0e-4b1d-e0b6-f325-2bd6d4264ea9@suse.cz>

Hi Martin,

On Mon, 2022-06-27 at 15:41 +0200, Martin Liška wrote:
> On 6/25/22 21:44, Mark Wielaard wrote:
> > Could you resent this patch with git send-email or rebase it
> > against
> > current master. As is it doesn't apply giving errors when trying to
> > use it with git am.
> 
> Sure, rebased and attached to this email.

Thanks. That applies cleanly. And (after installing xxhash-devel)
compiles and make checks nicely.

> > We probably also need a configure change to detect whether xxhash
> > is actually available.
> 
> With my patch, one will see a compilation error. Can you please help
> me with the configure detection?

Ha. I just looked at our configure file and it, uh, is almost a noop.
It doesn't no any checking, not even for having libelf. Lets make
updating configure a separate task.

> > > diff --git a/Makefile b/Makefile
> > > index 9394ef4..89546c2 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -8,7 +8,7 @@ CFLAGS = -O2 -g
> > >  DWZ_VERSION := $(shell cat $(srcdir)/VERSION)
> > >  CFLAGS_VERSION = -DDWZ_VERSION='"$(DWZ_VERSION)"'
> > >  CFLAGS_COPYRIGHT = $(shell cat $(srcdir)/COPYRIGHT_YEARS)
> > > -CFLAGS_COMMON = -Wall -W -D_FILE_OFFSET_BITS=64
> > > +CFLAGS_COMMON = -Wall -W -D_FILE_OFFSET_BITS=64
> > > -DXXH_INLINE_ALL=1
> > >  override CFLAGS += $(CFLAGS_COMMON) $(CFLAGS_VERSION)
> > > $(CFLAGS_COPYRIGHT)
> > >  prefix = /usr
> > > @@ -17,7 +17,7 @@ bindir = $(exec_prefix)/bin
> > >  datarootdir = $(prefix)/share
> > >  mandir = $(datarootdir)/man
> > >  OBJECTS = args.o dwz.o hashtab.o pool.o sha1.o dwarfnames.o
> > > -LIBS=-lelf
> > > +LIBS=-lelf -lxxhash
> > >  dwz: $(OBJECTS)
> > >  	$(CC) $(LDFLAGS) -o $@ $^ $(LIBS)
> > >  args.o: native.o
> > 
> > So we use XXH_INLINE_ALL. Do we then still need to link with
> > -lxxhash?
> 
> Correct, it's not needed.

Good.

> > Allocating and reusing the global state is convenient, but might
> > make
> > parellizing dwz somewhat more complicated. Was this design used
> > deliberately, or can we easily switch to stack allocating the
> > state?
> 
> Yes, but the alternative approach would be making it a stack
> variable in each function where we use it. Note right now, processes
> are used for parallel execution.

OK, so nothing stands in the way of doing that if we do want to do in-
process parallelism.

> > Also I find the naming slightly confusing.  We used
> > iterative_hash_object to update a hash with an integral value, and
> > iterative_hash to update a hash with a size plus pointer. Now it
> > seems
> > we swap the meaning. iterative_hash gets replace by
> > hash_update_state_object. And iterative_hash_object gets replaced
> > by
> > hash_update_state. Was the meaning of "object" deliberately
> > swapped?
> 
> No, fixed that.

Thanks, that makes it a lot simpler imho.

> > Also it seems iterative_hash_object with inital zero value is
> > replaced
> > by hash, which isn't defined here but in hashtab.h. Why the split
> > in definitions?
> 
> All moved to dwz.c right now.

And added comments for everything. Thanks.

> > > +		  hash_init_state ();
> > > +		  hash_update_state (die->u.p1.die_hash);
> > > +		  hash_update_state (s);
> > > +		  hash_update_state (cu_file->time);
> > > +		  hash_update_state (cu_file->size);
> > > +		  hash_update_state_object (cu_file->file, file_len +
> > > 1);
> > 
> > OK.
> > 
> > >  		  if (cu_file->dir)
> > > -		    die->u.p1.die_hash
> > > -		      = iterative_hash (cu_file->dir,
> > > -					strlen (cu_file->dir) + 1,
> > > -					die->u.p1.die_hash);
> > > +		    {
> > > +		      hash_update_state_object (cu_file->dir,
> > > +						strlen (cu_file->dir) +
> > > 1);
> > > +		    }
> > 
> > OK, assuming the die->u.p1.die_hash below are alwasy done/correct.
> > But brackets aren't really needed.
> > 
> > >  		  /* Ignore DW_AT_comp_dir for DW_AT_*_file <built-in>
> > >  		     etc. if immediately followed by DW_AT_*_line
> > > 0.  */
> > >  		  else if (cu_file-
> > > >file_angle_brackets_encapsulated_no_slash
> > > @@ -3427,7 +3439,13 @@ checksum_die (DSO *dso, dw_cu_ref cu,
> > > dw_die_ref top_die, dw_die_ref die)
> > >  				  ? DW_AT_decl_line : DW_AT_call_line)
> > >  			   && t->attr[i + 1].form == DW_FORM_data1
> > >  			   && *new_ptr == 0)
> > > -		    break;
> > > +		    {
> > > +		      die->u.p1.die_hash = hash_digest ();
> > > +		      break;
> > > +		    }
> > > +
> > > +		  die->u.p1.die_hash = hash_digest ();
> > 
> > I find the logic hard to follow here. What makes sure the
> > die->u.p1.die_hash is always updated? Isn't it updated twice
> > (redundantly) in the above case? And needless brackets.
> 
> Yeah, it's hard because the 'break;' means the next call to:
> die->u.p1.die_hash = hash_digest ();
> is not executed. And die->u.p1.die_hash is used in the current version
> as initial hash value (and it's modified in each iterative_hash_object call).

Yeah, I see it is slightly easier to follow when not in patch form. It
does look correct.

> > > diff --git a/hashtab.h b/hashtab.h
> > > index cb3da01..509e627 100644
> > > --- a/hashtab.h
> > > +++ b/hashtab.h
> > > @@ -153,9 +153,11 @@ extern void	htab_restore (htab_t, const
> > > char *, htab_restorefn);
> > >  #endif
> > > -/* An iterative hash function for arbitrary data.  */
> > > -extern hashval_t iterative_hash (const void *, size_t,
> > > hashval_t);
> > > +#include <xxhash.h>
> > > +
> > >  /* Shorthand for hashing something with an intrinsic size.  */
> > > +#define hash(IN,LEN) XXH64(IN, LEN, 0)
> > > +#define iterative_hash(IN,LEN,INIT) XXH64(IN, LEN, INIT)
> > >  #define iterative_hash_object(OB,INIT) iterative_hash (&OB,
> > > sizeof (OB), INIT)
> > >  #ifdef __cplusplus
> > 
> > See above for the question why everything hash related isn't
> > defined
> > in the same place.
> 
> Done.

OK. dwz.c is fine. But any reason why you picked that instead of
hashtab.h?

> > 
> > How much of hashtab.{c,h} are we still using?
> 
> wc -l hashtab.[ch]
>   628 hashtab.c
>   160 hashtab.h
>   788 total
> 
> with my patch.

Ah, right, all the actual hashtable code is still there of course. We
only replaced the hash, not the datastructure.

> Please take a look at the updated patch.

I read it again and it looks good to me.

Thanks,

Mark

next prev parent reply	other threads:[~2022-06-30 16:13 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-05  8:17 Martin Liška
2022-03-31 11:29 ` Martin Liška
2022-04-28 10:06   ` Martin Liška
2022-06-20 13:29     ` Martin Liška
2022-06-25 19:44 ` Mark Wielaard
2022-06-27 13:41   ` Martin Liška
2022-06-30 16:13     ` Mark Wielaard [this message]
2022-07-04 13:31       ` Martin Liška
2022-07-04 15:40         ` Mark Wielaard
2022-07-07 12:29           ` Martin Liška
2022-07-07 13:36             ` xxhash devel on buildbot workers (Was: [PATCH] Use xxHash hashing algorithm) Mark Wielaard
2022-07-07 15:04               ` xxhash devel on buildbot workers Thomas Fitzsimmons
2022-07-11  7:48               ` xxhash devel on buildbot workers (Was: [PATCH] Use xxHash hashing algorithm) Dan Horák

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9936e9f84660a3a47e7a87567bab131d54fdb221.camel@klomp.org \
    --to=mark@klomp.org \
    --cc=dwz@sourceware.org \
    --cc=mliska@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).