From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gnu.wildebeest.org (gnu.wildebeest.org [45.83.234.184]) by sourceware.org (Postfix) with ESMTPS id 50AFD386F0CF for ; Thu, 30 Jun 2022 16:13:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 50AFD386F0CF Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=klomp.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org Received: from tarox.wildebeest.org (83-87-18-245.cable.dynamic.v4.ziggo.nl [83.87.18.245]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by gnu.wildebeest.org (Postfix) with ESMTPSA id 1A36130146AC; Thu, 30 Jun 2022 18:13:15 +0200 (CEST) Received: by tarox.wildebeest.org (Postfix, from userid 1000) id C57A7406F36A; Thu, 30 Jun 2022 18:13:14 +0200 (CEST) Message-ID: <9936e9f84660a3a47e7a87567bab131d54fdb221.camel@klomp.org> Subject: Re: [PATCH] Use xxHash hashing algorithm. From: Mark Wielaard To: Martin =?UTF-8?Q?Li=C5=A1ka?= Cc: dwz@sourceware.org Date: Thu, 30 Jun 2022 18:13:14 +0200 In-Reply-To: References: <20220625194440.GA16194@gnu.wildebeest.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Evolution 3.28.5 (3.28.5-10.el7) Mime-Version: 1.0 X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, GIT_PATCH_0, JMQ_SPF_NEUTRAL, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: dwz@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Dwz mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jun 2022 16:13:18 -0000 Hi Martin, On Mon, 2022-06-27 at 15:41 +0200, Martin Li=C5=A1ka wrote: > On 6/25/22 21:44, Mark Wielaard wrote: > > Could you resent this patch with git send-email or rebase it > > against > > current master. As is it doesn't apply giving errors when trying to > > use it with git am. >=20 > Sure, rebased and attached to this email. Thanks. That applies cleanly. And (after installing xxhash-devel) compiles and make checks nicely. > > We probably also need a configure change to detect whether xxhash > > is actually available. >=20 > With my patch, one will see a compilation error. Can you please help > me with the configure detection? Ha. I just looked at our configure file and it, uh, is almost a noop. It doesn't no any checking, not even for having libelf. Lets make updating configure a separate task. > > > diff --git a/Makefile b/Makefile > > > index 9394ef4..89546c2 100644 > > > --- a/Makefile > > > +++ b/Makefile > > > @@ -8,7 +8,7 @@ CFLAGS =3D -O2 -g > > > DWZ_VERSION :=3D $(shell cat $(srcdir)/VERSION) > > > CFLAGS_VERSION =3D -DDWZ_VERSION=3D'"$(DWZ_VERSION)"' > > > CFLAGS_COPYRIGHT =3D $(shell cat $(srcdir)/COPYRIGHT_YEARS) > > > -CFLAGS_COMMON =3D -Wall -W -D_FILE_OFFSET_BITS=3D64 > > > +CFLAGS_COMMON =3D -Wall -W -D_FILE_OFFSET_BITS=3D64 > > > -DXXH_INLINE_ALL=3D1 > > > override CFLAGS +=3D $(CFLAGS_COMMON) $(CFLAGS_VERSION) > > > $(CFLAGS_COPYRIGHT) > > > prefix =3D /usr > > > @@ -17,7 +17,7 @@ bindir =3D $(exec_prefix)/bin > > > datarootdir =3D $(prefix)/share > > > mandir =3D $(datarootdir)/man > > > OBJECTS =3D args.o dwz.o hashtab.o pool.o sha1.o dwarfnames.o > > > -LIBS=3D-lelf > > > +LIBS=3D-lelf -lxxhash > > > dwz: $(OBJECTS) > > > $(CC) $(LDFLAGS) -o $@ $^ $(LIBS) > > > args.o: native.o > >=20 > > So we use XXH_INLINE_ALL. Do we then still need to link with > > -lxxhash? >=20 > Correct, it's not needed. Good. > > Allocating and reusing the global state is convenient, but might > > make > > parellizing dwz somewhat more complicated. Was this design used > > deliberately, or can we easily switch to stack allocating the > > state? >=20 > Yes, but the alternative approach would be making it a stack > variable in each function where we use it. Note right now, processes > are used for parallel execution. OK, so nothing stands in the way of doing that if we do want to do in- process parallelism. > > Also I find the naming slightly confusing. We used > > iterative_hash_object to update a hash with an integral value, and > > iterative_hash to update a hash with a size plus pointer. Now it > > seems > > we swap the meaning. iterative_hash gets replace by > > hash_update_state_object. And iterative_hash_object gets replaced > > by > > hash_update_state. Was the meaning of "object" deliberately > > swapped? >=20 > No, fixed that. Thanks, that makes it a lot simpler imho. > > Also it seems iterative_hash_object with inital zero value is > > replaced > > by hash, which isn't defined here but in hashtab.h. Why the split > > in definitions? >=20 > All moved to dwz.c right now. And added comments for everything. Thanks. > > > + hash_init_state (); > > > + hash_update_state (die->u.p1.die_hash); > > > + hash_update_state (s); > > > + hash_update_state (cu_file->time); > > > + hash_update_state (cu_file->size); > > > + hash_update_state_object (cu_file->file, file_len + > > > 1); > >=20 > > OK. > >=20 > > > if (cu_file->dir) > > > - die->u.p1.die_hash > > > - =3D iterative_hash (cu_file->dir, > > > - strlen (cu_file->dir) + 1, > > > - die->u.p1.die_hash); > > > + { > > > + hash_update_state_object (cu_file->dir, > > > + strlen (cu_file->dir) + > > > 1); > > > + } > >=20 > > OK, assuming the die->u.p1.die_hash below are alwasy done/correct. > > But brackets aren't really needed. > >=20 > > > /* Ignore DW_AT_comp_dir for DW_AT_*_file > > > etc. if immediately followed by DW_AT_*_line > > > 0. */ > > > else if (cu_file- > > > >file_angle_brackets_encapsulated_no_slash > > > @@ -3427,7 +3439,13 @@ checksum_die (DSO *dso, dw_cu_ref cu, > > > dw_die_ref top_die, dw_die_ref die) > > > ? DW_AT_decl_line : DW_AT_call_line) > > > && t->attr[i + 1].form =3D=3D DW_FORM_data1 > > > && *new_ptr =3D=3D 0) > > > - break; > > > + { > > > + die->u.p1.die_hash =3D hash_digest (); > > > + break; > > > + } > > > + > > > + die->u.p1.die_hash =3D hash_digest (); > >=20 > > I find the logic hard to follow here. What makes sure the > > die->u.p1.die_hash is always updated? Isn't it updated twice > > (redundantly) in the above case? And needless brackets. >=20 > Yeah, it's hard because the 'break;' means the next call to: > die->u.p1.die_hash =3D hash_digest (); > is not executed. And die->u.p1.die_hash is used in the current version > as initial hash value (and it's modified in each iterative_hash_object ca= ll). Yeah, I see it is slightly easier to follow when not in patch form. It does look correct. > > > diff --git a/hashtab.h b/hashtab.h > > > index cb3da01..509e627 100644 > > > --- a/hashtab.h > > > +++ b/hashtab.h > > > @@ -153,9 +153,11 @@ extern void htab_restore (htab_t, const > > > char *, htab_restorefn); > > > #endif > > > -/* An iterative hash function for arbitrary data. */ > > > -extern hashval_t iterative_hash (const void *, size_t, > > > hashval_t); > > > +#include > > > + > > > /* Shorthand for hashing something with an intrinsic size. */ > > > +#define hash(IN,LEN) XXH64(IN, LEN, 0) > > > +#define iterative_hash(IN,LEN,INIT) XXH64(IN, LEN, INIT) > > > #define iterative_hash_object(OB,INIT) iterative_hash (&OB, > > > sizeof (OB), INIT) > > > #ifdef __cplusplus > >=20 > > See above for the question why everything hash related isn't > > defined > > in the same place. >=20 > Done. OK. dwz.c is fine. But any reason why you picked that instead of hashtab.h? > >=20 > > How much of hashtab.{c,h} are we still using? >=20 > wc -l hashtab.[ch] > 628 hashtab.c > 160 hashtab.h > 788 total >=20 > with my patch. Ah, right, all the actual hashtable code is still there of course. We only replaced the hash, not the datastructure. > Please take a look at the updated patch. I read it again and it looks good to me. Thanks, Mark