From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gnu.wildebeest.org (gnu.wildebeest.org [45.83.234.184]) by sourceware.org (Postfix) with ESMTPS id 38EAB3858C53 for ; Wed, 14 Jun 2023 16:30:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 38EAB3858C53 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=klomp.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org Received: from r6.localdomain (82-217-174-174.cable.dynamic.v4.ziggo.nl [82.217.174.174]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gnu.wildebeest.org (Postfix) with ESMTPSA id D36A7313ACBE; Wed, 14 Jun 2023 18:30:45 +0200 (CEST) Received: by r6.localdomain (Postfix, from userid 1000) id 8EF083402E2; Wed, 14 Jun 2023 18:30:45 +0200 (CEST) Message-ID: <4fd77ae2cbc3fdc194bcd0fc37c0c2f92efb66e6.camel@klomp.org> Subject: Re: [PATCH] find-debuginfo: remove duplicate filenames when creating debugsources.list From: Mark Wielaard To: Denys Vlasenko Cc: debugedit@sourceware.org Date: Wed, 14 Jun 2023 18:30:45 +0200 In-Reply-To: <20230614145638.7830-1-dvlasenk@redhat.com> References: <20230614145638.7830-1-dvlasenk@redhat.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.48.3 (3.48.3-1.fc38) MIME-Version: 1.0 X-Spam-Status: No, score=-3034.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,JMQ_SPF_NEUTRAL,KAM_DMARC_STATUS,RCVD_IN_BARRACUDACENTRAL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Dynes, CCing the debugedit devel list. On Wed, 2023-06-14 at 16:56 +0200, Denys Vlasenko wrote: > We remove duplicate filenames when we _process_ debugsources.list. > However, this means that momentarily we may have a very large > (in the range of *giga*bytes) debugsources.list. >=20 > This is unnecessary, we can also remove dups when we *create* it. We can also teach debugedit itself to not emit duplicate lines (currently it simply outputs every file/dir found in the .debug_info and .debug_line tables). But that wouldn't make this unnecessary (debugedit cannot know about the other file lists). It might be more efficient/create smaller temporary files though. > Signed-off-by: Denys Vlasenko > --- > scripts/find-debuginfo.in | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) >=20 > diff --git a/scripts/find-debuginfo.in b/scripts/find-debuginfo.in > index 7dec3c3..e7ac095 100755 > --- a/scripts/find-debuginfo.in > +++ b/scripts/find-debuginfo.in > @@ -575,7 +575,10 @@ else > exit 1 > fi > done > - cat "$temp"/debugsources.* >"$SOURCEFILE" > + # List of sources may have lots of duplicates. A kernel build was seen > + # with this list reaching 448 megabytes in size. "sort" helps to not h= ave > + # _two_ sets of 448 megabytes of temp files here. > + LC_ALL=3DC sort -z -u "$temp"/debugsources.* >"$SOURCEFILE" > cat "$temp"/elfbins.* >"$ELFBINSFILE" > fi > =20 Looks good, applied as commit 41fc1335b8b364c95a8ee2ed2956bbdfe7957853 Thanks, Mark