From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailout06.t-online.de (mailout06.t-online.de [194.25.134.19]) by sourceware.org (Postfix) with ESMTPS id 69630385771A for ; Thu, 27 Apr 2023 17:20:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 69630385771A Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=t-online.de Authentication-Results: sourceware.org; spf=none smtp.mailfrom=t-online.de Received: from fwd83.dcpf.telekom.de (fwd83.aul.t-online.de [10.223.144.109]) by mailout06.t-online.de (Postfix) with SMTP id C240050A8; Thu, 27 Apr 2023 19:20:29 +0200 (CEST) Received: from [192.168.2.101] ([91.57.243.242]) by fwd83.t-online.de with (TLSv1.3:TLS_AES_256_GCM_SHA384 encrypted) esmtp id 1ps5IB-0U2BCS0; Thu, 27 Apr 2023 19:20:27 +0200 Subject: Re: [PATCH setup 2/2] Detect filename collisions between packages To: Jon Turney , "cygwin-apps@cygwin.com" References: <20230423144330.3107-1-jon.turney@dronecode.org.uk> <20230423144330.3107-3-jon.turney@dronecode.org.uk> <358f3794-ea6c-d771-731b-34ab9bffde9b@dronecode.org.uk> From: Christian Franke Message-ID: <087e05db-80e5-f800-1533-51ae93c67079@t-online.de> Date: Thu, 27 Apr 2023 19:20:25 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 SeaMonkey/2.53.16 MIME-Version: 1.0 In-Reply-To: <358f3794-ea6c-d771-731b-34ab9bffde9b@dronecode.org.uk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-TOI-EXPURGATEID: 150726::1682616027-0B7FDC11-F3022B1F/0/0 CLEAN NORMAL X-TOI-MSGID: 2e8b4454-1215-4341-b95b-880327455658 X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,FREEMAIL_FROM,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,NICE_REPLY_A,RCVD_IN_BARRACUDACENTRAL,RCVD_IN_DNSWL_LOW,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Jon Turney wrote: > ... >>> Using std::set_intersection() on values from std::map() here is >>> probably >>> a mistake. It's simple to write, but the performance is not good. >> >> A faster alternative which avoids set_intersection calls in a loop is >> possibly to use one large data structure which maps filenames to sets >> of packages. Using multimap instead of the >> straightforward map> needs possibly less memory >> (not tested). But for multimap it is required that file/package name >> pairs are not inserted twice. >> >> I attached a small standalone POC source file using multimap. It >> would also detect collisions in the already installed packages. > > Thanks for the ideas. It seems I really didn't think that carefully > about this... > > It seems like maybe building a map from filename to the set of package > names which contain it, and then finding all the filenames where that > set has more than one member would be a possible better implementation. Of course this is the more obvious method and easier to implement. It is somewhat slower and needs more memory. Meantime I did a quick artificial test with 10000 "packages" with 100 "files" each and 2 collisions. This resulted in a (multi)map size if 1000000(+2), total size of strings was ~80MB. Results: Data structure: execution time, memory (working set) multimap: 1.8 seconds, 238MB map>: 2.0 seconds, 318MB The execution time is needed for the map insertions. The final scan for collisions is very fast in both cases.