From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14560 invoked by alias); 8 Dec 2019 11:05:49 -0000 Mailing-List: contact overseers-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: , Sender: overseers-owner@sourceware.org Received: (qmail 14552 invoked by uid 89); 8 Dec 2019 11:05:49 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.6 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.1 spammy=frank, Frank, optimal, HX-Spam-Relays-External:ESMTPA X-HELO: mx009.vodafonemail.xion.oxcs.net Received: from mx009.vodafonemail.xion.oxcs.net (HELO mx009.vodafonemail.xion.oxcs.net) (153.92.174.39) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sun, 08 Dec 2019 11:05:47 +0000 Received: from vsmx002.vodafonemail.xion.oxcs.net (unknown [192.168.75.192]) by mta-6-out.mta.xion.oxcs.net (Postfix) with ESMTP id C8749604E1E; Sun, 8 Dec 2019 11:05:44 +0000 (UTC) Received: from gertrud.localnet (unknown [91.47.60.226]) by mta-6-out.mta.xion.oxcs.net (Postfix) with ESMTPA id C1BA2604AF0; Sun, 8 Dec 2019 11:05:34 +0000 (UTC) From: ASSI To: "Frank Ch. Eigler" Cc: "Frank Ch. Eigler" , Jon Turney , overseers@sourceware.org Subject: Re: [Cygwin] package-grep on Sourceware Date: Sun, 08 Dec 2019 11:05:00 -0000 Message-ID: <1655048.ih8JofZtUf@gertrud> In-Reply-To: <20191127193804.GB1967@redhat.com> References: <24997709.IERGFR1AUN@gertrud> <1810471.5kLSPG0ym2@gertrud> <20191127193804.GB1967@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-SW-Source: 2019-q4/txt/msg00068.txt.bz2 Hi Frank, I've done some more testing locally. I have confirmed my suspicion that the output buffering will cause wrong results to be produced when the output gets longer than a page. This should explain the odd results of a search that causes all file names to be returned on the server. I propose that (GNU) grep is run in line-buffering mode instead, which at least on my local system doesn't materially impact the runtime. I also suggest that you use find instead of ls to create the list of files, which reduces the time to produce the file list slightly. If you really want to max out the parallelism you will also need to limit the number of arguments fed into each instance (otherwise you'll end up with xargs starting only 13 processes). Feeding in 1715 names on each invocation ends up starting 20 processes, so that should help getting the most out of the available number of cores. find $dir -mindepth 2 -maxdepth 2 -type f -not -name .htaccess | xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered -- "$param_grep" | sort > "$tmpfile" In a later iteration the list of files to be searched could be cached (in a file $dir.lst, say). This already helps in the local case, but is likely more effective on a system that has a lot more IO load than I can produce locally. <$dir.lst xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered -- "$param_grep" | sort > "$tmpfile" The cache file would need to be refreshed each time the listing directories get updated (although you probably could just run find in a cronjob every few minutes and nobody would notice a difference). Having a cache file would also make determining the optimal input length easier as you could just count the number of lines in order to calculate how to best split them among multiple processes. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Waldorf MIDI Implementation & additional documentation: http://Synth.Stromeko.net/Downloads.html#WaldorfDocs