From: ASSI <Stromeko@nexgo.de>
To: "Frank Ch. Eigler" <fche@redhat.com>
Cc: "Frank Ch. Eigler" <fche@elastic.org>,
Jon Turney <jon.turney@dronecode.org.uk>,
overseers@sourceware.org
Subject: Re: [Cygwin] package-grep on Sourceware
Date: Sun, 08 Dec 2019 11:05:00 -0000 [thread overview]
Message-ID: <1655048.ih8JofZtUf@gertrud> (raw)
In-Reply-To: <20191127193804.GB1967@redhat.com>
Hi Frank,
I've done some more testing locally. I have confirmed my suspicion that the
output buffering will cause wrong results to be produced when the output gets
longer than a page. This should explain the odd results of a search that
causes all file names to be returned on the server.
I propose that (GNU) grep is run in line-buffering mode instead, which at
least on my local system doesn't materially impact the runtime. I also
suggest that you use find instead of ls to create the list of files, which
reduces the time to produce the file list slightly. If you really want to max
out the parallelism you will also need to limit the number of arguments fed
into each instance (otherwise you'll end up with xargs starting only 13
processes). Feeding in 1715 names on each invocation ends up starting 20
processes, so that should help getting the most out of the available number of
cores.
find $dir -mindepth 2 -maxdepth 2 -type f -not -name .htaccess |
xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered -- "$param_grep" |
sort > "$tmpfile"
In a later iteration the list of files to be searched could be cached (in a
file $dir.lst, say). This already helps in the local case, but is likely more
effective on a system that has a lot more IO load than I can produce locally.
<$dir.lst xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered --
"$param_grep" |
sort > "$tmpfile"
The cache file would need to be refreshed each time the listing directories
get updated (although you probably could just run find in a cronjob every few
minutes and nobody would notice a difference). Having a cache file would also
make determining the optimal input length easier as you could just count the
number of lines in order to calculate how to best split them among multiple
processes.
Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+
Waldorf MIDI Implementation & additional documentation:
http://Synth.Stromeko.net/Downloads.html#WaldorfDocs
prev parent reply other threads:[~2019-12-08 11:05 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <24997709.IERGFR1AUN@gertrud>
2019-11-25 0:37 ` Frank Ch. Eigler
2019-11-25 16:48 ` ASSI
2019-11-27 19:38 ` Frank Ch. Eigler
2019-12-08 11:05 ` ASSI [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1655048.ih8JofZtUf@gertrud \
--to=stromeko@nexgo.de \
--cc=fche@elastic.org \
--cc=fche@redhat.com \
--cc=jon.turney@dronecode.org.uk \
--cc=overseers@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).