public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
From: ASSI <Stromeko@nexgo.de>
To: "Frank Ch. Eigler" <fche@redhat.com>
Cc: "Frank Ch. Eigler" <fche@elastic.org>,
	Jon Turney <jon.turney@dronecode.org.uk>,
	overseers@sourceware.org
Subject: Re: [Cygwin] package-grep on Sourceware
Date: Sun, 08 Dec 2019 11:05:00 -0000	[thread overview]
Message-ID: <1655048.ih8JofZtUf@gertrud> (raw)
In-Reply-To: <20191127193804.GB1967@redhat.com>

Hi Frank,

I've done some more testing locally.  I have confirmed my suspicion that the 
output buffering will cause wrong results to be produced when the output gets 
longer than a page.  This should explain the odd results of a search that 
causes all file names to be returned on the server.

I propose that (GNU) grep is run in line-buffering mode instead, which at 
least on my local system doesn't materially impact the runtime.  I also 
suggest that you use find instead of ls to create the list of files, which 
reduces the time to produce the file list slightly.  If you really want to max 
out the parallelism you will also need to limit the number of arguments fed 
into each instance (otherwise you'll end up with xargs starting only 13 
processes).  Feeding in 1715 names on each invocation ends up starting 20 
processes, so that should help getting the most out of the available number of 
cores.

  find $dir -mindepth 2 -maxdepth 2 -type f -not -name .htaccess |
    xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered -- "$param_grep" |
    sort > "$tmpfile"

In a later iteration the list of files to be searched could be cached (in a 
file $dir.lst, say).  This already helps in the local case, but is likely more 
effective on a system that has a lot more IO load than I can produce locally.

    <$dir.lst xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered -- 
"$param_grep" |
    sort > "$tmpfile"

The cache file would need to be refreshed each time the listing directories 
get updated (although you probably could just run find in a cronjob every few 
minutes and nobody would notice a difference).  Having a cache file would also 
make determining the optimal input length easier as you could just count the 
number of lines in order to calculate how to best split them among multiple 
processes.



Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Waldorf MIDI Implementation & additional documentation:
http://Synth.Stromeko.net/Downloads.html#WaldorfDocs



      reply	other threads:[~2019-12-08 11:05 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <24997709.IERGFR1AUN@gertrud>
2019-11-25  0:37 ` Frank Ch. Eigler
2019-11-25 16:48   ` ASSI
2019-11-27 19:38     ` Frank Ch. Eigler
2019-12-08 11:05       ` ASSI [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1655048.ih8JofZtUf@gertrud \
    --to=stromeko@nexgo.de \
    --cc=fche@elastic.org \
    --cc=fche@redhat.com \
    --cc=jon.turney@dronecode.org.uk \
    --cc=overseers@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).