* Re: [Cygwin] package-grep on Sourceware
[not found] <24997709.IERGFR1AUN@gertrud>
@ 2019-11-25 0:37 ` Frank Ch. Eigler
2019-11-25 16:48 ` ASSI
0 siblings, 1 reply; 4+ messages in thread
From: Frank Ch. Eigler @ 2019-11-25 0:37 UTC (permalink / raw)
To: ASSI; +Cc: Jon Turney, overseers
Hi -
> you've made changes to the package-grep CGI script on sourceware
> that pretty much remove its originally intended functionality. The
> original was searching the index of the package _contents_ as it
> would get installed on the disk, while the JSON file you search now
> only has the metadata for the packages. That's not even
> superficially the same thing. [...]
OK. As you may be aware, this simplification was done because the
previous code's workload was too high (grepping through 600MB+ of text
per cgi call), and we at overseers@ weren't told otherwise.
> I've looked into what could be done to do that in a more
> server-friendly way, but it turns out that there either isn't
> anything that's ready-made for this or I can't find it. [...]
Thanks for looking into it. A few other possibilities to
ameliorate the impact of restoring the previous version:
- using xargs grep to use our extra cores, to reduce latency
- to impose a mod_qos concurrency limit on package-grep.cgi
- FChE
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Cygwin] package-grep on Sourceware
2019-11-25 0:37 ` [Cygwin] package-grep on Sourceware Frank Ch. Eigler
@ 2019-11-25 16:48 ` ASSI
2019-11-27 19:38 ` Frank Ch. Eigler
0 siblings, 1 reply; 4+ messages in thread
From: ASSI @ 2019-11-25 16:48 UTC (permalink / raw)
To: Frank Ch. Eigler; +Cc: Jon Turney, overseers
Hi Frank,
On Monday, November 25, 2019 1:37:01 AM CET Frank Ch. Eigler wrote:
> > you've made changes to the package-grep CGI script on sourceware
> > that pretty much remove its originally intended functionality. The
> > original was searching the index of the package _contents_ as it
> > would get installed on the disk, while the JSON file you search now
> > only has the metadata for the packages. That's not even
> > superficially the same thing. [...]
>
> OK. As you may be aware, this simplification was done because the
> previous code's workload was too high (grepping through 600MB+ of text
> per cgi call), and we at overseers@ weren't told otherwise.
I wasn't questioning your motives, that part was pretty clear from the change
you did. I would like to understand what characteristic of the workload you
were trying to avoid as to me it looks like avoiding IO comes at a significant
cost in either memory or CPU consumption (or both). I understand that the
server is a fairly well equipped machine, but not in enough detail to make
that call.
> > I've looked into what could be done to do that in a more
> > server-friendly way, but it turns out that there either isn't
> > anything that's ready-made for this or I can't find it. [...]
>
> Thanks for looking into it. A few other possibilities to
> ameliorate the impact of restoring the previous version:
>
> - using xargs grep to use our extra cores, to reduce latency
Then I'd suggest to just use ripgrep (is that available?), which automatically
spreads the work across the available cores (and can be limited to less than
that if so desired). Is that available on the machine? Would it be possible
to cache the list files in a tmpfs? This doesn't improve performance on my
machine, but I have an SSD and no other IO.
> - to impose a mod_qos concurrency limit on package-grep.cgi
I can't really tell what that would mean, but I guess that queries would get
delayed if too many of them get stacked onto each other? That would help with
the load of course.
Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+
Waldorf MIDI Implementation & additional documentation:
http://Synth.Stromeko.net/Downloads.html#WaldorfDocs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Cygwin] package-grep on Sourceware
2019-11-25 16:48 ` ASSI
@ 2019-11-27 19:38 ` Frank Ch. Eigler
2019-12-08 11:05 ` ASSI
0 siblings, 1 reply; 4+ messages in thread
From: Frank Ch. Eigler @ 2019-11-27 19:38 UTC (permalink / raw)
To: ASSI; +Cc: Frank Ch. Eigler, Jon Turney, overseers
Hi -
> [...] I would like to understand what characteristic of the workload
> you were trying to avoid as to me it looks like avoiding IO comes at
> a significant cost in either memory or CPU consumption (or both).
> [...]
CPU elapsed time primarily. I restored the previous package-grep.cgi,
with the xargs parallelization speedup. We'll just have to eat the
I/O cost (or rather, letting hundreds of megabytes squat the page
cache in RAM). A malevolent enough grep regex could still take a long
time (produce many hits), but less bad than before.
- FChE
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Cygwin] package-grep on Sourceware
2019-11-27 19:38 ` Frank Ch. Eigler
@ 2019-12-08 11:05 ` ASSI
0 siblings, 0 replies; 4+ messages in thread
From: ASSI @ 2019-12-08 11:05 UTC (permalink / raw)
To: Frank Ch. Eigler; +Cc: Frank Ch. Eigler, Jon Turney, overseers
Hi Frank,
I've done some more testing locally. I have confirmed my suspicion that the
output buffering will cause wrong results to be produced when the output gets
longer than a page. This should explain the odd results of a search that
causes all file names to be returned on the server.
I propose that (GNU) grep is run in line-buffering mode instead, which at
least on my local system doesn't materially impact the runtime. I also
suggest that you use find instead of ls to create the list of files, which
reduces the time to produce the file list slightly. If you really want to max
out the parallelism you will also need to limit the number of arguments fed
into each instance (otherwise you'll end up with xargs starting only 13
processes). Feeding in 1715 names on each invocation ends up starting 20
processes, so that should help getting the most out of the available number of
cores.
find $dir -mindepth 2 -maxdepth 2 -type f -not -name .htaccess |
xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered -- "$param_grep" |
sort > "$tmpfile"
In a later iteration the list of files to be searched could be cached (in a
file $dir.lst, say). This already helps in the local case, but is likely more
effective on a system that has a lot more IO load than I can produce locally.
<$dir.lst xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered --
"$param_grep" |
sort > "$tmpfile"
The cache file would need to be refreshed each time the listing directories
get updated (although you probably could just run find in a cronjob every few
minutes and nobody would notice a difference). Having a cache file would also
make determining the optimal input length easier as you could just count the
number of lines in order to calculate how to best split them among multiple
processes.
Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+
Waldorf MIDI Implementation & additional documentation:
http://Synth.Stromeko.net/Downloads.html#WaldorfDocs
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-12-08 11:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <24997709.IERGFR1AUN@gertrud>
2019-11-25 0:37 ` [Cygwin] package-grep on Sourceware Frank Ch. Eigler
2019-11-25 16:48 ` ASSI
2019-11-27 19:38 ` Frank Ch. Eigler
2019-12-08 11:05 ` ASSI
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).