Re: [Cygwin] package-grep on Sourceware

public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed

* Re: [Cygwin] package-grep on Sourceware
       [not found] <24997709.IERGFR1AUN@gertrud>
@ 2019-11-25  0:37 ` Frank Ch. Eigler
  2019-11-25 16:48   ` ASSI
  0 siblings, 1 reply; 4+ messages in thread
From: Frank Ch. Eigler @ 2019-11-25  0:37 UTC (permalink / raw)
  To: ASSI; +Cc: Jon Turney, overseers

Hi -

> you've made changes to the package-grep CGI script on sourceware
> that pretty much remove its originally intended functionality.  The
> original was searching the index of the package _contents_ as it
> would get installed on the disk, while the JSON file you search now
> only has the metadata for the packages.  That's not even
> superficially the same thing.  [...]

OK.  As you may be aware, this simplification was done because the
previous code's workload was too high (grepping through 600MB+ of text
per cgi call), and we at overseers@ weren't told otherwise.


> I've looked into what could be done to do that in a more
> server-friendly way, but it turns out that there either isn't
> anything that's ready-made for this or I can't find it. [...]

Thanks for looking into it.  A few other possibilities to
ameliorate the impact of restoring the previous version:

- using xargs grep to use our extra cores, to reduce latency
- to impose a mod_qos concurrency limit on package-grep.cgi


- FChE

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Cygwin] package-grep on Sourceware
  2019-11-25  0:37 ` [Cygwin] package-grep on Sourceware Frank Ch. Eigler
@ 2019-11-25 16:48   ` ASSI
  2019-11-27 19:38     ` Frank Ch. Eigler
  0 siblings, 1 reply; 4+ messages in thread
From: ASSI @ 2019-11-25 16:48 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Jon Turney, overseers

Hi Frank,

On Monday, November 25, 2019 1:37:01 AM CET Frank Ch. Eigler wrote:
> > you've made changes to the package-grep CGI script on sourceware
> > that pretty much remove its originally intended functionality.  The
> > original was searching the index of the package _contents_ as it
> > would get installed on the disk, while the JSON file you search now
> > only has the metadata for the packages.  That's not even
> > superficially the same thing.  [...]
> 
> OK.  As you may be aware, this simplification was done because the
> previous code's workload was too high (grepping through 600MB+ of text
> per cgi call), and we at overseers@ weren't told otherwise.

I wasn't questioning your motives, that part was pretty clear from the change 
you did.  I would like to understand what characteristic of the workload you 
were trying to avoid as to me it looks like avoiding IO comes at a significant 
cost in either memory or CPU consumption (or both).  I understand that the 
server is a fairly well equipped machine, but not in enough detail to make 
that call.

> > I've looked into what could be done to do that in a more
> > server-friendly way, but it turns out that there either isn't
> > anything that's ready-made for this or I can't find it. [...]
> 
> Thanks for looking into it.  A few other possibilities to
> ameliorate the impact of restoring the previous version:
> 
> - using xargs grep to use our extra cores, to reduce latency

Then I'd suggest to just use ripgrep (is that available?), which automatically 
spreads the work across the available cores (and can be limited to less than 
that if so desired).  Is that available on the machine?  Would it be possible 
to cache the list files in a tmpfs?  This doesn't improve performance on my 
machine, but I have an SSD and no other IO.

> - to impose a mod_qos concurrency limit on package-grep.cgi

I can't really tell what that would mean, but I guess that queries would get 
delayed if too many of them get stacked onto each other?  That would help with 
the load of course.

Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Waldorf MIDI Implementation & additional documentation:
http://Synth.Stromeko.net/Downloads.html#WaldorfDocs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Cygwin] package-grep on Sourceware
  2019-11-25 16:48   ` ASSI
@ 2019-11-27 19:38     ` Frank Ch. Eigler
  2019-12-08 11:05       ` ASSI
  0 siblings, 1 reply; 4+ messages in thread
From: Frank Ch. Eigler @ 2019-11-27 19:38 UTC (permalink / raw)
  To: ASSI; +Cc: Frank Ch. Eigler, Jon Turney, overseers

Hi -

> [...] I would like to understand what characteristic of the workload
> you were trying to avoid as to me it looks like avoiding IO comes at
> a significant cost in either memory or CPU consumption (or both).
> [...]

CPU elapsed time primarily.  I restored the previous package-grep.cgi,
with the xargs parallelization speedup.  We'll just have to eat the
I/O cost (or rather, letting hundreds of megabytes squat the page
cache in RAM).  A malevolent enough grep regex could still take a long
time (produce many hits), but less bad than before.

- FChE

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Cygwin] package-grep on Sourceware
  2019-11-27 19:38     ` Frank Ch. Eigler
@ 2019-12-08 11:05       ` ASSI
  0 siblings, 0 replies; 4+ messages in thread
From: ASSI @ 2019-12-08 11:05 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Frank Ch. Eigler, Jon Turney, overseers

Hi Frank,

I've done some more testing locally.  I have confirmed my suspicion that the 
output buffering will cause wrong results to be produced when the output gets 
longer than a page.  This should explain the odd results of a search that 
causes all file names to be returned on the server.

I propose that (GNU) grep is run in line-buffering mode instead, which at 
least on my local system doesn't materially impact the runtime.  I also 
suggest that you use find instead of ls to create the list of files, which 
reduces the time to produce the file list slightly.  If you really want to max 
out the parallelism you will also need to limit the number of arguments fed 
into each instance (otherwise you'll end up with xargs starting only 13 
processes).  Feeding in 1715 names on each invocation ends up starting 20 
processes, so that should help getting the most out of the available number of 
cores.

  find $dir -mindepth 2 -maxdepth 2 -type f -not -name .htaccess |
    xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered -- "$param_grep" |
    sort > "$tmpfile"

In a later iteration the list of files to be searched could be cached (in a 
file $dir.lst, say).  This already helps in the local case, but is likely more 
effective on a system that has a lot more IO load than I can produce locally.

    <$dir.lst xargs -L1715 -P16 LC_ALL=C grep -l --line-buffered -- 
"$param_grep" |
    sort > "$tmpfile"

The cache file would need to be refreshed each time the listing directories 
get updated (although you probably could just run find in a cronjob every few 
minutes and nobody would notice a difference).  Having a cache file would also 
make determining the optimal input length easier as you could just count the 
number of lines in order to calculate how to best split them among multiple 
processes.

Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Waldorf MIDI Implementation & additional documentation:
http://Synth.Stromeko.net/Downloads.html#WaldorfDocs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-12-08 11:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <24997709.IERGFR1AUN@gertrud>
2019-11-25  0:37 ` [Cygwin] package-grep on Sourceware Frank Ch. Eigler
2019-11-25 16:48   ` ASSI
2019-11-27 19:38     ` Frank Ch. Eigler
2019-12-08 11:05       ` ASSI

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).