sourceware being drowned by cvsweb.cgi

public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed

* sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08 sourceware being drowned by cvsweb.cgi Geoff Keating
@ 2000-07-29 11:51 ` Geoff Keating
  2000-12-30  6:08 ` Chris Faylor
  1 sibling, 0 replies; 12+ messages in thread
From: Geoff Keating @ 2000-07-29 11:51 UTC (permalink / raw)
  To: overseers

I kept getting the message

+ cvs -Q -d :pserver:anoncvs@anoncvs.cygnus.com:/cvs/gcc co egcs-full
Fatal error, aborting.
load average of 27 is too high

I tracked it down to the cvsweb stuff, which had dozens of copies
running.

Looking at the webserver logs:

[geoffk@sourceware ~]$ grep cvsweb /www/logs/sourceware-access_log | tail -50
crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/src/libgloss/sbrk.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=src&sortby=date HTTP/1.0" 200 3296 -
crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/~checkout~/src/libgloss/Makefile.in?rev=1.1&content-type=text/plain&cvsroot=src HTTP/1.0" 200 4812 -
crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/src/libgloss/README?annotate=1.1&cvsroot=src HTTP/1.0" 200 1721 -
crawler.googlebot.com 2000-07-29T11:40:25 "GET /cgi-bin/cvsweb.cgi/src/libgloss/sh/CVS/?cvsroot=src HTTP/1.0" 200 1293 -
...

(googlebot is the crawler for the search engine www.google.com.)
I think someone forgot to list cvsweb in robots.txt:

[geoffk@desire geoffk]$ wget http://sourceware.cygnus.com/robots.txt
[geoffk@desire geoffk]$ cat robots.txt 
# contact sourcemaster@sources.redhat.com for questions.
# see http://info.webcrawler.com/mak/projects/robots/norobots.html
# for information about the file format.

User-agent: *
Disallow: /cygwin/snapshots


Could someone add this?  Or, failing that, change cvsweb so that it
doesn't run if the load average is over, say, 10?

-- 
- Geoffrey Keating <geoffk@cygnus.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08 ` Chris Faylor
@ 2000-07-29 18:16   ` Chris Faylor
  2000-12-30  6:08   ` Gerald Pfeifer
  1 sibling, 0 replies; 12+ messages in thread
From: Chris Faylor @ 2000-07-29 18:16 UTC (permalink / raw)
  To: Geoff Keating; +Cc: overseers

Wow.  That's a big hole.  I guess that explains the load average
on sources.

I've changed the sourceware top-level robots.txt to disallow
/cgi-bin/cvsweb.cgi.  Hopefully I got the syntax right.

There are a few more of these on sources, though:

% find . -name robots.txt
./gcc/htdocs/robots.txt
./gcc/htdocs-preformatted/robots.txt
./sourceware/htdocs/pthreads-win32/robots.txt
./sourceware/htdocs/robots.txt

Do we have to do anything specific for the other cases.  I'd think that
only gcc would possibly be special.

Thanks for looking into this, Geoff.

cgf

On Sat, Jul 29, 2000 at 11:51:27AM -0700, Geoff Keating wrote:
>I kept getting the message
>
>+ cvs -Q -d :pserver:anoncvs@anoncvs.cygnus.com:/cvs/gcc co egcs-full
>Fatal error, aborting.
>load average of 27 is too high
>
>I tracked it down to the cvsweb stuff, which had dozens of copies
>running.
>
>Looking at the webserver logs:
>
>[geoffk@sourceware ~]$ grep cvsweb /www/logs/sourceware-access_log | tail -50
>crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/src/libgloss/sbrk.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=src&sortby=date HTTP/1.0" 200 3296 -
>crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/~checkout~/src/libgloss/Makefile.in?rev=1.1&content-type=text/plain&cvsroot=src HTTP/1.0" 200 4812 -
>crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/src/libgloss/README?annotate=1.1&cvsroot=src HTTP/1.0" 200 1721 -
>crawler.googlebot.com 2000-07-29T11:40:25 "GET /cgi-bin/cvsweb.cgi/src/libgloss/sh/CVS/?cvsroot=src HTTP/1.0" 200 1293 -
>...
>
>(googlebot is the crawler for the search engine www.google.com.)
>I think someone forgot to list cvsweb in robots.txt:
>
>[geoffk@desire geoffk]$ wget http://sourceware.cygnus.com/robots.txt
>[geoffk@desire geoffk]$ cat robots.txt 
># contact sourcemaster@sources.redhat.com for questions.
># see http://info.webcrawler.com/mak/projects/robots/norobots.html
># for information about the file format.
>
>User-agent: *
>Disallow: /cygwin/snapshots
>
>
>Could someone add this?  Or, failing that, change cvsweb so that it
>doesn't run if the load average is over, say, 10?
>
>-- 
>- Geoffrey Keating <geoffk@cygnus.com>

-- 
cgf@cygnus.com                        Cygnus Solutions, a Red Hat company
http://sourceware.cygnus.com/         http://www.redhat.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08   ` Gerald Pfeifer
@ 2000-07-30  2:47     ` Gerald Pfeifer
  2000-12-30  6:08     ` Chris Faylor
  1 sibling, 0 replies; 12+ messages in thread
From: Gerald Pfeifer @ 2000-07-30  2:47 UTC (permalink / raw)
  To: overseers

On Sat, 29 Jul 2000, Chris Faylor wrote:
> I've changed the sourceware top-level robots.txt to disallow
> /cgi-bin/cvsweb.cgi.  Hopefully I got the syntax right.

Looks okay. :)

However, is there any specific reason not just to write
  Disallow: /cgi-bin/

Well, I just realized that also the Faq-O-Matic resides in cgi-bin, so
it might make sense to add
  Allow: /cgi-bin/fom.cgi
there. (If it works like that, that is!)

> There are a few more of these on sources, though:
> 
> % find . -name robots.txt
> ./gcc/htdocs/robots.txt
> ./gcc/htdocs-preformatted/robots.txt
> [...] 
> Do we have to do anything specific for the other cases.  I'd think that
> only gcc would possibly be special.

Yes, on the GCC side I added robots.txt forbidding cgi-bin over a year
ago.

GCC is different, as it is hosted on a separate domain, so we need a
different robots.txt. (Search engines will only look at webhost/robots.txt
instead of trying to access robots.txt in every subdirectory.)

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/
Have a look at http://petition.eurolinux.org -- it's not about Linux, BTW!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08     ` Chris Faylor
@ 2000-07-30 10:09       ` Chris Faylor
  2000-12-30  6:08       ` Jason Molenda
  1 sibling, 0 replies; 12+ messages in thread
From: Chris Faylor @ 2000-07-30 10:09 UTC (permalink / raw)
  To: Gerald Pfeifer; +Cc: overseers

On Sun, Jul 30, 2000 at 11:46:56AM +0200, Gerald Pfeifer wrote:
>On Sat, 29 Jul 2000, Chris Faylor wrote:
>> I've changed the sourceware top-level robots.txt to disallow
>> /cgi-bin/cvsweb.cgi.  Hopefully I got the syntax right.
>
>Looks okay. :)
>
>However, is there any specific reason not just to write
>  Disallow: /cgi-bin/

I meant to ask that question myself.  My first stab at the change was
just to disallow /cgi-bin/ entirely but then I went with the safer
entry.

Jason Molenda, any thoughts on this?

cgf

>Well, I just realized that also the Faq-O-Matic resides in cgi-bin, so
>it might make sense to add
>  Allow: /cgi-bin/fom.cgi
>there. (If it works like that, that is!)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08       ` Jason Molenda
@ 2000-07-30 22:58         ` Jason Molenda
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Molenda @ 2000-07-30 22:58 UTC (permalink / raw)
  To: overseers

On Sun, Jul 30, 2000 at 01:09:21PM -0400, Chris Faylor wrote:

> I meant to ask that question myself.  My first stab at the change was
> just to disallow /cgi-bin/ entirely but then I went with the safer
> entry.
> 
> Jason Molenda, any thoughts on this?

Disallowing cgi-bin seems like the right way to go - it looks like
an oversight on my part that the robots.txt wasn't like that from
the beginning.  cvsweb.cgi is obviously the one script you want to
keep the robots away from the most - it's filling their databases
with useless content and it's very system-resources expensive if
the search engine hits the site hard.

> >Well, I just realized that also the Faq-O-Matic resides in cgi-bin, so
> >it might make sense to add
> >  Allow: /cgi-bin/fom.cgi
> >there. (If it works like that, that is!)

I agree about allowing FOM access -- having the FAQs indexed by
search engines seems desirable.

J

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08   ` Gerald Pfeifer
  2000-07-30  2:47     ` Gerald Pfeifer
@ 2000-12-30  6:08     ` Chris Faylor
  2000-07-30 10:09       ` Chris Faylor
  2000-12-30  6:08       ` Jason Molenda
  1 sibling, 2 replies; 12+ messages in thread
From: Chris Faylor @ 2000-12-30  6:08 UTC (permalink / raw)
  To: Gerald Pfeifer; +Cc: overseers

On Sun, Jul 30, 2000 at 11:46:56AM +0200, Gerald Pfeifer wrote:
>On Sat, 29 Jul 2000, Chris Faylor wrote:
>> I've changed the sourceware top-level robots.txt to disallow
>> /cgi-bin/cvsweb.cgi.  Hopefully I got the syntax right.
>
>Looks okay. :)
>
>However, is there any specific reason not just to write
>  Disallow: /cgi-bin/

I meant to ask that question myself.  My first stab at the change was
just to disallow /cgi-bin/ entirely but then I went with the safer
entry.

Jason Molenda, any thoughts on this?

cgf

>Well, I just realized that also the Faq-O-Matic resides in cgi-bin, so
>it might make sense to add
>  Allow: /cgi-bin/fom.cgi
>there. (If it works like that, that is!)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08 ` Chris Faylor
  2000-07-29 18:16   ` Chris Faylor
@ 2000-12-30  6:08   ` Gerald Pfeifer
  2000-07-30  2:47     ` Gerald Pfeifer
  2000-12-30  6:08     ` Chris Faylor
  1 sibling, 2 replies; 12+ messages in thread
From: Gerald Pfeifer @ 2000-12-30  6:08 UTC (permalink / raw)
  To: overseers

On Sat, 29 Jul 2000, Chris Faylor wrote:
> I've changed the sourceware top-level robots.txt to disallow
> /cgi-bin/cvsweb.cgi.  Hopefully I got the syntax right.

Looks okay. :)

However, is there any specific reason not just to write
  Disallow: /cgi-bin/

Well, I just realized that also the Faq-O-Matic resides in cgi-bin, so
it might make sense to add
  Allow: /cgi-bin/fom.cgi
there. (If it works like that, that is!)

> There are a few more of these on sources, though:
> 
> % find . -name robots.txt
> ./gcc/htdocs/robots.txt
> ./gcc/htdocs-preformatted/robots.txt
> [...] 
> Do we have to do anything specific for the other cases.  I'd think that
> only gcc would possibly be special.

Yes, on the GCC side I added robots.txt forbidding cgi-bin over a year
ago.

GCC is different, as it is hosted on a separate domain, so we need a
different robots.txt. (Search engines will only look at webhost/robots.txt
instead of trying to access robots.txt in every subdirectory.)

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/
Have a look at http://petition.eurolinux.org -- it's not about Linux, BTW!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08 sourceware being drowned by cvsweb.cgi Geoff Keating
  2000-07-29 11:51 ` Geoff Keating
@ 2000-12-30  6:08 ` Chris Faylor
  2000-07-29 18:16   ` Chris Faylor
  2000-12-30  6:08   ` Gerald Pfeifer
  1 sibling, 2 replies; 12+ messages in thread
From: Chris Faylor @ 2000-12-30  6:08 UTC (permalink / raw)
  To: Geoff Keating; +Cc: overseers

Wow.  That's a big hole.  I guess that explains the load average
on sources.

I've changed the sourceware top-level robots.txt to disallow
/cgi-bin/cvsweb.cgi.  Hopefully I got the syntax right.

There are a few more of these on sources, though:

% find . -name robots.txt
./gcc/htdocs/robots.txt
./gcc/htdocs-preformatted/robots.txt
./sourceware/htdocs/pthreads-win32/robots.txt
./sourceware/htdocs/robots.txt

Do we have to do anything specific for the other cases.  I'd think that
only gcc would possibly be special.

Thanks for looking into this, Geoff.

cgf

On Sat, Jul 29, 2000 at 11:51:27AM -0700, Geoff Keating wrote:
>I kept getting the message
>
>+ cvs -Q -d :pserver:anoncvs@anoncvs.cygnus.com:/cvs/gcc co egcs-full
>Fatal error, aborting.
>load average of 27 is too high
>
>I tracked it down to the cvsweb stuff, which had dozens of copies
>running.
>
>Looking at the webserver logs:
>
>[geoffk@sourceware ~]$ grep cvsweb /www/logs/sourceware-access_log | tail -50
>crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/src/libgloss/sbrk.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=src&sortby=date HTTP/1.0" 200 3296 -
>crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/~checkout~/src/libgloss/Makefile.in?rev=1.1&content-type=text/plain&cvsroot=src HTTP/1.0" 200 4812 -
>crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/src/libgloss/README?annotate=1.1&cvsroot=src HTTP/1.0" 200 1721 -
>crawler.googlebot.com 2000-07-29T11:40:25 "GET /cgi-bin/cvsweb.cgi/src/libgloss/sh/CVS/?cvsroot=src HTTP/1.0" 200 1293 -
>...
>
>(googlebot is the crawler for the search engine www.google.com.)
>I think someone forgot to list cvsweb in robots.txt:
>
>[geoffk@desire geoffk]$ wget http://sourceware.cygnus.com/robots.txt
>[geoffk@desire geoffk]$ cat robots.txt 
># contact sourcemaster@sources.redhat.com for questions.
># see http://info.webcrawler.com/mak/projects/robots/norobots.html
># for information about the file format.
>
>User-agent: *
>Disallow: /cygwin/snapshots
>
>
>Could someone add this?  Or, failing that, change cvsweb so that it
>doesn't run if the load average is over, say, 10?
>
>-- 
>- Geoffrey Keating <geoffk@cygnus.com>

-- 
cgf@cygnus.com                        Cygnus Solutions, a Red Hat company
http://sourceware.cygnus.com/         http://www.redhat.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* sourceware being drowned by cvsweb.cgi
@ 2000-12-30  6:08 Geoff Keating
  2000-07-29 11:51 ` Geoff Keating
  2000-12-30  6:08 ` Chris Faylor
  0 siblings, 2 replies; 12+ messages in thread
From: Geoff Keating @ 2000-12-30  6:08 UTC (permalink / raw)
  To: overseers

I kept getting the message

+ cvs -Q -d :pserver:anoncvs@anoncvs.cygnus.com:/cvs/gcc co egcs-full
Fatal error, aborting.
load average of 27 is too high

I tracked it down to the cvsweb stuff, which had dozens of copies
running.

Looking at the webserver logs:

[geoffk@sourceware ~]$ grep cvsweb /www/logs/sourceware-access_log | tail -50
crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/src/libgloss/sbrk.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=src&sortby=date HTTP/1.0" 200 3296 -
crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/~checkout~/src/libgloss/Makefile.in?rev=1.1&content-type=text/plain&cvsroot=src HTTP/1.0" 200 4812 -
crawler.googlebot.com 2000-07-29T11:40:21 "GET /cgi-bin/cvsweb.cgi/src/libgloss/README?annotate=1.1&cvsroot=src HTTP/1.0" 200 1721 -
crawler.googlebot.com 2000-07-29T11:40:25 "GET /cgi-bin/cvsweb.cgi/src/libgloss/sh/CVS/?cvsroot=src HTTP/1.0" 200 1293 -
...

(googlebot is the crawler for the search engine www.google.com.)
I think someone forgot to list cvsweb in robots.txt:

[geoffk@desire geoffk]$ wget http://sourceware.cygnus.com/robots.txt
[geoffk@desire geoffk]$ cat robots.txt 
# contact sourcemaster@sources.redhat.com for questions.
# see http://info.webcrawler.com/mak/projects/robots/norobots.html
# for information about the file format.

User-agent: *
Disallow: /cygwin/snapshots


Could someone add this?  Or, failing that, change cvsweb so that it
doesn't run if the load average is over, say, 10?

-- 
- Geoffrey Keating <geoffk@cygnus.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08     ` Chris Faylor
  2000-07-30 10:09       ` Chris Faylor
@ 2000-12-30  6:08       ` Jason Molenda
  2000-07-30 22:58         ` Jason Molenda
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Molenda @ 2000-12-30  6:08 UTC (permalink / raw)
  To: overseers

On Sun, Jul 30, 2000 at 01:09:21PM -0400, Chris Faylor wrote:

> I meant to ask that question myself.  My first stab at the change was
> just to disallow /cgi-bin/ entirely but then I went with the safer
> entry.
> 
> Jason Molenda, any thoughts on this?

Disallowing cgi-bin seems like the right way to go - it looks like
an oversight on my part that the robots.txt wasn't like that from
the beginning.  cvsweb.cgi is obviously the one script you want to
keep the robots away from the most - it's filling their databases
with useless content and it's very system-resources expensive if
the search engine hits the site hard.

> >Well, I just realized that also the Faq-O-Matic resides in cgi-bin, so
> >it might make sense to add
> >  Allow: /cgi-bin/fom.cgi
> >there. (If it works like that, that is!)

I agree about allowing FOM access -- having the FAQs indexed by
search engines seems desirable.

J

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
@ 2000-12-30  6:08 Jim Kingdon
  2000-07-30  0:34 ` Jim Kingdon
  0 siblings, 1 reply; 12+ messages in thread
From: Jim Kingdon @ 2000-12-30  6:08 UTC (permalink / raw)
  To: overseers; +Cc: geoffk

> I've changed the sourceware top-level robots.txt to disallow
> /cgi-bin/cvsweb.cgi.  Hopefully I got the syntax right.

When I was running Cyclic.com, I went ahead and let the search engines
index cvsweb, on the theory that it would get me more hits (and I did
get a lot of search engine hits; not sure I ever looked at the logs
enough to know how many were from cvsweb).  This would cause a high
load for a while, and then it would stop because they had indexed as
much as they were going to.  The URLs which cvsweb spits out are
systematic in the sense that they don't form infinite loops and such
(as far as I know).

Not that it is particularly important to have cvsweb indexed - and I
agree that a load average of 27 is too high ;-).  Just food for
thought in case anyone wants to play with selectively re-enabling
cvsweb in robots.txt some day.

Thanks for noticing this, geoffk.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: sourceware being drowned by cvsweb.cgi
  2000-12-30  6:08 Jim Kingdon
@ 2000-07-30  0:34 ` Jim Kingdon
  0 siblings, 0 replies; 12+ messages in thread
From: Jim Kingdon @ 2000-07-30  0:34 UTC (permalink / raw)
  To: overseers; +Cc: geoffk

> I've changed the sourceware top-level robots.txt to disallow
> /cgi-bin/cvsweb.cgi.  Hopefully I got the syntax right.

When I was running Cyclic.com, I went ahead and let the search engines
index cvsweb, on the theory that it would get me more hits (and I did
get a lot of search engine hits; not sure I ever looked at the logs
enough to know how many were from cvsweb).  This would cause a high
load for a while, and then it would stop because they had indexed as
much as they were going to.  The URLs which cvsweb spits out are
systematic in the sense that they don't form infinite loops and such
(as far as I know).

Not that it is particularly important to have cvsweb indexed - and I
agree that a load average of 27 is too high ;-).  Just food for
thought in case anyone wants to play with selectively re-enabling
cvsweb in robots.txt some day.

Thanks for noticing this, geoffk.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2000-12-30  6:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-30  6:08 sourceware being drowned by cvsweb.cgi Geoff Keating
2000-07-29 11:51 ` Geoff Keating
2000-12-30  6:08 ` Chris Faylor
2000-07-29 18:16   ` Chris Faylor
2000-12-30  6:08   ` Gerald Pfeifer
2000-07-30  2:47     ` Gerald Pfeifer
2000-12-30  6:08     ` Chris Faylor
2000-07-30 10:09       ` Chris Faylor
2000-12-30  6:08       ` Jason Molenda
2000-07-30 22:58         ` Jason Molenda
  -- strict thread matches above, loose matches on Subject: below --
2000-12-30  6:08 Jim Kingdon
2000-07-30  0:34 ` Jim Kingdon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).