Don't let search bots look at buglist.cgi

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Don't let search bots look at buglist.cgi
@ 2011-05-13 19:19 Ian Lance Taylor
  2011-05-16 12:59 ` Richard Guenther
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Lance Taylor @ 2011-05-13 19:19 UTC (permalink / raw)
  To: overseers, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 570 bytes --]

I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
at some of the long running instances, and they were coming from
searchbots.  I can't think of a good reason for this, so I have
committed this patch to the gcc.gnu.org robots.txt file to not let
searchbots search through lists of bugs.  I plan to make a similar
change on the sourceware.org and cygwin.com sides.  Please let me know
if this seems like a mistake.

Does anybody have any experience with
http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
better approach.

Ian

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: patch --]
[-- Type: text/x-diff, Size: 393 bytes --]

Index: robots.txt
===================================================================
RCS file: /cvs/gcc/wwwdocs/htdocs/robots.txt,v
retrieving revision 1.9
diff -u -r1.9 robots.txt
--- robots.txt	22 Sep 2009 19:19:30 -0000	1.9
+++ robots.txt	13 May 2011 17:08:33 -0000
@@ -5,4 +5,5 @@
 User-Agent: *
 Disallow: /viewcvs/
 Disallow: /cgi-bin/
+Disallow: /bugzilla/buglist.cgi
 Crawl-Delay: 60

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-13 19:19 Don't let search bots look at buglist.cgi Ian Lance Taylor
@ 2011-05-16 12:59 ` Richard Guenther
  2011-05-16 13:17   ` Andrew Haley
  2011-05-16 23:13   ` Ian Lance Taylor
  0 siblings, 2 replies; 19+ messages in thread
From: Richard Guenther @ 2011-05-16 12:59 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: overseers, gcc-patches

On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
> at some of the long running instances, and they were coming from
> searchbots.  I can't think of a good reason for this, so I have
> committed this patch to the gcc.gnu.org robots.txt file to not let
> searchbots search through lists of bugs.  I plan to make a similar
> change on the sourceware.org and cygwin.com sides.  Please let me know
> if this seems like a mistake.
>
> Does anybody have any experience with
> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
> better approach.

Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
can crawl the gcc-bugs mailinglist archives.

Richard.

> Ian
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 12:59 ` Richard Guenther
@ 2011-05-16 13:17   ` Andrew Haley
  2011-05-16 13:18     ` Michael Matz
  2011-05-16 23:13   ` Ian Lance Taylor
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Haley @ 2011-05-16 13:17 UTC (permalink / raw)
  To: gcc-patches

On 16/05/11 10:45, Richard Guenther wrote:
> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>> at some of the long running instances, and they were coming from
>> searchbots.  I can't think of a good reason for this, so I have
>> committed this patch to the gcc.gnu.org robots.txt file to not let
>> searchbots search through lists of bugs.  I plan to make a similar
>> change on the sourceware.org and cygwin.com sides.  Please let me know
>> if this seems like a mistake.
>>
>> Does anybody have any experience with
>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>> better approach.
> 
> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
> can crawl the gcc-bugs mailinglist archives.

I don't understand this.  Surely it is super-useful for Google etc. to
be able to search gcc's Bugzilla.

Andrew.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:17   ` Andrew Haley
@ 2011-05-16 13:18     ` Michael Matz
  2011-05-16 13:28       ` Andrew Haley
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Matz @ 2011-05-16 13:18 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-patches

Hi,

On Mon, 16 May 2011, Andrew Haley wrote:

> On 16/05/11 10:45, Richard Guenther wrote:
> > On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
> >> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
> >> at some of the long running instances, and they were coming from
> >> searchbots.  I can't think of a good reason for this, so I have
> >> committed this patch to the gcc.gnu.org robots.txt file to not let
> >> searchbots search through lists of bugs.  I plan to make a similar
> >> change on the sourceware.org and cygwin.com sides.  Please let me know
> >> if this seems like a mistake.
> >>
> >> Does anybody have any experience with
> >> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
> >> better approach.
> > 
> > Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
> > can crawl the gcc-bugs mailinglist archives.
> 
> I don't understand this.  Surely it is super-useful for Google etc. to
> be able to search gcc's Bugzilla.

gcc-bugs provides exactly the same information, and doesn't have to 
regenerate the full web page for each access to a bug report.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:18     ` Michael Matz
@ 2011-05-16 13:28       ` Andrew Haley
  2011-05-16 13:32         ` Andreas Schwab
  2011-05-16 13:34         ` Richard Guenther
  0 siblings, 2 replies; 19+ messages in thread
From: Andrew Haley @ 2011-05-16 13:28 UTC (permalink / raw)
  To: Michael Matz; +Cc: gcc-patches

On 05/16/2011 01:09 PM, Michael Matz wrote:
> Hi,
> 
> On Mon, 16 May 2011, Andrew Haley wrote:
> 
>> On 16/05/11 10:45, Richard Guenther wrote:
>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>>>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>>>> at some of the long running instances, and they were coming from
>>>> searchbots.  I can't think of a good reason for this, so I have
>>>> committed this patch to the gcc.gnu.org robots.txt file to not let
>>>> searchbots search through lists of bugs.  I plan to make a similar
>>>> change on the sourceware.org and cygwin.com sides.  Please let me know
>>>> if this seems like a mistake.
>>>>
>>>> Does anybody have any experience with
>>>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>>>> better approach.
>>>
>>> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
>>> can crawl the gcc-bugs mailinglist archives.
>>
>> I don't understand this.  Surely it is super-useful for Google etc. to
>> be able to search gcc's Bugzilla.
> 
> gcc-bugs provides exactly the same information, and doesn't have to 
> regenerate the full web page for each access to a bug report.

It's not quite the same information, surely.  Wouldn't searchers be directed
to an email rather than the bug itself?

Andrew.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:28       ` Andrew Haley
@ 2011-05-16 13:32         ` Andreas Schwab
  2011-05-16 13:34         ` Richard Guenther
  1 sibling, 0 replies; 19+ messages in thread
From: Andreas Schwab @ 2011-05-16 13:32 UTC (permalink / raw)
  To: Andrew Haley; +Cc: Michael Matz, gcc-patches

Andrew Haley <aph@redhat.com> writes:

> It's not quite the same information, surely.  Wouldn't searchers be directed
> to an email rather than the bug itself?

The mail contains the bugzilla link, so they can easily get there if
needed.

Andreas.

-- 
Andreas Schwab, schwab@redhat.com
GPG Key fingerprint = D4E8 DBE3 3813 BB5D FA84  5EC7 45C6 250E 6F00 984E
"And now for something completely different."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:28       ` Andrew Haley
  2011-05-16 13:32         ` Andreas Schwab
@ 2011-05-16 13:34         ` Richard Guenther
  2011-05-16 13:39           ` Andrew Haley
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Guenther @ 2011-05-16 13:34 UTC (permalink / raw)
  To: Andrew Haley; +Cc: Michael Matz, gcc-patches

On Mon, May 16, 2011 at 3:04 PM, Andrew Haley <aph@redhat.com> wrote:
> On 05/16/2011 01:09 PM, Michael Matz wrote:
>> Hi,
>>
>> On Mon, 16 May 2011, Andrew Haley wrote:
>>
>>> On 16/05/11 10:45, Richard Guenther wrote:
>>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>>>>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>>>>> at some of the long running instances, and they were coming from
>>>>> searchbots.  I can't think of a good reason for this, so I have
>>>>> committed this patch to the gcc.gnu.org robots.txt file to not let
>>>>> searchbots search through lists of bugs.  I plan to make a similar
>>>>> change on the sourceware.org and cygwin.com sides.  Please let me know
>>>>> if this seems like a mistake.
>>>>>
>>>>> Does anybody have any experience with
>>>>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>>>>> better approach.
>>>>
>>>> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
>>>> can crawl the gcc-bugs mailinglist archives.
>>>
>>> I don't understand this.  Surely it is super-useful for Google etc. to
>>> be able to search gcc's Bugzilla.
>>
>> gcc-bugs provides exactly the same information, and doesn't have to
>> regenerate the full web page for each access to a bug report.
>
> It's not quite the same information, surely.  Wouldn't searchers be directed
> to an email rather than the bug itself?

Yes, though there is a link in all mails.

Richard.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:34         ` Richard Guenther
@ 2011-05-16 13:39           ` Andrew Haley
  2011-05-16 13:42             ` Michael Matz
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Haley @ 2011-05-16 13:39 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Matz, gcc-patches

On 05/16/2011 02:10 PM, Richard Guenther wrote:
> On Mon, May 16, 2011 at 3:04 PM, Andrew Haley <aph@redhat.com> wrote:
>> On 05/16/2011 01:09 PM, Michael Matz wrote:
>>> Hi,
>>>
>>> On Mon, 16 May 2011, Andrew Haley wrote:
>>>
>>>> On 16/05/11 10:45, Richard Guenther wrote:
>>>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>>>>>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>>>>>> at some of the long running instances, and they were coming from
>>>>>> searchbots.  I can't think of a good reason for this, so I have
>>>>>> committed this patch to the gcc.gnu.org robots.txt file to not let
>>>>>> searchbots search through lists of bugs.  I plan to make a similar
>>>>>> change on the sourceware.org and cygwin.com sides.  Please let me know
>>>>>> if this seems like a mistake.
>>>>>>
>>>>>> Does anybody have any experience with
>>>>>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>>>>>> better approach.
>>>>>
>>>>> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
>>>>> can crawl the gcc-bugs mailinglist archives.
>>>>
>>>> I don't understand this.  Surely it is super-useful for Google etc. to
>>>> be able to search gcc's Bugzilla.
>>>
>>> gcc-bugs provides exactly the same information, and doesn't have to
>>> regenerate the full web page for each access to a bug report.
>>
>> It's not quite the same information, surely.  Wouldn't searchers be directed
>> to an email rather than the bug itself?
> 
> Yes, though there is a link in all mails.

Right, so we are contemplating a reduction in search quality in
exchange for a reduction in server load.  That is not an improvement
from the point of view of our users, and is therefore not the sort of
thing we should do unless the server load is so great that it impedes
our mission.

Andrew.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:39           ` Andrew Haley
@ 2011-05-16 13:42             ` Michael Matz
  2011-05-16 13:42               ` Andrew Haley
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Matz @ 2011-05-16 13:42 UTC (permalink / raw)
  To: Andrew Haley; +Cc: Richard Guenther, gcc-patches

Hi,

On Mon, 16 May 2011, Andrew Haley wrote:

> >> It's not quite the same information, surely.  Wouldn't searchers be 
> >> directed to an email rather than the bug itself?
> > 
> > Yes, though there is a link in all mails.
> 
> Right, so we are contemplating a reduction in search quality in exchange 
> for a reduction in server load.  That is not an improvement from the 
> point of view of our users, and is therefore not the sort of thing we 
> should do unless the server load is so great that it impedes our 
> mission.

It routinely is.  bugzilla performance is terrible most of the time for me 
(up to the point of five timeouts in sequence), svn speed is mediocre at 
best, and people with access to gcc.gnu.org often observe loads > 25, 
mostly due to I/O .


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:42             ` Michael Matz
@ 2011-05-16 13:42               ` Andrew Haley
  2011-05-16 13:45                 ` Michael Matz
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Haley @ 2011-05-16 13:42 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Guenther, gcc-patches

On 05/16/2011 02:22 PM, Michael Matz wrote:
> Hi,
> 
> On Mon, 16 May 2011, Andrew Haley wrote:
> 
>>>> It's not quite the same information, surely.  Wouldn't searchers be 
>>>> directed to an email rather than the bug itself?
>>>
>>> Yes, though there is a link in all mails.
>>
>> Right, so we are contemplating a reduction in search quality in exchange 
>> for a reduction in server load.  That is not an improvement from the 
>> point of view of our users, and is therefore not the sort of thing we 
>> should do unless the server load is so great that it impedes our 
>> mission.
> 
> It routinely is.  bugzilla performance is terrible most of the time for me 
> (up to the point of five timeouts in sequence), svn speed is mediocre at 
> best, and people with access to gcc.gnu.org often observe loads > 25, 
> mostly due to I/O .

And how have you concluded that is due to web crawlers?

Andrew.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:42               ` Andrew Haley
@ 2011-05-16 13:45                 ` Michael Matz
  2011-05-16 14:18                   ` Andrew Haley
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Matz @ 2011-05-16 13:45 UTC (permalink / raw)
  To: Andrew Haley; +Cc: Richard Guenther, gcc-patches

Hi,

On Mon, 16 May 2011, Andrew Haley wrote:

> > It routinely is.  bugzilla performance is terrible most of the time 
> > for me (up to the point of five timeouts in sequence), svn speed is 
> > mediocre at best, and people with access to gcc.gnu.org often observe 
> > loads > 25, mostly due to I/O .
> 
> And how have you concluded that is due to web crawlers?

httpd being in the top-10 always, fiddling with bugzilla URLs?
(Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple 
instances of discussion on #gcc and richi poking on it; that said, it 
still might not be web crawlers, that's right, but I'll happily accept 
_any_ load improvement on gcc.gnu.org, how unfounded they might seem)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 13:45                 ` Michael Matz
@ 2011-05-16 14:18                   ` Andrew Haley
  2011-05-16 14:37                     ` Richard Guenther
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Haley @ 2011-05-16 14:18 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Guenther, gcc-patches

On 05/16/2011 02:32 PM, Michael Matz wrote:
> 
> On Mon, 16 May 2011, Andrew Haley wrote:
> 
>>> It routinely is.  bugzilla performance is terrible most of the time 
>>> for me (up to the point of five timeouts in sequence), svn speed is 
>>> mediocre at best, and people with access to gcc.gnu.org often observe 
>>> loads > 25, mostly due to I/O .
>>
>> And how have you concluded that is due to web crawlers?
> 
> httpd being in the top-10 always, fiddling with bugzilla URLs?
> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple 
> instances of discussion on #gcc and richi poking on it; that said, it 
> still might not be web crawlers, that's right, but I'll happily accept 
> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)

Well, we have to be sensible.  If blocking crawlers only results in a
small load reduction that isn't, IMHO, a good deal for our users.

Andrew.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 14:18                   ` Andrew Haley
@ 2011-05-16 14:37                     ` Richard Guenther
  2011-05-16 15:28                       ` Andrew Haley
  2011-05-17  7:17                       ` Ian Lance Taylor
  0 siblings, 2 replies; 19+ messages in thread
From: Richard Guenther @ 2011-05-16 14:37 UTC (permalink / raw)
  To: Andrew Haley; +Cc: Michael Matz, gcc-patches

On Mon, May 16, 2011 at 3:34 PM, Andrew Haley <aph@redhat.com> wrote:
> On 05/16/2011 02:32 PM, Michael Matz wrote:
>>
>> On Mon, 16 May 2011, Andrew Haley wrote:
>>
>>>> It routinely is.  bugzilla performance is terrible most of the time
>>>> for me (up to the point of five timeouts in sequence), svn speed is
>>>> mediocre at best, and people with access to gcc.gnu.org often observe
>>>> loads > 25, mostly due to I/O .
>>>
>>> And how have you concluded that is due to web crawlers?
>>
>> httpd being in the top-10 always, fiddling with bugzilla URLs?
>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
>> instances of discussion on #gcc and richi poking on it; that said, it
>> still might not be web crawlers, that's right, but I'll happily accept
>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
>
> Well, we have to be sensible.  If blocking crawlers only results in a
> small load reduction that isn't, IMHO, a good deal for our users.

I for example see also

66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
/viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" (35%) 2060117us

and viewvc is certainly even worse (from an I/O perspecive).  I thought
we blocked all bot traffic from the viewvc stuff ...

Richard.

> Andrew.
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 14:37                     ` Richard Guenther
@ 2011-05-16 15:28                       ` Andrew Haley
  2011-05-17  7:17                       ` Ian Lance Taylor
  1 sibling, 0 replies; 19+ messages in thread
From: Andrew Haley @ 2011-05-16 15:28 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Michael Matz, gcc-patches

On 05/16/2011 02:42 PM, Richard Guenther wrote:
> On Mon, May 16, 2011 at 3:34 PM, Andrew Haley <aph@redhat.com> wrote:
>> On 05/16/2011 02:32 PM, Michael Matz wrote:
>>>
>>> On Mon, 16 May 2011, Andrew Haley wrote:
>>>
>>>>> It routinely is.  bugzilla performance is terrible most of the time
>>>>> for me (up to the point of five timeouts in sequence), svn speed is
>>>>> mediocre at best, and people with access to gcc.gnu.org often observe
>>>>> loads > 25, mostly due to I/O .
>>>>
>>>> And how have you concluded that is due to web crawlers?
>>>
>>> httpd being in the top-10 always, fiddling with bugzilla URLs?
>>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
>>> instances of discussion on #gcc and richi poking on it; that said, it
>>> still might not be web crawlers, that's right, but I'll happily accept
>>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
>>
>> Well, we have to be sensible.  If blocking crawlers only results in a
>> small load reduction that isn't, IMHO, a good deal for our users.
> 
> I for example see also
> 
> 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
> /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)" (35%) 2060117us
> 
> and viewvc is certainly even worse (from an I/O perspecive).  I thought
> we blocked all bot traffic from the viewvc stuff ...

It makes sense to block viewcvs, but I don't think it makes as
much sense to block the bugs themselves.  That's the part that
is useful to our users.

Andrew.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 12:59 ` Richard Guenther
  2011-05-16 13:17   ` Andrew Haley
@ 2011-05-16 23:13   ` Ian Lance Taylor
  2011-05-17  2:53     ` Joseph S. Myers
  1 sibling, 1 reply; 19+ messages in thread
From: Ian Lance Taylor @ 2011-05-16 23:13 UTC (permalink / raw)
  To: Richard Guenther; +Cc: overseers, gcc-patches

Richard Guenther <richard.guenther@gmail.com> writes:

> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>> at some of the long running instances, and they were coming from
>> searchbots.  I can't think of a good reason for this, so I have
>> committed this patch to the gcc.gnu.org robots.txt file to not let
>> searchbots search through lists of bugs.  I plan to make a similar
>> change on the sourceware.org and cygwin.com sides.  Please let me know
>> if this seems like a mistake.
>>
>> Does anybody have any experience with
>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>> better approach.
>
> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
> can crawl the gcc-bugs mailinglist archives.

I don't see anything wrong with crawling bugzilla, though, and the
resulting links should be better.

Ian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 23:13   ` Ian Lance Taylor
@ 2011-05-17  2:53     ` Joseph S. Myers
  0 siblings, 0 replies; 19+ messages in thread
From: Joseph S. Myers @ 2011-05-17  2:53 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: Richard Guenther, overseers, gcc-patches

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1267 bytes --]

On Mon, 16 May 2011, Ian Lance Taylor wrote:

> Richard Guenther <richard.guenther@gmail.com> writes:
> 
> > On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
> >> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
> >> at some of the long running instances, and they were coming from
> >> searchbots.  I can't think of a good reason for this, so I have
> >> committed this patch to the gcc.gnu.org robots.txt file to not let
> >> searchbots search through lists of bugs.  I plan to make a similar
> >> change on the sourceware.org and cygwin.com sides.  Please let me know
> >> if this seems like a mistake.
> >>
> >> Does anybody have any experience with
> >> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
> >> better approach.
> >
> > Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
> > can crawl the gcc-bugs mailinglist archives.
> 
> I don't see anything wrong with crawling bugzilla, though, and the
> resulting links should be better.

Indeed.  I think the individual bugs, and the GCC-specific help texts 
(such as describekeywords.cgi and describecomponents.cgi), should be 
indexed.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-16 14:37                     ` Richard Guenther
  2011-05-16 15:28                       ` Andrew Haley
@ 2011-05-17  7:17                       ` Ian Lance Taylor
  2011-05-17 11:12                         ` Axel Freyn
  2011-05-17 13:39                         ` Michael Matz
  1 sibling, 2 replies; 19+ messages in thread
From: Ian Lance Taylor @ 2011-05-17  7:17 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Andrew Haley, Michael Matz, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1131 bytes --]

On Mon, May 16, 2011 at 6:42 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
>>>
>>> httpd being in the top-10 always, fiddling with bugzilla URLs?
>>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
>>> instances of discussion on #gcc and richi poking on it; that said, it
>>> still might not be web crawlers, that's right, but I'll happily accept
>>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)

I think that simply blocking buglist.cgi has dropped bugzilla off the
immediate radar.
It also seems to have lowered the load, although I'm not sure if we
are still keeping
historical data.


> I for example see also
>
> 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
> /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)" (35%) 2060117us
>
> and viewvc is certainly even worse (from an I/O perspecive).  I thought
> we blocked all bot traffic from the viewvc stuff ...

This is only happening at top level.  I committed this patch to fix this.

Ian

[-- Attachment #2: foo.patch --]
[-- Type: text/x-patch, Size: 517 bytes --]

Index: robots.txt
===================================================================
RCS file: /cvs/gcc/wwwdocs/htdocs/robots.txt,v
retrieving revision 1.10
diff -u -r1.10 robots.txt
--- robots.txt	13 May 2011 17:09:11 -0000	1.10
+++ robots.txt	17 May 2011 05:19:11 -0000
@@ -2,8 +2,8 @@
 # for information about the file format.
 # Contact gcc@gcc.gnu.org for questions.
 
-User-Agent: *
-Disallow: /viewcvs/
+User-agent: *
+Disallow: /viewcvs
 Disallow: /cgi-bin/
 Disallow: /bugzilla/buglist.cgi
 Crawl-Delay: 60

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-17  7:17                       ` Ian Lance Taylor
@ 2011-05-17 11:12                         ` Axel Freyn
  2011-05-17 13:39                         ` Michael Matz
  1 sibling, 0 replies; 19+ messages in thread
From: Axel Freyn @ 2011-05-17 11:12 UTC (permalink / raw)
  To: gcc-patches

On Mon, May 16, 2011 at 10:27:44PM -0700, Ian Lance Taylor wrote:
> On Mon, May 16, 2011 at 6:42 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
> >>>
> >>> httpd being in the top-10 always, fiddling with bugzilla URLs?
> >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
> >>> instances of discussion on #gcc and richi poking on it; that said, it
> >>> still might not be web crawlers, that's right, but I'll happily accept
> >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
> 
> I think that simply blocking buglist.cgi has dropped bugzilla off the
> immediate radar.
> It also seems to have lowered the load, although I'm not sure if we
> are still keeping
> historical data.
> 
> 
> > I for example see also
> >
> > 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
> > /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
> > "Mozilla/5.0 (compatible; Googlebot/2.1;
> > +http://www.google.com/bot.html)" (35%) 2060117us
> >
> > and viewvc is certainly even worse (from an I/O perspecive). Â I thought
> > we blocked all bot traffic from the viewvc stuff ...
> 
> This is only happening at top level.  I committed this patch to fix this.
Probably you know it much better than me, but wouldn't it be a
possibility to only allow some of google crawlers? (if all try to crawl
bugzilla)
As I read
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=1061943
it would be possible to block the Crawlers Googlebot-Mobile,
Mediapartners-Google and AdsBot-Google, (which seem to be independent
Crawlers?) while allowing the main Googlebot (Well, I don't know how
often which crawler appears how often on bugzilla...)

Axel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Don't let search bots look at buglist.cgi
  2011-05-17  7:17                       ` Ian Lance Taylor
  2011-05-17 11:12                         ` Axel Freyn
@ 2011-05-17 13:39                         ` Michael Matz
  1 sibling, 0 replies; 19+ messages in thread
From: Michael Matz @ 2011-05-17 13:39 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: Richard Guenther, Andrew Haley, gcc-patches

Hi,

On Mon, 16 May 2011, Ian Lance Taylor wrote:

> >>> httpd being in the top-10 always, fiddling with bugzilla URLs? 
> >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from 
> >>> multiple instances of discussion on #gcc and richi poking on it; 
> >>> that said, it still might not be web crawlers, that's right, but 
> >>> I'll happily accept
> >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
> 
> I think that simply blocking buglist.cgi has dropped bugzilla off the 
> immediate radar. It also seems to have lowered the load, although I'm 
> not sure if we are still keeping historical data.

Btw. FWIW, I had a quick look at one of the httpd log files, and in seven 
hours on last Saturday (from 5:30 to 12:30), there were overall 435203 GET 
requests, and 391319 of them came from our own MnoGoSearch engine, that's 
90%.  Granted many are then in fact 304 (not modified) responses, but 
still, perhaps the eagerness of our own crawler can be turned down a bit.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2011-05-17 11:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-13 19:19 Don't let search bots look at buglist.cgi Ian Lance Taylor
2011-05-16 12:59 ` Richard Guenther
2011-05-16 13:17   ` Andrew Haley
2011-05-16 13:18     ` Michael Matz
2011-05-16 13:28       ` Andrew Haley
2011-05-16 13:32         ` Andreas Schwab
2011-05-16 13:34         ` Richard Guenther
2011-05-16 13:39           ` Andrew Haley
2011-05-16 13:42             ` Michael Matz
2011-05-16 13:42               ` Andrew Haley
2011-05-16 13:45                 ` Michael Matz
2011-05-16 14:18                   ` Andrew Haley
2011-05-16 14:37                     ` Richard Guenther
2011-05-16 15:28                       ` Andrew Haley
2011-05-17  7:17                       ` Ian Lance Taylor
2011-05-17 11:12                         ` Axel Freyn
2011-05-17 13:39                         ` Michael Matz
2011-05-16 23:13   ` Ian Lance Taylor
2011-05-17  2:53     ` Joseph S. Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).