public inbox for overseers@sourceware.org
 help / color / mirror / Atom feed
* [v-amwilc at microsoft period com: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt]
@ 2011-11-17 17:56 Chris Faylor
  2011-11-18  0:22 ` Jonathan Larmour
  2011-11-18  1:00 ` Ian Lance Taylor
  0 siblings, 2 replies; 3+ messages in thread
From: Chris Faylor @ 2011-11-17 17:56 UTC (permalink / raw)
  To: overseers

[Reply-To set to overseers]
Should we change the Crawl-Delay ?

cgf

----- Forwarded message from "Amy Wilcox (Murphy & Associates)"  -----

From: "Amy Wilcox (Murphy & Associates)" <v-amwilc at microsoft period com>
To: gcc
Subject: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt
Date: Wed, 16 Nov 2011 20:05:26 +0000
X-SWARE-Spam-Status: No, hits=-0.4 required=5.0	tests=BAYES_50,HTML_MESSAGE,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD
X-Spam-Status: No, hits=-0.4 required=5.0	tests=BAYES_50,HTML_MESSAGE,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Thread-Topic: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt
Thread-Index: AcykmurVUWZdOAcKRlO9QHDwspi2hA==
Accept-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [157.54.51.34]
X-Virus-Checked: Checked by ClamAV on sourceware.org

Hi,

I am contacting you from the Microsoft Corporation and its Internet search engine Bing (http://www.bing.com) in regards to your robots.txt file at http://gcc.gnu.org/robots.txt. Our customers have alerted us that some of your site content was not visible in our results. We have discovered that you are preventing us from crawling this content by the following crawl-delay settings in your robots.txt.

User-agent: *
Disallow: /viewcvs
Disallow: /cgi-bin/
Disallow: /bugzilla/buglist.cgi
Crawl-Delay: 60

Your current crawl-delay setting of 60 authorizes us to crawl around 1440 URLs per day (86,400 seconds per day / 60 crawl-delay ) which is not enough to guarantee that new URLs are crawled and indexed. Also this rate will not allow us to crawl older URLs to verify if they have been updated or if they are still available on your site.

Since you have a large number of URLs on your site, we would be pleased if you remove the crawl delay settings in your robots.txt which additionally will increase traffic to your site via Bing and Yahoo search results. If you would like to use a slower or faster crawl rate at different times of the day our Bing Webmaster Tools will allow you to configure these settings (http://www.bing.com/community/site_blogs/b/webmaster/archive/2011/06/08/updates-to-bing-webmaster-tools-data-and-content.aspx) and also assist you further in obtaining the best results possible for your business or website (http://www.bing.com/toolbox/webmaster/ ).

If you have further questions please let me know.

Best regards,

Amy Wilcox
Web Analyst, Bing from Microsoft
v-amwilc at microsoft period com


----- End forwarded message -----

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [v-amwilc at microsoft period com: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt]
  2011-11-17 17:56 [v-amwilc at microsoft period com: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt] Chris Faylor
@ 2011-11-18  0:22 ` Jonathan Larmour
  2011-11-18  1:00 ` Ian Lance Taylor
  1 sibling, 0 replies; 3+ messages in thread
From: Jonathan Larmour @ 2011-11-18  0:22 UTC (permalink / raw)
  To: overseers

On 17/11/11 17:56, Chris Faylor wrote:
> [Reply-To set to overseers]
> Should we change the Crawl-Delay ?

Probably yes as the vastness of sourceware's links make using a search
engine all the more important.

But on my own website, I find the occasional problems caused by crawler
activity are usually not so much to do with the individual crawlers
themselves so much as if they happen to all be kicking the server
simultaneously. In one sense that's not their fault as they can't know
about server load and other crawlers, but there is a potential solution...

We already use mod_bw version 0.6 in Apache - but according to
<http://ivn.cl/> if we upgraded to version 0.8 or later we would be able
to use a regexp to match the user-agent of the various search engines, and
therefore restrict them collectively.

For example:
BandWidth "u:(Slurp|BaiduSpider|Googlebot|msnbot)" 100000
MaxConnection "u:(Slurp|BaiduSpider|Googlebot|msnbot)" 10

This is what I've been meaning to do on my own site, although I've never
got round to it. I'm sure there are many more user agents that could be
added into that.

Jifl

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [v-amwilc at microsoft period com: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt]
  2011-11-17 17:56 [v-amwilc at microsoft period com: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt] Chris Faylor
  2011-11-18  0:22 ` Jonathan Larmour
@ 2011-11-18  1:00 ` Ian Lance Taylor
  1 sibling, 0 replies; 3+ messages in thread
From: Ian Lance Taylor @ 2011-11-18  1:00 UTC (permalink / raw)
  To: overseers

Chris Faylor <cgf-use-the-mailinglist-please@cgf.cx> writes:

> [Reply-To set to overseers]
> Should we change the Crawl-Delay ?

I think it would be fine to try.

Ian

> From: "Amy Wilcox (Murphy & Associates)" <v-amwilc at microsoft period com>
> Subject: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt
> To: gcc
> Date: Wed, 16 Nov 2011 20:05:26 +0000
>
> Hi,
>
> I am contacting you from the Microsoft Corporation and its Internet search engine Bing (http://www.bing.com) in regards to your robots.txt file at http://gcc.gnu.org/robots.txt. Our customers have alerted us that some of your site content was not visible in our results. We have discovered that you are preventing us from crawling this content by the following crawl-delay settings in your robots.txt.
>
> User-agent: *
> Disallow: /viewcvs
> Disallow: /cgi-bin/
> Disallow: /bugzilla/buglist.cgi
> Crawl-Delay: 60
>
> Your current crawl-delay setting of 60 authorizes us to crawl around 1440 URLs per day (86,400 seconds per day / 60 crawl-delay ) which is not enough to guarantee that new URLs are crawled and indexed. Also this rate will not allow us to crawl older URLs to verify if they have been updated or if they are still available on your site.
>
> Since you have a large number of URLs on your site, we would be pleased if you remove the crawl delay settings in your robots.txt which additionally will increase traffic to your site via Bing and Yahoo search results. If you would like to use a slower or faster crawl rate at different times of the day our Bing Webmaster Tools will allow you to configure these settings (http://www.bing.com/community/site_blogs/b/webmaster/archive/2011/06/08/updates-to-bing-webmaster-tools-data-and-content.aspx) and also assist you further in obtaining the best results possible for your business or website (http://www.bing.com/toolbox/webmaster/ ).
>
> If you have further questions please let me know.
>
> Best regards,
>
> Amy Wilcox
> Web Analyst, Bing from Microsoft
> v-amwilc at microsoft period com
>
>
> ----------

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-11-18  1:00 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-17 17:56 [v-amwilc at microsoft period com: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt] Chris Faylor
2011-11-18  0:22 ` Jonathan Larmour
2011-11-18  1:00 ` Ian Lance Taylor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).