From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 15436 invoked by alias); 18 Nov 2011 00:22:26 -0000 Received: (qmail 15425 invoked by uid 22791); 18 Nov 2011 00:22:25 -0000 X-SWARE-Spam-Status: No, hits=-0.6 required=5.0 tests=AWL,BAYES_50 X-Spam-Check-By: sourceware.org Received: from virtual.bogons.net (HELO virtual.bogons.net) (193.178.223.136) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 18 Nov 2011 00:22:09 +0000 Received: from jifvik.dyndns.org (jifvik.dyndns.org [85.158.45.40]) by virtual.bogons.net (8.10.2+Sun/8.11.2) with ESMTP id pAI0M7129352 for ; Fri, 18 Nov 2011 00:22:07 GMT Received: from lert.jifvik.org (lert.jifvik.org [172.31.1.138]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by jifvik.dyndns.org (Postfix) with ESMTP id 2B8D13FE1 for ; Fri, 18 Nov 2011 00:22:07 +0000 (GMT) Message-ID: <4EC5A52E.1040402@jifvik.org> Date: Fri, 18 Nov 2011 00:22:00 -0000 From: Jonathan Larmour User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101027 Fedora/3.0.10-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.10 MIME-Version: 1.0 To: overseers@sourceware.org Subject: Re: [v-amwilc at microsoft period com: Robots.txt file restricting msnbot with crawl-delay at http://gcc.gnu.org/robots.txt] References: <20111117175629.GA3124@sourceware.org> In-Reply-To: <20111117175629.GA3124@sourceware.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Mailing-List: contact overseers-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: , Sender: overseers-owner@sourceware.org X-SW-Source: 2011-q4/txt/msg00039.txt.bz2 On 17/11/11 17:56, Chris Faylor wrote: > [Reply-To set to overseers] > Should we change the Crawl-Delay ? Probably yes as the vastness of sourceware's links make using a search engine all the more important. But on my own website, I find the occasional problems caused by crawler activity are usually not so much to do with the individual crawlers themselves so much as if they happen to all be kicking the server simultaneously. In one sense that's not their fault as they can't know about server load and other crawlers, but there is a potential solution... We already use mod_bw version 0.6 in Apache - but according to if we upgraded to version 0.8 or later we would be able to use a regexp to match the user-agent of the various search engines, and therefore restrict them collectively. For example: BandWidth "u:(Slurp|BaiduSpider|Googlebot|msnbot)" 100000 MaxConnection "u:(Slurp|BaiduSpider|Googlebot|msnbot)" 10 This is what I've been meaning to do on my own site, although I've never got round to it. I'm sure there are many more user agents that could be added into that. Jifl