Gentoo Archives: gentoo-user

From: Michael Mol <mikemol@×××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Google privacy changes
Date: Wed, 08 Feb 2012 15:54:52
Message-Id: CA+czFiCYUS6R-9NR6WJyjRK5Yx+nnHC3ZVbn1oTpABYPUh8YAQ@mail.gmail.com
In Reply to: Re: [gentoo-user] Google privacy changes by Paul Hartman
1 On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman
2 <paul.hartman+gentoo@×××××.com> wrote:
3 > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pandu@××××××.info> wrote:
4 >>
5 >> On Jan 27, 2012 11:18 PM, "Paul Hartman" <paul.hartman+gentoo@×××××.com>
6 >> wrote:
7 >>>
8 >>
9 >> ---- >8 snippage
10 >>
11 >>>
12 >>> BTW, the Baidu spider hits my site more than all of the others combined...
13 >>>
14 >>
15 >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was the
16 >> reason why my company decided to change our webhosting company: Its
17 >> spidering brought our previous webhosting to its knees...
18 >>
19 >> Rgds,
20 >
21 > I wonder if Baidu crawler honors the Crawl-delay directive in robots.txt?
22 >
23 > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit rules. ;)
24
25 I don't remember if it respects Crawl-Delay, but it respects forbidden
26 paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get
27 DDOS'd by Yahoo a number of times. Turned out the solution was to
28 disallow access to expensive-to-render pages. If you're using
29 MediaWiki with prettified URLs, this works great:
30
31 User-agent: *
32 Allow: /mw/images/
33 Allow: /mw/skins/
34 Allow: /mw/title.png
35 Disallow: /w/
36 Disallow: /mw/
37 Disallow: /wiki/Special:
38
39 --
40 :wq

Replies

Subject Author
Re: [gentoo-user] Google privacy changes Pandu Poluan <pandu@××××××.info>