Gentoo Archives: gentoo-user

From:	Michael Mol <mikemol@×××××.com>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] Google privacy changes
Date:	Wed, 08 Feb 2012 15:54:52
Message-Id:	`CA+czFiCYUS6R-9NR6WJyjRK5Yx+nnHC3ZVbn1oTpABYPUh8YAQ@mail.gmail.com`
In Reply to:	Re: [gentoo-user] Google privacy changes by Paul Hartman

1	On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman
2	<paul.hartman+gentoo@×××××.com> wrote:
3	> On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pandu@××××××.info> wrote:
4	>>
5	>> On Jan 27, 2012 11:18 PM, "Paul Hartman" <paul.hartman+gentoo@×××××.com>
6	>> wrote:
7	>>>
8	>>
9	>> ---- >8 snippage
10	>>
11	>>>
12	>>> BTW, the Baidu spider hits my site more than all of the others combined...
13	>>>
14	>>
15	>> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was the
16	>> reason why my company decided to change our webhosting company: Its
17	>> spidering brought our previous webhosting to its knees...
18	>>
19	>> Rgds,
20	>
21	> I wonder if Baidu crawler honors the Crawl-delay directive in robots.txt?
22	>
23	> Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit rules. ;)
24
25	I don't remember if it respects Crawl-Delay, but it respects forbidden
26	paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get
27	DDOS'd by Yahoo a number of times. Turned out the solution was to
28	disallow access to expensive-to-render pages. If you're using
29	MediaWiki with prettified URLs, this works great:
30
31	User-agent: *
32	Allow: /mw/images/
33	Allow: /mw/skins/
34	Allow: /mw/title.png
35	Disallow: /w/
36	Disallow: /mw/
37	Disallow: /wiki/Special:
38
39	--
40	:wq

Subject	Author
Re: [gentoo-user] Google privacy changes	Pandu Poluan <pandu@××××××.info>