1 |
On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman |
2 |
<paul.hartman+gentoo@×××××.com> wrote: |
3 |
> On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pandu@××××××.info> wrote: |
4 |
>> |
5 |
>> On Jan 27, 2012 11:18 PM, "Paul Hartman" <paul.hartman+gentoo@×××××.com> |
6 |
>> wrote: |
7 |
>>> |
8 |
>> |
9 |
>> ---- >8 snippage |
10 |
>> |
11 |
>>> |
12 |
>>> BTW, the Baidu spider hits my site more than all of the others combined... |
13 |
>>> |
14 |
>> |
15 |
>> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was the |
16 |
>> reason why my company decided to change our webhosting company: Its |
17 |
>> spidering brought our previous webhosting to its knees... |
18 |
>> |
19 |
>> Rgds, |
20 |
> |
21 |
> I wonder if Baidu crawler honors the Crawl-delay directive in robots.txt? |
22 |
> |
23 |
> Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit rules. ;) |
24 |
|
25 |
I don't remember if it respects Crawl-Delay, but it respects forbidden |
26 |
paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get |
27 |
DDOS'd by Yahoo a number of times. Turned out the solution was to |
28 |
disallow access to expensive-to-render pages. If you're using |
29 |
MediaWiki with prettified URLs, this works great: |
30 |
|
31 |
User-agent: * |
32 |
Allow: /mw/images/ |
33 |
Allow: /mw/skins/ |
34 |
Allow: /mw/title.png |
35 |
Disallow: /w/ |
36 |
Disallow: /mw/ |
37 |
Disallow: /wiki/Special: |
38 |
|
39 |
-- |
40 |
:wq |