1 |
On Feb 8, 2012 10:57 PM, "Michael Mol" <mikemol@×××××.com> wrote: |
2 |
> |
3 |
> On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman |
4 |
> <paul.hartman+gentoo@×××××.com> wrote: |
5 |
> > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pandu@××××××.info> wrote: |
6 |
> >> |
7 |
> >> On Jan 27, 2012 11:18 PM, "Paul Hartman" <paul.hartman+gentoo@×××××.com |
8 |
> |
9 |
> >> wrote: |
10 |
> >>> |
11 |
> >> |
12 |
> >> ---- >8 snippage |
13 |
> >> |
14 |
> >>> |
15 |
> >>> BTW, the Baidu spider hits my site more than all of the others |
16 |
combined... |
17 |
> >>> |
18 |
> >> |
19 |
> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu |
20 |
was the |
21 |
> >> reason why my company decided to change our webhosting company: Its |
22 |
> >> spidering brought our previous webhosting to its knees... |
23 |
> >> |
24 |
> >> Rgds, |
25 |
> > |
26 |
> > I wonder if Baidu crawler honors the Crawl-delay directive in |
27 |
robots.txt? |
28 |
> > |
29 |
> > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit |
30 |
rules. ;) |
31 |
> |
32 |
> I don't remember if it respects Crawl-Delay, but it respects forbidden |
33 |
> paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get |
34 |
> DDOS'd by Yahoo a number of times. Turned out the solution was to |
35 |
> disallow access to expensive-to-render pages. If you're using |
36 |
> MediaWiki with prettified URLs, this works great: |
37 |
> |
38 |
> User-agent: * |
39 |
> Allow: /mw/images/ |
40 |
> Allow: /mw/skins/ |
41 |
> Allow: /mw/title.png |
42 |
> Disallow: /w/ |
43 |
> Disallow: /mw/ |
44 |
> Disallow: /wiki/Special: |
45 |
> |
46 |
|
47 |
*slaps forehead* |
48 |
|
49 |
Now why didn't I think of that before?! |
50 |
|
51 |
Thanks for reminding me! |
52 |
|
53 |
Rgds, |