1 |
On Wed, Feb 8, 2012 at 12:17 PM, Pandu Poluan <pandu@××××××.info> wrote: |
2 |
> |
3 |
> On Feb 8, 2012 10:57 PM, "Michael Mol" <mikemol@×××××.com> wrote: |
4 |
>> |
5 |
>> On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman |
6 |
>> <paul.hartman+gentoo@×××××.com> wrote: |
7 |
>> > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pandu@××××××.info> wrote: |
8 |
>> >> |
9 |
>> >> On Jan 27, 2012 11:18 PM, "Paul Hartman" |
10 |
>> >> <paul.hartman+gentoo@×××××.com> |
11 |
>> >> wrote: |
12 |
>> >>> |
13 |
>> >> |
14 |
>> >> ---- >8 snippage |
15 |
>> >> |
16 |
>> >>> |
17 |
>> >>> BTW, the Baidu spider hits my site more than all of the others |
18 |
>> >>> combined... |
19 |
>> >>> |
20 |
>> >> |
21 |
>> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was |
22 |
>> >> the |
23 |
>> >> reason why my company decided to change our webhosting company: Its |
24 |
>> >> spidering brought our previous webhosting to its knees... |
25 |
>> >> |
26 |
>> >> Rgds, |
27 |
>> > |
28 |
>> > I wonder if Baidu crawler honors the Crawl-delay directive in |
29 |
>> > robots.txt? |
30 |
>> > |
31 |
>> > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit |
32 |
>> > rules. ;) |
33 |
>> |
34 |
>> I don't remember if it respects Crawl-Delay, but it respects forbidden |
35 |
>> paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get |
36 |
>> DDOS'd by Yahoo a number of times. Turned out the solution was to |
37 |
>> disallow access to expensive-to-render pages. If you're using |
38 |
>> MediaWiki with prettified URLs, this works great: |
39 |
>> |
40 |
>> User-agent: * |
41 |
>> Allow: /mw/images/ |
42 |
>> Allow: /mw/skins/ |
43 |
>> Allow: /mw/title.png |
44 |
>> Disallow: /w/ |
45 |
>> Disallow: /mw/ |
46 |
>> Disallow: /wiki/Special: |
47 |
>> |
48 |
> |
49 |
> *slaps forehead* |
50 |
> |
51 |
> Now why didn't I think of that before?! |
52 |
> |
53 |
> Thanks for reminding me! |
54 |
|
55 |
I didn't think of it until I watched the logs live and saw it crawling |
56 |
through page histories during one of the events. MediaWiki stores page |
57 |
histories as a series of diffs from the current version, so it has to |
58 |
assemble old versions by reverse-applying the diffs of all the made to |
59 |
the page between the current version and the version you're asking |
60 |
for. if you have a bot retrieve ten versions of a page that has ten |
61 |
revisions, that's 210 reverse diff operations. Grabbing all versions |
62 |
of a page with 20 revisions would result in over 1500 reverse diffs. |
63 |
My 'hello world' page has over five hundred revisions. |
64 |
|
65 |
So the page history crawling was pretty quickly obvious... |
66 |
|
67 |
-- |
68 |
:wq |