Gentoo Archives: gentoo-user

From: Michael Mol <mikemol@×××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Google privacy changes
Date: Wed, 08 Feb 2012 18:32:00
Message-Id: CA+czFiBvdq5_VUw2VNK3HqJ9Bfp3Ndk+xoUBAMVW9qnwez_34A@mail.gmail.com
In Reply to: Re: [gentoo-user] Google privacy changes by Pandu Poluan
1 On Wed, Feb 8, 2012 at 12:17 PM, Pandu Poluan <pandu@××××××.info> wrote:
2 >
3 > On Feb 8, 2012 10:57 PM, "Michael Mol" <mikemol@×××××.com> wrote:
4 >>
5 >> On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman
6 >> <paul.hartman+gentoo@×××××.com> wrote:
7 >> > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pandu@××××××.info> wrote:
8 >> >>
9 >> >> On Jan 27, 2012 11:18 PM, "Paul Hartman"
10 >> >> <paul.hartman+gentoo@×××××.com>
11 >> >> wrote:
12 >> >>>
13 >> >>
14 >> >> ---- >8 snippage
15 >> >>
16 >> >>>
17 >> >>> BTW, the Baidu spider hits my site more than all of the others
18 >> >>> combined...
19 >> >>>
20 >> >>
21 >> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was
22 >> >> the
23 >> >> reason why my company decided to change our webhosting company: Its
24 >> >> spidering brought our previous webhosting to its knees...
25 >> >>
26 >> >> Rgds,
27 >> >
28 >> > I wonder if Baidu crawler honors the Crawl-delay directive in
29 >> > robots.txt?
30 >> >
31 >> > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit
32 >> > rules. ;)
33 >>
34 >> I don't remember if it respects Crawl-Delay, but it respects forbidden
35 >> paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get
36 >> DDOS'd by Yahoo a number of times. Turned out the solution was to
37 >> disallow access to expensive-to-render pages. If you're using
38 >> MediaWiki with prettified URLs, this works great:
39 >>
40 >> User-agent: *
41 >> Allow: /mw/images/
42 >> Allow: /mw/skins/
43 >> Allow: /mw/title.png
44 >> Disallow: /w/
45 >> Disallow: /mw/
46 >> Disallow: /wiki/Special:
47 >>
48 >
49 > *slaps forehead*
50 >
51 > Now why didn't I think of that before?!
52 >
53 > Thanks for reminding me!
54
55 I didn't think of it until I watched the logs live and saw it crawling
56 through page histories during one of the events. MediaWiki stores page
57 histories as a series of diffs from the current version, so it has to
58 assemble old versions by reverse-applying the diffs of all the made to
59 the page between the current version and the version you're asking
60 for. if you have a bot retrieve ten versions of a page that has ten
61 revisions, that's 210 reverse diff operations. Grabbing all versions
62 of a page with 20 revisions would result in over 1500 reverse diffs.
63 My 'hello world' page has over five hundred revisions.
64
65 So the page history crawling was pretty quickly obvious...
66
67 --
68 :wq