Gentoo Archives: gentoo-server

From: jos houtman <jos@×××××.nl>
To: gentoo-server@l.g.o
Subject: [gentoo-server] Mirroring/backing-up a large 15Million (6TB) file collection
Date: Wed, 19 Apr 2006 12:10:43
Message-Id: 44462858.7010706@hyves.nl
1 Hello list,
2
3 I got to backup a huge changing/growing collection of files (~15
4 million/~6TB) and the existing rsync solution just doenst cut it any more.
5 Below I will explain the situation, what the problem is and my current
6 train of thought. Where you people can help is providing me with a sound
7 bord, and hopefully some out-of-the-box thinking :D.
8
9 current situation:
10 - We are a community website providing photo/video sharing as part of
11 the overall package.
12 - Our collection of video's and photo's is surpassing 15.3 million.
13 - Growth rate is estimated at 50.000 files a day, a rate which ofc grows
14 itself overtime.
15 - The collection is saved on a 9TB system.
16 - The backups are two off-site 4TB systems, the collections needs to be
17 split over these.
18 - Our backup-window is the whole day as long as this does not provide a
19 performance drain. Reality is that we need to use the quiet night hours
20 0 to 8.
21 - The collection is stored in a set of subdirectories each containing
22 50.000 files. (1-50000,50001-100000, etc). There are ~300 subdirs in use
23 now.
24 - Files are never deleted.
25 - In the future it can happen that files change. my exception is that
26 atmost a few thousand files a day will change, scattered over the whole
27 collection with an emphasis on the most recent files.
28 - We only need a mirrored version of the collection, no need for weekly
29 snapshot or incremental snapshot.. just a copy of the collection.
30 - It takes rsync 30 minutes to check that a directory needs no updating,
31 It will take longer if it does.
32 - The collection can not be taken offline during backups, it needs to be
33 accesible all the time.
34
35 What we use now is a rsync on the last directories, this is by the grace
36 that files are currently immutable. This will change though.
37
38 Train of thought:
39 Remembering that (in the recent future) changes can happen in the whole
40 collection.
41 We will need to compare the collection to its back-upped/mirrored
42 counterpart.
43 Figuring that running rsync on 300 subdirectory's will take atleast 150
44 hours makes it not a viable option.
45 We can assume that the rsync people are smart and that there is no
46 faster ways to compare such huge collections.
47
48 Another option is just copying the whole collection every night.
49 Assuming a sustained transfer rate of 50MB/sec (20 is more likely i
50 think, but no experience) it will take 33 hours.
51 This method generates a huge disk load, which is unacceptable especially
52 for more then a few hours.
53
54 So we cannot construct a list of differences within an acceptable time,
55 neither can we copy all the files.
56 Combining these to I got the following brainstorm:
57 It is in-effecient to reconstruct the changes you made during the day,
58 why not save this knowledge when its allready available (while
59 editing/saving the files).
60 saving all the changed filesnames in a queue/list, allows us to read
61 this list at the end of the day and only copy the needed files, atmost
62 60.000 (new + changed).
63 It is also possible to make this a daemon process, keeping backup
64 up-to-date almost realtime with relative little load, you could built-in
65 rates to decrease the performace drain during peak hours. This could
66 just work :D
67
68 Only problem is constructing the list and capturing the knowledge while
69 it is available, two options exist:
70 At system level this can be done using for example I-notify, this
71 requires a user-daemon. If the daemon crashes changes will be missed
72 though.
73 At application (the one making the changes) level this can also be done,
74 when the application crashes no changes are made, so nothing is missed.
75 But it does require making the backup dependent on the application. Not
76 an ideal situation.
77
78 Do you have any idea's/hints? maybe options i have not seen, even when
79 they sound silly. They might trigger a fresh idea :D
80
81 [1] http://www-128.ibm.com/developerworks/linux/library/l-inotify.html
82
83 with regards,
84
85 Jos Houtman
86 jos@×××××.nl
87
88 --
89 gentoo-server@g.o mailing list

Replies