1 |
Hello list, |
2 |
|
3 |
I got to backup a huge changing/growing collection of files (~15 |
4 |
million/~6TB) and the existing rsync solution just doenst cut it any more. |
5 |
Below I will explain the situation, what the problem is and my current |
6 |
train of thought. Where you people can help is providing me with a sound |
7 |
bord, and hopefully some out-of-the-box thinking :D. |
8 |
|
9 |
current situation: |
10 |
- We are a community website providing photo/video sharing as part of |
11 |
the overall package. |
12 |
- Our collection of video's and photo's is surpassing 15.3 million. |
13 |
- Growth rate is estimated at 50.000 files a day, a rate which ofc grows |
14 |
itself overtime. |
15 |
- The collection is saved on a 9TB system. |
16 |
- The backups are two off-site 4TB systems, the collections needs to be |
17 |
split over these. |
18 |
- Our backup-window is the whole day as long as this does not provide a |
19 |
performance drain. Reality is that we need to use the quiet night hours |
20 |
0 to 8. |
21 |
- The collection is stored in a set of subdirectories each containing |
22 |
50.000 files. (1-50000,50001-100000, etc). There are ~300 subdirs in use |
23 |
now. |
24 |
- Files are never deleted. |
25 |
- In the future it can happen that files change. my exception is that |
26 |
atmost a few thousand files a day will change, scattered over the whole |
27 |
collection with an emphasis on the most recent files. |
28 |
- We only need a mirrored version of the collection, no need for weekly |
29 |
snapshot or incremental snapshot.. just a copy of the collection. |
30 |
- It takes rsync 30 minutes to check that a directory needs no updating, |
31 |
It will take longer if it does. |
32 |
- The collection can not be taken offline during backups, it needs to be |
33 |
accesible all the time. |
34 |
|
35 |
What we use now is a rsync on the last directories, this is by the grace |
36 |
that files are currently immutable. This will change though. |
37 |
|
38 |
Train of thought: |
39 |
Remembering that (in the recent future) changes can happen in the whole |
40 |
collection. |
41 |
We will need to compare the collection to its back-upped/mirrored |
42 |
counterpart. |
43 |
Figuring that running rsync on 300 subdirectory's will take atleast 150 |
44 |
hours makes it not a viable option. |
45 |
We can assume that the rsync people are smart and that there is no |
46 |
faster ways to compare such huge collections. |
47 |
|
48 |
Another option is just copying the whole collection every night. |
49 |
Assuming a sustained transfer rate of 50MB/sec (20 is more likely i |
50 |
think, but no experience) it will take 33 hours. |
51 |
This method generates a huge disk load, which is unacceptable especially |
52 |
for more then a few hours. |
53 |
|
54 |
So we cannot construct a list of differences within an acceptable time, |
55 |
neither can we copy all the files. |
56 |
Combining these to I got the following brainstorm: |
57 |
It is in-effecient to reconstruct the changes you made during the day, |
58 |
why not save this knowledge when its allready available (while |
59 |
editing/saving the files). |
60 |
saving all the changed filesnames in a queue/list, allows us to read |
61 |
this list at the end of the day and only copy the needed files, atmost |
62 |
60.000 (new + changed). |
63 |
It is also possible to make this a daemon process, keeping backup |
64 |
up-to-date almost realtime with relative little load, you could built-in |
65 |
rates to decrease the performace drain during peak hours. This could |
66 |
just work :D |
67 |
|
68 |
Only problem is constructing the list and capturing the knowledge while |
69 |
it is available, two options exist: |
70 |
At system level this can be done using for example I-notify, this |
71 |
requires a user-daemon. If the daemon crashes changes will be missed |
72 |
though. |
73 |
At application (the one making the changes) level this can also be done, |
74 |
when the application crashes no changes are made, so nothing is missed. |
75 |
But it does require making the backup dependent on the application. Not |
76 |
an ideal situation. |
77 |
|
78 |
Do you have any idea's/hints? maybe options i have not seen, even when |
79 |
they sound silly. They might trigger a fresh idea :D |
80 |
|
81 |
[1] http://www-128.ibm.com/developerworks/linux/library/l-inotify.html |
82 |
|
83 |
with regards, |
84 |
|
85 |
Jos Houtman |
86 |
jos@×××××.nl |
87 |
|
88 |
-- |
89 |
gentoo-server@g.o mailing list |