Gentoo Archives: gentoo-user

From: Rich Freeman <rich0@g.o>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Re: ceph on gentoo?
Date: Wed, 24 Dec 2014 12:41:10
Message-Id: CAGfcS_m2h1=AMBCV31QF+zfAUOvTstPUac_6J0kd3QqObdNVBA@mail.gmail.com
In Reply to: [gentoo-user] Re: ceph on gentoo? by "Holger Hoffstätte"
1 On Wed, Dec 24, 2014 at 5:16 AM, Holger Hoffstätte
2 <holger.hoffstaette@××××××××××.com> wrote:
3 >
4 > There's light and deep scrub; the former does what you described,
5 > while deep does checksumming. In case of mismatch it should create
6 > a quorum. Whether that actually happens and/or works is another
7 > matter. ;)
8 >
9
10 If you have 10 copies of a file and 9 are identical and 1 differs,
11 then there is little risk in this approach. The problem is that if
12 you have two copies of the file and they are different, all it can do
13 is pick one, which is what I believe it does. So, not only isn't this
14 as efficient as n+2 raid, or 2*n raid, but you end up needing 3-4*n
15 redundancy. That is a LOT of wasted space simply to avoid having a
16 checksum.
17
18 > Unfortunately a full point-in-time deep scrub and the resulting creation
19 > of checksums is more or less economically unviable with growing amounts
20 > of data; this really should be incremental.
21
22 Since checksums aren't stored anywhere, you end up having to scan
23 every node and compare all the checksums across them. Depending on
24 how that works it is likely to be a fairly synchronous operation,
25 which makes it much harder to deal with file access during the
26 operation. If they just sequentially scan each disk, create an index,
27 sort the index, and then pass it on to some central node to do all the
28 comparisions that would be better than doing it completely
29 synchronously.
30
31 >
32 > I know how btrfs scrub works, but it too (and in fact every storge system)
33 > suffers from the problem of having to decide which copy is "good"; they
34 > all have different points in their timeline where they need to make a
35 > decision at which a checksum is considered valid. When we're talking
36 > about preventing bitrot, just having another copy is usually enough.
37 >
38 > On top of that btrfs will at least tell you which file is suspected,
39 > thanks to its wonderful backreferences.
40
41 btrfs maintains checksums for every block on the disk apart from those
42 blocks. Sure, if your metadata and data all gets corrupted at once
43 you could have problems, but you'll at least know that you have
44 problems. A btrfs scrub is asynchronous - each disk can be checked
45 independently of the others as there is no need to compare checksums
46 for files across disks, since the checksums are pre-calculated. If a
47 bad extent is found, it is re-copied from one of the good disks (which
48 of course is synchronous).
49
50 Since the scans are asynchronous it performs a lot better than a RAID
51 scrub, since a read against a mirror can be allowed to disrupt just
52 one of the device scrubs while the other proceeds. Indeed, you could
53 just scrub the devices one at a time and then only writes or parallel
54 reads take a hit (for mirrored mode).
55
56 Btrfs is of course immature and can't recover errors for raid5/6
57 modes, and of course those raid modes would not perform as well when
58 being scrubbed since a read requires access to n disks and a write
59 requires access to n+1/2 disks (note though that the use of checksums
60 makes it safe to do a read without reading full parity - I have no
61 idea if the btrfs implementation takes advantage of this).
62
63 For a single-node system btrfs (and of course zfs) have a much more
64 robust design IMHO. Now, of course the node itself becomes the
65 bottleneck and that is what ceph is intended to handle. The problem
66 is that like pre-zfs RAID it handles total failure well, and data
67 corruption less-well. Indeed, unless it always checks multiple nodes
68 on every read a silent corruption is probably not going to be detected
69 without a scrub (while btrfs and zfs compare checksums on EVERY read,
70 since that is much less expensive than reading multiple devices).
71
72 I'm sure this could be fixed in ceph, but it doesn't seem like anybody
73 is prioritizing that.
74
75 --
76 Rich