1 |
On Wed, Dec 24, 2014 at 5:16 AM, Holger Hoffstätte |
2 |
<holger.hoffstaette@××××××××××.com> wrote: |
3 |
> |
4 |
> There's light and deep scrub; the former does what you described, |
5 |
> while deep does checksumming. In case of mismatch it should create |
6 |
> a quorum. Whether that actually happens and/or works is another |
7 |
> matter. ;) |
8 |
> |
9 |
|
10 |
If you have 10 copies of a file and 9 are identical and 1 differs, |
11 |
then there is little risk in this approach. The problem is that if |
12 |
you have two copies of the file and they are different, all it can do |
13 |
is pick one, which is what I believe it does. So, not only isn't this |
14 |
as efficient as n+2 raid, or 2*n raid, but you end up needing 3-4*n |
15 |
redundancy. That is a LOT of wasted space simply to avoid having a |
16 |
checksum. |
17 |
|
18 |
> Unfortunately a full point-in-time deep scrub and the resulting creation |
19 |
> of checksums is more or less economically unviable with growing amounts |
20 |
> of data; this really should be incremental. |
21 |
|
22 |
Since checksums aren't stored anywhere, you end up having to scan |
23 |
every node and compare all the checksums across them. Depending on |
24 |
how that works it is likely to be a fairly synchronous operation, |
25 |
which makes it much harder to deal with file access during the |
26 |
operation. If they just sequentially scan each disk, create an index, |
27 |
sort the index, and then pass it on to some central node to do all the |
28 |
comparisions that would be better than doing it completely |
29 |
synchronously. |
30 |
|
31 |
> |
32 |
> I know how btrfs scrub works, but it too (and in fact every storge system) |
33 |
> suffers from the problem of having to decide which copy is "good"; they |
34 |
> all have different points in their timeline where they need to make a |
35 |
> decision at which a checksum is considered valid. When we're talking |
36 |
> about preventing bitrot, just having another copy is usually enough. |
37 |
> |
38 |
> On top of that btrfs will at least tell you which file is suspected, |
39 |
> thanks to its wonderful backreferences. |
40 |
|
41 |
btrfs maintains checksums for every block on the disk apart from those |
42 |
blocks. Sure, if your metadata and data all gets corrupted at once |
43 |
you could have problems, but you'll at least know that you have |
44 |
problems. A btrfs scrub is asynchronous - each disk can be checked |
45 |
independently of the others as there is no need to compare checksums |
46 |
for files across disks, since the checksums are pre-calculated. If a |
47 |
bad extent is found, it is re-copied from one of the good disks (which |
48 |
of course is synchronous). |
49 |
|
50 |
Since the scans are asynchronous it performs a lot better than a RAID |
51 |
scrub, since a read against a mirror can be allowed to disrupt just |
52 |
one of the device scrubs while the other proceeds. Indeed, you could |
53 |
just scrub the devices one at a time and then only writes or parallel |
54 |
reads take a hit (for mirrored mode). |
55 |
|
56 |
Btrfs is of course immature and can't recover errors for raid5/6 |
57 |
modes, and of course those raid modes would not perform as well when |
58 |
being scrubbed since a read requires access to n disks and a write |
59 |
requires access to n+1/2 disks (note though that the use of checksums |
60 |
makes it safe to do a read without reading full parity - I have no |
61 |
idea if the btrfs implementation takes advantage of this). |
62 |
|
63 |
For a single-node system btrfs (and of course zfs) have a much more |
64 |
robust design IMHO. Now, of course the node itself becomes the |
65 |
bottleneck and that is what ceph is intended to handle. The problem |
66 |
is that like pre-zfs RAID it handles total failure well, and data |
67 |
corruption less-well. Indeed, unless it always checks multiple nodes |
68 |
on every read a silent corruption is probably not going to be detected |
69 |
without a scrub (while btrfs and zfs compare checksums on EVERY read, |
70 |
since that is much less expensive than reading multiple devices). |
71 |
|
72 |
I'm sure this could be fixed in ceph, but it doesn't seem like anybody |
73 |
is prioritizing that. |
74 |
|
75 |
-- |
76 |
Rich |