Gentoo Archives: gentoo-user

From:	Rich Freeman <rich0@g.o>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] Re: ceph on gentoo?
Date:	Wed, 24 Dec 2014 12:41:10
Message-Id:	`CAGfcS_m2h1=AMBCV31QF+zfAUOvTstPUac_6J0kd3QqObdNVBA@mail.gmail.com`
In Reply to:	[gentoo-user] Re: ceph on gentoo? by "Holger Hoffstätte"

1	On Wed, Dec 24, 2014 at 5:16 AM, Holger Hoffstätte
2	<holger.hoffstaette@××××××××××.com> wrote:
3	>
4	> There's light and deep scrub; the former does what you described,
5	> while deep does checksumming. In case of mismatch it should create
6	> a quorum. Whether that actually happens and/or works is another
7	> matter. ;)
8	>
9
10	If you have 10 copies of a file and 9 are identical and 1 differs,
11	then there is little risk in this approach. The problem is that if
12	you have two copies of the file and they are different, all it can do
13	is pick one, which is what I believe it does. So, not only isn't this
14	as efficient as n+2 raid, or 2n raid, but you end up needing 3-4n
15	redundancy. That is a LOT of wasted space simply to avoid having a
16	checksum.
17
18	> Unfortunately a full point-in-time deep scrub and the resulting creation
19	> of checksums is more or less economically unviable with growing amounts
20	> of data; this really should be incremental.
21
22	Since checksums aren't stored anywhere, you end up having to scan
23	every node and compare all the checksums across them. Depending on
24	how that works it is likely to be a fairly synchronous operation,
25	which makes it much harder to deal with file access during the
26	operation. If they just sequentially scan each disk, create an index,
27	sort the index, and then pass it on to some central node to do all the
28	comparisions that would be better than doing it completely
29	synchronously.
30
31	>
32	> I know how btrfs scrub works, but it too (and in fact every storge system)
33	> suffers from the problem of having to decide which copy is "good"; they
34	> all have different points in their timeline where they need to make a
35	> decision at which a checksum is considered valid. When we're talking
36	> about preventing bitrot, just having another copy is usually enough.
37	>
38	> On top of that btrfs will at least tell you which file is suspected,
39	> thanks to its wonderful backreferences.
40
41	btrfs maintains checksums for every block on the disk apart from those
42	blocks. Sure, if your metadata and data all gets corrupted at once
43	you could have problems, but you'll at least know that you have
44	problems. A btrfs scrub is asynchronous - each disk can be checked
45	independently of the others as there is no need to compare checksums
46	for files across disks, since the checksums are pre-calculated. If a
47	bad extent is found, it is re-copied from one of the good disks (which
48	of course is synchronous).
49
50	Since the scans are asynchronous it performs a lot better than a RAID
51	scrub, since a read against a mirror can be allowed to disrupt just
52	one of the device scrubs while the other proceeds. Indeed, you could
53	just scrub the devices one at a time and then only writes or parallel
54	reads take a hit (for mirrored mode).
55
56	Btrfs is of course immature and can't recover errors for raid5/6
57	modes, and of course those raid modes would not perform as well when
58	being scrubbed since a read requires access to n disks and a write
59	requires access to n+1/2 disks (note though that the use of checksums
60	makes it safe to do a read without reading full parity - I have no
61	idea if the btrfs implementation takes advantage of this).
62
63	For a single-node system btrfs (and of course zfs) have a much more
64	robust design IMHO. Now, of course the node itself becomes the
65	bottleneck and that is what ceph is intended to handle. The problem
66	is that like pre-zfs RAID it handles total failure well, and data
67	corruption less-well. Indeed, unless it always checks multiple nodes
68	on every read a silent corruption is probably not going to be detected
69	without a scrub (while btrfs and zfs compare checksums on EVERY read,
70	since that is much less expensive than reading multiple devices).
71
72	I'm sure this could be fixed in ceph, but it doesn't seem like anybody
73	is prioritizing that.
74
75	--
76	Rich

Report Message

Find on MARC Find on Google Groups