Gentoo Archives: gentoo-user

From:	Rich Freeman <rich0@g.o>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] OT: btrfs raid 5/6
Date:	Thu, 07 Dec 2017 23:09:11
Message-Id:	`CAGfcS_=tDn-DzzkNL18LCSgQLKH3A7DgBRFKi9QTSXAjMsjEFA@mail.gmail.com`
In Reply to:	Re: [gentoo-user] OT: btrfs raid 5/6 by Frank Steinmetzger

1	On Thu, Dec 7, 2017 at 11:04 AM, Frank Steinmetzger <Warp_7@×××.de> wrote:
2	> On Thu, Dec 07, 2017 at 10:26:34AM -0500, Rich Freeman wrote:
3	>
4	>> […] They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based
5	>> solutions. Maybe you can get by with less, but finding ARM systems with
6	>> even 4GB of RAM is tough, and even that means only one hard drive per
7	>> node, which means a lot of $40+ nodes to go on top of the cost of the
8	>> drives themselves.
9	>
10	> You can't really get ECC on ARM, right? So M-ITX was the next best choice. I
11	> have a tiny (probably one of the smallest available) M-ITX case for four
12	> 3,5″ bays and an internal 2.5″ mount:
13	> https://www.inter-tech.de/en/products/ipc/storage-cases/sc-4100
14	>
15
16	I don't think ECC is readily available on ARM (most of those boards
17	are SoCs where the RAM is integral and can't be expanded). If CephFS
18	were designed with end-to-end checksums that wouldn't really matter
19	much, because the client would detect any error in a storage node and
20	could obtain a good copy from another node and trigger a resilver.
21	However, I don't think Ceph is quite there, with checksums being used
22	at various points but I think there are gaps where no checksum is
23	protecting the data. That is one of the things I don't like about it.
24
25	If I were designing the checksums for it I'd probably have the client
26	compute the checksum and send it with the data, then at every step the
27	checksum is checked, and stored in the metadata on permanent storage.
28	Then when the ack goes back to the client that the data is written the
29	checksum would be returned to the client from the metadata, and the
30	client would do a comparison. Any retrieval would include the client
31	obtaining the checksum from the metadata and then comparing it to the
32	data from the storage nodes. I don't think this approach would really
33	add any extra overhead (the metadata needs to be recorded when writing
34	anyway, and read when reading anyway). It just ensures there is a
35	checksum on separate storage from the data and that it is the one
36	captured when the data was first written. A storage node could be
37	completely unreliable in this scenario as it exists apart from the
38	checksum being used to verify it. Storage nodes would still do their
39	own checksum verification anyway since that would allow errors to be
40	detected sooner and reduce latency, but this is not essential to
41	reliability.
42
43	Instead I think Ceph does not store checksums in the metadata. The
44	client checksum is used to verify accurate transfer over the network,
45	but then the various nodes forget about it, and record the data. If
46	the data is backed on ZFS/btrfs/bluestore then the filesystem would
47	compute its own checksum to detect silent corruption while at rest.
48	However, if the data were corrupted by faulty software or memory
49	failure after it was verified upon reception but before it was
50	re-checksummed prior to storage then you would have a problem. In
51	that case a scrub would detect non-matching data between nodes but
52	with no way to determine which node is correct.
53
54	If somebody with more knowledge of Ceph knows otherwise I'm all ears,
55	because this is one of those things that gives me a bit of pause.
56	Don't get me wrong - most other approaches have the same issues, but I
57	can reduce the risk of some of that with ECC, but that isn't practical
58	when you want many RAM-intensive storage nodes in the solution.
59
60	--
61	Rich

Report Message

Find on MARC Find on Google Groups