Gentoo Archives: gentoo-user

From: Rich Freeman <rich0@g.o>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] OT: btrfs raid 5/6
Date: Thu, 07 Dec 2017 23:09:11
Message-Id: CAGfcS_=tDn-DzzkNL18LCSgQLKH3A7DgBRFKi9QTSXAjMsjEFA@mail.gmail.com
In Reply to: Re: [gentoo-user] OT: btrfs raid 5/6 by Frank Steinmetzger
1 On Thu, Dec 7, 2017 at 11:04 AM, Frank Steinmetzger <Warp_7@×××.de> wrote:
2 > On Thu, Dec 07, 2017 at 10:26:34AM -0500, Rich Freeman wrote:
3 >
4 >> […] They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based
5 >> solutions. Maybe you can get by with less, but finding ARM systems with
6 >> even 4GB of RAM is tough, and even that means only one hard drive per
7 >> node, which means a lot of $40+ nodes to go on top of the cost of the
8 >> drives themselves.
9 >
10 > You can't really get ECC on ARM, right? So M-ITX was the next best choice. I
11 > have a tiny (probably one of the smallest available) M-ITX case for four
12 > 3,5″ bays and an internal 2.5″ mount:
13 > https://www.inter-tech.de/en/products/ipc/storage-cases/sc-4100
14 >
15
16 I don't think ECC is readily available on ARM (most of those boards
17 are SoCs where the RAM is integral and can't be expanded). If CephFS
18 were designed with end-to-end checksums that wouldn't really matter
19 much, because the client would detect any error in a storage node and
20 could obtain a good copy from another node and trigger a resilver.
21 However, I don't think Ceph is quite there, with checksums being used
22 at various points but I think there are gaps where no checksum is
23 protecting the data. That is one of the things I don't like about it.
24
25 If I were designing the checksums for it I'd probably have the client
26 compute the checksum and send it with the data, then at every step the
27 checksum is checked, and stored in the metadata on permanent storage.
28 Then when the ack goes back to the client that the data is written the
29 checksum would be returned to the client from the metadata, and the
30 client would do a comparison. Any retrieval would include the client
31 obtaining the checksum from the metadata and then comparing it to the
32 data from the storage nodes. I don't think this approach would really
33 add any extra overhead (the metadata needs to be recorded when writing
34 anyway, and read when reading anyway). It just ensures there is a
35 checksum on separate storage from the data and that it is the one
36 captured when the data was first written. A storage node could be
37 completely unreliable in this scenario as it exists apart from the
38 checksum being used to verify it. Storage nodes would still do their
39 own checksum verification anyway since that would allow errors to be
40 detected sooner and reduce latency, but this is not essential to
41 reliability.
42
43 Instead I think Ceph does not store checksums in the metadata. The
44 client checksum is used to verify accurate transfer over the network,
45 but then the various nodes forget about it, and record the data. If
46 the data is backed on ZFS/btrfs/bluestore then the filesystem would
47 compute its own checksum to detect silent corruption while at rest.
48 However, if the data were corrupted by faulty software or memory
49 failure after it was verified upon reception but before it was
50 re-checksummed prior to storage then you would have a problem. In
51 that case a scrub would detect non-matching data between nodes but
52 with no way to determine which node is correct.
53
54 If somebody with more knowledge of Ceph knows otherwise I'm all ears,
55 because this is one of those things that gives me a bit of pause.
56 Don't get me wrong - most other approaches have the same issues, but I
57 can reduce the risk of some of that with ECC, but that isn't practical
58 when you want many RAM-intensive storage nodes in the solution.
59
60 --
61 Rich