1 |
On Thu, Dec 7, 2017 at 11:04 AM, Frank Steinmetzger <Warp_7@×××.de> wrote: |
2 |
> On Thu, Dec 07, 2017 at 10:26:34AM -0500, Rich Freeman wrote: |
3 |
> |
4 |
>> […] They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based |
5 |
>> solutions. Maybe you can get by with less, but finding ARM systems with |
6 |
>> even 4GB of RAM is tough, and even that means only one hard drive per |
7 |
>> node, which means a lot of $40+ nodes to go on top of the cost of the |
8 |
>> drives themselves. |
9 |
> |
10 |
> You can't really get ECC on ARM, right? So M-ITX was the next best choice. I |
11 |
> have a tiny (probably one of the smallest available) M-ITX case for four |
12 |
> 3,5″ bays and an internal 2.5″ mount: |
13 |
> https://www.inter-tech.de/en/products/ipc/storage-cases/sc-4100 |
14 |
> |
15 |
|
16 |
I don't think ECC is readily available on ARM (most of those boards |
17 |
are SoCs where the RAM is integral and can't be expanded). If CephFS |
18 |
were designed with end-to-end checksums that wouldn't really matter |
19 |
much, because the client would detect any error in a storage node and |
20 |
could obtain a good copy from another node and trigger a resilver. |
21 |
However, I don't think Ceph is quite there, with checksums being used |
22 |
at various points but I think there are gaps where no checksum is |
23 |
protecting the data. That is one of the things I don't like about it. |
24 |
|
25 |
If I were designing the checksums for it I'd probably have the client |
26 |
compute the checksum and send it with the data, then at every step the |
27 |
checksum is checked, and stored in the metadata on permanent storage. |
28 |
Then when the ack goes back to the client that the data is written the |
29 |
checksum would be returned to the client from the metadata, and the |
30 |
client would do a comparison. Any retrieval would include the client |
31 |
obtaining the checksum from the metadata and then comparing it to the |
32 |
data from the storage nodes. I don't think this approach would really |
33 |
add any extra overhead (the metadata needs to be recorded when writing |
34 |
anyway, and read when reading anyway). It just ensures there is a |
35 |
checksum on separate storage from the data and that it is the one |
36 |
captured when the data was first written. A storage node could be |
37 |
completely unreliable in this scenario as it exists apart from the |
38 |
checksum being used to verify it. Storage nodes would still do their |
39 |
own checksum verification anyway since that would allow errors to be |
40 |
detected sooner and reduce latency, but this is not essential to |
41 |
reliability. |
42 |
|
43 |
Instead I think Ceph does not store checksums in the metadata. The |
44 |
client checksum is used to verify accurate transfer over the network, |
45 |
but then the various nodes forget about it, and record the data. If |
46 |
the data is backed on ZFS/btrfs/bluestore then the filesystem would |
47 |
compute its own checksum to detect silent corruption while at rest. |
48 |
However, if the data were corrupted by faulty software or memory |
49 |
failure after it was verified upon reception but before it was |
50 |
re-checksummed prior to storage then you would have a problem. In |
51 |
that case a scrub would detect non-matching data between nodes but |
52 |
with no way to determine which node is correct. |
53 |
|
54 |
If somebody with more knowledge of Ceph knows otherwise I'm all ears, |
55 |
because this is one of those things that gives me a bit of pause. |
56 |
Don't get me wrong - most other approaches have the same issues, but I |
57 |
can reduce the risk of some of that with ECC, but that isn't practical |
58 |
when you want many RAM-intensive storage nodes in the solution. |
59 |
|
60 |
-- |
61 |
Rich |