1 |
On Sun, May 3, 2020 at 6:50 PM hitachi303 |
2 |
<gentoo-user@××××××××××××××××.de> wrote: |
3 |
> |
4 |
> The only person I know who is running a really huge raid ( I guess 2000+ |
5 |
> drives) is comfortable with some spare drives. His raid did fail an can |
6 |
> fail. Data will be lost. Everything important has to be stored at a |
7 |
> secondary location. But they are using the raid to store data for some |
8 |
> days or weeks when a server is calculating stuff. If the raid fails they |
9 |
> have to restart the program for the calculation. |
10 |
|
11 |
So, if you have thousands of drives, you really shouldn't be using a |
12 |
conventional RAID solution. Now, if you're just using RAID to refer |
13 |
to any technology that stores data redundantly that is one thing. |
14 |
However, if you wanted to stick 2000 drives into a single host using |
15 |
something like mdadm/zfs, or heaven forbid a bazillion LSI HBAs with |
16 |
some kind of hacked-up solution for PCIe port replication plus SATA |
17 |
bus multipliers/etc, you're probably doing it wrong. (Really even |
18 |
with mdadm/zfs you probably still need some kind of terribly |
19 |
non-optimal solution for attaching all those drives to a single host.) |
20 |
|
21 |
At that scale you really should be using a distributed filesystem. Or |
22 |
you could use some application-level solution that accomplishes the |
23 |
same thing on top of a bunch of more modest hosts running zfs/etc (the |
24 |
Backblaze solution at least in the past). |
25 |
|
26 |
The most mainstream FOSS solution at this scale is Ceph. It achieves |
27 |
redundancy at the host level. That is, if you have it set up to |
28 |
tolerate two failures then you can take two random hosts in the |
29 |
cluster and smash their motherboards with a hammer in the middle of |
30 |
operation, and the cluster will keep on working and quickly restore |
31 |
its redundancy. Each host can have multiple drives, and losing any or |
32 |
all of the drives within a single host counts as a single failure. |
33 |
You can even do clever stuff like tell it which hosts are attached to |
34 |
which circuit breakers and then you could lose all the hosts on a |
35 |
single power circuit at once and it would be fine. |
36 |
|
37 |
This also has the benefit of covering you when one of your flakey |
38 |
drives causes weird bus issues that affect other drives, or one host |
39 |
crashes, and so on. The redundancy is entirely at the host level so |
40 |
you're protected against a much larger number of failure modes. |
41 |
|
42 |
This sort of solution also performs much faster as data requests are |
43 |
not CPU/NIC/HBA limited for any particular host. The software is |
44 |
obviously more complex, but the hardware can be simpler since if you |
45 |
want to expand storage you just buy more servers and plug them into |
46 |
the LAN, versus trying to figure out how to cram an extra dozen hard |
47 |
drives into a single host with all kinds of port multiplier games. |
48 |
You can also do maintenance and just reboot an entire host while the |
49 |
cluster stays online as long as you aren't messing with them all at |
50 |
once. |
51 |
|
52 |
I've gone in this general direction because I was tired of having to |
53 |
try to deal with massive cases, being limited to motherboards with 6 |
54 |
SATA ports, adding LSI HBAs that require an 8x slot and often |
55 |
conflicts with using an NVMe, and so on. |
56 |
|
57 |
-- |
58 |
Rich |