1 |
On 2/3/20 10:40 am, Rich Freeman wrote: |
2 |
> On Sun, Mar 1, 2020 at 8:52 PM William Kenworthy <billk@×××××××××.au> wrote: |
3 |
>> For those wanting to run a lot of drives on a single host - that defeats |
4 |
>> the main advantage of using a chunkserver based filesystem - |
5 |
>> redundancy. Its far more common to have a host fail than a disk drive. |
6 |
>> Losing the major part of your storage in one go means the cluster is |
7 |
>> effectively dead - hence having a lot of completely separate systems is |
8 |
>> much more reliable |
9 |
> Of course. You should have multiple hosts before you start putting |
10 |
> multiple drives on a single host. |
11 |
> |
12 |
> However, once you have a few hosts the performance improves by adding |
13 |
> more, but you're not really getting THAT much additional redundancy. |
14 |
> You would get faster rebuild times by having more hosts since there |
15 |
> would be less data to transfer when one fails and more hosts doing the |
16 |
> work. |
17 |
> |
18 |
> So, it is about finding a balance. You probably don't want 30 drives |
19 |
> on 2 hosts. However, you probably also don't need 15-30 hosts for |
20 |
> that many drives either. I wouldn't be putting 16 drives onto a |
21 |
> single host until I had a fair number of hosts. |
22 |
> |
23 |
> As far as the status of lizardfs goes - as far as I can tell it is |
24 |
> mostly developed by a company and they've wavered a bit on support in |
25 |
> the last year. I share your observation that they seem to be picking |
26 |
> up again. In any case, I'm running the latest stable and it works |
27 |
> just fine, but it lacks the high availability features. I can have |
28 |
> shadow masters, but they won't automatically fail over, so maintenance |
29 |
> on the master is still a pain. Recovery due to failure of the master |
30 |
> should be pretty quick though even if manual - just have to run a |
31 |
> command on each shadow to determine which has the most recent |
32 |
> metadata, then adjust DNS for my master CNAME to point to the new |
33 |
> master, and then edit config on the new master to tell it that it is |
34 |
> the master and no longer a shadow, and after restarting the daemon the |
35 |
> cluster should be online again. |
36 |
> |
37 |
> The latest release candidate has the high availability features (used |
38 |
> to be paid, is now free), however it is still a release candidate and |
39 |
> I'm not in that much of a rush. There was a lot of griping on the |
40 |
> forums/etc by users who switched to the release candidate and ran into |
41 |
> bugs that ate their data. IMO that is why you don't go running |
42 |
> release candidates for distributed filesystems with a dozen hard |
43 |
> drives on them - if you want to try them out just run them in VMs with |
44 |
> a few GB of storage to play with and who cares if your test data is |
45 |
> destroyed. It is usually wise to be conservative with your |
46 |
> filesystems. Makes no difference to me if they take another year to |
47 |
> do the next release - I'd like the HA features but it isn't like the |
48 |
> old code goes stale. |
49 |
> |
50 |
> Actually, the one thing that it would be nice if they fixed is the |
51 |
> FUSE client - it seems to leak RAM. |
52 |
> |
53 |
> Oh, and the docs seem to hint at a windows client somewhere which |
54 |
> would be really nice to have, but I can't find any trace of it. I |
55 |
> only normally run a single client but it would obviously perform well |
56 |
> as a general-purpose fileserver. |
57 |
> |
58 |
> There has been talk of a substantial rewrite, though I'm not sure if |
59 |
> that will actually happen now. If it does I hope they do keep the RAM |
60 |
> requirements low on the chunkservers. That was the main thing that |
61 |
> turned me off from ceph - it is a great platform in general but |
62 |
> needing 1GB RAM per 1TB disk adds up really fast, and it basically |
63 |
> precludes ARM SBCs as OSDs as you can't get those with that much RAM |
64 |
> for any sane price - even if you were only running one drive per host |
65 |
> good luck finding a SBC with 13GB+ of RAM. You can tune ceph to use |
66 |
> less RAM but I've heard that bad things happen if you have some hosts |
67 |
> shuffle during a rebuild and you don't have gobs of RAM - all the OSDs |
68 |
> end up with an impossible backlog and they keep crashing until you run |
69 |
> around like Santa Claus filling every stocking with a handful of $60 |
70 |
> DIMMs. |
71 |
> |
72 |
> Right now lizardfs basically uses almost no ram at all on |
73 |
> chunkservers, so an ARM SBC could run dozens of drives without an |
74 |
> issue. |
75 |
> |
76 |
Everything bad you hear about ceph is true ... and then some! I did try, |
77 |
but this was some years ago so hopefully its better now. The two biggies |
78 |
were excessive network requirements (bandwidth, separation) and recovery |
79 |
times with frequent crash and burn. There are ceph features I would |
80 |
really like to use (rbd, local copies with much simpler config, ...) but |
81 |
moosefs is a lot more bullet proof on lesser resource requirements |
82 |
though I did find properly pruned vlans on a smartswitch separating the |
83 |
intra-cluster from external requests made a noticeable difference. |
84 |
|
85 |
moosefs has a windows client but its only available with the paid |
86 |
version. The master/shadow-master and auto failover is only available |
87 |
through the paid version - for the community you have to stop the |
88 |
master, copy the files then change DNS etc. before restarting the new |
89 |
master - cant really do it online even when scripted - its painful with |
90 |
downtime and I had dns caching issues that took time to work their way |
91 |
out of the system. I thought lizardfs was much more community minded |
92 |
but you are characterising it as similar to moosefs - a taster offering |
93 |
by a commercial company holding back some of the non-essential but |
94 |
jucier features for the paid version - is that how you see them? |
95 |
|
96 |
By the way, to keep to the rpi subject, I did have a rpi3B with a usb2 |
97 |
sata drive attached but it was hopeless as a chunkserver impacting the |
98 |
whole cluster. Having the usb data flow and network data flow through |
99 |
the same hub just didn't go well - I started with the odroids before the |
100 |
rpi4 was released or I might have experimented with it first (using a |
101 |
sata HAT) - anyone with a comment on how that compares with a HC2? |
102 |
|
103 |
BillK |