1 |
On 09/11/18 10:29, Rich Freeman wrote: |
2 |
> On Thu, Nov 8, 2018 at 8:16 PM Dale <rdalek1967@×××××.com> wrote: |
3 |
>> I'm trying to come up with a |
4 |
>> plan that allows me to grow easier and without having to worry about |
5 |
>> running out of motherboard based ports. |
6 |
>> |
7 |
> So, this is an issue I've been changing my mind on over the years. |
8 |
> There are a few common approaches: |
9 |
> |
10 |
> * Find ways to cram a lot of drives on one host |
11 |
> * Use a patchwork of NAS devices or improvised hosts sharing over |
12 |
> samba/nfs/etc and end up with a mess of mount points. |
13 |
> * Use a distributed FS |
14 |
> |
15 |
> Right now I'm mainly using the first approach, and I'm trying to move |
16 |
> to the last. The middle option has never appealed to me. |
17 |
> |
18 |
> So, to do more of what you're doing in the most efficient way |
19 |
> possible, I recommend finding used LSI HBA cards. These have mini-SAS |
20 |
> ports on them, and one of these can be attached to a breakout cable |
21 |
> that gets you 4 SATA ports. I just picked up two of these for $20 |
22 |
> each on ebay (used) and they have 4 mini-SAS ports each, which is |
23 |
> capacity for 16 SATA drives per card. Typically these have 4x or |
24 |
> larger PCIe interfaces, so you'll need a large slot, or one with a |
25 |
> cutout. You'd have to do the math but I suspect that if the card+MB |
26 |
> supports PCIe 3.0 you're not losing much if you cram it into a smaller |
27 |
> slot. If most of the drives are idle most of the time then that also |
28 |
> demands less bandwidth. 16 fully busy hard drives obviously can put |
29 |
> out a lot of data if reading sequentially. |
30 |
> |
31 |
> You can of course get more consumer-oriented SATA cards, but you're |
32 |
> lucky to get 2-4 SATA ports on a card that runs you $30. The mini-SAS |
33 |
> HBAs get you a LOT more drives per PCIe slot, and your PCIe slots are |
34 |
> you main limiting factor assuming you have power and case space. |
35 |
> |
36 |
> Oh, and those HBA cards need to be flashed into "IT" mode - they're |
37 |
> often sold this way, but if they support RAID you want to flash the IT |
38 |
> firmware that just makes them into a bunch of standalone SATA slots. |
39 |
> This is usually a PITA that involves DOS or whatever, but I have |
40 |
> noticed some of the software needed in the Gentoo repo. |
41 |
> |
42 |
> If you go that route it is just like having a ton of SATA ports in |
43 |
> your system - they just show up as sda...sdz and so on (no idea where |
44 |
> it goes after that). Software-wise you just keep doing what you're |
45 |
> already doing (though you should be seriously considering |
46 |
> mdadm/zfs/btrfs/whatever at that point). |
47 |
> |
48 |
> That is the more traditional route. |
49 |
> |
50 |
> Now let me talk about distributed filesystems, which is the more |
51 |
> scalable approach. I'm getting tired of being limited by SATA ports, |
52 |
> and cases, and such. I'm also frustrated with some of zfs's |
53 |
> inflexibility around removing drives. These are constraints that make |
54 |
> upgrading painful, and often inefficient. Distributed filesystems |
55 |
> offer a different solution. |
56 |
> |
57 |
> A distributed filesystem spreads its storage across many hosts, with |
58 |
> an arbitrary number of drives per host (more or less). So, you can |
59 |
> add more hosts, add more drives to a host, and so on. That means |
60 |
> you're never forced to try to find a way to cram a few more drives in |
61 |
> one host. The resulting filesystem appears as one gigantic filesystem |
62 |
> (unless you want to split it up), which means no mess of nfs |
63 |
> mountpoints and so on, and all the other headaches of nfs. Just as |
64 |
> with RAID these support redundancy, except now you can lose entire |
65 |
> hosts without issue. With many you can even tell it which |
66 |
> PDU/rack/whatever each host is plugged into, and it will make sure you |
67 |
> can lose all the hosts in one rack. You can also mount the filesystem |
68 |
> on as many hosts as you want at the same time. |
69 |
> |
70 |
> They do tend to be a bit more complex. The big players can scale VERY |
71 |
> large - thousands of drives easily. Everything seems to be moving |
72 |
> towards Ceph/CephFS. If you were hosting a datacenter full of |
73 |
> VMs/containers/etc I'd be telling you to host it on Ceph. However, |
74 |
> for small scale (which you definitely are right now), I'm not thrilled |
75 |
> with it. Due to the way it allocates data (hash-based) anytime |
76 |
> anything changes you end up having to move all the data around in the |
77 |
> cluster, and all the reports I've read suggests it doesn't perform all |
78 |
> that great if you only have a few nodes. Ceph storage nodes are also |
79 |
> RAM-hungry, and I want to run these on ARM to save power, and few ARM |
80 |
> boards have that kind of RAM, and they're very expensive. |
81 |
> |
82 |
> Personally I'm working on deploying a cluster of a few nodes running |
83 |
> LizardFS, which is basically a fork/derivative of MooseFS. While it |
84 |
> won't scale nearly as well, below 100 nodes should be fine, and in |
85 |
> particular it sounds like it works fairly well with only a few nodes. |
86 |
> It has its pros and cons, but for my needs it should be sufficient. |
87 |
> It also isn't RAM-hungry. I'm going to be testing it on some |
88 |
> RockPro64s, with the LSI HBAs. |
89 |
> |
90 |
> I did note that Gentoo lacks a LizardFS client. I suspect I'll be |
91 |
> looking to fix that - I'm sure the moosefs ebuild would be a good |
92 |
> starting point. I'm probably going to be a whimp and run the storage |
93 |
> nodes on Ubuntu or whatever upstream targets - they're basically |
94 |
> appliances as far as I'm concerned. |
95 |
> |
96 |
> So, those are the two routes I'd recommend. Just get yourself an HBA |
97 |
> if you only want a few more drives. If you see your needs expanding |
98 |
> then consider a distributed filesystem. The advantage of the latter |
99 |
> is that you can keep expanding it however you want with additional |
100 |
> drives/nodes/whatever. If you're going over 20 nodes I'd use Ceph for |
101 |
> sure - IMO that seems to be the future of this space. |
102 |
> |
103 |
I'll second your comments on ceph after my experience - great idea for |
104 |
large scale systems, otherwise performance is quite poor on small |
105 |
systems. Needs at least GB connections with two networks as well as only |
106 |
one or two drives per host to work properly. |
107 |
|
108 |
I think I'll give lizardfs a go - an interesting read. |
109 |
|
110 |
|
111 |
BillK |