1 |
On 21/01/15 00:03, Rich Freeman wrote: |
2 |
> On Tue, Jan 20, 2015 at 10:07 AM, James <wireless@×××××××××××.com> wrote: |
3 |
>> Bill Kenworthy <billk <at> iinet.net.au> writes: |
4 |
>> |
5 |
>>> You can turn off COW and go single on btrfs to speed it up but bugs in |
6 |
>>> ceph and btrfs lose data real fast! |
7 |
>> |
8 |
>> Interesting idea, since I'll have raid1 underneath each node. I'll need to |
9 |
>> dig into this idea a bit more. |
10 |
>> |
11 |
> |
12 |
> So, btrfs and ceph solve an overlapping set of problems in an |
13 |
> overlapping set of ways. In general adding data security often comes |
14 |
> at the cost of performance, and obviously adding it at multiple layers |
15 |
> can come at the cost of additional performance. I think the right |
16 |
> solution is going to depend on the circumstances. |
17 |
> |
18 |
> if ceph provided that protection against bitrot I'd probably avoid a |
19 |
> COW filesystem entirely. It isn't going to add any additional value, |
20 |
> and they do have a performance cost. If I had mirroring at the ceph |
21 |
> level I'd probably just run them on ext4 on lvm with no |
22 |
> mdadm/btrfs/whatever below that. Availability is already ensured by |
23 |
> ceph - if you lose a drive then other nodes will pick up the load. If |
24 |
> I didn't have robust mirroring at the ceph level then having mirroring |
25 |
> of some kind at the individual node level would improve availability. |
26 |
> |
27 |
> On the other hand, ceph currently has some gaps, so having it on top |
28 |
> of zfs/btrfs could provide protection against bitrot. However, right |
29 |
> now there is no way to turn off COW while leaving checksumming |
30 |
> enabled. It would be nice if you could leave the checksumming on. |
31 |
> Then if there was bitrot btrfs would just return an error when you |
32 |
> tried to read the file, and then ceph would handle it like any other |
33 |
> disk error and use a mirrored copy on another node. The problem with |
34 |
> ceph+ext4 is that if there is bitrot neither layer will detect it. |
35 |
> |
36 |
> Does btrfs+ceph really have a performance hit that is larger than |
37 |
> btrfs without ceph? I fully expect it to be slower than ext4+ceph. |
38 |
> Btrfs in general performs fairly poorly right now - that is expected |
39 |
> to improve in the future, but I doubt that it will ever outperform |
40 |
> ext4 other than for specific operations that benefit from it (like |
41 |
> reflink copies). It will always be faster to just overwrite one block |
42 |
> in the middle of a file than to write the block out to unallocated |
43 |
> space and update all the metadata. |
44 |
> |
45 |
|
46 |
answer to both you and James here: |
47 |
|
48 |
I think it was pre 8.0 when I dropped out. Its Ceph that suffers from |
49 |
bitrot - I use the "golden master" approach to generating the VM's so |
50 |
corruption was obvious. I did report one bug in the early days that |
51 |
turned out to be btrfs, but I think it was largely ceph which has been |
52 |
born out by consolidating the ceph trial hardware and using it with |
53 |
btrfs and the same storage - rare problems and I can point to |
54 |
hardware/power when it happened. |
55 |
|
56 |
The performance hit was not due to lack of horsepower (cpu, ram etc) but |
57 |
due to I/O - both network bandwidth and internal bus on the hosts. That |
58 |
is why a small number of systems no matter how powerful wont work well. |
59 |
For real performance, I saw people using SSD's and large numbers of |
60 |
hosts in order to distribute the data flows - this does work and I saw |
61 |
some insane numbers posted. It also requires multiple networks |
62 |
(internal and external) to separate the flows (not VLAN but dedicated |
63 |
pipes) due to the extreem burstiness of the traffic. As well as VM |
64 |
images, I had backups (using dirvish) and thousands of security camera |
65 |
images. Deletes of a directory with a lot of files would take many |
66 |
hours. Same with using ceph for a mail store (came up on the ceph list |
67 |
under "why is it so slow") - as a chunk server its just not suitable for |
68 |
lots of small files. |
69 |
|
70 |
Towards the end of my use, I stopped seeing bitrot on a system with data |
71 |
but idle to limiting it to occurring during heavy use. My overall |
72 |
conclusion is lots of small hosts with no more than a couple of drives |
73 |
each and multiple networks with lots of bandwidth is what its designed for. |
74 |
|
75 |
I had two reasons for looking at ceph - distributed storage where data |
76 |
in use was held close to the user but could be redistributed easily |
77 |
with multiple copies (think two small data stores with an intermittent |
78 |
WAN link storing high and low priority data) and high performance with |
79 |
high availability on HW failure. |
80 |
|
81 |
Ceph was not the answer for me with the scale I have. |
82 |
|
83 |
BillK |