Gentoo Archives: gentoo-user

From: Bill Kenworthy <billk@×××××××××.au>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Re: btrfs fails to balance
Date: Tue, 20 Jan 2015 23:03:53
Message-Id: 54BEDEC9.3050402@iinet.net.au
In Reply to: Re: [gentoo-user] Re: btrfs fails to balance by Rich Freeman
1 On 21/01/15 00:03, Rich Freeman wrote:
2 > On Tue, Jan 20, 2015 at 10:07 AM, James <wireless@×××××××××××.com> wrote:
3 >> Bill Kenworthy <billk <at> iinet.net.au> writes:
4 >>
5 >>> You can turn off COW and go single on btrfs to speed it up but bugs in
6 >>> ceph and btrfs lose data real fast!
7 >>
8 >> Interesting idea, since I'll have raid1 underneath each node. I'll need to
9 >> dig into this idea a bit more.
10 >>
11 >
12 > So, btrfs and ceph solve an overlapping set of problems in an
13 > overlapping set of ways. In general adding data security often comes
14 > at the cost of performance, and obviously adding it at multiple layers
15 > can come at the cost of additional performance. I think the right
16 > solution is going to depend on the circumstances.
17 >
18 > if ceph provided that protection against bitrot I'd probably avoid a
19 > COW filesystem entirely. It isn't going to add any additional value,
20 > and they do have a performance cost. If I had mirroring at the ceph
21 > level I'd probably just run them on ext4 on lvm with no
22 > mdadm/btrfs/whatever below that. Availability is already ensured by
23 > ceph - if you lose a drive then other nodes will pick up the load. If
24 > I didn't have robust mirroring at the ceph level then having mirroring
25 > of some kind at the individual node level would improve availability.
26 >
27 > On the other hand, ceph currently has some gaps, so having it on top
28 > of zfs/btrfs could provide protection against bitrot. However, right
29 > now there is no way to turn off COW while leaving checksumming
30 > enabled. It would be nice if you could leave the checksumming on.
31 > Then if there was bitrot btrfs would just return an error when you
32 > tried to read the file, and then ceph would handle it like any other
33 > disk error and use a mirrored copy on another node. The problem with
34 > ceph+ext4 is that if there is bitrot neither layer will detect it.
35 >
36 > Does btrfs+ceph really have a performance hit that is larger than
37 > btrfs without ceph? I fully expect it to be slower than ext4+ceph.
38 > Btrfs in general performs fairly poorly right now - that is expected
39 > to improve in the future, but I doubt that it will ever outperform
40 > ext4 other than for specific operations that benefit from it (like
41 > reflink copies). It will always be faster to just overwrite one block
42 > in the middle of a file than to write the block out to unallocated
43 > space and update all the metadata.
44 >
45
46 answer to both you and James here:
47
48 I think it was pre 8.0 when I dropped out. Its Ceph that suffers from
49 bitrot - I use the "golden master" approach to generating the VM's so
50 corruption was obvious. I did report one bug in the early days that
51 turned out to be btrfs, but I think it was largely ceph which has been
52 born out by consolidating the ceph trial hardware and using it with
53 btrfs and the same storage - rare problems and I can point to
54 hardware/power when it happened.
55
56 The performance hit was not due to lack of horsepower (cpu, ram etc) but
57 due to I/O - both network bandwidth and internal bus on the hosts. That
58 is why a small number of systems no matter how powerful wont work well.
59 For real performance, I saw people using SSD's and large numbers of
60 hosts in order to distribute the data flows - this does work and I saw
61 some insane numbers posted. It also requires multiple networks
62 (internal and external) to separate the flows (not VLAN but dedicated
63 pipes) due to the extreem burstiness of the traffic. As well as VM
64 images, I had backups (using dirvish) and thousands of security camera
65 images. Deletes of a directory with a lot of files would take many
66 hours. Same with using ceph for a mail store (came up on the ceph list
67 under "why is it so slow") - as a chunk server its just not suitable for
68 lots of small files.
69
70 Towards the end of my use, I stopped seeing bitrot on a system with data
71 but idle to limiting it to occurring during heavy use. My overall
72 conclusion is lots of small hosts with no more than a couple of drives
73 each and multiple networks with lots of bandwidth is what its designed for.
74
75 I had two reasons for looking at ceph - distributed storage where data
76 in use was held close to the user but could be redistributed easily
77 with multiple copies (think two small data stores with an intermittent
78 WAN link storing high and low priority data) and high performance with
79 high availability on HW failure.
80
81 Ceph was not the answer for me with the scale I have.
82
83 BillK