1 |
On Sat, Sep 16, 2017 at 9:43 AM, Kai Krakow <hurikhan77@×××××.com> wrote: |
2 |
> |
3 |
> Actually, I'm running across 3x 1TB here on my desktop, with mraid1 and |
4 |
> draid 0. Combined with bcache it gives confident performance. |
5 |
> |
6 |
|
7 |
Not entirely sure I'd use the word "confident" to describe a |
8 |
filesystem where the loss of one disk guarantees that: |
9 |
1. You will lose data (no data redundancy). |
10 |
2. But the filesystem will be able to tell you exactly what data you |
11 |
lost (as metadata will be fine). |
12 |
|
13 |
> |
14 |
> I was very happy a long time with XFS but switched to btrfs when it |
15 |
> became usable due to compression and stuff. But performance of |
16 |
> compression seems to get worse lately, IO performance drops due to |
17 |
> hogged CPUs even if my system really isn't that incapable. |
18 |
> |
19 |
|
20 |
Btrfs performance is pretty bad in general right now. The problem is |
21 |
that they just simply haven't gotten around to optimizing it fully, |
22 |
mainly because they're more focused on getting rid of the data |
23 |
corruption bugs (which is of course the right priority). For example, |
24 |
with raid1 mode btrfs picks the disk to use for raid based on whether |
25 |
the PID is even or odd, without any regard to disk utilization. |
26 |
|
27 |
When I moved to zfs I noticed a huge performance boost. |
28 |
|
29 |
Fundamentally I don't see why btrfs can't perform just as well as the |
30 |
others. It just isn't there yet. |
31 |
|
32 |
> What's still cool is that I don't need to manage volumes since the |
33 |
> volume manager is built into btrfs. XFS on LVM was not that flexible. |
34 |
> If btrfs wouldn't have this feature, I probably would have switched |
35 |
> back to XFS already. |
36 |
|
37 |
My main concern with xfs/ext4 is that neither provides on-disk |
38 |
checksums or protection against the raid write hole. |
39 |
|
40 |
I just switched motherboards a few weeks ago and either a connection |
41 |
or a SATA port was bad because one of my drives was getting a TON of |
42 |
checksum errors on zfs. I moved it to an LSI card and scrubbed, and |
43 |
while it took forever and the system degraded the array more than once |
44 |
due to the high error rate, eventually it patched up all the errors |
45 |
and now the array is working without issue. I didn't suffer more than |
46 |
a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE |
47 |
headache trying to recover from that (doing who knows how much |
48 |
troubleshooting before realizing I had to do a slow full restore from |
49 |
backup with the system down). |
50 |
|
51 |
I just don't see how a modern filesystem can get away without having |
52 |
full checksum support. It is a bit odd that it has taken so long for |
53 |
Ceph to introduce it, and I'm still not sure if it is truly |
54 |
end-to-end, or if at any point in its life the data isn't protected by |
55 |
checksums. If I were designing something like Ceph I'd checksum the |
56 |
data at the client the moment it enters storage, then independently |
57 |
store the checksum and data, and then retrieve both and check it at |
58 |
the client when the data leaves storage. Then you're protected |
59 |
against corruption at any layer below that. You could of course have |
60 |
additional protections to catch errors sooner before the client even |
61 |
sees them. I think that the issue is that Ceph was really designed |
62 |
for object storage originally and they just figured the application |
63 |
would be responsible for data integrity. |
64 |
|
65 |
The other benefit of checksums is that if they're done right scrubs |
66 |
can go a lot faster, because you don't have to scrub all the |
67 |
redundancy data synchronously. You can just start an idle-priority |
68 |
read thread on every drive and then pause it anytime a drive is |
69 |
accessed, and an access on one drive won't slow down the others. With |
70 |
traditional RAID you have to read all the redundancy data |
71 |
synchronously because you can't check the integrity of any of it |
72 |
without the full set. I think even ZFS is stuck doing synchronous |
73 |
reads due to how it stores/computes the checksums. This is something |
74 |
btrfs got right. |
75 |
|
76 |
> |
77 |
>> For the moment I'm |
78 |
>> relying more on zfs. |
79 |
> |
80 |
> How does it perform memory-wise? Especially, I'm currently using bees[1] |
81 |
> for deduplication: It uses a 1G memory mapped file (you can choose |
82 |
> other sizes if you want), and it picks up new files really fast, within |
83 |
> a minute. I don't think zfs can do anything like that within the same |
84 |
> resources. |
85 |
|
86 |
I'm not using deduplication, but my understanding is that zfs deduplication: |
87 |
1. Works just fine. |
88 |
2. Uses a TON of RAM. |
89 |
|
90 |
So, it might not be your cup of tea. There is no way to do |
91 |
semi-offline dedup as with btrfs (not really offline in that the |
92 |
filesystem is fully running - just that you periodically scan for dups |
93 |
and fix them after the fact, vs detect them in realtime). With a |
94 |
semi-offline mode then the performance hits would only come at a time |
95 |
of my choosing, vs using gobs of RAM all the time to detect what are |
96 |
probably fairly rare dups. |
97 |
|
98 |
That aside, I find it works fine memory-wise (I don't use dedup). It |
99 |
has its own cache system not integrated fully into the kernel's native |
100 |
cache, so it tends to hold on to a lot more ram than other |
101 |
filesystems, but you can tune this behavior so that it stays fairly |
102 |
tame. |
103 |
|
104 |
-- |
105 |
Rich |