1 |
Am Sat, 16 Sep 2017 10:05:21 -0700 |
2 |
schrieb Rich Freeman <rich0@g.o>: |
3 |
|
4 |
> On Sat, Sep 16, 2017 at 9:43 AM, Kai Krakow <hurikhan77@×××××.com> |
5 |
> wrote: |
6 |
> > |
7 |
> > Actually, I'm running across 3x 1TB here on my desktop, with mraid1 |
8 |
> > and draid 0. Combined with bcache it gives confident performance. |
9 |
> > |
10 |
> |
11 |
> Not entirely sure I'd use the word "confident" to describe a |
12 |
> filesystem where the loss of one disk guarantees that: |
13 |
> 1. You will lose data (no data redundancy). |
14 |
> 2. But the filesystem will be able to tell you exactly what data you |
15 |
> lost (as metadata will be fine). |
16 |
|
17 |
I take daily backups with borg backup. It takes only 15 minutes to run. |
18 |
And it has been tested twice successfully. The only breakdowns I had |
19 |
were due to btrfs bugs, not hardware faults. |
20 |
|
21 |
This is confident enough for my desktop system. |
22 |
|
23 |
|
24 |
> > I was very happy a long time with XFS but switched to btrfs when it |
25 |
> > became usable due to compression and stuff. But performance of |
26 |
> > compression seems to get worse lately, IO performance drops due to |
27 |
> > hogged CPUs even if my system really isn't that incapable. |
28 |
> > |
29 |
> |
30 |
> Btrfs performance is pretty bad in general right now. The problem is |
31 |
> that they just simply haven't gotten around to optimizing it fully, |
32 |
> mainly because they're more focused on getting rid of the data |
33 |
> corruption bugs (which is of course the right priority). For example, |
34 |
> with raid1 mode btrfs picks the disk to use for raid based on whether |
35 |
> the PID is even or odd, without any regard to disk utilization. |
36 |
> |
37 |
> When I moved to zfs I noticed a huge performance boost. |
38 |
|
39 |
Interesting... While I never tried it I always feared that it would |
40 |
perform worse if not throwing RAM and ZIL/L2ARC at it. |
41 |
|
42 |
|
43 |
> Fundamentally I don't see why btrfs can't perform just as well as the |
44 |
> others. It just isn't there yet. |
45 |
|
46 |
And it will take a long time still, because devs are still throwing new |
47 |
features at it which need to stabilize. |
48 |
|
49 |
|
50 |
> > What's still cool is that I don't need to manage volumes since the |
51 |
> > volume manager is built into btrfs. XFS on LVM was not that |
52 |
> > flexible. If btrfs wouldn't have this feature, I probably would |
53 |
> > have switched back to XFS already. |
54 |
> |
55 |
> My main concern with xfs/ext4 is that neither provides on-disk |
56 |
> checksums or protection against the raid write hole. |
57 |
|
58 |
Btrfs suffers the same RAID5 write hole problem since years. I always |
59 |
planned moving to RAID5 later (which is why I have 3 disks) but I fear |
60 |
this won't be fixed any time soon due to design decisions made too |
61 |
early. |
62 |
|
63 |
|
64 |
> I just switched motherboards a few weeks ago and either a connection |
65 |
> or a SATA port was bad because one of my drives was getting a TON of |
66 |
> checksum errors on zfs. I moved it to an LSI card and scrubbed, and |
67 |
> while it took forever and the system degraded the array more than once |
68 |
> due to the high error rate, eventually it patched up all the errors |
69 |
> and now the array is working without issue. I didn't suffer more than |
70 |
> a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE |
71 |
> headache trying to recover from that (doing who knows how much |
72 |
> troubleshooting before realizing I had to do a slow full restore from |
73 |
> backup with the system down). |
74 |
|
75 |
I found md raid not very reliable in the past but I didn't try again in |
76 |
years. So this may have changed. I only remember it destroyed a file |
77 |
system after an unclean shutdown not only once, that's not what I |
78 |
expect from RAID1. Other servers with file systems on bare metal |
79 |
survived this just fine. |
80 |
|
81 |
|
82 |
> I just don't see how a modern filesystem can get away without having |
83 |
> full checksum support. It is a bit odd that it has taken so long for |
84 |
> Ceph to introduce it, and I'm still not sure if it is truly |
85 |
> end-to-end, or if at any point in its life the data isn't protected by |
86 |
> checksums. If I were designing something like Ceph I'd checksum the |
87 |
> data at the client the moment it enters storage, then independently |
88 |
> store the checksum and data, and then retrieve both and check it at |
89 |
> the client when the data leaves storage. Then you're protected |
90 |
> against corruption at any layer below that. You could of course have |
91 |
> additional protections to catch errors sooner before the client even |
92 |
> sees them. I think that the issue is that Ceph was really designed |
93 |
> for object storage originally and they just figured the application |
94 |
> would be responsible for data integrity. |
95 |
|
96 |
I'd at least pass the checksum through all the layers while checking it |
97 |
again, so you could detect which transport or layer is broken. |
98 |
|
99 |
|
100 |
> The other benefit of checksums is that if they're done right scrubs |
101 |
> can go a lot faster, because you don't have to scrub all the |
102 |
> redundancy data synchronously. You can just start an idle-priority |
103 |
> read thread on every drive and then pause it anytime a drive is |
104 |
> accessed, and an access on one drive won't slow down the others. With |
105 |
> traditional RAID you have to read all the redundancy data |
106 |
> synchronously because you can't check the integrity of any of it |
107 |
> without the full set. I think even ZFS is stuck doing synchronous |
108 |
> reads due to how it stores/computes the checksums. This is something |
109 |
> btrfs got right. |
110 |
|
111 |
One other point I decided for btrfs, tho I don't make much use of it |
112 |
currently. I used to do regular scrubs a while ago but combined with |
113 |
bcache, that is an SSD killer... I killed my old 128G SSD within one |
114 |
year, although I used overprovisioning. Well, I actually didn't kill |
115 |
it, it swapped it at 99% lifetime according to smartctl. It would |
116 |
probably still work for a long time in normal workloads. |
117 |
|
118 |
|
119 |
> >> For the moment I'm |
120 |
> >> relying more on zfs. |
121 |
> > |
122 |
> > How does it perform memory-wise? Especially, I'm currently using |
123 |
> > bees[1] for deduplication: It uses a 1G memory mapped file (you can |
124 |
> > choose other sizes if you want), and it picks up new files really |
125 |
> > fast, within a minute. I don't think zfs can do anything like that |
126 |
> > within the same resources. |
127 |
> |
128 |
> I'm not using deduplication, but my understanding is that zfs |
129 |
> deduplication: |
130 |
> 1. Works just fine. |
131 |
|
132 |
No doubt... |
133 |
|
134 |
> 2. Uses a TON of RAM. |
135 |
|
136 |
That's the problem. And I think there is no near-line dedup tool |
137 |
available? |
138 |
|
139 |
|
140 |
> So, it might not be your cup of tea. There is no way to do |
141 |
> semi-offline dedup as with btrfs (not really offline in that the |
142 |
> filesystem is fully running - just that you periodically scan for dups |
143 |
> and fix them after the fact, vs detect them in realtime). With a |
144 |
> semi-offline mode then the performance hits would only come at a time |
145 |
> of my choosing, vs using gobs of RAM all the time to detect what are |
146 |
> probably fairly rare dups. |
147 |
|
148 |
I'm using bees, and I'd call it near-line. Changes to files are picked |
149 |
up at commit time, when a new generation is made, and then it walks the |
150 |
new extents, maps those to files, and deduplicates the blocks. I was |
151 |
surprised how fast it detects new duplicate blocks. But it is still |
152 |
working through the rest of the file system (since days), at least |
153 |
without much impact on performance. Giving up 1G of RAM for this is |
154 |
totally okay. |
155 |
|
156 |
Once it finished scanning the first time, I'm thinking about starting |
157 |
it at timed intervals. But it looks like impact will be so low that I |
158 |
can keep it running all the time. Using cgroups to limit cpu and io |
159 |
shares works really great. |
160 |
|
161 |
I still didn't evaluate how it interferes with defragmenting, tho, or |
162 |
how big the impact is of bees fragmenting extents. |
163 |
|
164 |
|
165 |
> That aside, I find it works fine memory-wise (I don't use dedup). It |
166 |
> has its own cache system not integrated fully into the kernel's native |
167 |
> cache, so it tends to hold on to a lot more ram than other |
168 |
> filesystems, but you can tune this behavior so that it stays fairly |
169 |
> tame. |
170 |
|
171 |
I think the reasoning using own caching is, that block caching at the |
172 |
vfs layer cannot just be done in an efficient way for a cow file system |
173 |
with scrubbing and everything. You need to use good cache hinting |
174 |
throuhout the whole pipeline which is currently slowly integrated into |
175 |
the kernel. |
176 |
|
177 |
E.g., when btrfs does cow action, bcache doesn't get notified that it |
178 |
can discard the free block from cache. I don't know if this is handled |
179 |
in the kernel cache layer... |
180 |
|
181 |
|
182 |
-- |
183 |
Regards, |
184 |
Kai |
185 |
|
186 |
Replies to list-only preferred. |