1 |
Rich Freeman <rich0@g.o> writes: |
2 |
|
3 |
> On Tue, Jan 5, 2016 at 5:16 PM, lee <lee@××××××××.de> wrote: |
4 |
>> Rich Freeman <rich0@g.o> writes: |
5 |
>> |
6 |
>>> |
7 |
>>> I would run btrfs on bare partitions and use btrfs's raid1 |
8 |
>>> capabilities. You're almost certainly going to get better |
9 |
>>> performance, and you get more data integrity features. |
10 |
>> |
11 |
>> That would require me to set up software raid with mdadm as well, for |
12 |
>> the swap partition. |
13 |
> |
14 |
> Correct, if you don't want a panic if a single swap drive fails. |
15 |
> |
16 |
>> |
17 |
>>> If you have a silent corruption with mdadm doing the raid1 then btrfs |
18 |
>>> will happily warn you of your problem and you're going to have a |
19 |
>>> really hard time fixing it, |
20 |
>> |
21 |
>> BTW, what do you do when you have silent corruption on a swap partition? |
22 |
>> Is that possible, or does swapping use its own checksums? |
23 |
> |
24 |
> If the kernel pages in data from the good mirror, nothing happens. If |
25 |
> the kernel pages in data from the bad mirror, then whatever data |
26 |
> happens to be there is what will get loaded and used and/or executed. |
27 |
> If you're lucky the modified data will be part of unused heap or |
28 |
> something. If not, well, just about anything could happen. |
29 |
> |
30 |
> Nothing in this scenario will check that the data is correct, except |
31 |
> for a forced scrub of the disks. A scrub would probably detect the |
32 |
> error, but I don't think mdadm has any ability to recover it. Your |
33 |
> best bet is probably to try to immediately reboot and save what you |
34 |
> can, or a less-risky solution assuming you don't have anything |
35 |
> critical in RAM is to just do an immediate hard reset so that there is |
36 |
> no risk of bad data getting swapped in and overwriting good data on |
37 |
> your normal filesystems. |
38 |
|
39 |
Then you might be better off with no swap unless you put it on a file |
40 |
system that uses check sums. |
41 |
|
42 |
>> It's still odd. I already have two different file systems and the |
43 |
>> overhead of one kind of software raid while I would rather stick to one |
44 |
>> file system. With btrfs, I'd still have two different file systems --- |
45 |
>> plus mdadm and the overhead of three different kinds of software raid. |
46 |
> |
47 |
> I'm not sure why you'd need two different filesystems. |
48 |
|
49 |
btrfs and zfs |
50 |
|
51 |
I won't put my data on btrfs for at least quite a while. |
52 |
|
53 |
> Just btrfs for your data. I'm not sure where you're counting three |
54 |
> types of software raid either - you just have your swap. |
55 |
|
56 |
btrfs raid is software raid, zfs raid is software raid, mdadm is |
57 |
software raid. That makes three different sofware raids. |
58 |
|
59 |
> And I don't think any of this involves any significant overhead, other |
60 |
> than configuration. |
61 |
|
62 |
mdadm does have a very significant performance overhead. ZFS mirror |
63 |
performance seems to be rather poor. I don't know how much overhead is |
64 |
involved with zfs and btrfs software raid, yet since they basically all |
65 |
do the same thing, I have my doubts that the overhead is significantly |
66 |
lower than the overhead of mdadm. |
67 |
|
68 |
>> How would it be so much better to triple the software raids and to still |
69 |
>> have the same number of file systems? |
70 |
> |
71 |
> Well, the difference would be more data integrity insofar as hardware |
72 |
> failure goes, but certainly more risk of logical errors (IMO). |
73 |
|
74 |
There would be a possibility for more data integrity for the root file |
75 |
system, assuming that btrfs is as reliable as ext4 on hardware raid. Is |
76 |
it? |
77 |
|
78 |
That's about 10GB, mostly read and not written to. It would be a |
79 |
very minor improvement, if any. |
80 |
|
81 |
>>>> When you use hardware raid, it |
82 |
>>>> can be disadvantageous compared to btrfs-raid --- and when you use it |
83 |
>>>> anyway, things are suddenly much more straightforward because everything |
84 |
>>>> is on raid to begin with. |
85 |
>>> |
86 |
>>> I'd stick with mdadm. You're never going to run mixed |
87 |
>>> btrfs/hardware-raid on a single drive, |
88 |
>> |
89 |
>> A single disk doesn't make for a raid. |
90 |
> |
91 |
> You misunderstood my statement. If you have two drives, you can't run |
92 |
> both hardware raid and btrfs raid across them. Hardware raid setups |
93 |
> don't generally support running across only part of a drive, and in |
94 |
> this setup you'd have to run hardware raid on part of each of two |
95 |
> single drives. |
96 |
|
97 |
I have two drives to hold the root file system and the swap space. The |
98 |
raid controller they'd be connected do does not support using disks |
99 |
partially. |
100 |
|
101 |
>>> and the only time I'd consider |
102 |
>>> hardware raid is with a high quality raid card. You'd still have to |
103 |
>>> convince me not to use mdadm even if I had one of those lying around. |
104 |
>> |
105 |
>> From my own experience, I can tell you that mdadm already does have |
106 |
>> significant overhead when you use a raid1 of two disks and a raid5 with |
107 |
>> three disks. This overhead may be somewhat due to the SATA controller |
108 |
>> not being as capable as one would expect --- yet that doesn't matter |
109 |
>> because one thing you're looking at, besides reliability, is the overall |
110 |
>> performance. And the overall performance very noticeably increased when |
111 |
>> I migrated from mdadm raids to hardware raids, with the same disks and |
112 |
>> the same hardware, except that the raid card was added. |
113 |
> |
114 |
> Well, sure, the raid card probably had battery-backed cache if it was |
115 |
> decent, so linux could complete its commits to RAM and not have to |
116 |
> wait for the disks. |
117 |
|
118 |
yes |
119 |
|
120 |
>> And that was only 5 disks. I also know that the performance with a ZFS |
121 |
>> mirror with two disks was disappointingly poor. Those disks aren't |
122 |
>> exactly fast, but still. I haven't tested yet if it changed after |
123 |
>> adding 4 mirrored disks to the pool. And I know that the performance of |
124 |
>> another hardware raid5 with 6 disks was very good. |
125 |
> |
126 |
> You're probably going to find the performance of a COW filesystem to |
127 |
> be inferior to that of an overwrite-in-place filesystem, simply |
128 |
> because the latter has to do less work. |
129 |
|
130 |
Reading isn't as fast as I would expect, either. |
131 |
|
132 |
>> Thus I'm not convinced that software raid is the way to go. I wish they |
133 |
>> would make hardware ZFS (or btrfs, if it ever becomes reliable) |
134 |
>> controllers. |
135 |
> |
136 |
> I doubt it would perform any better. What would that controller do |
137 |
> that your CPU wouldn't do? |
138 |
|
139 |
The CPU wouldn't need to do what the controller does and have time to do |
140 |
other things instead. |
141 |
|
142 |
> Well, other than have battery-backed cache, which would help in any |
143 |
> circumstance. If you stuck 5 raid cards in your PC and put one drive |
144 |
> on each card and put mdadm or ZFS across all five it would almost |
145 |
> certainly perform better because you're adding battery-backed cache. |
146 |
|
147 |
It's probably not only that. A 512MB cache probably doesn't make that |
148 |
much difference. I'm guessing that the SATA controller might be |
149 |
overwhelmed when it has to handle 5 disks simultaneously while the |
150 |
hardware raid controller is designed to handle up to 256 disks |
151 |
simultaneously and thus does a much better job with a couple disks, |
152 |
taking the load off of the rest of the system. |
153 |
|
154 |
In the end, it doesn't really matter what exactly causes the difference |
155 |
in performance. What matters is that the performance is so much better. |
156 |
|
157 |
>> The relevant advantage of btrfs is being able to make snapshots. Is |
158 |
>> that worth all the (potential) trouble? Snapshots are worthless when |
159 |
>> the file system destroys them with the rest of the data. |
160 |
> |
161 |
> And that is why I wouldn't use btrfs on a production system unless the |
162 |
> use case mitigated this risk and there was benefit from the snapshots. |
163 |
> Of course you're taking on more risk using an experimental filesystem. |
164 |
|
165 |
Yes, and I'd have other disadvantages. I've come to think that being |
166 |
able to make snapshots isn't worth all the trouble. |
167 |
|
168 |
>>> btrfs does not support swap files at present. |
169 |
>> |
170 |
>> What happens when you try it? |
171 |
> |
172 |
> No idea. Should be easy to test in a VM. I suspect either an error |
173 |
> or a kernel bug/panic/etc. |
174 |
|
175 |
If it's that bad, that doesn't sound like a file system ready to be used |
176 |
yet. |
177 |
|
178 |
>>> When it does you'll need to disable COW for them (using chattr) |
179 |
>>> otherwise they'll be fragmented until your system grinds to a halt. A |
180 |
>>> swap file is about the worst case scenario for any COW filesystem - |
181 |
>>> I'm not sure how ZFS handles them. |
182 |
>> |
183 |
>> Well, then they need to make special provisions for swap files in btrfs |
184 |
>> so that we can finally get rid of the swap partitions. |
185 |
> |
186 |
> I'm sure they'll happily accept patches. :) |
187 |
|
188 |
I'm sure they won't. The thing is that like everyone says they |
189 |
appreciate contributions, bug reports and patches while they make it |
190 |
more or less impossible to contribute or show no interest in getting |
191 |
contributions, don't look at bug reports or close them automatically and |
192 |
prematurely or show a great deal of disinterest in them and decline any |
193 |
patches should you have ventured to provide some. |
194 |
|
195 |
You'd be misguided to think that anyone cares about or wants your |
196 |
contribution. If you make one, you're making it only for yourself. I |
197 |
don't even make bug reports anymore because it's useless. |
198 |
|
199 |
>>> If I had done that in the past I think I would have completely avoided |
200 |
>>> that issue that required me to restore from backups. That happened in |
201 |
>>> the 3.15/3.16 timeframe and I'd have never even run those kernels. |
202 |
>>> They were stable kernels at the time, and a few versions in when I |
203 |
>>> switched to them (I was probably just following gentoo-sources stable |
204 |
>>> keywords back then), but they still had regressions (fixes were |
205 |
>>> eventually backported). |
206 |
>> |
207 |
>> How do you know if an old kernel you pick because you think the btrfs |
208 |
>> part works well enough is the right pick? You can either encounter a |
209 |
>> bug that has been fixed or a regression that hasn't been |
210 |
>> discovered/fixed yet. That way, you can't win. |
211 |
> |
212 |
> You read the lists closely. If you want to be bleeding-edge it will |
213 |
> take more work than if you just go with the flow. That's why I'm not |
214 |
> on 4.1 yet - I read the lists and am not quite sure they're ready yet. |
215 |
|
216 |
That sounds like a lot of work. You seem to really be going to lengths |
217 |
to use btrfs. |
218 |
|
219 |
>>> I think btrfs is certainly usable today, though I'd be hesitant to run |
220 |
>>> it on production servers depending on the use case (I'd be looking for |
221 |
>>> a use case that actually has a significant benefit from using btrfs, |
222 |
>>> and which somehow mitigates the risks). |
223 |
>> |
224 |
>> There you go, it's usable, and the risk of using it is too high. |
225 |
> |
226 |
> That is a judgement that everybody has to make based on their |
227 |
> requirements. The important thing is to make an informed decision. I |
228 |
> don't get paid if you pick btrfs. |
229 |
|
230 |
Being more informed doesn't magically result in better decisions. |
231 |
Information, as is knowledge, is volatile and fluid; software is power, |
232 |
while making decisions is only a freedom. |
233 |
|
234 |
>>> Right now I keep a daily rsnapshot (rsync on steroids - it's in the |
235 |
>>> Gentoo repo) backup of my btrfs filesystems on ext4. I occasionally |
236 |
>>> debate whether I still need it, but I sleep better knowing I have it. |
237 |
>>> This is in addition to my daily duplicity cloud backups of my most |
238 |
>>> important data (so, /etc and /home are in the cloud, and mythtv's |
239 |
>>> /var/video is just on a local rsync backup). |
240 |
>> |
241 |
>> I wouldn't give my data out of my hands. |
242 |
> |
243 |
> Somehow I doubt the folks at Amazon are going to break RSA anytime soon. |
244 |
|
245 |
Which means? |
246 |
|
247 |
>> Snapper? I've never heard of that ... |
248 |
>> |
249 |
> |
250 |
> http://snapper.io/ |
251 |
> |
252 |
> Basically snapshots+crontab and some wrappers to set retention |
253 |
> policies and such. That and some things like package-manager plugins |
254 |
> so that you get snapshots before you install stuff. |
255 |
|
256 |
Does this make things easier or more complicated? Like I fail to |
257 |
understand what's supposed to be so great about zfs incremental |
258 |
snapshots to get backups. Apparently you'd have to pile up an |
259 |
indefinite amount of snapshots so you can increment them indefinitely. |
260 |
And it gets extremely scary when you want to remove them to get back to |
261 |
something sane. |
262 |
|
263 |
>> Queuing up the data when there's more data than the system can deal with |
264 |
>> only works when the system has sufficient time to catch up with the |
265 |
>> queue. Otherwise, you have to block something at some point, or you |
266 |
>> must drop the data. At that point, it doesn't matter how you arrange |
267 |
>> the contents of the queue within it. |
268 |
> |
269 |
> Absolutely true. You need to throttle the data before it gets into |
270 |
> the queue, so that the business of the queue is exposed to the |
271 |
> applications so that they behave appropriately (falling back to |
272 |
> lower-bandwidth alternatives, etc). In my case if mythtv's write |
273 |
> buffers are filling up and I'm also running an emerge install phase |
274 |
> the correct answer (per ionice) is for emerge to block so that my |
275 |
> realtime video capture buffers are safely flushed. What you don't |
276 |
> want is for the kernel to let emerge dump a few GB of low-priority |
277 |
> data into the write cache alongside my 5Mbps HD recording stream. |
278 |
> Granted, it isn't as big a problem as it used to be now that RAM sizes |
279 |
> have increased. |
280 |
|
281 |
You could re-arrange the queue, and when it's long enough, you don't |
282 |
need to freeze anything. But what does, for example, a web browser do |
283 |
when it cannot receive data as fast as it can display it, or what does a |
284 |
VOIP application do when it cannot send the data as fast as it wants to? |
285 |
I don't want my web browser to freeze, and a speaker whose voice is |
286 |
supposed to be transmitted over a network cannot be frozen in their |
287 |
speech to give sufficient time for the queue to become empty. |
288 |
|
289 |
>> Gentoo /is/ fire-and-forget in that it works fine. Btrfs is not in that |
290 |
>> it may work or not. |
291 |
>> |
292 |
> |
293 |
> Well, we certainly must have come a long way then. :) I still |
294 |
> remember the last time the glibc ABI changed and I was basically |
295 |
> rebuilding everything from single-user mode holding my breath. |
296 |
|
297 |
Did it work? |