1 |
Rich Freeman <rich0@g.o> schrieb: |
2 |
|
3 |
> On Sun, Jun 22, 2014 at 7:44 AM, Kai Krakow <hurikhan77@×××××.com> wrote: |
4 |
>> I don't see where you could lose the volume management features. You just |
5 |
>> add device on top of the bcache device after you initialized the raw |
6 |
>> device with a bcache superblock and attached it. The rest works the same, |
7 |
>> just that you use bcacheX instead of sdX devices. |
8 |
> |
9 |
> Ah, didn't realize you could attach/remove devices to bcache later. |
10 |
> Presumably it handles device failures gracefully, ie exposing them to |
11 |
> the underlying filesystem so that it can properly recover? |
12 |
|
13 |
I'm not sure if multiple partitions can share the same cache device |
14 |
partition but more or less that's it: Initialize bcache, then attach your |
15 |
backing devices, then add those bcache devices to your btrfs. |
16 |
|
17 |
I don't know how errors are handled, tho. But as with every caching |
18 |
technique (even in ZFS) your data is likely toast if the cache device dies |
19 |
in the middle of action. Thus, you should put bcache on LVM RAID if you are |
20 |
going to use it for write caching (i.e. write-back mode). Read caching |
21 |
should be okay (write-through mode). Bcache is a little slower than other |
22 |
flash-cache implementations because it only reports data as written back to |
23 |
the FS if it reached stable storage (which can be the cache device, tho, if |
24 |
you are using write-back mode). It was also designed with unexpected reboots |
25 |
in mind, read. It will replay transactions from its log on reboot. This |
26 |
means, you can have unstable data conditions on the raw device which is why |
27 |
you should never try to use that directly, e.g. from a rescue disk. But |
28 |
since bcache wraps the partition with its own superblock this mistake should |
29 |
be impossible. |
30 |
|
31 |
I'm not sure how graceful device failures are handled. I suppose in write- |
32 |
back mode you can get into trouble because it's too late for bcache to tell |
33 |
the FS that there is a write error when it already confirmed that stable |
34 |
storage has been hit. Maybe it will just keep the data around so you could |
35 |
swap devices or will report the error next time when data is written to that |
36 |
location. It probably interferes with btrfs RAID logic on that matter. |
37 |
|
38 |
> The only problem with doing stuff like this at a lower level (both |
39 |
> write and read caching) is that it isn't RAID-aware. If you write |
40 |
> 10GB of data, you use 20GB of cache to do it if you're mirrored, |
41 |
> because the cache doesn't know about mirroring. |
42 |
|
43 |
Yes, it will write double the data to the cache then - but only if btrfs |
44 |
also did actually read both copies (which it probably does not because it |
45 |
has checksums and does not need to compare data, and lets just ignore the |
46 |
case that another process could try to read the same data from the other |
47 |
raid member later, that case should become optimized-out by the OS cache). |
48 |
Otherwise both caches should work pretty individually with their own set of |
49 |
data depending on how btrfs uses each device individually. Remember that |
50 |
btrfs raid is not a block-based raid where block locations would match 1:1 |
51 |
on each device. Btrfs raid can place one mirror of data in two completely |
52 |
different locations on each member device (which is actually a good thing in |
53 |
case block errors accumulate in specific locations for a "faulty" model of a |
54 |
disk). In case of write caching it will of course cache double the data |
55 |
(because both members will be written to). But I think that's okay for the |
56 |
same reasons, except it will wear your cache device faster. But in that case |
57 |
I suggest to use individual SSDs for each btrfs member device anyways. It's |
58 |
not optimal, I know. Could be useful to see some best practices and |
59 |
pros/cons on that topic (individual cache device per btrfs member vs. bcache |
60 |
on LVM RAID with bcache partitions on the RAID for all members). I think the |
61 |
best strategy depends on if you are write-most or read-most. |
62 |
|
63 |
Thanks for mentioning. Interesting thoughts. ;-) |
64 |
|
65 |
> Offhand I'm not sure |
66 |
> if there are any performance penalties as well around the need for |
67 |
> barriers/etc with the cache not being able to be relied on to do the |
68 |
> right thing in terms of what gets written out - also, the data isn't |
69 |
> redundant while it is on the cache, unless you mirror the cache. |
70 |
|
71 |
This is partialy what I outlined above. I think in case of write-caching, |
72 |
there is no barriers pass-thru needed. Bcache will confirm the barriers and |
73 |
that's all the FS needs to know (because bcache is supervising the FS, all |
74 |
requests go through the bcache layer, no direct access to the backing |
75 |
device). Of course, it's then bcache's job to ensure everything gets written |
76 |
out correctly in the background (whenever it feels to do so). But it can use |
77 |
its own write-barriers to ensure that for the underlying device - that's |
78 |
nothing the FS has to care about. Performance should be faster anyway |
79 |
because, well, you are writing to a faster device - that is what bcache is |
80 |
all about, isn't it? ;-) |
81 |
|
82 |
I don't think write-barriers for read caching are needed, at least not from |
83 |
point of view of the FS. The caching layer, tho, will use it internally for |
84 |
its caching structures. If that will have a bad effect on performance is |
85 |
probably dependent on the implementation, but my intuition says: No |
86 |
performance impact because putting read data in the cache can be defered and |
87 |
then data will be written in the background (write-behind). |
88 |
|
89 |
> Granted, if you're using it for write intent logging then there isn't |
90 |
> much getting around that. |
91 |
|
92 |
Well, sure for bcache. But I think in case of FS-internal write caching |
93 |
devices that case could be handled gracefully (the method which you'd |
94 |
prefer). Since in the internal case the cache has knowledge about the FS bad |
95 |
block handling, it can just retry writing data to another location/disk or |
96 |
keep it around until the admin fixed the problem with the backing device. |
97 |
|
98 |
BTW: SSD firmwares usually suffer similar problems like outlined above |
99 |
because they do writes in the background when they already confirmed |
100 |
persistence to the OS layer. This is why SSD failures are usually much |
101 |
severe compared HDD failures. Do some research, and you should find tests |
102 |
about that topic. Especially consumer SSD firmwares have a big problem with |
103 |
that. So I'm not sure if it really should be bcache's job to fix that |
104 |
particular problem. You should just ensure good firmware and proper failure |
105 |
protection at the hardware level if you want to do fancy caching stuff - the |
106 |
FTL should be able to hide those problems before the whole thing explodes, |
107 |
then report errors before it is able to no longer ensure correct |
108 |
persistence. I suppose that is also the detail where enterprise grade SSDs |
109 |
behave different. HDDs have related issues (SATA vs enterprise SCSI vs SAS, |
110 |
hotword: IO timeouts and bad blocks, and why you should not use consumer |
111 |
hardware for RAIDs). I think all the same holds true for ZFS. |
112 |
|
113 |
>> Having to prepare devices for bcache is kind of a show-stopper because it |
114 |
>> is no drop-in component that way. But OTOH I like that approach better |
115 |
>> than dm- cache because it protects from using the backing device without |
116 |
>> going through the caching layer which could otherwise severely damage |
117 |
>> your data, and you get along with fewer devices and don't need to size a |
118 |
>> meta device (which probably needs to grow later if you add devices, I |
119 |
>> don't know). |
120 |
> |
121 |
> And this is the main thing keeping me away from it. It is REALLY |
122 |
> painful to migrate to/from. Having it integrated into the filesystem |
123 |
> delivers all the same benefits of not being able to mount it without |
124 |
> the cache present. |
125 |
|
126 |
The migration pain is what currently keeps me away, too. Otherwise I would |
127 |
just buy one of those fancy new cheap but still speedy Crucial SSDs and |
128 |
"just enable" bcache... :-\ |
129 |
|
130 |
> Now excuse me while I go fix my btrfs (I tried re-enabling snapper and |
131 |
> it again got the filesystem into a worked-up state after trying to |
132 |
> clean up half a dozen snapshots at the same time - it works fine until |
133 |
> you go and try to write a lot of data to it, then it stops syncing |
134 |
> though you don't necessarily notice until a few hours later when the |
135 |
> write cache exhausts RAM and on reboot your disk reverts back a few |
136 |
> hours). I suspect that if I just treat it gently for a few hours |
137 |
> btrfs will clean up the mess and it will work normally again, but the |
138 |
> damage apparently persists after a reboot if you go heavy in the disk |
139 |
> too quickly... |
140 |
|
141 |
You should report that to the btrfs list. You could try to "echo w > |
142 |
/proc/sysrq-trigger" and look at the blocked processes list in dmesg |
143 |
afterwards. I'm sure one important btrfs thread is in blocked state then... |
144 |
|
145 |
-- |
146 |
Replies to list only preferred. |