Gentoo Archives: gentoo-user

From: Kai Krakow <hurikhan77@×××××.com>
To: gentoo-user@l.g.o
Subject: [gentoo-user] Re: Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
Date: Tue, 24 Jun 2014 18:51:32
Message-Id: luso7b-h43.ln1@hurikhan77.spdns.de
In Reply to: Re: [gentoo-user] Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD? by Rich Freeman
1 Rich Freeman <rich0@g.o> schrieb:
2
3 > On Sun, Jun 22, 2014 at 7:44 AM, Kai Krakow <hurikhan77@×××××.com> wrote:
4 >> I don't see where you could lose the volume management features. You just
5 >> add device on top of the bcache device after you initialized the raw
6 >> device with a bcache superblock and attached it. The rest works the same,
7 >> just that you use bcacheX instead of sdX devices.
8 >
9 > Ah, didn't realize you could attach/remove devices to bcache later.
10 > Presumably it handles device failures gracefully, ie exposing them to
11 > the underlying filesystem so that it can properly recover?
12
13 I'm not sure if multiple partitions can share the same cache device
14 partition but more or less that's it: Initialize bcache, then attach your
15 backing devices, then add those bcache devices to your btrfs.
16
17 I don't know how errors are handled, tho. But as with every caching
18 technique (even in ZFS) your data is likely toast if the cache device dies
19 in the middle of action. Thus, you should put bcache on LVM RAID if you are
20 going to use it for write caching (i.e. write-back mode). Read caching
21 should be okay (write-through mode). Bcache is a little slower than other
22 flash-cache implementations because it only reports data as written back to
23 the FS if it reached stable storage (which can be the cache device, tho, if
24 you are using write-back mode). It was also designed with unexpected reboots
25 in mind, read. It will replay transactions from its log on reboot. This
26 means, you can have unstable data conditions on the raw device which is why
27 you should never try to use that directly, e.g. from a rescue disk. But
28 since bcache wraps the partition with its own superblock this mistake should
29 be impossible.
30
31 I'm not sure how graceful device failures are handled. I suppose in write-
32 back mode you can get into trouble because it's too late for bcache to tell
33 the FS that there is a write error when it already confirmed that stable
34 storage has been hit. Maybe it will just keep the data around so you could
35 swap devices or will report the error next time when data is written to that
36 location. It probably interferes with btrfs RAID logic on that matter.
37
38 > The only problem with doing stuff like this at a lower level (both
39 > write and read caching) is that it isn't RAID-aware. If you write
40 > 10GB of data, you use 20GB of cache to do it if you're mirrored,
41 > because the cache doesn't know about mirroring.
42
43 Yes, it will write double the data to the cache then - but only if btrfs
44 also did actually read both copies (which it probably does not because it
45 has checksums and does not need to compare data, and lets just ignore the
46 case that another process could try to read the same data from the other
47 raid member later, that case should become optimized-out by the OS cache).
48 Otherwise both caches should work pretty individually with their own set of
49 data depending on how btrfs uses each device individually. Remember that
50 btrfs raid is not a block-based raid where block locations would match 1:1
51 on each device. Btrfs raid can place one mirror of data in two completely
52 different locations on each member device (which is actually a good thing in
53 case block errors accumulate in specific locations for a "faulty" model of a
54 disk). In case of write caching it will of course cache double the data
55 (because both members will be written to). But I think that's okay for the
56 same reasons, except it will wear your cache device faster. But in that case
57 I suggest to use individual SSDs for each btrfs member device anyways. It's
58 not optimal, I know. Could be useful to see some best practices and
59 pros/cons on that topic (individual cache device per btrfs member vs. bcache
60 on LVM RAID with bcache partitions on the RAID for all members). I think the
61 best strategy depends on if you are write-most or read-most.
62
63 Thanks for mentioning. Interesting thoughts. ;-)
64
65 > Offhand I'm not sure
66 > if there are any performance penalties as well around the need for
67 > barriers/etc with the cache not being able to be relied on to do the
68 > right thing in terms of what gets written out - also, the data isn't
69 > redundant while it is on the cache, unless you mirror the cache.
70
71 This is partialy what I outlined above. I think in case of write-caching,
72 there is no barriers pass-thru needed. Bcache will confirm the barriers and
73 that's all the FS needs to know (because bcache is supervising the FS, all
74 requests go through the bcache layer, no direct access to the backing
75 device). Of course, it's then bcache's job to ensure everything gets written
76 out correctly in the background (whenever it feels to do so). But it can use
77 its own write-barriers to ensure that for the underlying device - that's
78 nothing the FS has to care about. Performance should be faster anyway
79 because, well, you are writing to a faster device - that is what bcache is
80 all about, isn't it? ;-)
81
82 I don't think write-barriers for read caching are needed, at least not from
83 point of view of the FS. The caching layer, tho, will use it internally for
84 its caching structures. If that will have a bad effect on performance is
85 probably dependent on the implementation, but my intuition says: No
86 performance impact because putting read data in the cache can be defered and
87 then data will be written in the background (write-behind).
88
89 > Granted, if you're using it for write intent logging then there isn't
90 > much getting around that.
91
92 Well, sure for bcache. But I think in case of FS-internal write caching
93 devices that case could be handled gracefully (the method which you'd
94 prefer). Since in the internal case the cache has knowledge about the FS bad
95 block handling, it can just retry writing data to another location/disk or
96 keep it around until the admin fixed the problem with the backing device.
97
98 BTW: SSD firmwares usually suffer similar problems like outlined above
99 because they do writes in the background when they already confirmed
100 persistence to the OS layer. This is why SSD failures are usually much
101 severe compared HDD failures. Do some research, and you should find tests
102 about that topic. Especially consumer SSD firmwares have a big problem with
103 that. So I'm not sure if it really should be bcache's job to fix that
104 particular problem. You should just ensure good firmware and proper failure
105 protection at the hardware level if you want to do fancy caching stuff - the
106 FTL should be able to hide those problems before the whole thing explodes,
107 then report errors before it is able to no longer ensure correct
108 persistence. I suppose that is also the detail where enterprise grade SSDs
109 behave different. HDDs have related issues (SATA vs enterprise SCSI vs SAS,
110 hotword: IO timeouts and bad blocks, and why you should not use consumer
111 hardware for RAIDs). I think all the same holds true for ZFS.
112
113 >> Having to prepare devices for bcache is kind of a show-stopper because it
114 >> is no drop-in component that way. But OTOH I like that approach better
115 >> than dm- cache because it protects from using the backing device without
116 >> going through the caching layer which could otherwise severely damage
117 >> your data, and you get along with fewer devices and don't need to size a
118 >> meta device (which probably needs to grow later if you add devices, I
119 >> don't know).
120 >
121 > And this is the main thing keeping me away from it. It is REALLY
122 > painful to migrate to/from. Having it integrated into the filesystem
123 > delivers all the same benefits of not being able to mount it without
124 > the cache present.
125
126 The migration pain is what currently keeps me away, too. Otherwise I would
127 just buy one of those fancy new cheap but still speedy Crucial SSDs and
128 "just enable" bcache... :-\
129
130 > Now excuse me while I go fix my btrfs (I tried re-enabling snapper and
131 > it again got the filesystem into a worked-up state after trying to
132 > clean up half a dozen snapshots at the same time - it works fine until
133 > you go and try to write a lot of data to it, then it stops syncing
134 > though you don't necessarily notice until a few hours later when the
135 > write cache exhausts RAM and on reboot your disk reverts back a few
136 > hours). I suspect that if I just treat it gently for a few hours
137 > btrfs will clean up the mess and it will work normally again, but the
138 > damage apparently persists after a reboot if you go heavy in the disk
139 > too quickly...
140
141 You should report that to the btrfs list. You could try to "echo w >
142 /proc/sysrq-trigger" and look at the blocked processes list in dmesg
143 afterwards. I'm sure one important btrfs thread is in blocked state then...
144
145 --
146 Replies to list only preferred.

Replies