Gentoo Archives: gentoo-user

From: Marc Joliet <marcec@×××.de>
To: Gentoo-User ML <gentoo-user@l.g.o>
Subject: [gentoo-user] Good BTRFS summary by Duncan (forwarded from gentoo-amd64)
Date: Thu, 29 May 2014 16:48:53
Message-Id: 20140529184837.5b82971a@marcec
1 Hi list,
2
3 interestingly enough, the topic of BTRFS came up on gentoo-amd64 in the context
4 of a discussion on RAID. Natrually ZFS and BTRFS emererged as part of the
5 discussion, and Duncan gave a good point-by-point summary that might save
6 someone some time in the future (you know, so you don't have to trawl archives
7 to get all the pieces together). So most of it has been mentioned before, but
8 I think that there are some details and points that haven't been mentioned
9 before.
10
11 (I also thought that James, who mentioned in my BTRFS thread that he wanted to
12 collect stuff for an eventual Wiki entry, might find this interesting, too.)
13
14 So anyway, I hope somebody finds this useful!
15
16 Start weitergeleitete Nachricht:
17
18 Datum: Wed, 28 May 2014 03:12:05 +0000 (UTC)
19 Von: Duncan <1i5t5.duncan@×××.net>
20 An: gentoo-amd64@l.g.o
21 Betreff: [gentoo-amd64] btrfs Was: Soliciting new RAID ideas
22
23
24 thegeezer posted on Wed, 28 May 2014 00:38:03 +0100 as excerpted:
25
26 > depending on your budget a pair of large sata drives + mdadm will be
27 > ideal, if you had lvm already you could simply 'move' then 'enlarge'
28 > your existing stuff (tm) : i'd like to know how btrfs would do the same
29 > for anyone who can let me know.
30 > you have raid6 because you probably know that raid5 is just waiting for
31 > trouble, so i'd probably start looking at btrfs for your finanical data
32 > to be checksummed.
33
34 Given that I'm a regular on the btrfs list as well as running it myself,
35 I'm likely to know more about it than most. Here's a whirlwind rundown
36 with a strong emphasis on practical points a lot of people miss (IOW, I'm
37 skipping a lot of the commonly covered and obvious stuff). Point 6 below
38 directly answers your move/enlarge question. Meanwhile, points 1, 7 and
39 8 are critically important, as we see a lot of people on the btrfs list
40 getting them wrong.
41
42 1) Since there's raid5/6 discussion on the thread... Don't use btrfs
43 raid56 modes at this time, except purely for playing around with trashable
44 or fully backed up data. The implementation as introduced isn't code-
45 complete, and while the operational runtime side works, recovery from
46 dropped devices, not so much. Thus, in terms of data safety you're
47 effectively running a slow raid0 with lots of extra overhead that can be
48 considered trash if a device drops, with the sole benefit being that when
49 the raid56 mode recovery implementation code gets merged (and is tested
50 for a kernel cycle or two to work out the initial bugs), you'll then get
51 what amounts to a "free" upgrade to the raid5 or raid6 mode you had
52 originally configured, since it was doing the operational parity
53 calculation and writes to track it all along, it just couldn't yet be
54 used for actual recovery as the code simply wasn't there to do so.
55
56 2) Btrfs raid0, raid1 and raid10 modes, along with single mode (on a
57 single or multiple-devices) and dup mode (on a single device, metadata is
58 by default duplicated -- two copies, except on ssd where the default is
59 only a single copy since some ssds dedup anyway) are reasonably mature
60 and stable, to the same point as btrfs in general, anyway, which is to
61 say it's "mostly stable, keep your backups fresh but you're not /too/
62 likely to have to use them." There are still enough bugs being fixed in
63 each kernel release, however, that running latest stable series is
64 /strongly/ recommended, as your data is at risk to known-fixed bugs (even
65 if at this point they only tend to hit the corner-cases) if you're not
66 doing so.
67
68 3) It's worth noting that btrfs treats data and metadata separately --
69 when you do a mkfs.btrfs, you can configure redundancy modes separately
70 for each, the single-device default being (as above) dup metadata (except
71 for ssd), single data, the multi-device default being raid1 metadata,
72 single data
73
74 4) FWIW, most of my btrfs formatted partitions are dual-device raid1 mode
75 for both data and metadata, on ssd. (Second backup is reiserfs on
76 spinning-rust, just in case some Armageddon bug eats all the btrfs at the
77 same time, working copy and first backup, tho btrfs is stable enough now
78 that's extremely unlikely, but I didn't consider it so back when I set
79 things up nearly a year ago now.)
80
81 The reason for my raid1 mode choice isn't that of ordinary raid1, it's
82 specifically due to btrfs' checksumming and data integrity features -- if
83 one copy fails its checksum, btrfs will, IF IT HAS ANOTHER COPY TO TRY,
84 check the second copy and if it's good, will use it and rewrite the bad
85 copy. Btrfs scrub allows checking the entire filesystem for checksum
86 errors and restoring any errors it finds from good copies where possible.
87
88 Obviously, the default single data mode (or raid0) won't have a second
89 copy to check and rewrite from, while raid1 (and raid10) modes will (as
90 will dup-mode metadata on a single device, but with one exception, dup
91 mode isn't allowed for data, only metadata, the exception being the mixed-
92 blockgroup mode that mixes data and metadata together, that's the default
93 on filesystems under 1 GiB but isn't recommended on large filesystems for
94 performance reasons).
95
96 So I wanted a second copy of both data and metadata to take advantage of
97 btrfs' data integrity and scrub features, and with btrfs raid1 mode, I
98 get both that and the traditional raid1 device-loss protection as well.
99 =:^)
100
101 5) It's worth noting that as of now, btrfs raid1 mode is only two-way-
102 mirrored, no matter how many devices are configured into the filesystem.
103 N-way-mirrored is the next feature on the roadmap after the raid56 work
104 is completed, but given how nearly every btrfs feature has taken much
105 longer to complete than originally planned, I'm not expecting it until
106 sometime next year, now.
107
108 Which is unfortunate, as my risk vs. cost sweet spot would be 3-way-
109 mirroring, covering in case *TWO* copies of a block failed checksum. Oh,
110 well, it's coming, even if it seems at this point like the proverbial
111 carrot dangling off a stick held in front of the donkey.
112
113 6) Btrfs handles moving then enlarging (parallel to LVM) using btrfs
114 add/delete, to add or delete a device to/from a filesystem (moving the
115 content from a to-be-deleted device in the process), plus btrfs balance,
116 to restripe/convert/rebalance between devices as well as to free
117 allocated but empty data and metadata chunks back to unallocated.
118 There's also btrfs resize, but that's more like the conventional
119 filesystem resize command, resizing the part of the filesystem on an
120 individual device (partitioned/virtual or whole physical device).
121
122 So to add a device, you'd btrfs device add, then btrfs balance, with an
123 optional conversion to a different redundancy mode if desired, to
124 rebalance the existing data and metadata onto that device. (Without the
125 rebalance it would be used for new chunks, but existing data and metadata
126 chunks would stay where they were. I'll omit the "chunk definition"
127 discussion in the interest of brevity.)
128
129 To delete a device, you'd btrfs device delete, which would move all the
130 data on that device onto other existing devices in the filesystem, after
131 which it could be removed.
132
133 7) Given the thread, I'd be remiss to omit this one. VM images and other
134 large "internal-rewrite-pattern" files (large database files, etc) need
135 special treatment on btrfs, at least currently. As such, btrfs may not
136 be the greatest solution for Mark (tho it would work fine with special
137 procedures), given the several VMs he runs. This one unfortunately hits
138 a lot of people. =:^( But here's a heads-up, so it doesn't have to hit
139 anyone reading this! =:^)
140
141 As a property of the technology, any copy-on-write-based filesystem is
142 going to find files where various bits of existing data within the file
143 are repeatedly rewritten (as opposed to new data simply being appended,
144 think a log file or live-stored audio/video stream) extremely challenging
145 to deal with. The problem is that unlike ordinary filesystems that
146 rewrite the data in place such that a file continues to occupy the same
147 extents as it did before, copy-on-write filesystems will write a changed
148 block to a different location. While COW does mean atomic updates and
149 thus more reliability since either the new data or the old data should
150 exist, never an unpredictable mixture of the two, as a result of the
151 above rewrite pattern, this type of internally-rewritten file gets
152 **HEAVILY** fragmented over time.
153
154 We've had filefrag reports of several gig files with over 100K extents!
155 Obviously, this isn't going to be the most efficient file in the world to
156 access!
157
158 For smaller files, up to a couple hundred MiB or perhaps a bit more,
159 btrfs has the autodefrag mount option, which can help a lot. With this
160 option enabled, whenever a block of a file is changed and rewritten, thus
161 written elsewhere, btrfs queues up a rewrite of the entire file to happen
162 in the background. The rewrite will be done sequentially, thus defragging
163 the file. This works quite well for firefox's sqlite database files, for
164 instance, as they're internal-rewrite-pattern, but they're small enough
165 that autodefrag handles them reasonably nicely.
166
167 But this solution doesn't scale so well as the file size increases toward
168 and past a GiB, particularly for files with a continuous stream of
169 internal rewrites such as can happen with an operating VM writing to its
170 virtual storage device. At some point, the stream of writes comes in
171 faster than the file can be rewritten, and things start to back up!
172
173 To deal with this case, there's the NOCOW file attribute, set with chattr
174 +C. However, to be effective, this attribute must be set when the file
175 is empty, before it has existing content. The easiest way to do that is
176 to set the attribute on the directory which will contain the files.
177 While it doesn't affect the directory itself any, newly created files
178 within that directory inherit the NOCOW attribute before they have data,
179 thus allowing it to work without having to worry about it that much. For
180 existing files, create a new directory, set its NOCOW attribute, and COPY
181 (don't move, and don't use cp --reflink) the existing files into it.
182
183 Once you have your large internal-rewrite-pattern files set NOCOW, btrfs
184 will rewrite them in-place as an ordinary filesystem would, thus avoiding
185 the problem.
186
187 Except for one thing. I haven't mentioned btrfs snapshots yet as that
188 feature, but for this caveat, is covered well enough elsewhere. But
189 here's the problem. A snapshot locks the existing file data in place.
190 As a result, the first write to a block within a file after a snapshot
191 MUST be COW, even if the file is otherwise set NOCOW.
192
193 If only the occasional one-off snapshot is done it's not /too/ bad, as
194 all the internal file writes between snapshots are NOCOW, it's only the
195 first one to each file block after a snapshot that must be COW. But many
196 people and distros are script-automating their snapshots in ordered to
197 have rollback capacities, and on btrfs, snapshots are (ordinarily) light
198 enough that people are sometimes configuring a snapshot a minute! If
199 only a minute's changes can be written to a the existing location, then
200 there's a snapshot and changes must be written to a new location, then
201 another snapshot and yet another location... Basically the NOCOW we set
202 on that file isn't doing us any good!
203
204 8) So making this a separate point as it's important and a lot of people
205 get it wrong. NOCOW and snapshots don't mix!
206
207 There is, however, a (partial) workaround. Because snapshots stop at
208 btrfs subvolume boundaries, if you put your large VM images and similar
209 large internal-rewrite-pattern files (databases, etc) in subvolumes,
210 making that directory I suggested above a full subvolume not just a NOCOW
211 directory, snapshots of the parent subvolume will not include the VM
212 images subvolume, thus leaving the VM images alone. This solves the
213 snapshot-broken-NOCOW and thus the fragmentation issue, but it DOES mean
214 that those VM images must be backed up using more conventional methods
215 since snapshotting won't work for them.
216
217 9) Some other still partially broken bits of btrfs include:
218
219 9a) Quotas: Just don't use them on btrfs at this point. Performance
220 doesn't scale (altho there's a rewrite in progress), and they are buggy.
221 Additionally, the scaling interaction with snapshots is geometrically
222 negative, sometimes requiring 64 GiB of RAM or more and coming to a near
223 standstill at that, for users with enough quota-groups and enough
224 snapshots. If you need quotas, use a more traditional filesystem with
225 stable quota support. Hopefully by this time next year...
226
227 9b) Snapshot-aware-defrag: This was enabled at one point but simply
228 didn't scale, when it turned out people were doing things like per-minute
229 snapshots and thus had thousands and thousands of snapshots. So this has
230 been disabled for the time being. Btrfs defrag will defrag the working
231 copy it is run on, but currently doesn't account for snapshots, so data
232 that was fragmented at snapshot time gets duplicated as it is
233 defragmented. However, they plan to re-enable the feature ones they have
234 rewritten various bits to scale far better than they do at present.
235
236 9c) Send and receive. Btrfs send and receive are a very nice feature
237 that can make backups far faster, with far less data transferred.
238 They're great when they work. Unfortunately, there are still various
239 corner-cases where they don't. (As an example, a recent fix was for the
240 case where subdir B was nested inside subdir A for the first, full send/
241 receive, but later, the relationship was reversed, with subdir B made the
242 parent of subdir A. Until the recent fix, send/receive couldn't handle
243 that sort of corner-case.) You can go ahead and use it if it's working
244 for you, as if it finishes without error, the copy should be 100%
245 reliable. However, have an alternate plan for backups if you suddenly
246 hit one of those corner-cases and send/receive quits working.
247
248 Of course it's worth mentioning that b and c deal with features that most
249 filesystems don't have at all, so with the exception of quotas, it's not
250 like something's broken on btrfs that works on other filesystems.
251 Instead, these features are (nearly) unique to btrfs, so even if they
252 come with certain limitations, that's still better than not having the
253 option of using the feature at all, because it simply doesn't exist on
254 the other filesystem!
255
256 10) Btrfs in general is headed toward stable now, and a lot of people,
257 including me, have used it for a significant amount of time without
258 problems, but it's still new enough that you're strongly urged to make
259 and test your backups, because by not doing so, you're stating by your
260 actions if not your words, that you simply don't care if some as yet
261 undiscovered and unfixed bug in the filesystem eats your data.
262
263 For similar reasons altho already mentioned above, run the latest stable
264 kernel from the latest stable kernel series, at the oldest, and consider
265 running rc kernels from at least rc2 or so (by which time any real data
266 eating bugs, in btrfs or elsewhere, should be found and fixed, or at
267 least published). Because anything older and you are literally risking
268 your data to known and fixed bugs.
269
270 As is said, take reasonable care and you're much less likely to be the
271 statistic!
272
273 --
274 Duncan - List replies preferred. No HTML msgs.
275 "Every nonfree program has a lord, a master --
276 and if you use the program, he is your master." Richard Stallman
277
278
279
280
281 --
282 Marc Joliet
283 --
284 "People who think they know everything really annoy those of us who know we
285 don't" - Bjarne Stroustrup

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies

Subject Author
[gentoo-user] Re: Good BTRFS summary by Duncan (forwarded from gentoo-amd64) James <wireless@×××××××××××.com>