Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] btrfs Was: Soliciting new RAID ideas
Date: Wed, 28 May 2014 03:12:28
Message-Id: pan$64532$35068efb$291c3acc$915b8eda@cox.net
In Reply to: Re: [gentoo-amd64] Soliciting new RAID ideas by thegeezer@thegeezer.net
1 thegeezer posted on Wed, 28 May 2014 00:38:03 +0100 as excerpted:
2
3 > depending on your budget a pair of large sata drives + mdadm will be
4 > ideal, if you had lvm already you could simply 'move' then 'enlarge'
5 > your existing stuff (tm) : i'd like to know how btrfs would do the same
6 > for anyone who can let me know.
7 > you have raid6 because you probably know that raid5 is just waiting for
8 > trouble, so i'd probably start looking at btrfs for your finanical data
9 > to be checksummed.
10
11 Given that I'm a regular on the btrfs list as well as running it myself,
12 I'm likely to know more about it than most. Here's a whirlwind rundown
13 with a strong emphasis on practical points a lot of people miss (IOW, I'm
14 skipping a lot of the commonly covered and obvious stuff). Point 6 below
15 directly answers your move/enlarge question. Meanwhile, points 1, 7 and
16 8 are critically important, as we see a lot of people on the btrfs list
17 getting them wrong.
18
19 1) Since there's raid5/6 discussion on the thread... Don't use btrfs
20 raid56 modes at this time, except purely for playing around with trashable
21 or fully backed up data. The implementation as introduced isn't code-
22 complete, and while the operational runtime side works, recovery from
23 dropped devices, not so much. Thus, in terms of data safety you're
24 effectively running a slow raid0 with lots of extra overhead that can be
25 considered trash if a device drops, with the sole benefit being that when
26 the raid56 mode recovery implementation code gets merged (and is tested
27 for a kernel cycle or two to work out the initial bugs), you'll then get
28 what amounts to a "free" upgrade to the raid5 or raid6 mode you had
29 originally configured, since it was doing the operational parity
30 calculation and writes to track it all along, it just couldn't yet be
31 used for actual recovery as the code simply wasn't there to do so.
32
33 2) Btrfs raid0, raid1 and raid10 modes, along with single mode (on a
34 single or multiple-devices) and dup mode (on a single device, metadata is
35 by default duplicated -- two copies, except on ssd where the default is
36 only a single copy since some ssds dedup anyway) are reasonably mature
37 and stable, to the same point as btrfs in general, anyway, which is to
38 say it's "mostly stable, keep your backups fresh but you're not /too/
39 likely to have to use them." There are still enough bugs being fixed in
40 each kernel release, however, that running latest stable series is
41 /strongly/ recommended, as your data is at risk to known-fixed bugs (even
42 if at this point they only tend to hit the corner-cases) if you're not
43 doing so.
44
45 3) It's worth noting that btrfs treats data and metadata separately --
46 when you do a mkfs.btrfs, you can configure redundancy modes separately
47 for each, the single-device default being (as above) dup metadata (except
48 for ssd), single data, the multi-device default being raid1 metadata,
49 single data
50
51 4) FWIW, most of my btrfs formatted partitions are dual-device raid1 mode
52 for both data and metadata, on ssd. (Second backup is reiserfs on
53 spinning-rust, just in case some Armageddon bug eats all the btrfs at the
54 same time, working copy and first backup, tho btrfs is stable enough now
55 that's extremely unlikely, but I didn't consider it so back when I set
56 things up nearly a year ago now.)
57
58 The reason for my raid1 mode choice isn't that of ordinary raid1, it's
59 specifically due to btrfs' checksumming and data integrity features -- if
60 one copy fails its checksum, btrfs will, IF IT HAS ANOTHER COPY TO TRY,
61 check the second copy and if it's good, will use it and rewrite the bad
62 copy. Btrfs scrub allows checking the entire filesystem for checksum
63 errors and restoring any errors it finds from good copies where possible.
64
65 Obviously, the default single data mode (or raid0) won't have a second
66 copy to check and rewrite from, while raid1 (and raid10) modes will (as
67 will dup-mode metadata on a single device, but with one exception, dup
68 mode isn't allowed for data, only metadata, the exception being the mixed-
69 blockgroup mode that mixes data and metadata together, that's the default
70 on filesystems under 1 GiB but isn't recommended on large filesystems for
71 performance reasons).
72
73 So I wanted a second copy of both data and metadata to take advantage of
74 btrfs' data integrity and scrub features, and with btrfs raid1 mode, I
75 get both that and the traditional raid1 device-loss protection as well.
76 =:^)
77
78 5) It's worth noting that as of now, btrfs raid1 mode is only two-way-
79 mirrored, no matter how many devices are configured into the filesystem.
80 N-way-mirrored is the next feature on the roadmap after the raid56 work
81 is completed, but given how nearly every btrfs feature has taken much
82 longer to complete than originally planned, I'm not expecting it until
83 sometime next year, now.
84
85 Which is unfortunate, as my risk vs. cost sweet spot would be 3-way-
86 mirroring, covering in case *TWO* copies of a block failed checksum. Oh,
87 well, it's coming, even if it seems at this point like the proverbial
88 carrot dangling off a stick held in front of the donkey.
89
90 6) Btrfs handles moving then enlarging (parallel to LVM) using btrfs
91 add/delete, to add or delete a device to/from a filesystem (moving the
92 content from a to-be-deleted device in the process), plus btrfs balance,
93 to restripe/convert/rebalance between devices as well as to free
94 allocated but empty data and metadata chunks back to unallocated.
95 There's also btrfs resize, but that's more like the conventional
96 filesystem resize command, resizing the part of the filesystem on an
97 individual device (partitioned/virtual or whole physical device).
98
99 So to add a device, you'd btrfs device add, then btrfs balance, with an
100 optional conversion to a different redundancy mode if desired, to
101 rebalance the existing data and metadata onto that device. (Without the
102 rebalance it would be used for new chunks, but existing data and metadata
103 chunks would stay where they were. I'll omit the "chunk definition"
104 discussion in the interest of brevity.)
105
106 To delete a device, you'd btrfs device delete, which would move all the
107 data on that device onto other existing devices in the filesystem, after
108 which it could be removed.
109
110 7) Given the thread, I'd be remiss to omit this one. VM images and other
111 large "internal-rewrite-pattern" files (large database files, etc) need
112 special treatment on btrfs, at least currently. As such, btrfs may not
113 be the greatest solution for Mark (tho it would work fine with special
114 procedures), given the several VMs he runs. This one unfortunately hits
115 a lot of people. =:^( But here's a heads-up, so it doesn't have to hit
116 anyone reading this! =:^)
117
118 As a property of the technology, any copy-on-write-based filesystem is
119 going to find files where various bits of existing data within the file
120 are repeatedly rewritten (as opposed to new data simply being appended,
121 think a log file or live-stored audio/video stream) extremely challenging
122 to deal with. The problem is that unlike ordinary filesystems that
123 rewrite the data in place such that a file continues to occupy the same
124 extents as it did before, copy-on-write filesystems will write a changed
125 block to a different location. While COW does mean atomic updates and
126 thus more reliability since either the new data or the old data should
127 exist, never an unpredictable mixture of the two, as a result of the
128 above rewrite pattern, this type of internally-rewritten file gets
129 **HEAVILY** fragmented over time.
130
131 We've had filefrag reports of several gig files with over 100K extents!
132 Obviously, this isn't going to be the most efficient file in the world to
133 access!
134
135 For smaller files, up to a couple hundred MiB or perhaps a bit more,
136 btrfs has the autodefrag mount option, which can help a lot. With this
137 option enabled, whenever a block of a file is changed and rewritten, thus
138 written elsewhere, btrfs queues up a rewrite of the entire file to happen
139 in the background. The rewrite will be done sequentially, thus defragging
140 the file. This works quite well for firefox's sqlite database files, for
141 instance, as they're internal-rewrite-pattern, but they're small enough
142 that autodefrag handles them reasonably nicely.
143
144 But this solution doesn't scale so well as the file size increases toward
145 and past a GiB, particularly for files with a continuous stream of
146 internal rewrites such as can happen with an operating VM writing to its
147 virtual storage device. At some point, the stream of writes comes in
148 faster than the file can be rewritten, and things start to back up!
149
150 To deal with this case, there's the NOCOW file attribute, set with chattr
151 +C. However, to be effective, this attribute must be set when the file
152 is empty, before it has existing content. The easiest way to do that is
153 to set the attribute on the directory which will contain the files.
154 While it doesn't affect the directory itself any, newly created files
155 within that directory inherit the NOCOW attribute before they have data,
156 thus allowing it to work without having to worry about it that much. For
157 existing files, create a new directory, set its NOCOW attribute, and COPY
158 (don't move, and don't use cp --reflink) the existing files into it.
159
160 Once you have your large internal-rewrite-pattern files set NOCOW, btrfs
161 will rewrite them in-place as an ordinary filesystem would, thus avoiding
162 the problem.
163
164 Except for one thing. I haven't mentioned btrfs snapshots yet as that
165 feature, but for this caveat, is covered well enough elsewhere. But
166 here's the problem. A snapshot locks the existing file data in place.
167 As a result, the first write to a block within a file after a snapshot
168 MUST be COW, even if the file is otherwise set NOCOW.
169
170 If only the occasional one-off snapshot is done it's not /too/ bad, as
171 all the internal file writes between snapshots are NOCOW, it's only the
172 first one to each file block after a snapshot that must be COW. But many
173 people and distros are script-automating their snapshots in ordered to
174 have rollback capacities, and on btrfs, snapshots are (ordinarily) light
175 enough that people are sometimes configuring a snapshot a minute! If
176 only a minute's changes can be written to a the existing location, then
177 there's a snapshot and changes must be written to a new location, then
178 another snapshot and yet another location... Basically the NOCOW we set
179 on that file isn't doing us any good!
180
181 8) So making this a separate point as it's important and a lot of people
182 get it wrong. NOCOW and snapshots don't mix!
183
184 There is, however, a (partial) workaround. Because snapshots stop at
185 btrfs subvolume boundaries, if you put your large VM images and similar
186 large internal-rewrite-pattern files (databases, etc) in subvolumes,
187 making that directory I suggested above a full subvolume not just a NOCOW
188 directory, snapshots of the parent subvolume will not include the VM
189 images subvolume, thus leaving the VM images alone. This solves the
190 snapshot-broken-NOCOW and thus the fragmentation issue, but it DOES mean
191 that those VM images must be backed up using more conventional methods
192 since snapshotting won't work for them.
193
194 9) Some other still partially broken bits of btrfs include:
195
196 9a) Quotas: Just don't use them on btrfs at this point. Performance
197 doesn't scale (altho there's a rewrite in progress), and they are buggy.
198 Additionally, the scaling interaction with snapshots is geometrically
199 negative, sometimes requiring 64 GiB of RAM or more and coming to a near
200 standstill at that, for users with enough quota-groups and enough
201 snapshots. If you need quotas, use a more traditional filesystem with
202 stable quota support. Hopefully by this time next year...
203
204 9b) Snapshot-aware-defrag: This was enabled at one point but simply
205 didn't scale, when it turned out people were doing things like per-minute
206 snapshots and thus had thousands and thousands of snapshots. So this has
207 been disabled for the time being. Btrfs defrag will defrag the working
208 copy it is run on, but currently doesn't account for snapshots, so data
209 that was fragmented at snapshot time gets duplicated as it is
210 defragmented. However, they plan to re-enable the feature ones they have
211 rewritten various bits to scale far better than they do at present.
212
213 9c) Send and receive. Btrfs send and receive are a very nice feature
214 that can make backups far faster, with far less data transferred.
215 They're great when they work. Unfortunately, there are still various
216 corner-cases where they don't. (As an example, a recent fix was for the
217 case where subdir B was nested inside subdir A for the first, full send/
218 receive, but later, the relationship was reversed, with subdir B made the
219 parent of subdir A. Until the recent fix, send/receive couldn't handle
220 that sort of corner-case.) You can go ahead and use it if it's working
221 for you, as if it finishes without error, the copy should be 100%
222 reliable. However, have an alternate plan for backups if you suddenly
223 hit one of those corner-cases and send/receive quits working.
224
225 Of course it's worth mentioning that b and c deal with features that most
226 filesystems don't have at all, so with the exception of quotas, it's not
227 like something's broken on btrfs that works on other filesystems.
228 Instead, these features are (nearly) unique to btrfs, so even if they
229 come with certain limitations, that's still better than not having the
230 option of using the feature at all, because it simply doesn't exist on
231 the other filesystem!
232
233 10) Btrfs in general is headed toward stable now, and a lot of people,
234 including me, have used it for a significant amount of time without
235 problems, but it's still new enough that you're strongly urged to make
236 and test your backups, because by not doing so, you're stating by your
237 actions if not your words, that you simply don't care if some as yet
238 undiscovered and unfixed bug in the filesystem eats your data.
239
240 For similar reasons altho already mentioned above, run the latest stable
241 kernel from the latest stable kernel series, at the oldest, and consider
242 running rc kernels from at least rc2 or so (by which time any real data
243 eating bugs, in btrfs or elsewhere, should be found and fixed, or at
244 least published). Because anything older and you are literally risking
245 your data to known and fixed bugs.
246
247 As is said, take reasonable care and you're much less likely to be the
248 statistic!
249
250 --
251 Duncan - List replies preferred. No HTML msgs.
252 "Every nonfree program has a lord, a master --
253 and if you use the program, he is your master." Richard Stallman

Replies

Subject Author
Re: [gentoo-amd64] btrfs Was: Soliciting new RAID ideas thegeezer <thegeezer@×××××××××.net>