1 |
Hi list, |
2 |
|
3 |
interestingly enough, the topic of BTRFS came up on gentoo-amd64 in the context |
4 |
of a discussion on RAID. Natrually ZFS and BTRFS emererged as part of the |
5 |
discussion, and Duncan gave a good point-by-point summary that might save |
6 |
someone some time in the future (you know, so you don't have to trawl archives |
7 |
to get all the pieces together). So most of it has been mentioned before, but |
8 |
I think that there are some details and points that haven't been mentioned |
9 |
before. |
10 |
|
11 |
(I also thought that James, who mentioned in my BTRFS thread that he wanted to |
12 |
collect stuff for an eventual Wiki entry, might find this interesting, too.) |
13 |
|
14 |
So anyway, I hope somebody finds this useful! |
15 |
|
16 |
Start weitergeleitete Nachricht: |
17 |
|
18 |
Datum: Wed, 28 May 2014 03:12:05 +0000 (UTC) |
19 |
Von: Duncan <1i5t5.duncan@×××.net> |
20 |
An: gentoo-amd64@l.g.o |
21 |
Betreff: [gentoo-amd64] btrfs Was: Soliciting new RAID ideas |
22 |
|
23 |
|
24 |
thegeezer posted on Wed, 28 May 2014 00:38:03 +0100 as excerpted: |
25 |
|
26 |
> depending on your budget a pair of large sata drives + mdadm will be |
27 |
> ideal, if you had lvm already you could simply 'move' then 'enlarge' |
28 |
> your existing stuff (tm) : i'd like to know how btrfs would do the same |
29 |
> for anyone who can let me know. |
30 |
> you have raid6 because you probably know that raid5 is just waiting for |
31 |
> trouble, so i'd probably start looking at btrfs for your finanical data |
32 |
> to be checksummed. |
33 |
|
34 |
Given that I'm a regular on the btrfs list as well as running it myself, |
35 |
I'm likely to know more about it than most. Here's a whirlwind rundown |
36 |
with a strong emphasis on practical points a lot of people miss (IOW, I'm |
37 |
skipping a lot of the commonly covered and obvious stuff). Point 6 below |
38 |
directly answers your move/enlarge question. Meanwhile, points 1, 7 and |
39 |
8 are critically important, as we see a lot of people on the btrfs list |
40 |
getting them wrong. |
41 |
|
42 |
1) Since there's raid5/6 discussion on the thread... Don't use btrfs |
43 |
raid56 modes at this time, except purely for playing around with trashable |
44 |
or fully backed up data. The implementation as introduced isn't code- |
45 |
complete, and while the operational runtime side works, recovery from |
46 |
dropped devices, not so much. Thus, in terms of data safety you're |
47 |
effectively running a slow raid0 with lots of extra overhead that can be |
48 |
considered trash if a device drops, with the sole benefit being that when |
49 |
the raid56 mode recovery implementation code gets merged (and is tested |
50 |
for a kernel cycle or two to work out the initial bugs), you'll then get |
51 |
what amounts to a "free" upgrade to the raid5 or raid6 mode you had |
52 |
originally configured, since it was doing the operational parity |
53 |
calculation and writes to track it all along, it just couldn't yet be |
54 |
used for actual recovery as the code simply wasn't there to do so. |
55 |
|
56 |
2) Btrfs raid0, raid1 and raid10 modes, along with single mode (on a |
57 |
single or multiple-devices) and dup mode (on a single device, metadata is |
58 |
by default duplicated -- two copies, except on ssd where the default is |
59 |
only a single copy since some ssds dedup anyway) are reasonably mature |
60 |
and stable, to the same point as btrfs in general, anyway, which is to |
61 |
say it's "mostly stable, keep your backups fresh but you're not /too/ |
62 |
likely to have to use them." There are still enough bugs being fixed in |
63 |
each kernel release, however, that running latest stable series is |
64 |
/strongly/ recommended, as your data is at risk to known-fixed bugs (even |
65 |
if at this point they only tend to hit the corner-cases) if you're not |
66 |
doing so. |
67 |
|
68 |
3) It's worth noting that btrfs treats data and metadata separately -- |
69 |
when you do a mkfs.btrfs, you can configure redundancy modes separately |
70 |
for each, the single-device default being (as above) dup metadata (except |
71 |
for ssd), single data, the multi-device default being raid1 metadata, |
72 |
single data |
73 |
|
74 |
4) FWIW, most of my btrfs formatted partitions are dual-device raid1 mode |
75 |
for both data and metadata, on ssd. (Second backup is reiserfs on |
76 |
spinning-rust, just in case some Armageddon bug eats all the btrfs at the |
77 |
same time, working copy and first backup, tho btrfs is stable enough now |
78 |
that's extremely unlikely, but I didn't consider it so back when I set |
79 |
things up nearly a year ago now.) |
80 |
|
81 |
The reason for my raid1 mode choice isn't that of ordinary raid1, it's |
82 |
specifically due to btrfs' checksumming and data integrity features -- if |
83 |
one copy fails its checksum, btrfs will, IF IT HAS ANOTHER COPY TO TRY, |
84 |
check the second copy and if it's good, will use it and rewrite the bad |
85 |
copy. Btrfs scrub allows checking the entire filesystem for checksum |
86 |
errors and restoring any errors it finds from good copies where possible. |
87 |
|
88 |
Obviously, the default single data mode (or raid0) won't have a second |
89 |
copy to check and rewrite from, while raid1 (and raid10) modes will (as |
90 |
will dup-mode metadata on a single device, but with one exception, dup |
91 |
mode isn't allowed for data, only metadata, the exception being the mixed- |
92 |
blockgroup mode that mixes data and metadata together, that's the default |
93 |
on filesystems under 1 GiB but isn't recommended on large filesystems for |
94 |
performance reasons). |
95 |
|
96 |
So I wanted a second copy of both data and metadata to take advantage of |
97 |
btrfs' data integrity and scrub features, and with btrfs raid1 mode, I |
98 |
get both that and the traditional raid1 device-loss protection as well. |
99 |
=:^) |
100 |
|
101 |
5) It's worth noting that as of now, btrfs raid1 mode is only two-way- |
102 |
mirrored, no matter how many devices are configured into the filesystem. |
103 |
N-way-mirrored is the next feature on the roadmap after the raid56 work |
104 |
is completed, but given how nearly every btrfs feature has taken much |
105 |
longer to complete than originally planned, I'm not expecting it until |
106 |
sometime next year, now. |
107 |
|
108 |
Which is unfortunate, as my risk vs. cost sweet spot would be 3-way- |
109 |
mirroring, covering in case *TWO* copies of a block failed checksum. Oh, |
110 |
well, it's coming, even if it seems at this point like the proverbial |
111 |
carrot dangling off a stick held in front of the donkey. |
112 |
|
113 |
6) Btrfs handles moving then enlarging (parallel to LVM) using btrfs |
114 |
add/delete, to add or delete a device to/from a filesystem (moving the |
115 |
content from a to-be-deleted device in the process), plus btrfs balance, |
116 |
to restripe/convert/rebalance between devices as well as to free |
117 |
allocated but empty data and metadata chunks back to unallocated. |
118 |
There's also btrfs resize, but that's more like the conventional |
119 |
filesystem resize command, resizing the part of the filesystem on an |
120 |
individual device (partitioned/virtual or whole physical device). |
121 |
|
122 |
So to add a device, you'd btrfs device add, then btrfs balance, with an |
123 |
optional conversion to a different redundancy mode if desired, to |
124 |
rebalance the existing data and metadata onto that device. (Without the |
125 |
rebalance it would be used for new chunks, but existing data and metadata |
126 |
chunks would stay where they were. I'll omit the "chunk definition" |
127 |
discussion in the interest of brevity.) |
128 |
|
129 |
To delete a device, you'd btrfs device delete, which would move all the |
130 |
data on that device onto other existing devices in the filesystem, after |
131 |
which it could be removed. |
132 |
|
133 |
7) Given the thread, I'd be remiss to omit this one. VM images and other |
134 |
large "internal-rewrite-pattern" files (large database files, etc) need |
135 |
special treatment on btrfs, at least currently. As such, btrfs may not |
136 |
be the greatest solution for Mark (tho it would work fine with special |
137 |
procedures), given the several VMs he runs. This one unfortunately hits |
138 |
a lot of people. =:^( But here's a heads-up, so it doesn't have to hit |
139 |
anyone reading this! =:^) |
140 |
|
141 |
As a property of the technology, any copy-on-write-based filesystem is |
142 |
going to find files where various bits of existing data within the file |
143 |
are repeatedly rewritten (as opposed to new data simply being appended, |
144 |
think a log file or live-stored audio/video stream) extremely challenging |
145 |
to deal with. The problem is that unlike ordinary filesystems that |
146 |
rewrite the data in place such that a file continues to occupy the same |
147 |
extents as it did before, copy-on-write filesystems will write a changed |
148 |
block to a different location. While COW does mean atomic updates and |
149 |
thus more reliability since either the new data or the old data should |
150 |
exist, never an unpredictable mixture of the two, as a result of the |
151 |
above rewrite pattern, this type of internally-rewritten file gets |
152 |
**HEAVILY** fragmented over time. |
153 |
|
154 |
We've had filefrag reports of several gig files with over 100K extents! |
155 |
Obviously, this isn't going to be the most efficient file in the world to |
156 |
access! |
157 |
|
158 |
For smaller files, up to a couple hundred MiB or perhaps a bit more, |
159 |
btrfs has the autodefrag mount option, which can help a lot. With this |
160 |
option enabled, whenever a block of a file is changed and rewritten, thus |
161 |
written elsewhere, btrfs queues up a rewrite of the entire file to happen |
162 |
in the background. The rewrite will be done sequentially, thus defragging |
163 |
the file. This works quite well for firefox's sqlite database files, for |
164 |
instance, as they're internal-rewrite-pattern, but they're small enough |
165 |
that autodefrag handles them reasonably nicely. |
166 |
|
167 |
But this solution doesn't scale so well as the file size increases toward |
168 |
and past a GiB, particularly for files with a continuous stream of |
169 |
internal rewrites such as can happen with an operating VM writing to its |
170 |
virtual storage device. At some point, the stream of writes comes in |
171 |
faster than the file can be rewritten, and things start to back up! |
172 |
|
173 |
To deal with this case, there's the NOCOW file attribute, set with chattr |
174 |
+C. However, to be effective, this attribute must be set when the file |
175 |
is empty, before it has existing content. The easiest way to do that is |
176 |
to set the attribute on the directory which will contain the files. |
177 |
While it doesn't affect the directory itself any, newly created files |
178 |
within that directory inherit the NOCOW attribute before they have data, |
179 |
thus allowing it to work without having to worry about it that much. For |
180 |
existing files, create a new directory, set its NOCOW attribute, and COPY |
181 |
(don't move, and don't use cp --reflink) the existing files into it. |
182 |
|
183 |
Once you have your large internal-rewrite-pattern files set NOCOW, btrfs |
184 |
will rewrite them in-place as an ordinary filesystem would, thus avoiding |
185 |
the problem. |
186 |
|
187 |
Except for one thing. I haven't mentioned btrfs snapshots yet as that |
188 |
feature, but for this caveat, is covered well enough elsewhere. But |
189 |
here's the problem. A snapshot locks the existing file data in place. |
190 |
As a result, the first write to a block within a file after a snapshot |
191 |
MUST be COW, even if the file is otherwise set NOCOW. |
192 |
|
193 |
If only the occasional one-off snapshot is done it's not /too/ bad, as |
194 |
all the internal file writes between snapshots are NOCOW, it's only the |
195 |
first one to each file block after a snapshot that must be COW. But many |
196 |
people and distros are script-automating their snapshots in ordered to |
197 |
have rollback capacities, and on btrfs, snapshots are (ordinarily) light |
198 |
enough that people are sometimes configuring a snapshot a minute! If |
199 |
only a minute's changes can be written to a the existing location, then |
200 |
there's a snapshot and changes must be written to a new location, then |
201 |
another snapshot and yet another location... Basically the NOCOW we set |
202 |
on that file isn't doing us any good! |
203 |
|
204 |
8) So making this a separate point as it's important and a lot of people |
205 |
get it wrong. NOCOW and snapshots don't mix! |
206 |
|
207 |
There is, however, a (partial) workaround. Because snapshots stop at |
208 |
btrfs subvolume boundaries, if you put your large VM images and similar |
209 |
large internal-rewrite-pattern files (databases, etc) in subvolumes, |
210 |
making that directory I suggested above a full subvolume not just a NOCOW |
211 |
directory, snapshots of the parent subvolume will not include the VM |
212 |
images subvolume, thus leaving the VM images alone. This solves the |
213 |
snapshot-broken-NOCOW and thus the fragmentation issue, but it DOES mean |
214 |
that those VM images must be backed up using more conventional methods |
215 |
since snapshotting won't work for them. |
216 |
|
217 |
9) Some other still partially broken bits of btrfs include: |
218 |
|
219 |
9a) Quotas: Just don't use them on btrfs at this point. Performance |
220 |
doesn't scale (altho there's a rewrite in progress), and they are buggy. |
221 |
Additionally, the scaling interaction with snapshots is geometrically |
222 |
negative, sometimes requiring 64 GiB of RAM or more and coming to a near |
223 |
standstill at that, for users with enough quota-groups and enough |
224 |
snapshots. If you need quotas, use a more traditional filesystem with |
225 |
stable quota support. Hopefully by this time next year... |
226 |
|
227 |
9b) Snapshot-aware-defrag: This was enabled at one point but simply |
228 |
didn't scale, when it turned out people were doing things like per-minute |
229 |
snapshots and thus had thousands and thousands of snapshots. So this has |
230 |
been disabled for the time being. Btrfs defrag will defrag the working |
231 |
copy it is run on, but currently doesn't account for snapshots, so data |
232 |
that was fragmented at snapshot time gets duplicated as it is |
233 |
defragmented. However, they plan to re-enable the feature ones they have |
234 |
rewritten various bits to scale far better than they do at present. |
235 |
|
236 |
9c) Send and receive. Btrfs send and receive are a very nice feature |
237 |
that can make backups far faster, with far less data transferred. |
238 |
They're great when they work. Unfortunately, there are still various |
239 |
corner-cases where they don't. (As an example, a recent fix was for the |
240 |
case where subdir B was nested inside subdir A for the first, full send/ |
241 |
receive, but later, the relationship was reversed, with subdir B made the |
242 |
parent of subdir A. Until the recent fix, send/receive couldn't handle |
243 |
that sort of corner-case.) You can go ahead and use it if it's working |
244 |
for you, as if it finishes without error, the copy should be 100% |
245 |
reliable. However, have an alternate plan for backups if you suddenly |
246 |
hit one of those corner-cases and send/receive quits working. |
247 |
|
248 |
Of course it's worth mentioning that b and c deal with features that most |
249 |
filesystems don't have at all, so with the exception of quotas, it's not |
250 |
like something's broken on btrfs that works on other filesystems. |
251 |
Instead, these features are (nearly) unique to btrfs, so even if they |
252 |
come with certain limitations, that's still better than not having the |
253 |
option of using the feature at all, because it simply doesn't exist on |
254 |
the other filesystem! |
255 |
|
256 |
10) Btrfs in general is headed toward stable now, and a lot of people, |
257 |
including me, have used it for a significant amount of time without |
258 |
problems, but it's still new enough that you're strongly urged to make |
259 |
and test your backups, because by not doing so, you're stating by your |
260 |
actions if not your words, that you simply don't care if some as yet |
261 |
undiscovered and unfixed bug in the filesystem eats your data. |
262 |
|
263 |
For similar reasons altho already mentioned above, run the latest stable |
264 |
kernel from the latest stable kernel series, at the oldest, and consider |
265 |
running rc kernels from at least rc2 or so (by which time any real data |
266 |
eating bugs, in btrfs or elsewhere, should be found and fixed, or at |
267 |
least published). Because anything older and you are literally risking |
268 |
your data to known and fixed bugs. |
269 |
|
270 |
As is said, take reasonable care and you're much less likely to be the |
271 |
statistic! |
272 |
|
273 |
-- |
274 |
Duncan - List replies preferred. No HTML msgs. |
275 |
"Every nonfree program has a lord, a master -- |
276 |
and if you use the program, he is your master." Richard Stallman |
277 |
|
278 |
|
279 |
|
280 |
|
281 |
-- |
282 |
Marc Joliet |
283 |
-- |
284 |
"People who think they know everything really annoy those of us who know we |
285 |
don't" - Bjarne Stroustrup |