1 |
thegeezer posted on Wed, 28 May 2014 00:38:03 +0100 as excerpted: |
2 |
|
3 |
> depending on your budget a pair of large sata drives + mdadm will be |
4 |
> ideal, if you had lvm already you could simply 'move' then 'enlarge' |
5 |
> your existing stuff (tm) : i'd like to know how btrfs would do the same |
6 |
> for anyone who can let me know. |
7 |
> you have raid6 because you probably know that raid5 is just waiting for |
8 |
> trouble, so i'd probably start looking at btrfs for your finanical data |
9 |
> to be checksummed. |
10 |
|
11 |
Given that I'm a regular on the btrfs list as well as running it myself, |
12 |
I'm likely to know more about it than most. Here's a whirlwind rundown |
13 |
with a strong emphasis on practical points a lot of people miss (IOW, I'm |
14 |
skipping a lot of the commonly covered and obvious stuff). Point 6 below |
15 |
directly answers your move/enlarge question. Meanwhile, points 1, 7 and |
16 |
8 are critically important, as we see a lot of people on the btrfs list |
17 |
getting them wrong. |
18 |
|
19 |
1) Since there's raid5/6 discussion on the thread... Don't use btrfs |
20 |
raid56 modes at this time, except purely for playing around with trashable |
21 |
or fully backed up data. The implementation as introduced isn't code- |
22 |
complete, and while the operational runtime side works, recovery from |
23 |
dropped devices, not so much. Thus, in terms of data safety you're |
24 |
effectively running a slow raid0 with lots of extra overhead that can be |
25 |
considered trash if a device drops, with the sole benefit being that when |
26 |
the raid56 mode recovery implementation code gets merged (and is tested |
27 |
for a kernel cycle or two to work out the initial bugs), you'll then get |
28 |
what amounts to a "free" upgrade to the raid5 or raid6 mode you had |
29 |
originally configured, since it was doing the operational parity |
30 |
calculation and writes to track it all along, it just couldn't yet be |
31 |
used for actual recovery as the code simply wasn't there to do so. |
32 |
|
33 |
2) Btrfs raid0, raid1 and raid10 modes, along with single mode (on a |
34 |
single or multiple-devices) and dup mode (on a single device, metadata is |
35 |
by default duplicated -- two copies, except on ssd where the default is |
36 |
only a single copy since some ssds dedup anyway) are reasonably mature |
37 |
and stable, to the same point as btrfs in general, anyway, which is to |
38 |
say it's "mostly stable, keep your backups fresh but you're not /too/ |
39 |
likely to have to use them." There are still enough bugs being fixed in |
40 |
each kernel release, however, that running latest stable series is |
41 |
/strongly/ recommended, as your data is at risk to known-fixed bugs (even |
42 |
if at this point they only tend to hit the corner-cases) if you're not |
43 |
doing so. |
44 |
|
45 |
3) It's worth noting that btrfs treats data and metadata separately -- |
46 |
when you do a mkfs.btrfs, you can configure redundancy modes separately |
47 |
for each, the single-device default being (as above) dup metadata (except |
48 |
for ssd), single data, the multi-device default being raid1 metadata, |
49 |
single data |
50 |
|
51 |
4) FWIW, most of my btrfs formatted partitions are dual-device raid1 mode |
52 |
for both data and metadata, on ssd. (Second backup is reiserfs on |
53 |
spinning-rust, just in case some Armageddon bug eats all the btrfs at the |
54 |
same time, working copy and first backup, tho btrfs is stable enough now |
55 |
that's extremely unlikely, but I didn't consider it so back when I set |
56 |
things up nearly a year ago now.) |
57 |
|
58 |
The reason for my raid1 mode choice isn't that of ordinary raid1, it's |
59 |
specifically due to btrfs' checksumming and data integrity features -- if |
60 |
one copy fails its checksum, btrfs will, IF IT HAS ANOTHER COPY TO TRY, |
61 |
check the second copy and if it's good, will use it and rewrite the bad |
62 |
copy. Btrfs scrub allows checking the entire filesystem for checksum |
63 |
errors and restoring any errors it finds from good copies where possible. |
64 |
|
65 |
Obviously, the default single data mode (or raid0) won't have a second |
66 |
copy to check and rewrite from, while raid1 (and raid10) modes will (as |
67 |
will dup-mode metadata on a single device, but with one exception, dup |
68 |
mode isn't allowed for data, only metadata, the exception being the mixed- |
69 |
blockgroup mode that mixes data and metadata together, that's the default |
70 |
on filesystems under 1 GiB but isn't recommended on large filesystems for |
71 |
performance reasons). |
72 |
|
73 |
So I wanted a second copy of both data and metadata to take advantage of |
74 |
btrfs' data integrity and scrub features, and with btrfs raid1 mode, I |
75 |
get both that and the traditional raid1 device-loss protection as well. |
76 |
=:^) |
77 |
|
78 |
5) It's worth noting that as of now, btrfs raid1 mode is only two-way- |
79 |
mirrored, no matter how many devices are configured into the filesystem. |
80 |
N-way-mirrored is the next feature on the roadmap after the raid56 work |
81 |
is completed, but given how nearly every btrfs feature has taken much |
82 |
longer to complete than originally planned, I'm not expecting it until |
83 |
sometime next year, now. |
84 |
|
85 |
Which is unfortunate, as my risk vs. cost sweet spot would be 3-way- |
86 |
mirroring, covering in case *TWO* copies of a block failed checksum. Oh, |
87 |
well, it's coming, even if it seems at this point like the proverbial |
88 |
carrot dangling off a stick held in front of the donkey. |
89 |
|
90 |
6) Btrfs handles moving then enlarging (parallel to LVM) using btrfs |
91 |
add/delete, to add or delete a device to/from a filesystem (moving the |
92 |
content from a to-be-deleted device in the process), plus btrfs balance, |
93 |
to restripe/convert/rebalance between devices as well as to free |
94 |
allocated but empty data and metadata chunks back to unallocated. |
95 |
There's also btrfs resize, but that's more like the conventional |
96 |
filesystem resize command, resizing the part of the filesystem on an |
97 |
individual device (partitioned/virtual or whole physical device). |
98 |
|
99 |
So to add a device, you'd btrfs device add, then btrfs balance, with an |
100 |
optional conversion to a different redundancy mode if desired, to |
101 |
rebalance the existing data and metadata onto that device. (Without the |
102 |
rebalance it would be used for new chunks, but existing data and metadata |
103 |
chunks would stay where they were. I'll omit the "chunk definition" |
104 |
discussion in the interest of brevity.) |
105 |
|
106 |
To delete a device, you'd btrfs device delete, which would move all the |
107 |
data on that device onto other existing devices in the filesystem, after |
108 |
which it could be removed. |
109 |
|
110 |
7) Given the thread, I'd be remiss to omit this one. VM images and other |
111 |
large "internal-rewrite-pattern" files (large database files, etc) need |
112 |
special treatment on btrfs, at least currently. As such, btrfs may not |
113 |
be the greatest solution for Mark (tho it would work fine with special |
114 |
procedures), given the several VMs he runs. This one unfortunately hits |
115 |
a lot of people. =:^( But here's a heads-up, so it doesn't have to hit |
116 |
anyone reading this! =:^) |
117 |
|
118 |
As a property of the technology, any copy-on-write-based filesystem is |
119 |
going to find files where various bits of existing data within the file |
120 |
are repeatedly rewritten (as opposed to new data simply being appended, |
121 |
think a log file or live-stored audio/video stream) extremely challenging |
122 |
to deal with. The problem is that unlike ordinary filesystems that |
123 |
rewrite the data in place such that a file continues to occupy the same |
124 |
extents as it did before, copy-on-write filesystems will write a changed |
125 |
block to a different location. While COW does mean atomic updates and |
126 |
thus more reliability since either the new data or the old data should |
127 |
exist, never an unpredictable mixture of the two, as a result of the |
128 |
above rewrite pattern, this type of internally-rewritten file gets |
129 |
**HEAVILY** fragmented over time. |
130 |
|
131 |
We've had filefrag reports of several gig files with over 100K extents! |
132 |
Obviously, this isn't going to be the most efficient file in the world to |
133 |
access! |
134 |
|
135 |
For smaller files, up to a couple hundred MiB or perhaps a bit more, |
136 |
btrfs has the autodefrag mount option, which can help a lot. With this |
137 |
option enabled, whenever a block of a file is changed and rewritten, thus |
138 |
written elsewhere, btrfs queues up a rewrite of the entire file to happen |
139 |
in the background. The rewrite will be done sequentially, thus defragging |
140 |
the file. This works quite well for firefox's sqlite database files, for |
141 |
instance, as they're internal-rewrite-pattern, but they're small enough |
142 |
that autodefrag handles them reasonably nicely. |
143 |
|
144 |
But this solution doesn't scale so well as the file size increases toward |
145 |
and past a GiB, particularly for files with a continuous stream of |
146 |
internal rewrites such as can happen with an operating VM writing to its |
147 |
virtual storage device. At some point, the stream of writes comes in |
148 |
faster than the file can be rewritten, and things start to back up! |
149 |
|
150 |
To deal with this case, there's the NOCOW file attribute, set with chattr |
151 |
+C. However, to be effective, this attribute must be set when the file |
152 |
is empty, before it has existing content. The easiest way to do that is |
153 |
to set the attribute on the directory which will contain the files. |
154 |
While it doesn't affect the directory itself any, newly created files |
155 |
within that directory inherit the NOCOW attribute before they have data, |
156 |
thus allowing it to work without having to worry about it that much. For |
157 |
existing files, create a new directory, set its NOCOW attribute, and COPY |
158 |
(don't move, and don't use cp --reflink) the existing files into it. |
159 |
|
160 |
Once you have your large internal-rewrite-pattern files set NOCOW, btrfs |
161 |
will rewrite them in-place as an ordinary filesystem would, thus avoiding |
162 |
the problem. |
163 |
|
164 |
Except for one thing. I haven't mentioned btrfs snapshots yet as that |
165 |
feature, but for this caveat, is covered well enough elsewhere. But |
166 |
here's the problem. A snapshot locks the existing file data in place. |
167 |
As a result, the first write to a block within a file after a snapshot |
168 |
MUST be COW, even if the file is otherwise set NOCOW. |
169 |
|
170 |
If only the occasional one-off snapshot is done it's not /too/ bad, as |
171 |
all the internal file writes between snapshots are NOCOW, it's only the |
172 |
first one to each file block after a snapshot that must be COW. But many |
173 |
people and distros are script-automating their snapshots in ordered to |
174 |
have rollback capacities, and on btrfs, snapshots are (ordinarily) light |
175 |
enough that people are sometimes configuring a snapshot a minute! If |
176 |
only a minute's changes can be written to a the existing location, then |
177 |
there's a snapshot and changes must be written to a new location, then |
178 |
another snapshot and yet another location... Basically the NOCOW we set |
179 |
on that file isn't doing us any good! |
180 |
|
181 |
8) So making this a separate point as it's important and a lot of people |
182 |
get it wrong. NOCOW and snapshots don't mix! |
183 |
|
184 |
There is, however, a (partial) workaround. Because snapshots stop at |
185 |
btrfs subvolume boundaries, if you put your large VM images and similar |
186 |
large internal-rewrite-pattern files (databases, etc) in subvolumes, |
187 |
making that directory I suggested above a full subvolume not just a NOCOW |
188 |
directory, snapshots of the parent subvolume will not include the VM |
189 |
images subvolume, thus leaving the VM images alone. This solves the |
190 |
snapshot-broken-NOCOW and thus the fragmentation issue, but it DOES mean |
191 |
that those VM images must be backed up using more conventional methods |
192 |
since snapshotting won't work for them. |
193 |
|
194 |
9) Some other still partially broken bits of btrfs include: |
195 |
|
196 |
9a) Quotas: Just don't use them on btrfs at this point. Performance |
197 |
doesn't scale (altho there's a rewrite in progress), and they are buggy. |
198 |
Additionally, the scaling interaction with snapshots is geometrically |
199 |
negative, sometimes requiring 64 GiB of RAM or more and coming to a near |
200 |
standstill at that, for users with enough quota-groups and enough |
201 |
snapshots. If you need quotas, use a more traditional filesystem with |
202 |
stable quota support. Hopefully by this time next year... |
203 |
|
204 |
9b) Snapshot-aware-defrag: This was enabled at one point but simply |
205 |
didn't scale, when it turned out people were doing things like per-minute |
206 |
snapshots and thus had thousands and thousands of snapshots. So this has |
207 |
been disabled for the time being. Btrfs defrag will defrag the working |
208 |
copy it is run on, but currently doesn't account for snapshots, so data |
209 |
that was fragmented at snapshot time gets duplicated as it is |
210 |
defragmented. However, they plan to re-enable the feature ones they have |
211 |
rewritten various bits to scale far better than they do at present. |
212 |
|
213 |
9c) Send and receive. Btrfs send and receive are a very nice feature |
214 |
that can make backups far faster, with far less data transferred. |
215 |
They're great when they work. Unfortunately, there are still various |
216 |
corner-cases where they don't. (As an example, a recent fix was for the |
217 |
case where subdir B was nested inside subdir A for the first, full send/ |
218 |
receive, but later, the relationship was reversed, with subdir B made the |
219 |
parent of subdir A. Until the recent fix, send/receive couldn't handle |
220 |
that sort of corner-case.) You can go ahead and use it if it's working |
221 |
for you, as if it finishes without error, the copy should be 100% |
222 |
reliable. However, have an alternate plan for backups if you suddenly |
223 |
hit one of those corner-cases and send/receive quits working. |
224 |
|
225 |
Of course it's worth mentioning that b and c deal with features that most |
226 |
filesystems don't have at all, so with the exception of quotas, it's not |
227 |
like something's broken on btrfs that works on other filesystems. |
228 |
Instead, these features are (nearly) unique to btrfs, so even if they |
229 |
come with certain limitations, that's still better than not having the |
230 |
option of using the feature at all, because it simply doesn't exist on |
231 |
the other filesystem! |
232 |
|
233 |
10) Btrfs in general is headed toward stable now, and a lot of people, |
234 |
including me, have used it for a significant amount of time without |
235 |
problems, but it's still new enough that you're strongly urged to make |
236 |
and test your backups, because by not doing so, you're stating by your |
237 |
actions if not your words, that you simply don't care if some as yet |
238 |
undiscovered and unfixed bug in the filesystem eats your data. |
239 |
|
240 |
For similar reasons altho already mentioned above, run the latest stable |
241 |
kernel from the latest stable kernel series, at the oldest, and consider |
242 |
running rc kernels from at least rc2 or so (by which time any real data |
243 |
eating bugs, in btrfs or elsewhere, should be found and fixed, or at |
244 |
least published). Because anything older and you are literally risking |
245 |
your data to known and fixed bugs. |
246 |
|
247 |
As is said, take reasonable care and you're much less likely to be the |
248 |
statistic! |
249 |
|
250 |
-- |
251 |
Duncan - List replies preferred. No HTML msgs. |
252 |
"Every nonfree program has a lord, a master -- |
253 |
and if you use the program, he is your master." Richard Stallman |