1 |
Marc Joliet posted on Wed, 25 Feb 2015 19:56:32 +0100 as excerpted: |
2 |
|
3 |
> But regardless of what you use, I think that the worst offenders are |
4 |
> services that write logs themselves (I'm looking at you, samba). |
5 |
|
6 |
>> c) I use btrfs for my primary filesystems, and btrfs and journald's |
7 |
>> binary-format journals don't play so well together. [...] |
8 |
> |
9 |
> Well, I'm on an SSD, but even on the laptop I haven't noticed any |
10 |
> performance issues (yet). Then again, I use autodefrag, so that |
11 |
> probably helps. |
12 |
|
13 |
Autodefrag does help. |
14 |
|
15 |
There are two related issues at work, here. |
16 |
|
17 |
The primary one is that pretty much any COW-based filesystem, including |
18 |
btrfs, is going to have problems with internal-rewrite-pattern (as |
19 |
opposed to append-only rewrites) files of any significant size. At the |
20 |
small end this includes sqlite database files such as those firefox and |
21 |
other mozilla products use. These, autodefrag manages well. |
22 |
|
23 |
At the larger end are multi-gig VM images and similarly sized database |
24 |
files. These, autodefrag doesn't manage so well, particularly if writes |
25 |
are coming in at any significant rate, because at some point it's going |
26 |
to take longer to rewrite the entire file (or even the affected normally |
27 |
one-gig data chunk) than the time between incoming writes. |
28 |
|
29 |
And the place where such fragmentation REALLY shows up is trying to run |
30 |
btrfs filesystem maintenance commands like balance. On a sufficiently |
31 |
fragmented filesystemsystem, particularly with quotas on too as their |
32 |
tracking significantly complicates things, balance can take WEEKS on a |
33 |
single-digits terabyte filesystem. |
34 |
|
35 |
IOW, a lot of people don't notice it until something goes wrong and |
36 |
they're trying to replace a failed device with one of the btrfs raid |
37 |
modes, etc. That's a nasty time to find out how tangled things were, and |
38 |
realize it'll take weeks to sort out, during which another device could |
39 |
well fail, leaving you high and dry! |
40 |
|
41 |
The immediate (partial) solution to the problem with these large files, |
42 |
typically over a gig, is to set them nocow (which on btrfs must be done |
43 |
at creation time, while the file is still zero-sized, in ordered to take |
44 |
proper effect; this is normally accomplished by setting the directory |
45 |
they'll be in to nocow, which doesn't affect the directory itself, but |
46 |
does cause any newly created files or subdirs in it to inherit the nocow |
47 |
attribute). |
48 |
|
49 |
And this is actually what systemd-219 is doing with the journal files now. |
50 |
|
51 |
But, setting nocow automatically disables both transparent compression |
52 |
(if otherwise enabled) and checksumming. The latter isn't actually as |
53 |
bad as one might expect, because most applications (including systemd/ |
54 |
journald) that deal with such files already have some sort of builtin |
55 |
corruption detection and possible repair functionality -- they have to in |
56 |
ordered to work acceptably on traditional filesystems that didn't do |
57 |
filesystem level checksumming, and letting them have at it would indeed |
58 |
seem to be the best policy in this case. |
59 |
|
60 |
The second, related problem, is snapshotting. Because snapshotting |
61 |
relies on COW, snapshotting a nocow file forces it to effectively cow-1 |
62 |
-- the first time a block is rewritten after a snapshot, it is cowed, |
63 |
despite the ordinary nocow. Now setup say hourly auto-snapshotting using |
64 |
snapper or the like, and continue to write to that "nocow" file, and |
65 |
pretty soon it'll be as fragmented as if it weren't nocow at all! |
66 |
|
67 |
With careful planning, separate subvolumes for the nocow files so they |
68 |
aren't snapshotted with the rest of the system, snapshotting the nocow |
69 |
subvolume with a period near the low frequency end of your target range |
70 |
(say every other day or weekly instead of daily or twice a day), and if |
71 |
they aren't rotated out regularly, periodic scripted btrfs defrags (say |
72 |
weekly or monthly) of the affected files, good admins generally can keep |
73 |
fragmentation from this source at least within reason. |
74 |
|
75 |
And systemd-219 is actually creating a separate subvolume for its journal |
76 |
files now, by default, thus keeping them out of the general system (or |
77 |
/var) snapshot. But while both that and nocowing the journal files now |
78 |
does help, it's still a reasonably fragile solution, as long as admins |
79 |
don't realize what's going on, and can be tempted to set daily or more |
80 |
frequent snapshotting on the journal subvolume too (or if the subvolume |
81 |
doesn't take, say because it's an existing installation where there's |
82 |
already a directory by that name and thus there can't be a subvolume at |
83 |
the same place with the same name). |
84 |
|
85 |
|
86 |
**BUT A BIG CAVEAT** lest anyone on stable with btrfs and systemd jump |
87 |
onto 219 too fast. Yes, 219 DOES have some nice new features. |
88 |
Unfortunately, it's broken in a few new ways as well. |
89 |
|
90 |
* Apparently, systemd-219's networkd breaks with at least static IPv4- |
91 |
only configurations, as my network failed to come up with it. From the |
92 |
errors it was trying IPv6 and because that failed (it's not even in my |
93 |
kernel), it gave up and didn't even try IPv4, instead trying to set the |
94 |
IPv4 IP and gateway values into IPv6, which obviously isn't going to work |
95 |
at all! |
96 |
|
97 |
* There's also issues with the new tmpfiles.d configuration that has |
98 |
replaced d lines (create a directory if it doesn't exist) with new v |
99 |
lines (create a subvolume if on btrfs and possible, else fallback to d |
100 |
behavior and create a directory), because subvolume creation fails |
101 |
differently than directory creation, and the differences aren't all |
102 |
sorted, yet. |
103 |
|
104 |
Hopefully, systemd-220 will fix the IPv4 issue and bring a bit more |
105 |
maturity to the tmpfiles.d subvolumes-creation feature by properly |
106 |
falling back to d/directories if need be, instead of erroring out. |
107 |
Meanwhile, hopefully a gentoo systemd-219-rX release will fix some of |
108 |
these issues as well. But for right now, I'd suggest staying away from |
109 |
it, as it's definitely not prime-time ready in its current form. |
110 |
|
111 |
FWIW, I'm back on 218-r3 for now, done with a quick emerge --pkgonly |
112 |
<systemd-219. I've not yet masked 219, however, so an update will try to |
113 |
bring it back in, and I will thus have to see what changes have happened |
114 |
and either mask it or try building it again, next time I update. |
115 |
|
116 |
> What's funny though is that the systemd news file |
117 |
> (http://cgit.freedesktop.org/systemd/systemd/tree/NEWS) occasionally |
118 |
> refers to non-btrfs file systems as "legacy file sysetms". At least, as |
119 |
> a btrfs user I think it's funny :) . |
120 |
|
121 |
Indeed. They've definitely adopted btrfs and are running with it. If |
122 |
you've read anything about their plans, the features of btrfs really do |
123 |
provide a filesystem-side ready-made solution for them to adopt, altho |
124 |
I'd still not call btrfs itself exactly mature -- even more than with |
125 |
other filesystems, if an admin is putting data on btrfs and doesn't have |
126 |
tested backups available, they really do NOT value that data, claims to |
127 |
the contrary not withstanding. |
128 |
|
129 |
And in a way, it's good, because systemd pushing it like that means |
130 |
systemd based distros will be pushing it too, which will bring far wider |
131 |
deployment of btrfs, ready or not, which will in turn help btrfs mature |
132 |
faster with all those additional strange-corner-case bug reports and |
133 |
hopefully fixes. I just feel for the poor admins trusting their distro |
134 |
as they head into this without the backups they really should have... as |
135 |
ultimately, a lot of them are unfortunately going to have to learn that |
136 |
no backups really DOES mean you'd rather lose that data than bother with |
137 |
backups, lesson, the HARD way! =:^( |
138 |
|
139 |
-- |
140 |
Duncan - List replies preferred. No HTML msgs. |
141 |
"Every nonfree program has a lord, a master -- |
142 |
and if you use the program, he is your master." Richard Stallman |