Gentoo Archives: gentoo-amd64

From: Duncan <1i5t5.duncan@×××.net>
To: gentoo-amd64@l.g.o
Subject: [gentoo-amd64] Re: Systemd migration: opinion and questions
Date: Thu, 26 Feb 2015 01:56:02
Message-Id: pan$63dac$8fbeccdf$99e3327c$a5c109ed@cox.net
In Reply to: Re: [gentoo-amd64] Re: Systemd migration: opinion and questions by Marc Joliet
1 Marc Joliet posted on Wed, 25 Feb 2015 19:56:32 +0100 as excerpted:
2
3 > But regardless of what you use, I think that the worst offenders are
4 > services that write logs themselves (I'm looking at you, samba).
5
6 >> c) I use btrfs for my primary filesystems, and btrfs and journald's
7 >> binary-format journals don't play so well together. [...]
8 >
9 > Well, I'm on an SSD, but even on the laptop I haven't noticed any
10 > performance issues (yet). Then again, I use autodefrag, so that
11 > probably helps.
12
13 Autodefrag does help.
14
15 There are two related issues at work, here.
16
17 The primary one is that pretty much any COW-based filesystem, including
18 btrfs, is going to have problems with internal-rewrite-pattern (as
19 opposed to append-only rewrites) files of any significant size. At the
20 small end this includes sqlite database files such as those firefox and
21 other mozilla products use. These, autodefrag manages well.
22
23 At the larger end are multi-gig VM images and similarly sized database
24 files. These, autodefrag doesn't manage so well, particularly if writes
25 are coming in at any significant rate, because at some point it's going
26 to take longer to rewrite the entire file (or even the affected normally
27 one-gig data chunk) than the time between incoming writes.
28
29 And the place where such fragmentation REALLY shows up is trying to run
30 btrfs filesystem maintenance commands like balance. On a sufficiently
31 fragmented filesystemsystem, particularly with quotas on too as their
32 tracking significantly complicates things, balance can take WEEKS on a
33 single-digits terabyte filesystem.
34
35 IOW, a lot of people don't notice it until something goes wrong and
36 they're trying to replace a failed device with one of the btrfs raid
37 modes, etc. That's a nasty time to find out how tangled things were, and
38 realize it'll take weeks to sort out, during which another device could
39 well fail, leaving you high and dry!
40
41 The immediate (partial) solution to the problem with these large files,
42 typically over a gig, is to set them nocow (which on btrfs must be done
43 at creation time, while the file is still zero-sized, in ordered to take
44 proper effect; this is normally accomplished by setting the directory
45 they'll be in to nocow, which doesn't affect the directory itself, but
46 does cause any newly created files or subdirs in it to inherit the nocow
47 attribute).
48
49 And this is actually what systemd-219 is doing with the journal files now.
50
51 But, setting nocow automatically disables both transparent compression
52 (if otherwise enabled) and checksumming. The latter isn't actually as
53 bad as one might expect, because most applications (including systemd/
54 journald) that deal with such files already have some sort of builtin
55 corruption detection and possible repair functionality -- they have to in
56 ordered to work acceptably on traditional filesystems that didn't do
57 filesystem level checksumming, and letting them have at it would indeed
58 seem to be the best policy in this case.
59
60 The second, related problem, is snapshotting. Because snapshotting
61 relies on COW, snapshotting a nocow file forces it to effectively cow-1
62 -- the first time a block is rewritten after a snapshot, it is cowed,
63 despite the ordinary nocow. Now setup say hourly auto-snapshotting using
64 snapper or the like, and continue to write to that "nocow" file, and
65 pretty soon it'll be as fragmented as if it weren't nocow at all!
66
67 With careful planning, separate subvolumes for the nocow files so they
68 aren't snapshotted with the rest of the system, snapshotting the nocow
69 subvolume with a period near the low frequency end of your target range
70 (say every other day or weekly instead of daily or twice a day), and if
71 they aren't rotated out regularly, periodic scripted btrfs defrags (say
72 weekly or monthly) of the affected files, good admins generally can keep
73 fragmentation from this source at least within reason.
74
75 And systemd-219 is actually creating a separate subvolume for its journal
76 files now, by default, thus keeping them out of the general system (or
77 /var) snapshot. But while both that and nocowing the journal files now
78 does help, it's still a reasonably fragile solution, as long as admins
79 don't realize what's going on, and can be tempted to set daily or more
80 frequent snapshotting on the journal subvolume too (or if the subvolume
81 doesn't take, say because it's an existing installation where there's
82 already a directory by that name and thus there can't be a subvolume at
83 the same place with the same name).
84
85
86 **BUT A BIG CAVEAT** lest anyone on stable with btrfs and systemd jump
87 onto 219 too fast. Yes, 219 DOES have some nice new features.
88 Unfortunately, it's broken in a few new ways as well.
89
90 * Apparently, systemd-219's networkd breaks with at least static IPv4-
91 only configurations, as my network failed to come up with it. From the
92 errors it was trying IPv6 and because that failed (it's not even in my
93 kernel), it gave up and didn't even try IPv4, instead trying to set the
94 IPv4 IP and gateway values into IPv6, which obviously isn't going to work
95 at all!
96
97 * There's also issues with the new tmpfiles.d configuration that has
98 replaced d lines (create a directory if it doesn't exist) with new v
99 lines (create a subvolume if on btrfs and possible, else fallback to d
100 behavior and create a directory), because subvolume creation fails
101 differently than directory creation, and the differences aren't all
102 sorted, yet.
103
104 Hopefully, systemd-220 will fix the IPv4 issue and bring a bit more
105 maturity to the tmpfiles.d subvolumes-creation feature by properly
106 falling back to d/directories if need be, instead of erroring out.
107 Meanwhile, hopefully a gentoo systemd-219-rX release will fix some of
108 these issues as well. But for right now, I'd suggest staying away from
109 it, as it's definitely not prime-time ready in its current form.
110
111 FWIW, I'm back on 218-r3 for now, done with a quick emerge --pkgonly
112 <systemd-219. I've not yet masked 219, however, so an update will try to
113 bring it back in, and I will thus have to see what changes have happened
114 and either mask it or try building it again, next time I update.
115
116 > What's funny though is that the systemd news file
117 > (http://cgit.freedesktop.org/systemd/systemd/tree/NEWS) occasionally
118 > refers to non-btrfs file systems as "legacy file sysetms". At least, as
119 > a btrfs user I think it's funny :) .
120
121 Indeed. They've definitely adopted btrfs and are running with it. If
122 you've read anything about their plans, the features of btrfs really do
123 provide a filesystem-side ready-made solution for them to adopt, altho
124 I'd still not call btrfs itself exactly mature -- even more than with
125 other filesystems, if an admin is putting data on btrfs and doesn't have
126 tested backups available, they really do NOT value that data, claims to
127 the contrary not withstanding.
128
129 And in a way, it's good, because systemd pushing it like that means
130 systemd based distros will be pushing it too, which will bring far wider
131 deployment of btrfs, ready or not, which will in turn help btrfs mature
132 faster with all those additional strange-corner-case bug reports and
133 hopefully fixes. I just feel for the poor admins trusting their distro
134 as they head into this without the backups they really should have... as
135 ultimately, a lot of them are unfortunately going to have to learn that
136 no backups really DOES mean you'd rather lose that data than bother with
137 backups, lesson, the HARD way! =:^(
138
139 --
140 Duncan - List replies preferred. No HTML msgs.
141 "Every nonfree program has a lord, a master --
142 and if you use the program, he is your master." Richard Stallman