[gentoo-amd64] Re: Systemd migration: opinion and questions - gentoo-amd64

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-amd64@l.g.o
Subject:	[gentoo-amd64] Re: Systemd migration: opinion and questions
Date:	Thu, 26 Feb 2015 01:56:02
Message-Id:	`pan$63dac$8fbeccdf$99e3327c$a5c109ed@cox.net`
In Reply to:	Re: [gentoo-amd64] Re: Systemd migration: opinion and questions by Marc Joliet

1

Marc Joliet posted on Wed, 25 Feb 2015 19:56:32 +0100 as excerpted:

2

3

> But regardless of what you use, I think that the worst offenders are

4

> services that write logs themselves (I'm looking at you, samba).

5

6

>> c) I use btrfs for my primary filesystems, and btrfs and journald's

7

>> binary-format journals don't play so well together. [...]

8

>

9

> Well, I'm on an SSD, but even on the laptop I haven't noticed any

10

> performance issues (yet).  Then again, I use autodefrag, so that

11

> probably helps.

12

13

Autodefrag does help.

14

15

There are two related issues at work, here.

16

17

The primary one is that pretty much any COW-based filesystem, including 

18

btrfs, is going to have problems with internal-rewrite-pattern (as 

19

opposed to append-only rewrites) files of any significant size.  At the 

20

small end this includes sqlite database files such as those firefox and 

21

other mozilla products use.  These, autodefrag manages well.

22

23

At the larger end are multi-gig VM images and similarly sized database 

24

files.  These, autodefrag doesn't manage so well, particularly if writes 

25

are coming in at any significant rate, because at some point it's going 

26

to take longer to rewrite the entire file (or even the affected normally 

27

one-gig data chunk) than the time between incoming writes.

28

29

And the place where such fragmentation REALLY shows up is trying to run 

30

btrfs filesystem maintenance commands like balance.  On a sufficiently 

31

fragmented filesystemsystem, particularly with quotas on too as their 

32

tracking significantly complicates things, balance can take WEEKS on a 

33

single-digits terabyte filesystem.

34

35

IOW, a lot of people don't notice it until something goes wrong and 

36

they're trying to replace a failed device with one of the btrfs raid 

37

modes, etc.  That's a nasty time to find out how tangled things were, and 

38

realize it'll take weeks to sort out, during which another device could 

39

well fail, leaving you high and dry!

40

41

The immediate (partial) solution to the problem with these large files, 

42

typically over a gig, is to set them nocow (which on btrfs must be done 

43

at creation time, while the file is still zero-sized, in ordered to take 

44

proper effect; this is normally accomplished by setting the directory 

45

they'll be in to nocow, which doesn't affect the directory itself, but 

46

does cause any newly created files or subdirs in it to inherit the nocow 

47

attribute).

48

49

And this is actually what systemd-219 is doing with the journal files now.

50

51

But, setting nocow automatically disables both transparent compression 

52

(if otherwise enabled) and checksumming.  The latter isn't actually as 

53

bad as one might expect, because most applications (including systemd/

54

journald) that deal with such files already have some sort of builtin 

55

corruption detection and possible repair functionality -- they have to in 

56

ordered to work acceptably on traditional filesystems that didn't do 

57

filesystem level checksumming, and letting them have at it would indeed 

58

seem to be the best policy in this case.

59

60

The second, related problem, is snapshotting.  Because snapshotting 

61

relies on COW, snapshotting a nocow file forces it to effectively cow-1 

62

-- the first time a block is rewritten after a snapshot, it is cowed, 

63

despite the ordinary nocow.  Now setup say hourly auto-snapshotting using 

64

snapper or the like, and continue to write to that "nocow" file, and 

65

pretty soon it'll be as fragmented as if it weren't nocow at all!

66

67

With careful planning, separate subvolumes for the nocow files so they 

68

aren't snapshotted with the rest of the system, snapshotting the nocow 

69

subvolume with a period near the low frequency end of your target range 

70

(say every other day or weekly instead of daily or twice a day), and if 

71

they aren't rotated out regularly, periodic scripted btrfs defrags (say 

72

weekly or monthly) of the affected files, good admins generally can keep 

73

fragmentation from this source at least within reason.

74

75

And systemd-219 is actually creating a separate subvolume for its journal 

76

files now, by default, thus keeping them out of the general system (or 

77

/var) snapshot.  But while both that and nocowing the journal files now 

78

does help, it's still a reasonably fragile solution, as long as admins 

79

don't realize what's going on, and can be tempted to set daily or more 

80

frequent snapshotting on the journal subvolume too (or if the subvolume 

81

doesn't take, say because it's an existing installation where there's 

82

already a directory by that name and thus there can't be a subvolume at 

83

the same place with the same name).

84

85

86

**BUT A BIG CAVEAT** lest anyone on stable with btrfs and systemd jump 

87

onto 219 too fast.  Yes, 219 DOES have some nice new features.  

88

Unfortunately, it's broken in a few new ways as well.

89

90

* Apparently, systemd-219's networkd breaks with at least static IPv4-

91

only configurations, as my network failed to come up with it.  From the 

92

errors it was trying IPv6 and because that failed (it's not even in my 

93

kernel), it gave up and didn't even try IPv4, instead trying to set the 

94

IPv4 IP and gateway values into IPv6, which obviously isn't going to work 

95

at all!

96

97

* There's also issues with the new tmpfiles.d configuration that has 

98

replaced d lines (create a directory if it doesn't exist) with new v 

99

lines (create a subvolume if on btrfs and possible, else fallback to d 

100

behavior and create a directory), because subvolume creation fails 

101

differently than directory creation, and the differences aren't all 

102

sorted, yet.

103

104

Hopefully, systemd-220 will fix the IPv4 issue and bring a bit more 

105

maturity to the tmpfiles.d subvolumes-creation feature by properly 

106

falling back to d/directories if need be, instead of erroring out.  

107

Meanwhile, hopefully a gentoo systemd-219-rX release will fix some of 

108

these issues as well.  But for right now, I'd suggest staying away from 

109

it, as it's definitely not prime-time ready in its current form.

110

111

FWIW, I'm back on 218-r3 for now, done with a quick emerge --pkgonly 

112

<systemd-219.  I've not yet masked 219, however, so an update will try to 

113

bring it back in, and I will thus have to see what changes have happened 

114

and either mask it or try building it again, next time I update.

115

116

> What's funny though is that the systemd news file

117

> (http://cgit.freedesktop.org/systemd/systemd/tree/NEWS) occasionally

118

> refers to non-btrfs file systems as "legacy file sysetms".  At least, as

119

> a btrfs user I think it's funny :) .

120

121

Indeed.  They've definitely adopted btrfs and are running with it.  If 

122

you've read anything about their plans, the features of btrfs really do 

123

provide a filesystem-side ready-made solution for them to adopt, altho 

124

I'd still not call btrfs itself exactly mature -- even more than with 

125

other filesystems, if an admin is putting data on btrfs and doesn't have 

126

tested backups available, they really do NOT value that data, claims to 

127

the contrary not withstanding.

128

129

And in a way, it's good, because systemd pushing it like that means 

130

systemd based distros will be pushing it too, which will bring far wider 

131

deployment of btrfs, ready or not, which will in turn help btrfs mature 

132

faster with all those additional strange-corner-case bug reports and 

133

hopefully fixes.  I just feel for the poor admins trusting their distro 

134

as they head into this without the backups they really should have... as 

135

ultimately, a lot of them are unfortunately going to have to learn that 

136

no backups really DOES mean you'd rather lose that data than bother with 

137

backups, lesson, the HARD way! =:^(

138

139

--

140

Duncan - List replies preferred.   No HTML msgs.

141

"Every nonfree program has a lord, a master --

142

and if you use the program, he is your master."  Richard Stallman

1	Marc Joliet posted on Wed, 25 Feb 2015 19:56:32 +0100 as excerpted:
2
3	> But regardless of what you use, I think that the worst offenders are
4	> services that write logs themselves (I'm looking at you, samba).
5
6	>> c) I use btrfs for my primary filesystems, and btrfs and journald's
7	>> binary-format journals don't play so well together. [...]
8	>
9	> Well, I'm on an SSD, but even on the laptop I haven't noticed any
10	> performance issues (yet). Then again, I use autodefrag, so that
11	> probably helps.
12
13	Autodefrag does help.
14
15	There are two related issues at work, here.
16
17	The primary one is that pretty much any COW-based filesystem, including
18	btrfs, is going to have problems with internal-rewrite-pattern (as
19	opposed to append-only rewrites) files of any significant size. At the
20	small end this includes sqlite database files such as those firefox and
21	other mozilla products use. These, autodefrag manages well.
22
23	At the larger end are multi-gig VM images and similarly sized database
24	files. These, autodefrag doesn't manage so well, particularly if writes
25	are coming in at any significant rate, because at some point it's going
26	to take longer to rewrite the entire file (or even the affected normally
27	one-gig data chunk) than the time between incoming writes.
28
29	And the place where such fragmentation REALLY shows up is trying to run
30	btrfs filesystem maintenance commands like balance. On a sufficiently
31	fragmented filesystemsystem, particularly with quotas on too as their
32	tracking significantly complicates things, balance can take WEEKS on a
33	single-digits terabyte filesystem.
34
35	IOW, a lot of people don't notice it until something goes wrong and
36	they're trying to replace a failed device with one of the btrfs raid
37	modes, etc. That's a nasty time to find out how tangled things were, and
38	realize it'll take weeks to sort out, during which another device could
39	well fail, leaving you high and dry!
40
41	The immediate (partial) solution to the problem with these large files,
42	typically over a gig, is to set them nocow (which on btrfs must be done
43	at creation time, while the file is still zero-sized, in ordered to take
44	proper effect; this is normally accomplished by setting the directory
45	they'll be in to nocow, which doesn't affect the directory itself, but
46	does cause any newly created files or subdirs in it to inherit the nocow
47	attribute).
48
49	And this is actually what systemd-219 is doing with the journal files now.
50
51	But, setting nocow automatically disables both transparent compression
52	(if otherwise enabled) and checksumming. The latter isn't actually as
53	bad as one might expect, because most applications (including systemd/
54	journald) that deal with such files already have some sort of builtin
55	corruption detection and possible repair functionality -- they have to in
56	ordered to work acceptably on traditional filesystems that didn't do
57	filesystem level checksumming, and letting them have at it would indeed
58	seem to be the best policy in this case.
59
60	The second, related problem, is snapshotting. Because snapshotting
61	relies on COW, snapshotting a nocow file forces it to effectively cow-1
62	-- the first time a block is rewritten after a snapshot, it is cowed,
63	despite the ordinary nocow. Now setup say hourly auto-snapshotting using
64	snapper or the like, and continue to write to that "nocow" file, and
65	pretty soon it'll be as fragmented as if it weren't nocow at all!
66
67	With careful planning, separate subvolumes for the nocow files so they
68	aren't snapshotted with the rest of the system, snapshotting the nocow
69	subvolume with a period near the low frequency end of your target range
70	(say every other day or weekly instead of daily or twice a day), and if
71	they aren't rotated out regularly, periodic scripted btrfs defrags (say
72	weekly or monthly) of the affected files, good admins generally can keep
73	fragmentation from this source at least within reason.
74
75	And systemd-219 is actually creating a separate subvolume for its journal
76	files now, by default, thus keeping them out of the general system (or
77	/var) snapshot. But while both that and nocowing the journal files now
78	does help, it's still a reasonably fragile solution, as long as admins
79	don't realize what's going on, and can be tempted to set daily or more
80	frequent snapshotting on the journal subvolume too (or if the subvolume
81	doesn't take, say because it's an existing installation where there's
82	already a directory by that name and thus there can't be a subvolume at
83	the same place with the same name).
84
85
86	BUT A BIG CAVEAT lest anyone on stable with btrfs and systemd jump
87	onto 219 too fast. Yes, 219 DOES have some nice new features.
88	Unfortunately, it's broken in a few new ways as well.
89
90	* Apparently, systemd-219's networkd breaks with at least static IPv4-
91	only configurations, as my network failed to come up with it. From the
92	errors it was trying IPv6 and because that failed (it's not even in my
93	kernel), it gave up and didn't even try IPv4, instead trying to set the
94	IPv4 IP and gateway values into IPv6, which obviously isn't going to work
95	at all!
96
97	* There's also issues with the new tmpfiles.d configuration that has
98	replaced d lines (create a directory if it doesn't exist) with new v
99	lines (create a subvolume if on btrfs and possible, else fallback to d
100	behavior and create a directory), because subvolume creation fails
101	differently than directory creation, and the differences aren't all
102	sorted, yet.
103
104	Hopefully, systemd-220 will fix the IPv4 issue and bring a bit more
105	maturity to the tmpfiles.d subvolumes-creation feature by properly
106	falling back to d/directories if need be, instead of erroring out.
107	Meanwhile, hopefully a gentoo systemd-219-rX release will fix some of
108	these issues as well. But for right now, I'd suggest staying away from
109	it, as it's definitely not prime-time ready in its current form.
110
111	FWIW, I'm back on 218-r3 for now, done with a quick emerge --pkgonly
112	<systemd-219. I've not yet masked 219, however, so an update will try to
113	bring it back in, and I will thus have to see what changes have happened
114	and either mask it or try building it again, next time I update.
115
116	> What's funny though is that the systemd news file
117	> (http://cgit.freedesktop.org/systemd/systemd/tree/NEWS) occasionally
118	> refers to non-btrfs file systems as "legacy file sysetms". At least, as
119	> a btrfs user I think it's funny :) .
120
121	Indeed. They've definitely adopted btrfs and are running with it. If
122	you've read anything about their plans, the features of btrfs really do
123	provide a filesystem-side ready-made solution for them to adopt, altho
124	I'd still not call btrfs itself exactly mature -- even more than with
125	other filesystems, if an admin is putting data on btrfs and doesn't have
126	tested backups available, they really do NOT value that data, claims to
127	the contrary not withstanding.
128
129	And in a way, it's good, because systemd pushing it like that means
130	systemd based distros will be pushing it too, which will bring far wider
131	deployment of btrfs, ready or not, which will in turn help btrfs mature
132	faster with all those additional strange-corner-case bug reports and
133	hopefully fixes. I just feel for the poor admins trusting their distro
134	as they head into this without the backups they really should have... as
135	ultimately, a lot of them are unfortunately going to have to learn that
136	no backups really DOES mean you'd rather lose that data than bother with
137	backups, lesson, the HARD way! =:^(
138
139	--
140	Duncan - List replies preferred. No HTML msgs.
141	"Every nonfree program has a lord, a master --
142	and if you use the program, he is your master." Richard Stallman

Gentoo Archives: gentoo-amd64