Re: [gentoo-dev] Re: rfc: Does OpenRC really need mount-ro - gentoo-dev

From:	Richard Yao <ryao@g.o>
To:	gentoo-dev@l.g.o
Subject:	Re: [gentoo-dev] Re: rfc: Does OpenRC really need mount-ro
Date:	Wed, 17 Feb 2016 14:24:58
Message-Id:	`4800E3E6-4D70-4DF8-8F40-705C6B77882B@gentoo.org`
In Reply to:	[gentoo-dev] Re: rfc: Does OpenRC really need mount-ro by Duncan <1i5t5.duncan@cox.net>

1

> On Feb 16, 2016, at 9:20 PM, Duncan <1i5t5.duncan@×××.net> wrote:

2

>

3

> William Hubbs posted on Tue, 16 Feb 2016 12:41:29 -0600 as excerpted:

4

>

5

>> What I'm trying to figure out is, what to do about re-mounting file

6

>> systems read-only.

7

>>

8

>> How does systemd do this? I didn't find an equivalent of the mount-ro

9

>> service there.

10

>

11

> For quite some time now, systemd has actually had a mechanism whereby the

12

> main systemd process reexecs (with a pivot-root) the initr* systemd and

13

> returns control to it during the shutdown process, thereby allowing a

14

> more controlled shutdown than traditional init systems because the final

15

> stages are actually running from the virtual-filesystem of the initr*,

16

> such that after everything running on the main root is shutdown, the main

17

> root itself can actually be unmounted, not just mounted read-only,

18

> because there is literally nothing running on it any longer.

19

>

20

> There's still a fallback to read-only mounting if an initr* isn't used or

21

> if reinvoking the initr* version fails for some reason, but with an

22

> initr*, when everything's working properly, while there are still some

23

> bits of userspace running, they're no longer actually running off of the

24

> main root, so main root can actually be unmounted much like any other

25

> filesystem.

26

27

Systemd installs that go back into the initramfs at shutdown are rare because there is a hook for the initramfs to tell systemd that it should re-exec it and very few configurations do that. Even fewer that do it actually need it.

28

29

The biggest user of that mechanism of which I am aware is ZFS on EL/Fedora when booted with Dracut. It does not need it and it was only implemented was that someone who did not understand how ZFS was designed to integrate with the boot and startup processes thought it was a good idea.

30

31

As it turns out, that behavior actually breaks the mechanism intended to make multipath sane by marking the pool in such a way that it tells all systems with access to the disks that a pool that will be used on next boot is not going to be used by anyone. If they import it and the system boots, the pool can be damaged beyond repair.

32

33

Thankfully, no one seems to boot EL/Fedora systems off ZFS pools in multipath environments. The code to hook into this special behavior will be removed in the future, but that is a low priority as none of the developers' employers care about it and the almost negligible possibility that the mechanism would save someone from data loss  has made it too low of a priority for any of us to spend our free time on it.

34

35

> The process is explained a bit better in the copious blogposted systemd

36

> documentation.  Let's see if I can find a link...

37

>

38

> OK, this isn't where I originally read about it, which IIRC was aimed

39

> more at admins, while this is aimed at initr* devs, but that's probably a

40

> good thing as it includes more specific detail...

41

>

42

> https://www.freedesktop.org/wiki/Software/systemd/InitrdInterface/

43

>

44

> And here's some more, this time in the storage daemon controlled root and

45

> initr* context...

46

>

47

> https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/

48

>

49

>

50

> But... all that doesn't answer the original question directly, does it?

51

> Where there's no return to initr*, how /does/ systemd handle read-only

52

> mounting?

53

>

54

> First, the nice ascii-diagram flow charts in the bootup (7) manpage may

55

> be useful, in particular here, the shutdown diagram (tho IDK if you can

56

> find such things useful or not??).

57

>

58

> https://www.freedesktop.org/software/systemd/man/bootup.html

59

>

60

> Here's the shutdown diagram described in words:

61

>

62

> Initial shutdown is via two targets (as opposed to specific services),

63

> shutdown.target, which conflicts with all (normal) system services

64

> thereby shutting them down, and umount.target, which conflicts with file

65

> mounts, swaps, cryptsetup device, etc.  Here, we're obviously interested

66

> in umount.target.  Then after those two targets are reached, various low

67

> level services are run or stopped, in ordered to reach final.target.

68

> After final.target, the appropriate systemd-(reboot|poweroff|halt|kexec)

69

> service is run, to hit the ultimate (reboot|poweroff|halt|kexec).target,

70

> which of course is never actually evaluated, since the service actually

71

> does the intended action.

72

>

73

> The primary takeaway is that you might not be finding a specific systemd

74

> remount-ro service, because it might be a target, defined in terms of

75

> conflicts with mount units, etc, rather than a specific service.

76

>

77

> Neither shutdown.target nor umount.target have any wants or requires by

78

> default, but the various normal services and mount units conflict with

79

> them, either via default or specifically, so are shut down before the

80

> target can be reached.

81

>

82

> final.target has the After=shutdown.target umount.target setting, so

83

> won't be reached until they are reached.

84

>

85

> The respective (reboot|poweroff|halt|kexec).target units Requires= and

86

> After= their respective systemd-*.service units, and reboot and poweroff

87

> (but not halt and kexec) have 30-minute timeouts after which they run

88

> reboot-force or poweroff-force, respectively.

89

>

90

> The respective systemd-(reboot|poweroff|halt|kexec).service units

91

> Requires= and After= shutdown.target, umount.target and final.target, all

92

> three, so won't be run until those complete.  They simply

93

> ExecStart=/usr/bin/systemctl --force their respective actions.

94

>

95

> And here's what the systemd.special (7) manpage says about umount.target:

96

>

97

>  umount.target

98

>    A special target unit that umounts all mount and automount points

99

>    on system shutdown.

100

>

101

>    Mounts that shall be unmounted on system shutdown shall add

102

>    Conflicts dependencies to this unit for their mount unit,

103

>    which is implicitly done when DefaultDependencies=yes is set

104

>    (the default).

105

>

106

> But that /still/ doesn't reveal what actually does the remount-ro, as

107

> opposed to umount.  I don't see that either, at the unit level, nor do I

108

> see anything related to it in for instance my auto-generated from fstab

109

> /run/systemd/generators/-.mount file or in the systemd-fstab-generator

110

> (8) manpage.

111

>

112

> Thus I must conclude that it's actually resolved in the mount-unit

113

> conflicts handling in systemd's source code, itself.

114

>

115

> And indeed... in systemd's tarball, we see in src/core/umount.c, in

116

> mount_points_list_umount...

117

>

118

> That the function actually remounts /everything/ (well, everything not in

119

> a container) read-only, before actually trying to umount them.  Indention

120

> restandardized on two-space here, to avoid unnecessary wrapping as

121

> posted.  This is from systemd-228:

122

>

123

> static int mount_points_list_umount(MountPoint **head, bool *changed, bool

124

> log_error) {

125

>  MountPoint *m, *n;

126

>  int n_failed = 0;

127

>

128

>  assert(head);

129

>

130

>  LIST_FOREACH_SAFE(mount_point, m, n, *head) {

131

>

132

>    /* If we are in a container, don't attempt to

133

>       read-only mount anything as that brings no real

134

>       benefits, but might confuse the host, as we remount

135

>       the superblock here, not the bind mound. */

136

>    if (detect_container() <= 0)  {

137

>      _cleanup_free_ char *options = NULL;

138

>      /* MS_REMOUNT requires that the data parameter

139

>       * should be the same from the original mount

140

>       * except for the desired changes. Since we want

141

>       * to remount read-only, we should filter out

142

>       * rw (and ro too, because it confuses the kernel) */

143

>      (void) fstab_filter_options(m->options, "rw\0ro\0", NULL, NULL,

144

> &options);

145

>

146

>      /* We always try to remount directories read-only

147

>       * first, before we go on and umount them.

148

>       *

149

>       * Mount points can be stacked. If a mount

150

>       * point is stacked below / or /usr, we

151

>       * cannot umount or remount it directly,

152

>       * since there is no way to refer to the

153

>       * underlying mount. There's nothing we can do

154

>       * about it for the general case, but we can

155

>       * do something about it if it is aliased

156

>       * somehwere else via a bind mount. If we

157

>       * explicitly remount the super block of that

158

>       * alias read-only we hence should be

159

>       * relatively safe regarding keeping the fs we

160

>       * can otherwise not see dirty. */

161

>      log_info("Remounting '%s' read-only with options '%s'.", m->path,

162

> options);

163

>      (void) mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, options);

164

>    }

165

>

166

>    /* Skip / and /usr since we cannot unmount that

167

>     * anyway, since we are running from it. They have

168

>     * already been remounted ro. */

169

>    if (path_equal(m->path, "/")

170

> #ifndef HAVE_SPLIT_USR

171

>      || path_equal(m->path, "/usr")

172

> #endif

173

>    )

174

>      continue;

175

>

176

>    /* Trying to umount. We don't force here since we rely

177

>        * on busy NFS and FUSE file systems to return EBUSY

178

>        * until we closed everything on top of them. */

179

>    log_info("Unmounting %s.", m->path);

180

>    if (umount2(m->path, 0) == 0) {

181

>      if (changed)

182

>        *changed = true;

183

>

184

>      mount_point_free(head, m);

185

>    } else if (log_error) {

186

>      log_warning_errno(errno, "Could not unmount %s: %m", m->path);

187

>      n_failed++;

188

>    }

189

>  }

190

>

191

>  return n_failed;

192

> }

193

>

194

>

195

> So the short answer ultimately is... Systemd has a single umount

196

> function, which first does remount-ro, so it's actually remounting

197

> (nearly) everything read-only, then tries umount.

198

>

199

>

200

> Meanwhile, (semi-)answering the elsewhere implied question of why only

201

> Linux needs the mount-ro service...  I'm no BSD expert, but in my

202

> wanderings I came across a remark that they didn't need it, because their

203

> kernel reboot/halt/poweroff routines have a built-in kernelspace sync-and-

204

> remount-ro routine for anything that can't be unmounted, which Linux

205

> lacks.  They obviously consider this a Linux deficiency, but while I've

206

> not come across the Linux reason for not doing it, an educated guess is

207

> that it's considered putting policy into the kernel, and that's

208

> considered a no-no, policy is userspace; the kernel simply enforces it as

209

> directed (which is why kernel 2.4's devfs was removed for 2.6, to be

210

> replaced with the userspace-based udev).  Additionally, not kernel-

211

> forcing the remount-ro bit does give developers a way to test results of

212

> an uncontrolled shutdown, say on a specific testing filesystem only,

213

> without exposing the rest of the system, which can still be shut down

214

> normally, to it.

215

>

216

> So on Linux userspace must do the final umounts and force-read-onlys,

217

> because unlike the BSDs, the Linux kernel doesn't have builtin routines

218

> that automatically force it, regardless of userspace.

219

>

220

> But as others have said, on Linux the remount-ro is _definitely_

221

> required, and "bad things _will_ happen" if it's not done.  (Just how bad

222

> depends on the filesystem and its mount options, and hardware, among

223

> other things.)

224

>

225

>

226

> Finally, one more thing to mention.  On systems with magic-srq in the

227

> kernel...

228

>

229

> echo 0x30 > /proc/sys/kernel/sysrq

230

>

231

> ... enables the sync (0x10) and remount-readonly (0x20) functions.  (Of

232

> course only do this at shutdown/reboot, as you don't want to disturb the

233

> user's configured srq defaults in normal runtime.)

234

>

235

> You can then force emergency sync (s) and remount-read-only (u) with...

236

>

237

> echo s > /proc/sysrq-trigger

238

> echo u > /proc/sysrq-trigger

239

>

240

> As that's kernel emergency priority, it should force-sync and force

241

> everything readonly (and quiesce mid-layer layer block devices such as md

242

> and dm), even if it would normally refuse to do so due to files open for

243

> writing.  You might consider something like that as a fallback, if normal

244

> mount-readonly fails.  Of course it won't work if magic-srq functionality

245

> isn't built into the kernel, but then you're no worse off than before,

246

> and are far better off on kernels where it's supported, so it's certainly

247

> worth considering. =:^)

248

>

249

> --

250

> Duncan - List replies preferred.   No HTML msgs.

251

> "Every nonfree program has a lord, a master --

252

> and if you use the program, he is your master."  Richard Stallman

253

>

254

>

Gentoo Archives: gentoo-dev

Replies

1	> On Feb 16, 2016, at 9:20 PM, Duncan <1i5t5.duncan@×××.net> wrote:
2	>
3	> William Hubbs posted on Tue, 16 Feb 2016 12:41:29 -0600 as excerpted:
4	>
5	>> What I'm trying to figure out is, what to do about re-mounting file
6	>> systems read-only.
7	>>
8	>> How does systemd do this? I didn't find an equivalent of the mount-ro
9	>> service there.
10	>
11	> For quite some time now, systemd has actually had a mechanism whereby the
12	> main systemd process reexecs (with a pivot-root) the initr* systemd and
13	> returns control to it during the shutdown process, thereby allowing a
14	> more controlled shutdown than traditional init systems because the final
15	> stages are actually running from the virtual-filesystem of the initr*,
16	> such that after everything running on the main root is shutdown, the main
17	> root itself can actually be unmounted, not just mounted read-only,
18	> because there is literally nothing running on it any longer.
19	>
20	> There's still a fallback to read-only mounting if an initr* isn't used or
21	> if reinvoking the initr* version fails for some reason, but with an
22	> initr*, when everything's working properly, while there are still some
23	> bits of userspace running, they're no longer actually running off of the
24	> main root, so main root can actually be unmounted much like any other
25	> filesystem.
26
27	Systemd installs that go back into the initramfs at shutdown are rare because there is a hook for the initramfs to tell systemd that it should re-exec it and very few configurations do that. Even fewer that do it actually need it.
28
29	The biggest user of that mechanism of which I am aware is ZFS on EL/Fedora when booted with Dracut. It does not need it and it was only implemented was that someone who did not understand how ZFS was designed to integrate with the boot and startup processes thought it was a good idea.
30
31	As it turns out, that behavior actually breaks the mechanism intended to make multipath sane by marking the pool in such a way that it tells all systems with access to the disks that a pool that will be used on next boot is not going to be used by anyone. If they import it and the system boots, the pool can be damaged beyond repair.
32
33	Thankfully, no one seems to boot EL/Fedora systems off ZFS pools in multipath environments. The code to hook into this special behavior will be removed in the future, but that is a low priority as none of the developers' employers care about it and the almost negligible possibility that the mechanism would save someone from data loss has made it too low of a priority for any of us to spend our free time on it.
34
35	> The process is explained a bit better in the copious blogposted systemd
36	> documentation. Let's see if I can find a link...
37	>
38	> OK, this isn't where I originally read about it, which IIRC was aimed
39	> more at admins, while this is aimed at initr* devs, but that's probably a
40	> good thing as it includes more specific detail...
41	>
42	> https://www.freedesktop.org/wiki/Software/systemd/InitrdInterface/
43	>
44	> And here's some more, this time in the storage daemon controlled root and
45	> initr* context...
46	>
47	> https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/
48	>
49	>
50	> But... all that doesn't answer the original question directly, does it?
51	> Where there's no return to initr*, how /does/ systemd handle read-only
52	> mounting?
53	>
54	> First, the nice ascii-diagram flow charts in the bootup (7) manpage may
55	> be useful, in particular here, the shutdown diagram (tho IDK if you can
56	> find such things useful or not??).
57	>
58	> https://www.freedesktop.org/software/systemd/man/bootup.html
59	>
60	> Here's the shutdown diagram described in words:
61	>
62	> Initial shutdown is via two targets (as opposed to specific services),
63	> shutdown.target, which conflicts with all (normal) system services
64	> thereby shutting them down, and umount.target, which conflicts with file
65	> mounts, swaps, cryptsetup device, etc. Here, we're obviously interested
66	> in umount.target. Then after those two targets are reached, various low
67	> level services are run or stopped, in ordered to reach final.target.
68	> After final.target, the appropriate systemd-(reboot\|poweroff\|halt\|kexec)
69	> service is run, to hit the ultimate (reboot\|poweroff\|halt\|kexec).target,
70	> which of course is never actually evaluated, since the service actually
71	> does the intended action.
72	>
73	> The primary takeaway is that you might not be finding a specific systemd
74	> remount-ro service, because it might be a target, defined in terms of
75	> conflicts with mount units, etc, rather than a specific service.
76	>
77	> Neither shutdown.target nor umount.target have any wants or requires by
78	> default, but the various normal services and mount units conflict with
79	> them, either via default or specifically, so are shut down before the
80	> target can be reached.
81	>
82	> final.target has the After=shutdown.target umount.target setting, so
83	> won't be reached until they are reached.
84	>
85	> The respective (reboot\|poweroff\|halt\|kexec).target units Requires= and
86	> After= their respective systemd-*.service units, and reboot and poweroff
87	> (but not halt and kexec) have 30-minute timeouts after which they run
88	> reboot-force or poweroff-force, respectively.
89	>
90	> The respective systemd-(reboot\|poweroff\|halt\|kexec).service units
91	> Requires= and After= shutdown.target, umount.target and final.target, all
92	> three, so won't be run until those complete. They simply
93	> ExecStart=/usr/bin/systemctl --force their respective actions.
94	>
95	> And here's what the systemd.special (7) manpage says about umount.target:
96	>
97	> umount.target
98	> A special target unit that umounts all mount and automount points
99	> on system shutdown.
100	>
101	> Mounts that shall be unmounted on system shutdown shall add
102	> Conflicts dependencies to this unit for their mount unit,
103	> which is implicitly done when DefaultDependencies=yes is set
104	> (the default).
105	>
106	> But that /still/ doesn't reveal what actually does the remount-ro, as
107	> opposed to umount. I don't see that either, at the unit level, nor do I
108	> see anything related to it in for instance my auto-generated from fstab
109	> /run/systemd/generators/-.mount file or in the systemd-fstab-generator
110	> (8) manpage.
111	>
112	> Thus I must conclude that it's actually resolved in the mount-unit
113	> conflicts handling in systemd's source code, itself.
114	>
115	> And indeed... in systemd's tarball, we see in src/core/umount.c, in
116	> mount_points_list_umount...
117	>
118	> That the function actually remounts /everything/ (well, everything not in
119	> a container) read-only, before actually trying to umount them. Indention
120	> restandardized on two-space here, to avoid unnecessary wrapping as
121	> posted. This is from systemd-228:
122	>
123	> static int mount_points_list_umount(MountPoint *head, bool changed, bool
124	> log_error) {
125	> MountPoint m, n;
126	> int n_failed = 0;
127	>
128	> assert(head);
129	>
130	> LIST_FOREACH_SAFE(mount_point, m, n, *head) {
131	>
132	> /* If we are in a container, don't attempt to
133	> read-only mount anything as that brings no real
134	> benefits, but might confuse the host, as we remount
135	> the superblock here, not the bind mound. */
136	> if (detect_container() <= 0) {
137	> _cleanup_free_ char *options = NULL;
138	> /* MS_REMOUNT requires that the data parameter
139	> * should be the same from the original mount
140	> * except for the desired changes. Since we want
141	> * to remount read-only, we should filter out
142	> * rw (and ro too, because it confuses the kernel) */
143	> (void) fstab_filter_options(m->options, "rw\0ro\0", NULL, NULL,
144	> &options);
145	>
146	> /* We always try to remount directories read-only
147	> * first, before we go on and umount them.
148	> *
149	> * Mount points can be stacked. If a mount
150	> * point is stacked below / or /usr, we
151	> * cannot umount or remount it directly,
152	> * since there is no way to refer to the
153	> * underlying mount. There's nothing we can do
154	> * about it for the general case, but we can
155	> * do something about it if it is aliased
156	> * somehwere else via a bind mount. If we
157	> * explicitly remount the super block of that
158	> * alias read-only we hence should be
159	> * relatively safe regarding keeping the fs we
160	> * can otherwise not see dirty. */
161	> log_info("Remounting '%s' read-only with options '%s'.", m->path,
162	> options);
163	> (void) mount(NULL, m->path, NULL, MS_REMOUNT\|MS_RDONLY, options);
164	> }
165	>
166	> /* Skip / and /usr since we cannot unmount that
167	> * anyway, since we are running from it. They have
168	> * already been remounted ro. */
169	> if (path_equal(m->path, "/")
170	> #ifndef HAVE_SPLIT_USR
171	> \|\| path_equal(m->path, "/usr")
172	> #endif
173	> )
174	> continue;
175	>
176	> /* Trying to umount. We don't force here since we rely
177	> * on busy NFS and FUSE file systems to return EBUSY
178	> * until we closed everything on top of them. */
179	> log_info("Unmounting %s.", m->path);
180	> if (umount2(m->path, 0) == 0) {
181	> if (changed)
182	> *changed = true;
183	>
184	> mount_point_free(head, m);
185	> } else if (log_error) {
186	> log_warning_errno(errno, "Could not unmount %s: %m", m->path);
187	> n_failed++;
188	> }
189	> }
190	>
191	> return n_failed;
192	> }
193	>
194	>
195	> So the short answer ultimately is... Systemd has a single umount
196	> function, which first does remount-ro, so it's actually remounting
197	> (nearly) everything read-only, then tries umount.
198	>
199	>
200	> Meanwhile, (semi-)answering the elsewhere implied question of why only
201	> Linux needs the mount-ro service... I'm no BSD expert, but in my
202	> wanderings I came across a remark that they didn't need it, because their
203	> kernel reboot/halt/poweroff routines have a built-in kernelspace sync-and-
204	> remount-ro routine for anything that can't be unmounted, which Linux
205	> lacks. They obviously consider this a Linux deficiency, but while I've
206	> not come across the Linux reason for not doing it, an educated guess is
207	> that it's considered putting policy into the kernel, and that's
208	> considered a no-no, policy is userspace; the kernel simply enforces it as
209	> directed (which is why kernel 2.4's devfs was removed for 2.6, to be
210	> replaced with the userspace-based udev). Additionally, not kernel-
211	> forcing the remount-ro bit does give developers a way to test results of
212	> an uncontrolled shutdown, say on a specific testing filesystem only,
213	> without exposing the rest of the system, which can still be shut down
214	> normally, to it.
215	>
216	> So on Linux userspace must do the final umounts and force-read-onlys,
217	> because unlike the BSDs, the Linux kernel doesn't have builtin routines
218	> that automatically force it, regardless of userspace.
219	>
220	> But as others have said, on Linux the remount-ro is _definitely_
221	> required, and "bad things _will_ happen" if it's not done. (Just how bad
222	> depends on the filesystem and its mount options, and hardware, among
223	> other things.)
224	>
225	>
226	> Finally, one more thing to mention. On systems with magic-srq in the
227	> kernel...
228	>
229	> echo 0x30 > /proc/sys/kernel/sysrq
230	>
231	> ... enables the sync (0x10) and remount-readonly (0x20) functions. (Of
232	> course only do this at shutdown/reboot, as you don't want to disturb the
233	> user's configured srq defaults in normal runtime.)
234	>
235	> You can then force emergency sync (s) and remount-read-only (u) with...
236	>
237	> echo s > /proc/sysrq-trigger
238	> echo u > /proc/sysrq-trigger
239	>
240	> As that's kernel emergency priority, it should force-sync and force
241	> everything readonly (and quiesce mid-layer layer block devices such as md
242	> and dm), even if it would normally refuse to do so due to files open for
243	> writing. You might consider something like that as a fallback, if normal
244	> mount-readonly fails. Of course it won't work if magic-srq functionality
245	> isn't built into the kernel, but then you're no worse off than before,
246	> and are far better off on kernels where it's supported, so it's certainly
247	> worth considering. =:^)
248	>
249	> --
250	> Duncan - List replies preferred. No HTML msgs.
251	> "Every nonfree program has a lord, a master --
252	> and if you use the program, he is your master." Richard Stallman
253	>
254	>