Re: [gentoo-user] Can I use containers? - gentoo-user

From:	Grant Taylor <gtaylor@×××××××××××××××××××××.net>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] Can I use containers?
Date:	Sun, 19 May 2019 04:39:16
Message-Id:	`b5c64e90-495f-7123-d2d2-0caf6262e0a6@spamtrap.tnetconsulting.net`
In Reply to:	Re: [gentoo-user] Can I use containers? by Rich Freeman

1

On 5/18/19 5:49 PM, Rich Freeman wrote:

2

> I'd be interested if there are other scripts people have put out

3

> there, but I agree that most of the container solutions on Linux

4

> are overly-complex.

5

6

Here's what I use for some networking, which probably qualifies as 

7

extremely light weight ""containers.

8

9

Prerequisite:  Create a place for the name spaces to anchor:

10

11

    # Create the directories to contain the *NS mount points.

12

    sudo mkdir -p /run/{mount,net,uts}ns

13

14

You can use any path that you want.  —  I do a lot with iproute2's 

15

network namespaces (which is where this evolved from), which use 

16

/run/netns/$NetNSname.  So I used that as a pattern for the other types 

17

of namespaces.  Adjust as you want.  —  What I'm doing is interoperable 

18

with iproute2's netns command.

19

20

Per ""Container:  Create the ""Containers mount points:

21

22

    # Create the *NS mount points

23

    sudo touch /run/{mount,net,uts}ns/$ContainerName

24

25

Start the actual namespaces:

26

27

    # Spawn the lab# NetNSs.

28

    unshare --mount=/run/mountns/$ContainerName 

29

--net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName /bin/true

30

31

Note:  The namespaces don't die when true exits because they are 

32

associated with a mount point.

33

34

Tweak the namespaces:

35

36

    # Set the lab# NetNS's hostname.

37

    nsenter --mount=/run/mountns/$ContainerName 

38

--net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName 

39

/bin/hostname $ContainerName

40

41

I reuse this command calling different binaries any time I want to do 

42

something in the ""container.  Calling /bin/bash (et al.) enters the 

43

container.

44

45

I've created a wrapper script (nsenter.wrapper) that passes the proper 

46

parameters to nsenter.  I've then sym-linked the container name to the 

47

nsenter.wrapper script.  This means that I can run "$ContainerName 

48

$Command"  or simply enter the container with $ContainerName.  (The 

49

script checks the number of parameters and assumes /bin/bash if no 

50

command is specified.

51

52

I think it's ultimately extremely trivial to have a ""container 

53

(glorified collection of name spaces) to do things I want with virtually 

54

zero disk space.  Ok, ok, maybe 1 or 2 kB for the script & links.

55

56

Note:  Since I'm using the mount name space, I can have a completely 

57

different mount tree inside the ""container than I have outside the 

58

container / on the host.  I'm not currently doing that, but it's 

59

possible to change things as desired.

60

61

> I personally use nspawn, which is actually pretty minimal, but it

62

> depends on systemd, which I'm sure many would argue is overly complex.

63

> :)  However, if you are running systemd you can basically do a

64

> one-liner that requires zero setup to turn a chroot into a container.

65

66

As much as I might not like systemd, if you have it, and it reliably 

67

does what you want, then I see no reason to /not/ use it.  Just 

68

acknowledge it as a dependency on your solution, which you have done. 

69

So I think we're cool.

70

71

> On to the original questions about mounts:

72

>

73

> In general you can mount stuff in containers without issue.  There are

74

> two ways to go about it.  One is to mount something on the host and

75

> bind-mount it into the container, typically at launch time.  The other

76

> is to give the container the necessary capabilities so that it can

77

> do its own mounting (typically containers are not given the necessary

78

> capabilities, so mounting will fail even as root inside the container).

79

80

Given that one of the uses of containers is security isolation (such as 

81

it is), I feel like giving the container the ability to mount things is 

82

less than a stellar idea.  But to each his / her own.

83

84

> I believe the reason the wiki says to be careful with mounts has more

85

> to do with UID/GID mapping.  As you are using nfs this is already an

86

> issue you're probably dealing with.  You're probably aware that running

87

> nfs with multiple hosts with unsynchronized passwd/group files can

88

> be tricky, because linux (and unix in general) works with UIDs/GIDs,

89

> and not really directly with names,

90

91

That's true for NFS v1-3.  But NFS v4 changes that.  NFS v4 actually 

92

uses user names & group names and has a daemon that runs on the client & 

93

server to translate things as necessary.

94

95

> so if you're doing something with one UID on one host and with a

96

> different UID on another host you might get unexpected permissions

97

> behavior.

98

99

Yep.  You need to do /something/ to account for this.  Be it manually 

100

manage UID & GID across things, or use something like NFSv4's 

101

synchronization mechanism.

102

103

> In a nutshell the same thing can happen with containers, or for

104

> that matter with chroots.

105

106

I mostly agree.  However, user namespaces can nullify this.

107

108

I've not dabbled with user namespaces yet, but my understanding is that 

109

they can have completely different UIDs & GIDs inside the user namespace 

110

than outside of it.  It's my understanding that UID 0 / GID 0 inside a 

111

user namespace can be mapped to UID 12345 / GID 23456 outside of the 

112

user namespace.  Refer to nsenter / unshare man pages for more details.

113

114

> If you have identical passwd/group files it should be a non-issue.

115

116

Point of order:  The files don't need to be identical.  The UIDs & GIDs 

117

need to be managed if you aren't using something like user namespaces. 

118

So it's perfectly valid to have a text file that is used to coordinate 

119

UIDs & GIDs somewhere and then use those in passw/shadow group/gshadow 

120

files.

121

122

> However, if you want to do mapping with unprivileged containers

123

> you have to be careful with mounts as they might not get translated

124

> properly.  Using completely different UIDs in a container is their

125

> suggested solution, which is fine as long as the actual container

126

> filesystem isn't shared with anything else.

127

128

I conceptually agree.  However I think mount namespaces combined with 

129

user namespaces muddy the water.  Again, refer to the nsenter / unshare 

130

man pages and what they refer to.

131

132

nsenter has an option for sharing something between mount namespaces.  I 

133

have no idea what it does, much less how it does it.  I suspect that the 

134

kernel mounts it once (maybe not visible from anywhere else) and then 

135

bind-mounts it to multiple locations for visibility / access.

136

137

> That tends to be the case anyway when you're using container

138

> implementations that do a lot of fancy image management.  If you're

139

> doing something very minimal and just using a path/chroot on the host

140

> as your container then you need to be mindful of your UIDs/GIDs if

141

> you go accessing anything from the host directly.

142

143

UID & GID management is important.  /Something/ should be doing it.

144

145

> The other thing I'd be careful with is mounting physical devices in

146

> more than one place.  Since you're actually sharing a kernel I suspect

147

> linux will "do the right thing" if you mount an ext4 on /dev/sda2 on

148

> two different containers, but I've never tried it (and again doing

149

> that requires giving containers access to even see sda2 because they

150

> probably won't see physical devices by default).

151

152

Seeing as how the containers are running under the same kernel, there is 

153

no actual need for the file system to be mounted multiple times. 

154

Instead the kernel would mount it and present it, much like a bind 

155

mount, to multiple containers for access.

156

157

Think along the lines of opening and working with a file system as a 

158

separate process from where it's presented for access.  Conceptually not 

159

that dissimilar to a hard link that has multiple representations of a 

160

file in multiple locations on the same file system.  (It's not a perfect 

161

analogy, but I hope that makes sense.)

162

163

> In a VM environment you definitely can't do this, because the VMs

164

> are completely isolated at the kernel level and having two different

165

> kernels having dirty buffers on the same physical device is going

166

> to kill any filesystem that isn't designed to be clustered.

167

168

Technically, you can usually get away with doing this.  But the mounts 

169

need to be read-only.  But I STRONGLY suggest that you NOT do this to a 

170

non-cluster aware file system.

171

172

I have colleagues that supported systems RO mounting an Ext file system 

173

this way.  It worked okay when it was used as a RO library.  The problem 

174

was when they made changes in the one with RW access.  They needed to 

175

unmount and remount all the RO clients to see the updates.  It was not 

176

graceful and we advised that they stop doing that.  But it did work for 

177

their needs.  They used it akin to a bit (~TB) CD-ROM.

178

179

> In a container environment the two containers aren't really isolated

180

> at the actual physical filesystem level since they share the kernel,

181

182

I think mount namespaces muddy this water.  Yes, it's the same kernel, 

183

but the containers don't have the same file systems exposed to the 

184

container.

185

186

> so I think you'd be fine but I'd really want to test or do some

187

> research before relying on it.

188

189

Yes, test.

190

191

But make sure you have a vague understanding of what's actually 

192

happening behind the scenes.  I find that tremendously helpful in 

193

knowing what can and can't be done, as well as why.

194

195

> In any case, the more typical solution is to just mount everything on

196

> the host and then bind-mount it into the container.  So, you could

197

> mount the nfs in /mnt and then bind-mount that into your container.

198

> There is really no performance hit and it should work fine without

199

> giving the container a bunch of capabilities.

200

201

I think there /is/ a performance hit.  It's just so /minimal/ that it's 

202

effectively non-existent.  Every additional line of code in the path 

203

that must be traversed does take CPU cycles.

1	On 5/18/19 5:49 PM, Rich Freeman wrote:
2	> I'd be interested if there are other scripts people have put out
3	> there, but I agree that most of the container solutions on Linux
4	> are overly-complex.
5
6	Here's what I use for some networking, which probably qualifies as
7	extremely light weight ""containers.
8
9	Prerequisite: Create a place for the name spaces to anchor:
10
11	# Create the directories to contain the *NS mount points.
12	sudo mkdir -p /run/{mount,net,uts}ns
13
14	You can use any path that you want. — I do a lot with iproute2's
15	network namespaces (which is where this evolved from), which use
16	/run/netns/$NetNSname. So I used that as a pattern for the other types
17	of namespaces. Adjust as you want. — What I'm doing is interoperable
18	with iproute2's netns command.
19
20	Per ""Container: Create the ""Containers mount points:
21
22	# Create the *NS mount points
23	sudo touch /run/{mount,net,uts}ns/$ContainerName
24
25	Start the actual namespaces:
26
27	# Spawn the lab# NetNSs.
28	unshare --mount=/run/mountns/$ContainerName
29	--net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName /bin/true
30
31	Note: The namespaces don't die when true exits because they are
32	associated with a mount point.
33
34	Tweak the namespaces:
35
36	# Set the lab# NetNS's hostname.
37	nsenter --mount=/run/mountns/$ContainerName
38	--net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName
39	/bin/hostname $ContainerName
40
41	I reuse this command calling different binaries any time I want to do
42	something in the ""container. Calling /bin/bash (et al.) enters the
43	container.
44
45	I've created a wrapper script (nsenter.wrapper) that passes the proper
46	parameters to nsenter. I've then sym-linked the container name to the
47	nsenter.wrapper script. This means that I can run "$ContainerName
48	$Command" or simply enter the container with $ContainerName. (The
49	script checks the number of parameters and assumes /bin/bash if no
50	command is specified.
51
52	I think it's ultimately extremely trivial to have a ""container
53	(glorified collection of name spaces) to do things I want with virtually
54	zero disk space. Ok, ok, maybe 1 or 2 kB for the script & links.
55
56	Note: Since I'm using the mount name space, I can have a completely
57	different mount tree inside the ""container than I have outside the
58	container / on the host. I'm not currently doing that, but it's
59	possible to change things as desired.
60
61	> I personally use nspawn, which is actually pretty minimal, but it
62	> depends on systemd, which I'm sure many would argue is overly complex.
63	> :) However, if you are running systemd you can basically do a
64	> one-liner that requires zero setup to turn a chroot into a container.
65
66	As much as I might not like systemd, if you have it, and it reliably
67	does what you want, then I see no reason to /not/ use it. Just
68	acknowledge it as a dependency on your solution, which you have done.
69	So I think we're cool.
70
71	> On to the original questions about mounts:
72	>
73	> In general you can mount stuff in containers without issue. There are
74	> two ways to go about it. One is to mount something on the host and
75	> bind-mount it into the container, typically at launch time. The other
76	> is to give the container the necessary capabilities so that it can
77	> do its own mounting (typically containers are not given the necessary
78	> capabilities, so mounting will fail even as root inside the container).
79
80	Given that one of the uses of containers is security isolation (such as
81	it is), I feel like giving the container the ability to mount things is
82	less than a stellar idea. But to each his / her own.
83
84	> I believe the reason the wiki says to be careful with mounts has more
85	> to do with UID/GID mapping. As you are using nfs this is already an
86	> issue you're probably dealing with. You're probably aware that running
87	> nfs with multiple hosts with unsynchronized passwd/group files can
88	> be tricky, because linux (and unix in general) works with UIDs/GIDs,
89	> and not really directly with names,
90
91	That's true for NFS v1-3. But NFS v4 changes that. NFS v4 actually
92	uses user names & group names and has a daemon that runs on the client &
93	server to translate things as necessary.
94
95	> so if you're doing something with one UID on one host and with a
96	> different UID on another host you might get unexpected permissions
97	> behavior.
98
99	Yep. You need to do /something/ to account for this. Be it manually
100	manage UID & GID across things, or use something like NFSv4's
101	synchronization mechanism.
102
103	> In a nutshell the same thing can happen with containers, or for
104	> that matter with chroots.
105
106	I mostly agree. However, user namespaces can nullify this.
107
108	I've not dabbled with user namespaces yet, but my understanding is that
109	they can have completely different UIDs & GIDs inside the user namespace
110	than outside of it. It's my understanding that UID 0 / GID 0 inside a
111	user namespace can be mapped to UID 12345 / GID 23456 outside of the
112	user namespace. Refer to nsenter / unshare man pages for more details.
113
114	> If you have identical passwd/group files it should be a non-issue.
115
116	Point of order: The files don't need to be identical. The UIDs & GIDs
117	need to be managed if you aren't using something like user namespaces.
118	So it's perfectly valid to have a text file that is used to coordinate
119	UIDs & GIDs somewhere and then use those in passw/shadow group/gshadow
120	files.
121
122	> However, if you want to do mapping with unprivileged containers
123	> you have to be careful with mounts as they might not get translated
124	> properly. Using completely different UIDs in a container is their
125	> suggested solution, which is fine as long as the actual container
126	> filesystem isn't shared with anything else.
127
128	I conceptually agree. However I think mount namespaces combined with
129	user namespaces muddy the water. Again, refer to the nsenter / unshare
130	man pages and what they refer to.
131
132	nsenter has an option for sharing something between mount namespaces. I
133	have no idea what it does, much less how it does it. I suspect that the
134	kernel mounts it once (maybe not visible from anywhere else) and then
135	bind-mounts it to multiple locations for visibility / access.
136
137	> That tends to be the case anyway when you're using container
138	> implementations that do a lot of fancy image management. If you're
139	> doing something very minimal and just using a path/chroot on the host
140	> as your container then you need to be mindful of your UIDs/GIDs if
141	> you go accessing anything from the host directly.
142
143	UID & GID management is important. /Something/ should be doing it.
144
145	> The other thing I'd be careful with is mounting physical devices in
146	> more than one place. Since you're actually sharing a kernel I suspect
147	> linux will "do the right thing" if you mount an ext4 on /dev/sda2 on
148	> two different containers, but I've never tried it (and again doing
149	> that requires giving containers access to even see sda2 because they
150	> probably won't see physical devices by default).
151
152	Seeing as how the containers are running under the same kernel, there is
153	no actual need for the file system to be mounted multiple times.
154	Instead the kernel would mount it and present it, much like a bind
155	mount, to multiple containers for access.
156
157	Think along the lines of opening and working with a file system as a
158	separate process from where it's presented for access. Conceptually not
159	that dissimilar to a hard link that has multiple representations of a
160	file in multiple locations on the same file system. (It's not a perfect
161	analogy, but I hope that makes sense.)
162
163	> In a VM environment you definitely can't do this, because the VMs
164	> are completely isolated at the kernel level and having two different
165	> kernels having dirty buffers on the same physical device is going
166	> to kill any filesystem that isn't designed to be clustered.
167
168	Technically, you can usually get away with doing this. But the mounts
169	need to be read-only. But I STRONGLY suggest that you NOT do this to a
170	non-cluster aware file system.
171
172	I have colleagues that supported systems RO mounting an Ext file system
173	this way. It worked okay when it was used as a RO library. The problem
174	was when they made changes in the one with RW access. They needed to
175	unmount and remount all the RO clients to see the updates. It was not
176	graceful and we advised that they stop doing that. But it did work for
177	their needs. They used it akin to a bit (~TB) CD-ROM.
178
179	> In a container environment the two containers aren't really isolated
180	> at the actual physical filesystem level since they share the kernel,
181
182	I think mount namespaces muddy this water. Yes, it's the same kernel,
183	but the containers don't have the same file systems exposed to the
184	container.
185
186	> so I think you'd be fine but I'd really want to test or do some
187	> research before relying on it.
188
189	Yes, test.
190
191	But make sure you have a vague understanding of what's actually
192	happening behind the scenes. I find that tremendously helpful in
193	knowing what can and can't be done, as well as why.
194
195	> In any case, the more typical solution is to just mount everything on
196	> the host and then bind-mount it into the container. So, you could
197	> mount the nfs in /mnt and then bind-mount that into your container.
198	> There is really no performance hit and it should work fine without
199	> giving the container a bunch of capabilities.
200
201	I think there /is/ a performance hit. It's just so /minimal/ that it's
202	effectively non-existent. Every additional line of code in the path
203	that must be traversed does take CPU cycles.

Gentoo Archives: gentoo-user