Gentoo Archives: gentoo-user

From: Grant Taylor <gtaylor@×××××××××××××××××××××.net>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Can I use containers?
Date: Sun, 19 May 2019 04:39:16
Message-Id: b5c64e90-495f-7123-d2d2-0caf6262e0a6@spamtrap.tnetconsulting.net
In Reply to: Re: [gentoo-user] Can I use containers? by Rich Freeman
1 On 5/18/19 5:49 PM, Rich Freeman wrote:
2 > I'd be interested if there are other scripts people have put out
3 > there, but I agree that most of the container solutions on Linux
4 > are overly-complex.
5
6 Here's what I use for some networking, which probably qualifies as
7 extremely light weight ""containers.
8
9 Prerequisite: Create a place for the name spaces to anchor:
10
11 # Create the directories to contain the *NS mount points.
12 sudo mkdir -p /run/{mount,net,uts}ns
13
14 You can use any path that you want. — I do a lot with iproute2's
15 network namespaces (which is where this evolved from), which use
16 /run/netns/$NetNSname. So I used that as a pattern for the other types
17 of namespaces. Adjust as you want. — What I'm doing is interoperable
18 with iproute2's netns command.
19
20 Per ""Container: Create the ""Containers mount points:
21
22 # Create the *NS mount points
23 sudo touch /run/{mount,net,uts}ns/$ContainerName
24
25 Start the actual namespaces:
26
27 # Spawn the lab# NetNSs.
28 unshare --mount=/run/mountns/$ContainerName
29 --net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName /bin/true
30
31 Note: The namespaces don't die when true exits because they are
32 associated with a mount point.
33
34 Tweak the namespaces:
35
36 # Set the lab# NetNS's hostname.
37 nsenter --mount=/run/mountns/$ContainerName
38 --net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName
39 /bin/hostname $ContainerName
40
41 I reuse this command calling different binaries any time I want to do
42 something in the ""container. Calling /bin/bash (et al.) enters the
43 container.
44
45 I've created a wrapper script (nsenter.wrapper) that passes the proper
46 parameters to nsenter. I've then sym-linked the container name to the
47 nsenter.wrapper script. This means that I can run "$ContainerName
48 $Command" or simply enter the container with $ContainerName. (The
49 script checks the number of parameters and assumes /bin/bash if no
50 command is specified.
51
52 I think it's ultimately extremely trivial to have a ""container
53 (glorified collection of name spaces) to do things I want with virtually
54 zero disk space. Ok, ok, maybe 1 or 2 kB for the script & links.
55
56 Note: Since I'm using the mount name space, I can have a completely
57 different mount tree inside the ""container than I have outside the
58 container / on the host. I'm not currently doing that, but it's
59 possible to change things as desired.
60
61 > I personally use nspawn, which is actually pretty minimal, but it
62 > depends on systemd, which I'm sure many would argue is overly complex.
63 > :) However, if you are running systemd you can basically do a
64 > one-liner that requires zero setup to turn a chroot into a container.
65
66 As much as I might not like systemd, if you have it, and it reliably
67 does what you want, then I see no reason to /not/ use it. Just
68 acknowledge it as a dependency on your solution, which you have done.
69 So I think we're cool.
70
71 > On to the original questions about mounts:
72 >
73 > In general you can mount stuff in containers without issue. There are
74 > two ways to go about it. One is to mount something on the host and
75 > bind-mount it into the container, typically at launch time. The other
76 > is to give the container the necessary capabilities so that it can
77 > do its own mounting (typically containers are not given the necessary
78 > capabilities, so mounting will fail even as root inside the container).
79
80 Given that one of the uses of containers is security isolation (such as
81 it is), I feel like giving the container the ability to mount things is
82 less than a stellar idea. But to each his / her own.
83
84 > I believe the reason the wiki says to be careful with mounts has more
85 > to do with UID/GID mapping. As you are using nfs this is already an
86 > issue you're probably dealing with. You're probably aware that running
87 > nfs with multiple hosts with unsynchronized passwd/group files can
88 > be tricky, because linux (and unix in general) works with UIDs/GIDs,
89 > and not really directly with names,
90
91 That's true for NFS v1-3. But NFS v4 changes that. NFS v4 actually
92 uses user names & group names and has a daemon that runs on the client &
93 server to translate things as necessary.
94
95 > so if you're doing something with one UID on one host and with a
96 > different UID on another host you might get unexpected permissions
97 > behavior.
98
99 Yep. You need to do /something/ to account for this. Be it manually
100 manage UID & GID across things, or use something like NFSv4's
101 synchronization mechanism.
102
103 > In a nutshell the same thing can happen with containers, or for
104 > that matter with chroots.
105
106 I mostly agree. However, user namespaces can nullify this.
107
108 I've not dabbled with user namespaces yet, but my understanding is that
109 they can have completely different UIDs & GIDs inside the user namespace
110 than outside of it. It's my understanding that UID 0 / GID 0 inside a
111 user namespace can be mapped to UID 12345 / GID 23456 outside of the
112 user namespace. Refer to nsenter / unshare man pages for more details.
113
114 > If you have identical passwd/group files it should be a non-issue.
115
116 Point of order: The files don't need to be identical. The UIDs & GIDs
117 need to be managed if you aren't using something like user namespaces.
118 So it's perfectly valid to have a text file that is used to coordinate
119 UIDs & GIDs somewhere and then use those in passw/shadow group/gshadow
120 files.
121
122 > However, if you want to do mapping with unprivileged containers
123 > you have to be careful with mounts as they might not get translated
124 > properly. Using completely different UIDs in a container is their
125 > suggested solution, which is fine as long as the actual container
126 > filesystem isn't shared with anything else.
127
128 I conceptually agree. However I think mount namespaces combined with
129 user namespaces muddy the water. Again, refer to the nsenter / unshare
130 man pages and what they refer to.
131
132 nsenter has an option for sharing something between mount namespaces. I
133 have no idea what it does, much less how it does it. I suspect that the
134 kernel mounts it once (maybe not visible from anywhere else) and then
135 bind-mounts it to multiple locations for visibility / access.
136
137 > That tends to be the case anyway when you're using container
138 > implementations that do a lot of fancy image management. If you're
139 > doing something very minimal and just using a path/chroot on the host
140 > as your container then you need to be mindful of your UIDs/GIDs if
141 > you go accessing anything from the host directly.
142
143 UID & GID management is important. /Something/ should be doing it.
144
145 > The other thing I'd be careful with is mounting physical devices in
146 > more than one place. Since you're actually sharing a kernel I suspect
147 > linux will "do the right thing" if you mount an ext4 on /dev/sda2 on
148 > two different containers, but I've never tried it (and again doing
149 > that requires giving containers access to even see sda2 because they
150 > probably won't see physical devices by default).
151
152 Seeing as how the containers are running under the same kernel, there is
153 no actual need for the file system to be mounted multiple times.
154 Instead the kernel would mount it and present it, much like a bind
155 mount, to multiple containers for access.
156
157 Think along the lines of opening and working with a file system as a
158 separate process from where it's presented for access. Conceptually not
159 that dissimilar to a hard link that has multiple representations of a
160 file in multiple locations on the same file system. (It's not a perfect
161 analogy, but I hope that makes sense.)
162
163 > In a VM environment you definitely can't do this, because the VMs
164 > are completely isolated at the kernel level and having two different
165 > kernels having dirty buffers on the same physical device is going
166 > to kill any filesystem that isn't designed to be clustered.
167
168 Technically, you can usually get away with doing this. But the mounts
169 need to be read-only. But I STRONGLY suggest that you NOT do this to a
170 non-cluster aware file system.
171
172 I have colleagues that supported systems RO mounting an Ext file system
173 this way. It worked okay when it was used as a RO library. The problem
174 was when they made changes in the one with RW access. They needed to
175 unmount and remount all the RO clients to see the updates. It was not
176 graceful and we advised that they stop doing that. But it did work for
177 their needs. They used it akin to a bit (~TB) CD-ROM.
178
179 > In a container environment the two containers aren't really isolated
180 > at the actual physical filesystem level since they share the kernel,
181
182 I think mount namespaces muddy this water. Yes, it's the same kernel,
183 but the containers don't have the same file systems exposed to the
184 container.
185
186 > so I think you'd be fine but I'd really want to test or do some
187 > research before relying on it.
188
189 Yes, test.
190
191 But make sure you have a vague understanding of what's actually
192 happening behind the scenes. I find that tremendously helpful in
193 knowing what can and can't be done, as well as why.
194
195 > In any case, the more typical solution is to just mount everything on
196 > the host and then bind-mount it into the container. So, you could
197 > mount the nfs in /mnt and then bind-mount that into your container.
198 > There is really no performance hit and it should work fine without
199 > giving the container a bunch of capabilities.
200
201 I think there /is/ a performance hit. It's just so /minimal/ that it's
202 effectively non-existent. Every additional line of code in the path
203 that must be traversed does take CPU cycles.