1 |
On 5/18/19 5:49 PM, Rich Freeman wrote: |
2 |
> I'd be interested if there are other scripts people have put out |
3 |
> there, but I agree that most of the container solutions on Linux |
4 |
> are overly-complex. |
5 |
|
6 |
Here's what I use for some networking, which probably qualifies as |
7 |
extremely light weight ""containers. |
8 |
|
9 |
Prerequisite: Create a place for the name spaces to anchor: |
10 |
|
11 |
# Create the directories to contain the *NS mount points. |
12 |
sudo mkdir -p /run/{mount,net,uts}ns |
13 |
|
14 |
You can use any path that you want. — I do a lot with iproute2's |
15 |
network namespaces (which is where this evolved from), which use |
16 |
/run/netns/$NetNSname. So I used that as a pattern for the other types |
17 |
of namespaces. Adjust as you want. — What I'm doing is interoperable |
18 |
with iproute2's netns command. |
19 |
|
20 |
Per ""Container: Create the ""Containers mount points: |
21 |
|
22 |
# Create the *NS mount points |
23 |
sudo touch /run/{mount,net,uts}ns/$ContainerName |
24 |
|
25 |
Start the actual namespaces: |
26 |
|
27 |
# Spawn the lab# NetNSs. |
28 |
unshare --mount=/run/mountns/$ContainerName |
29 |
--net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName /bin/true |
30 |
|
31 |
Note: The namespaces don't die when true exits because they are |
32 |
associated with a mount point. |
33 |
|
34 |
Tweak the namespaces: |
35 |
|
36 |
# Set the lab# NetNS's hostname. |
37 |
nsenter --mount=/run/mountns/$ContainerName |
38 |
--net=/run/netns/$ContainerName --uts=/run/utsns/$ContainerName |
39 |
/bin/hostname $ContainerName |
40 |
|
41 |
I reuse this command calling different binaries any time I want to do |
42 |
something in the ""container. Calling /bin/bash (et al.) enters the |
43 |
container. |
44 |
|
45 |
I've created a wrapper script (nsenter.wrapper) that passes the proper |
46 |
parameters to nsenter. I've then sym-linked the container name to the |
47 |
nsenter.wrapper script. This means that I can run "$ContainerName |
48 |
$Command" or simply enter the container with $ContainerName. (The |
49 |
script checks the number of parameters and assumes /bin/bash if no |
50 |
command is specified. |
51 |
|
52 |
I think it's ultimately extremely trivial to have a ""container |
53 |
(glorified collection of name spaces) to do things I want with virtually |
54 |
zero disk space. Ok, ok, maybe 1 or 2 kB for the script & links. |
55 |
|
56 |
Note: Since I'm using the mount name space, I can have a completely |
57 |
different mount tree inside the ""container than I have outside the |
58 |
container / on the host. I'm not currently doing that, but it's |
59 |
possible to change things as desired. |
60 |
|
61 |
> I personally use nspawn, which is actually pretty minimal, but it |
62 |
> depends on systemd, which I'm sure many would argue is overly complex. |
63 |
> :) However, if you are running systemd you can basically do a |
64 |
> one-liner that requires zero setup to turn a chroot into a container. |
65 |
|
66 |
As much as I might not like systemd, if you have it, and it reliably |
67 |
does what you want, then I see no reason to /not/ use it. Just |
68 |
acknowledge it as a dependency on your solution, which you have done. |
69 |
So I think we're cool. |
70 |
|
71 |
> On to the original questions about mounts: |
72 |
> |
73 |
> In general you can mount stuff in containers without issue. There are |
74 |
> two ways to go about it. One is to mount something on the host and |
75 |
> bind-mount it into the container, typically at launch time. The other |
76 |
> is to give the container the necessary capabilities so that it can |
77 |
> do its own mounting (typically containers are not given the necessary |
78 |
> capabilities, so mounting will fail even as root inside the container). |
79 |
|
80 |
Given that one of the uses of containers is security isolation (such as |
81 |
it is), I feel like giving the container the ability to mount things is |
82 |
less than a stellar idea. But to each his / her own. |
83 |
|
84 |
> I believe the reason the wiki says to be careful with mounts has more |
85 |
> to do with UID/GID mapping. As you are using nfs this is already an |
86 |
> issue you're probably dealing with. You're probably aware that running |
87 |
> nfs with multiple hosts with unsynchronized passwd/group files can |
88 |
> be tricky, because linux (and unix in general) works with UIDs/GIDs, |
89 |
> and not really directly with names, |
90 |
|
91 |
That's true for NFS v1-3. But NFS v4 changes that. NFS v4 actually |
92 |
uses user names & group names and has a daemon that runs on the client & |
93 |
server to translate things as necessary. |
94 |
|
95 |
> so if you're doing something with one UID on one host and with a |
96 |
> different UID on another host you might get unexpected permissions |
97 |
> behavior. |
98 |
|
99 |
Yep. You need to do /something/ to account for this. Be it manually |
100 |
manage UID & GID across things, or use something like NFSv4's |
101 |
synchronization mechanism. |
102 |
|
103 |
> In a nutshell the same thing can happen with containers, or for |
104 |
> that matter with chroots. |
105 |
|
106 |
I mostly agree. However, user namespaces can nullify this. |
107 |
|
108 |
I've not dabbled with user namespaces yet, but my understanding is that |
109 |
they can have completely different UIDs & GIDs inside the user namespace |
110 |
than outside of it. It's my understanding that UID 0 / GID 0 inside a |
111 |
user namespace can be mapped to UID 12345 / GID 23456 outside of the |
112 |
user namespace. Refer to nsenter / unshare man pages for more details. |
113 |
|
114 |
> If you have identical passwd/group files it should be a non-issue. |
115 |
|
116 |
Point of order: The files don't need to be identical. The UIDs & GIDs |
117 |
need to be managed if you aren't using something like user namespaces. |
118 |
So it's perfectly valid to have a text file that is used to coordinate |
119 |
UIDs & GIDs somewhere and then use those in passw/shadow group/gshadow |
120 |
files. |
121 |
|
122 |
> However, if you want to do mapping with unprivileged containers |
123 |
> you have to be careful with mounts as they might not get translated |
124 |
> properly. Using completely different UIDs in a container is their |
125 |
> suggested solution, which is fine as long as the actual container |
126 |
> filesystem isn't shared with anything else. |
127 |
|
128 |
I conceptually agree. However I think mount namespaces combined with |
129 |
user namespaces muddy the water. Again, refer to the nsenter / unshare |
130 |
man pages and what they refer to. |
131 |
|
132 |
nsenter has an option for sharing something between mount namespaces. I |
133 |
have no idea what it does, much less how it does it. I suspect that the |
134 |
kernel mounts it once (maybe not visible from anywhere else) and then |
135 |
bind-mounts it to multiple locations for visibility / access. |
136 |
|
137 |
> That tends to be the case anyway when you're using container |
138 |
> implementations that do a lot of fancy image management. If you're |
139 |
> doing something very minimal and just using a path/chroot on the host |
140 |
> as your container then you need to be mindful of your UIDs/GIDs if |
141 |
> you go accessing anything from the host directly. |
142 |
|
143 |
UID & GID management is important. /Something/ should be doing it. |
144 |
|
145 |
> The other thing I'd be careful with is mounting physical devices in |
146 |
> more than one place. Since you're actually sharing a kernel I suspect |
147 |
> linux will "do the right thing" if you mount an ext4 on /dev/sda2 on |
148 |
> two different containers, but I've never tried it (and again doing |
149 |
> that requires giving containers access to even see sda2 because they |
150 |
> probably won't see physical devices by default). |
151 |
|
152 |
Seeing as how the containers are running under the same kernel, there is |
153 |
no actual need for the file system to be mounted multiple times. |
154 |
Instead the kernel would mount it and present it, much like a bind |
155 |
mount, to multiple containers for access. |
156 |
|
157 |
Think along the lines of opening and working with a file system as a |
158 |
separate process from where it's presented for access. Conceptually not |
159 |
that dissimilar to a hard link that has multiple representations of a |
160 |
file in multiple locations on the same file system. (It's not a perfect |
161 |
analogy, but I hope that makes sense.) |
162 |
|
163 |
> In a VM environment you definitely can't do this, because the VMs |
164 |
> are completely isolated at the kernel level and having two different |
165 |
> kernels having dirty buffers on the same physical device is going |
166 |
> to kill any filesystem that isn't designed to be clustered. |
167 |
|
168 |
Technically, you can usually get away with doing this. But the mounts |
169 |
need to be read-only. But I STRONGLY suggest that you NOT do this to a |
170 |
non-cluster aware file system. |
171 |
|
172 |
I have colleagues that supported systems RO mounting an Ext file system |
173 |
this way. It worked okay when it was used as a RO library. The problem |
174 |
was when they made changes in the one with RW access. They needed to |
175 |
unmount and remount all the RO clients to see the updates. It was not |
176 |
graceful and we advised that they stop doing that. But it did work for |
177 |
their needs. They used it akin to a bit (~TB) CD-ROM. |
178 |
|
179 |
> In a container environment the two containers aren't really isolated |
180 |
> at the actual physical filesystem level since they share the kernel, |
181 |
|
182 |
I think mount namespaces muddy this water. Yes, it's the same kernel, |
183 |
but the containers don't have the same file systems exposed to the |
184 |
container. |
185 |
|
186 |
> so I think you'd be fine but I'd really want to test or do some |
187 |
> research before relying on it. |
188 |
|
189 |
Yes, test. |
190 |
|
191 |
But make sure you have a vague understanding of what's actually |
192 |
happening behind the scenes. I find that tremendously helpful in |
193 |
knowing what can and can't be done, as well as why. |
194 |
|
195 |
> In any case, the more typical solution is to just mount everything on |
196 |
> the host and then bind-mount it into the container. So, you could |
197 |
> mount the nfs in /mnt and then bind-mount that into your container. |
198 |
> There is really no performance hit and it should work fine without |
199 |
> giving the container a bunch of capabilities. |
200 |
|
201 |
I think there /is/ a performance hit. It's just so /minimal/ that it's |
202 |
effectively non-existent. Every additional line of code in the path |
203 |
that must be traversed does take CPU cycles. |