1 |
On Mon, Jan 2, 2023 at 4:55 PM m1027 <m1027@××××××.net> wrote: |
2 |
> |
3 |
> Many thanks for your detailed thoughs for sharing the rich |
4 |
> experiences on this! See below: |
5 |
> |
6 |
> antarus: |
7 |
> |
8 |
> > On Mon, Jan 2, 2023 at 4:48 AM m1027 <m1027@××××××.net> wrote: |
9 |
> > > |
10 |
> > > Hi and happy new year. |
11 |
> > > |
12 |
> > > When we create apps on Gentoo they become easily incompatible for |
13 |
> > > older Gentoo systems in production where unattended remote world |
14 |
> > > updates are risky. This is due to new glibc, openssl-3 etc. |
15 |
> > |
16 |
> > I wrote a very long reply, but I've removed most of it: I basically |
17 |
> > have a few questions, and then some comments: |
18 |
> > |
19 |
> > I don't quite grasp your problem statement, so I will repeat what I |
20 |
> > think it is and you can confirm / deny. |
21 |
> > |
22 |
> > - Your devs build using gentoo synced against some recent tree, they |
23 |
> > have recent packages, and they build some software that you deploy to |
24 |
> > prod. |
25 |
> |
26 |
> Yes. |
27 |
> |
28 |
> > - Your prod machines are running gentoo synced against some recent |
29 |
> > tree, but not upgraded (maybe only glsa-check runs) and so they are |
30 |
> > running 'old' packages because you are afraid to update them[0] |
31 |
> |
32 |
> Well, we did sync (without updading packages) in the past but today we |
33 |
> even fear to sync against recent trees. Without going into details, |
34 |
> as a rule of thumb, weekly or monthly sync + package updates work |
35 |
> near to perfect. (It's cool to see what a good job emerge does on our |
36 |
> own internal production systems.) Updating systems older than 12 |
37 |
> months or so may, however, be a hugh task. And too risky for remote |
38 |
> production systems of customers. |
39 |
|
40 |
My primary risk I think is that even if you ship your app in a |
41 |
container you still need somewhere to run the containers. Currently |
42 |
that is a fleet of different hardware and gentoo configurations, and |
43 |
while containers certainly simplify your life there, they won't fix |
44 |
all your problems. Now instead of worrying that upgrading your Gentoo |
45 |
OS will break your app, it will instead break your container runtime. |
46 |
It is likely a smaller surface area, but it is not zero. Not saying |
47 |
don't use containers, just that there is no free lunch here |
48 |
necessarily. |
49 |
|
50 |
> |
51 |
> |
52 |
> > - Your software builds OK in dev, but when you deploy it in prod it |
53 |
> > breaks, because prod is really old, and your developments are using |
54 |
> > packages that are too new. |
55 |
> |
56 |
> Exactly. |
57 |
> |
58 |
> |
59 |
> > My main feedback here is: |
60 |
> > - Your "build" environment should be like prod. You said you didn't |
61 |
> > want to build "developer VMs" but I am unsure why. For example I run |
62 |
> > Ubuntu and I do all my gentoo development (admittedly very little |
63 |
> > these days) |
64 |
> > in a systemd-nspawn container, and I have a few shell scripts to |
65 |
> > mount everything and set it up (so it has a tree snapshot, some git |
66 |
> > repos, some writable space etc.) |
67 |
> |
68 |
> Okay, yes. That is way (1) I mentioned in my OP. It works indeed but |
69 |
> has the mentioned drawbacks: VMs and maintenance pile up, and for |
70 |
> each developer. And you don't know when there is the moment to |
71 |
> create a new VM. But yes it seems to me one of the ways to go: |
72 |
> *Before* creating a production system you need to freeze portage, |
73 |
> create dev VMs, and prevent updates on the VMs, too. (Freezing aka |
74 |
> not updating has many disadvantages, of course.) |
75 |
|
76 |
Oh sorry, I failed to understand you were doing that already. I agree |
77 |
it's challenging, I think if you don't have a great method to simplify |
78 |
here, it might not be a great avenue going forward. |
79 |
- Trying to figure out when you can make a new VM. |
80 |
- Trying to figure out when you can take a build and deploy it to a |
81 |
customer safely. |
82 |
|
83 |
I've seen folks try to group customers in some way to reduce the |
84 |
number of prod artifacts required, but if you cannot it might be |
85 |
|
86 |
The benefit of containers here is that you can basically deploy your |
87 |
app at whatever rate you want, and only the OS upgrades remain risky |
88 |
(because they might break the container runtime.) |
89 |
Depending on your business needs, it might be advantageous to go that route. |
90 |
|
91 |
> |
92 |
> |
93 |
> > - Your "prod" environment is too risky to upgrade, and you have |
94 |
> > difficulty crafting builds that run in every prod environment. I think |
95 |
> > this is fixable by making a build environment more like the prod |
96 |
> > environment. |
97 |
> > The challenge here is that if you have not done that (kept the |
98 |
> > copies of ebuilds around, the distfiles, etc) it can be challenging to |
99 |
> > "recreate" the existing older prod environments. |
100 |
> > But if you do the above thing (where devs build in a container) |
101 |
> > and you can make that container like the prod environments, then you |
102 |
> > can enable devs to build for the prod environment (in a container on |
103 |
> > their local machine) and get the outcome you want. |
104 |
> |
105 |
> Not sure I got your point here. But yes, it comes down to what was |
106 |
> said above. |
107 |
> |
108 |
> |
109 |
> > - Understand that not upgrading prod is like, to use a finance term, |
110 |
> > picking up pennies in front of a steamroller. It's a great strategy, |
111 |
> > but eventually you will actually *need* to upgrade something. Maybe |
112 |
> > for a critical security issue, maybe for a feature. Having a build |
113 |
> > environment that matches prod is good practice, you should do it, but |
114 |
> > you should also really schedule maintenance for these prod nodes to |
115 |
> > get them upgraded. (For physical machines, I've often seen businesses |
116 |
> > just eat the risk and assume the machine will physically fail before |
117 |
> > the steamroller comes, but this is less true with virtualized |
118 |
> > environments that have longer real lifetimes.) |
119 |
> |
120 |
> Yes, haha, I agree. And yes, I totally ignored backporting security |
121 |
> here, as well as the need that we might *require* a dependend |
122 |
> package upgrade (e.g. to fix a known memory leak). I left that out |
123 |
> for simlicity only. |
124 |
|
125 |
Ahh my worry is that the easy parts are easy and the edge cases are |
126 |
what really makes things intractable here. |
127 |
|
128 |
> |
129 |
> |
130 |
> > > So, what we've thought of so far is: |
131 |
> > > |
132 |
> > > (1) Keeping outdated developer boxes around and compile there. We |
133 |
> > > would freeze portage against accidental emerge sync by creating a |
134 |
> > > git branch in /var/db/repos/gentoo. This feels hacky and requires a |
135 |
> > > increating number of develper VMs. And sometimes we are hit by a |
136 |
> > > silent incompatibility we were not aware of. |
137 |
> > |
138 |
> > In general when you build binaries for some target, you should build |
139 |
> > on that target when possible. To me, this is the crux of your issue |
140 |
> > (that you do not) and one of the main causes of your pain. |
141 |
> > You will need to figure out a way to either: |
142 |
> > - Upgrade the older environments to new packages. |
143 |
> > - Build in copies of the older environments. |
144 |
> > |
145 |
> > I actually expect the second one to take 1-2 sprints (so like 1 engineer month?) |
146 |
> > - One sprint to make some scripts that makes a new production 'container' |
147 |
> > - One sprint to sort of integrate that container into your dev |
148 |
> > workflow, so devs build in the container instead of what they build in |
149 |
> > now. |
150 |
> > |
151 |
> > It might be more or less daunting depending on how many distinct |
152 |
> > (unique?) prod environments you have (how many containers will you |
153 |
> > actually need for good build coverage?), how experienced in Gentoo |
154 |
> > your developers are, and how many artifacts from prod you have. |
155 |
> > - A few crazy ideas are like: |
156 |
> > - Snapshot an existing prod machine, strip of it machine-specific |
157 |
> > bits, and use that as your container. |
158 |
> > - Use quickpkg to generate a bunch of bin pkgs from a prod machine, |
159 |
> > use that to bootstrap a container. |
160 |
> > - Probably some other exciting ideas on the list ;) |
161 |
> |
162 |
> Thanks for the enthusiasm on it. ;-) Well: |
163 |
> |
164 |
> We cannot build (develop) on that exact target. Imagine hardware |
165 |
> being sold to customers. They just want/need a software update of |
166 |
> our app. |
167 |
> |
168 |
> And, unfortunatelly we don't have hardware clones of all the |
169 |
> different customer's hardware at ours to build, test etc. |
170 |
|
171 |
Ahh sorry, I meant mostly the software configuration here (my |
172 |
apologies). It sounds like above you are doing that already and are |
173 |
finding that keeping numerous software configurations (VMs) around is |
174 |
too costly. |
175 |
If that is the case it sounds like containers, flatpak, or snap |
176 |
packages could be the way to go (the last one only if your prod |
177 |
environment is systemd compatible.) |
178 |
|
179 |
> |
180 |
> So, we come back on the question how to have a solid LTS-like |
181 |
> software OS / stack onto which newly compiled developer apps can be |
182 |
> distributed and just work. And all this in Gentoo. :-) |
183 |
> |
184 |
> |
185 |
> > > (2) Using Ubuntu LTS for production and Gentoo for development is |
186 |
> > > hit by subtile libjpeg incompatibilites and such. |
187 |
> > |
188 |
> > I would advise, if possible, to make dev and prod as similar as |
189 |
> > possible[1]. I'd be curious what blockers you think there are to this |
190 |
> > pattern. |
191 |
> > Remember that "dev" is not "whatever your devs are using" but is |
192 |
> > ideally some maintained environment; segmented from their daily driver |
193 |
> > computer (somehow). |
194 |
> |
195 |
> That is again VMs per "release" and per dev, right? See above "way |
196 |
> (1)". |
197 |
|
198 |
At a previous job we built some scripts to build a VM per release; but |
199 |
in our scheme we only had to build 9 VMs worst case (3 targets, and 3 |
200 |
OS tracks, so 9 total.) We shared the base VM images (per release) |
201 |
with the entire development team of 9 people. It was feasible with |
202 |
decent internet (100mbit). We had some shared storage we put signed |
203 |
images on. But often you would only use 1 image to test locally, then |
204 |
push to the CI pipeline that would test on the other 8 images (because |
205 |
it was cheaper / whatever in the datacenter.) |
206 |
|
207 |
I continue to agree with you in that if you can't get your # of |
208 |
targets down near that kind of number (10-20ish) it's probably not |
209 |
going to be a great time for you. |
210 |
|
211 |
|
212 |
> |
213 |
> |
214 |
> > > (3) Distributing apps as VMs or docker: Even those tools advance and |
215 |
> > > become incompatible, right? And not suitable when for smaller Arm |
216 |
> > > devices. |
217 |
> > |
218 |
> > I think if your apps are small and self-contained and easily rebuilt, |
219 |
> > your (3) and (4) can be workable. |
220 |
> > |
221 |
> > If you need 1000 dependencies at runtime, your containers are going to |
222 |
> > be expensive to build, expensive to maintain, you are gonna have to |
223 |
> > build them often (for security issues), it can be challenging to |
224 |
> > support incremental builds and incremental updates...you generally |
225 |
> > want a clearer problem statement to adopt this pain. Two problem |
226 |
> > statements that might be worth it are below ;) |
227 |
> > |
228 |
> > If you told me you had 100 different production environments, or |
229 |
> > needed to support 12 different OSes, I'd tell you to use containers |
230 |
> > (or similar) |
231 |
> > If you told me you didn't control your production environment (because |
232 |
> > users installed the software wherever) I'd tell you use containers (or |
233 |
> > similar) |
234 |
> > |
235 |
> > > |
236 |
> > > (4) Flatpak: No experience, does it work well? |
237 |
> > |
238 |
> > Flatpak is conceptually similar to your (3). I know you are basically |
239 |
> > asking "does it work" and the answer is "probably", but see the other |
240 |
> > questions for (3). I suspect it's less about "does it work" and more |
241 |
> > about "is some container deployment thing really a great idea." |
242 |
> |
243 |
> Well thanks for your comments on containers and flatpak. It's |
244 |
> motivating to investigate that further. |
245 |
> |
246 |
> Admittedly, we've been sticking to natively built apps for reasons |
247 |
> that might not be relevant these days. (Hardware bound apps, bus |
248 |
> systems etc, performance reasons on IoT like devices, no real |
249 |
> experience in lean containers yet, only Qemu.) |
250 |
|
251 |
Depending on your app, you can get pretty lean containers. We have a |
252 |
golang app (https://gitweb.gentoo.org/sites/soko.git/tree/Dockerfile) |
253 |
whose docker image is 39MB, but it mostly just has a large statically |
254 |
compiled go-binary in it. We run gitlab-ce in a large container that |
255 |
is 2.5GB, so the sizes can definitely get large if you are not |
256 |
careful. |
257 |
|
258 |
Another potential issue for containers is the container runtime shares |
259 |
a kernel with the host, so if your host kernel is very old, but you |
260 |
need new kernel features (or syscalls) they may be missing on the host |
261 |
kernel, so there are still some gotchas (but as mentioned, probably |
262 |
fewer than you experience with a full OS build.) |
263 |
|
264 |
Good Luck! |
265 |
|
266 |
-A |
267 |
|
268 |
> |
269 |
> |
270 |
> > Peter's comment about basically running your own fork of gentoo.git |
271 |
> > and sort of 'importing the updates' is workable. Google did this for |
272 |
> > debian testing (called project Rodete)[2]. I can't say it's a |
273 |
> > particularly cheap solution (significant automation and testing |
274 |
> > required) but I think as long as you are keeping up (I would advise |
275 |
> > never falling more than 365d behind time.now() in your fork) then I |
276 |
> > think it provides some benefits. |
277 |
> > - You control when you take updates. |
278 |
> > - You want to stay "close" to time.now() in the tree, since a |
279 |
> > rolling distro is how things are tested. |
280 |
> > - This buys you 365d or so to fix any problem you find. |
281 |
> > - It nominally requires that you test against ::gentoo and |
282 |
> > ::your-gentoo-fork, so you find problems in ::gentoo before they are |
283 |
> > pulled into your fork, giving you a heads up that you need to put work |
284 |
> > in. |
285 |
> |
286 |
> I haven't commented on Peter yet but yes I'll have a look on what he |
287 |
> added. Something tells me that distributing apps in a container |
288 |
> might be the cheaper way for us. We'll see. |
289 |
> |
290 |
> |
291 |
> > [0] FWIW this is basically what #gentoo-infra does on our boxes and |
292 |
> > it's terrible and I would not recommend it to most people in the |
293 |
> > modern era. Upgrade your stuff regularly. |
294 |
> > [1] When I was at Google we had a hilarious outage because someone |
295 |
> > switched login managers (gdm vs kdm) and kdm had a different default |
296 |
> > umask somehow? Anyway it resulted in a critical component having the |
297 |
> > wrong permissions and it caused a massive outage (luckily we had |
298 |
> > sufficient redundancy that it was not user visible) but it was one of |
299 |
> > the scariest outages I had ever seen. I was in charge of investigating |
300 |
> > (being on the dev OS team at the time) and it was definitely very |
301 |
> > difficult to figure out "what changed" to produce the bad build. We |
302 |
> > stopped building on developer workstations soon after, FWIW. |
303 |
> > [2] https://cloud.google.com/blog/topics/developers-practitioners/how-google-got-to-rolling-linux-releases-for-desktops |
304 |
> |
305 |
> Thanks for sharing! Very interesting insights. |
306 |
> |
307 |
> To sum up: |
308 |
> |
309 |
> You described interesting ways to create and control own releases of |
310 |
> Gentoo. So production and developer systems could be aligned on |
311 |
> that. The effort depends. |
312 |
> |
313 |
> Another way is containers. |
314 |
> |
315 |
> |