Gentoo Archives: gentoo-dev

From: Alec Warner <antarus@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] Gentoo LTS or: proper backward compatibility?
Date: Tue, 03 Jan 2023 03:00:03
Message-Id: CAAr7Pr-KF1YzQ=B7X5KJZ68Afkok4kB3Ukec_0BLPwEUxsGX=w@mail.gmail.com
In Reply to: Re: [gentoo-dev] Gentoo LTS or: proper backward compatibility? by m1027
1 On Mon, Jan 2, 2023 at 4:55 PM m1027 <m1027@××××××.net> wrote:
2 >
3 > Many thanks for your detailed thoughs for sharing the rich
4 > experiences on this! See below:
5 >
6 > antarus:
7 >
8 > > On Mon, Jan 2, 2023 at 4:48 AM m1027 <m1027@××××××.net> wrote:
9 > > >
10 > > > Hi and happy new year.
11 > > >
12 > > > When we create apps on Gentoo they become easily incompatible for
13 > > > older Gentoo systems in production where unattended remote world
14 > > > updates are risky. This is due to new glibc, openssl-3 etc.
15 > >
16 > > I wrote a very long reply, but I've removed most of it: I basically
17 > > have a few questions, and then some comments:
18 > >
19 > > I don't quite grasp your problem statement, so I will repeat what I
20 > > think it is and you can confirm / deny.
21 > >
22 > > - Your devs build using gentoo synced against some recent tree, they
23 > > have recent packages, and they build some software that you deploy to
24 > > prod.
25 >
26 > Yes.
27 >
28 > > - Your prod machines are running gentoo synced against some recent
29 > > tree, but not upgraded (maybe only glsa-check runs) and so they are
30 > > running 'old' packages because you are afraid to update them[0]
31 >
32 > Well, we did sync (without updading packages) in the past but today we
33 > even fear to sync against recent trees. Without going into details,
34 > as a rule of thumb, weekly or monthly sync + package updates work
35 > near to perfect. (It's cool to see what a good job emerge does on our
36 > own internal production systems.) Updating systems older than 12
37 > months or so may, however, be a hugh task. And too risky for remote
38 > production systems of customers.
39
40 My primary risk I think is that even if you ship your app in a
41 container you still need somewhere to run the containers. Currently
42 that is a fleet of different hardware and gentoo configurations, and
43 while containers certainly simplify your life there, they won't fix
44 all your problems. Now instead of worrying that upgrading your Gentoo
45 OS will break your app, it will instead break your container runtime.
46 It is likely a smaller surface area, but it is not zero. Not saying
47 don't use containers, just that there is no free lunch here
48 necessarily.
49
50 >
51 >
52 > > - Your software builds OK in dev, but when you deploy it in prod it
53 > > breaks, because prod is really old, and your developments are using
54 > > packages that are too new.
55 >
56 > Exactly.
57 >
58 >
59 > > My main feedback here is:
60 > > - Your "build" environment should be like prod. You said you didn't
61 > > want to build "developer VMs" but I am unsure why. For example I run
62 > > Ubuntu and I do all my gentoo development (admittedly very little
63 > > these days)
64 > > in a systemd-nspawn container, and I have a few shell scripts to
65 > > mount everything and set it up (so it has a tree snapshot, some git
66 > > repos, some writable space etc.)
67 >
68 > Okay, yes. That is way (1) I mentioned in my OP. It works indeed but
69 > has the mentioned drawbacks: VMs and maintenance pile up, and for
70 > each developer. And you don't know when there is the moment to
71 > create a new VM. But yes it seems to me one of the ways to go:
72 > *Before* creating a production system you need to freeze portage,
73 > create dev VMs, and prevent updates on the VMs, too. (Freezing aka
74 > not updating has many disadvantages, of course.)
75
76 Oh sorry, I failed to understand you were doing that already. I agree
77 it's challenging, I think if you don't have a great method to simplify
78 here, it might not be a great avenue going forward.
79 - Trying to figure out when you can make a new VM.
80 - Trying to figure out when you can take a build and deploy it to a
81 customer safely.
82
83 I've seen folks try to group customers in some way to reduce the
84 number of prod artifacts required, but if you cannot it might be
85
86 The benefit of containers here is that you can basically deploy your
87 app at whatever rate you want, and only the OS upgrades remain risky
88 (because they might break the container runtime.)
89 Depending on your business needs, it might be advantageous to go that route.
90
91 >
92 >
93 > > - Your "prod" environment is too risky to upgrade, and you have
94 > > difficulty crafting builds that run in every prod environment. I think
95 > > this is fixable by making a build environment more like the prod
96 > > environment.
97 > > The challenge here is that if you have not done that (kept the
98 > > copies of ebuilds around, the distfiles, etc) it can be challenging to
99 > > "recreate" the existing older prod environments.
100 > > But if you do the above thing (where devs build in a container)
101 > > and you can make that container like the prod environments, then you
102 > > can enable devs to build for the prod environment (in a container on
103 > > their local machine) and get the outcome you want.
104 >
105 > Not sure I got your point here. But yes, it comes down to what was
106 > said above.
107 >
108 >
109 > > - Understand that not upgrading prod is like, to use a finance term,
110 > > picking up pennies in front of a steamroller. It's a great strategy,
111 > > but eventually you will actually *need* to upgrade something. Maybe
112 > > for a critical security issue, maybe for a feature. Having a build
113 > > environment that matches prod is good practice, you should do it, but
114 > > you should also really schedule maintenance for these prod nodes to
115 > > get them upgraded. (For physical machines, I've often seen businesses
116 > > just eat the risk and assume the machine will physically fail before
117 > > the steamroller comes, but this is less true with virtualized
118 > > environments that have longer real lifetimes.)
119 >
120 > Yes, haha, I agree. And yes, I totally ignored backporting security
121 > here, as well as the need that we might *require* a dependend
122 > package upgrade (e.g. to fix a known memory leak). I left that out
123 > for simlicity only.
124
125 Ahh my worry is that the easy parts are easy and the edge cases are
126 what really makes things intractable here.
127
128 >
129 >
130 > > > So, what we've thought of so far is:
131 > > >
132 > > > (1) Keeping outdated developer boxes around and compile there. We
133 > > > would freeze portage against accidental emerge sync by creating a
134 > > > git branch in /var/db/repos/gentoo. This feels hacky and requires a
135 > > > increating number of develper VMs. And sometimes we are hit by a
136 > > > silent incompatibility we were not aware of.
137 > >
138 > > In general when you build binaries for some target, you should build
139 > > on that target when possible. To me, this is the crux of your issue
140 > > (that you do not) and one of the main causes of your pain.
141 > > You will need to figure out a way to either:
142 > > - Upgrade the older environments to new packages.
143 > > - Build in copies of the older environments.
144 > >
145 > > I actually expect the second one to take 1-2 sprints (so like 1 engineer month?)
146 > > - One sprint to make some scripts that makes a new production 'container'
147 > > - One sprint to sort of integrate that container into your dev
148 > > workflow, so devs build in the container instead of what they build in
149 > > now.
150 > >
151 > > It might be more or less daunting depending on how many distinct
152 > > (unique?) prod environments you have (how many containers will you
153 > > actually need for good build coverage?), how experienced in Gentoo
154 > > your developers are, and how many artifacts from prod you have.
155 > > - A few crazy ideas are like:
156 > > - Snapshot an existing prod machine, strip of it machine-specific
157 > > bits, and use that as your container.
158 > > - Use quickpkg to generate a bunch of bin pkgs from a prod machine,
159 > > use that to bootstrap a container.
160 > > - Probably some other exciting ideas on the list ;)
161 >
162 > Thanks for the enthusiasm on it. ;-) Well:
163 >
164 > We cannot build (develop) on that exact target. Imagine hardware
165 > being sold to customers. They just want/need a software update of
166 > our app.
167 >
168 > And, unfortunatelly we don't have hardware clones of all the
169 > different customer's hardware at ours to build, test etc.
170
171 Ahh sorry, I meant mostly the software configuration here (my
172 apologies). It sounds like above you are doing that already and are
173 finding that keeping numerous software configurations (VMs) around is
174 too costly.
175 If that is the case it sounds like containers, flatpak, or snap
176 packages could be the way to go (the last one only if your prod
177 environment is systemd compatible.)
178
179 >
180 > So, we come back on the question how to have a solid LTS-like
181 > software OS / stack onto which newly compiled developer apps can be
182 > distributed and just work. And all this in Gentoo. :-)
183 >
184 >
185 > > > (2) Using Ubuntu LTS for production and Gentoo for development is
186 > > > hit by subtile libjpeg incompatibilites and such.
187 > >
188 > > I would advise, if possible, to make dev and prod as similar as
189 > > possible[1]. I'd be curious what blockers you think there are to this
190 > > pattern.
191 > > Remember that "dev" is not "whatever your devs are using" but is
192 > > ideally some maintained environment; segmented from their daily driver
193 > > computer (somehow).
194 >
195 > That is again VMs per "release" and per dev, right? See above "way
196 > (1)".
197
198 At a previous job we built some scripts to build a VM per release; but
199 in our scheme we only had to build 9 VMs worst case (3 targets, and 3
200 OS tracks, so 9 total.) We shared the base VM images (per release)
201 with the entire development team of 9 people. It was feasible with
202 decent internet (100mbit). We had some shared storage we put signed
203 images on. But often you would only use 1 image to test locally, then
204 push to the CI pipeline that would test on the other 8 images (because
205 it was cheaper / whatever in the datacenter.)
206
207 I continue to agree with you in that if you can't get your # of
208 targets down near that kind of number (10-20ish) it's probably not
209 going to be a great time for you.
210
211
212 >
213 >
214 > > > (3) Distributing apps as VMs or docker: Even those tools advance and
215 > > > become incompatible, right? And not suitable when for smaller Arm
216 > > > devices.
217 > >
218 > > I think if your apps are small and self-contained and easily rebuilt,
219 > > your (3) and (4) can be workable.
220 > >
221 > > If you need 1000 dependencies at runtime, your containers are going to
222 > > be expensive to build, expensive to maintain, you are gonna have to
223 > > build them often (for security issues), it can be challenging to
224 > > support incremental builds and incremental updates...you generally
225 > > want a clearer problem statement to adopt this pain. Two problem
226 > > statements that might be worth it are below ;)
227 > >
228 > > If you told me you had 100 different production environments, or
229 > > needed to support 12 different OSes, I'd tell you to use containers
230 > > (or similar)
231 > > If you told me you didn't control your production environment (because
232 > > users installed the software wherever) I'd tell you use containers (or
233 > > similar)
234 > >
235 > > >
236 > > > (4) Flatpak: No experience, does it work well?
237 > >
238 > > Flatpak is conceptually similar to your (3). I know you are basically
239 > > asking "does it work" and the answer is "probably", but see the other
240 > > questions for (3). I suspect it's less about "does it work" and more
241 > > about "is some container deployment thing really a great idea."
242 >
243 > Well thanks for your comments on containers and flatpak. It's
244 > motivating to investigate that further.
245 >
246 > Admittedly, we've been sticking to natively built apps for reasons
247 > that might not be relevant these days. (Hardware bound apps, bus
248 > systems etc, performance reasons on IoT like devices, no real
249 > experience in lean containers yet, only Qemu.)
250
251 Depending on your app, you can get pretty lean containers. We have a
252 golang app (https://gitweb.gentoo.org/sites/soko.git/tree/Dockerfile)
253 whose docker image is 39MB, but it mostly just has a large statically
254 compiled go-binary in it. We run gitlab-ce in a large container that
255 is 2.5GB, so the sizes can definitely get large if you are not
256 careful.
257
258 Another potential issue for containers is the container runtime shares
259 a kernel with the host, so if your host kernel is very old, but you
260 need new kernel features (or syscalls) they may be missing on the host
261 kernel, so there are still some gotchas (but as mentioned, probably
262 fewer than you experience with a full OS build.)
263
264 Good Luck!
265
266 -A
267
268 >
269 >
270 > > Peter's comment about basically running your own fork of gentoo.git
271 > > and sort of 'importing the updates' is workable. Google did this for
272 > > debian testing (called project Rodete)[2]. I can't say it's a
273 > > particularly cheap solution (significant automation and testing
274 > > required) but I think as long as you are keeping up (I would advise
275 > > never falling more than 365d behind time.now() in your fork) then I
276 > > think it provides some benefits.
277 > > - You control when you take updates.
278 > > - You want to stay "close" to time.now() in the tree, since a
279 > > rolling distro is how things are tested.
280 > > - This buys you 365d or so to fix any problem you find.
281 > > - It nominally requires that you test against ::gentoo and
282 > > ::your-gentoo-fork, so you find problems in ::gentoo before they are
283 > > pulled into your fork, giving you a heads up that you need to put work
284 > > in.
285 >
286 > I haven't commented on Peter yet but yes I'll have a look on what he
287 > added. Something tells me that distributing apps in a container
288 > might be the cheaper way for us. We'll see.
289 >
290 >
291 > > [0] FWIW this is basically what #gentoo-infra does on our boxes and
292 > > it's terrible and I would not recommend it to most people in the
293 > > modern era. Upgrade your stuff regularly.
294 > > [1] When I was at Google we had a hilarious outage because someone
295 > > switched login managers (gdm vs kdm) and kdm had a different default
296 > > umask somehow? Anyway it resulted in a critical component having the
297 > > wrong permissions and it caused a massive outage (luckily we had
298 > > sufficient redundancy that it was not user visible) but it was one of
299 > > the scariest outages I had ever seen. I was in charge of investigating
300 > > (being on the dev OS team at the time) and it was definitely very
301 > > difficult to figure out "what changed" to produce the bad build. We
302 > > stopped building on developer workstations soon after, FWIW.
303 > > [2] https://cloud.google.com/blog/topics/developers-practitioners/how-google-got-to-rolling-linux-releases-for-desktops
304 >
305 > Thanks for sharing! Very interesting insights.
306 >
307 > To sum up:
308 >
309 > You described interesting ways to create and control own releases of
310 > Gentoo. So production and developer systems could be aligned on
311 > that. The effort depends.
312 >
313 > Another way is containers.
314 >
315 >