1 |
On Mon, Jan 2, 2023 at 4:48 AM m1027 <m1027@××××××.net> wrote: |
2 |
> |
3 |
> Hi and happy new year. |
4 |
> |
5 |
> When we create apps on Gentoo they become easily incompatible for |
6 |
> older Gentoo systems in production where unattended remote world |
7 |
> updates are risky. This is due to new glibc, openssl-3 etc. |
8 |
|
9 |
I wrote a very long reply, but I've removed most of it: I basically |
10 |
have a few questions, and then some comments: |
11 |
|
12 |
I don't quite grasp your problem statement, so I will repeat what I |
13 |
think it is and you can confirm / deny. |
14 |
- Your devs build using gentoo synced against some recent tree, they |
15 |
have recent packages, and they build some software that you deploy to |
16 |
prod. |
17 |
- Your prod machines are running gentoo synced against some recent |
18 |
tree, but not upgraded (maybe only glsa-check runs) and so they are |
19 |
running 'old' packages because you are afraid to update them[0] |
20 |
- Your software builds OK in dev, but when you deploy it in prod it |
21 |
breaks, because prod is really old, and your developments are using |
22 |
packages that are too new. |
23 |
|
24 |
My main feedback here is: |
25 |
- Your "build" environment should be like prod. You said you didn't |
26 |
want to build "developer VMs" but I am unsure why. For example I run |
27 |
Ubuntu and I do all my gentoo development (admittedly very little |
28 |
these days) |
29 |
in a systemd-nspawn container, and I have a few shell scripts to |
30 |
mount everything and set it up (so it has a tree snapshot, some git |
31 |
repos, some writable space etc.) |
32 |
- Your "prod" environment is too risky to upgrade, and you have |
33 |
difficulty crafting builds that run in every prod environment. I think |
34 |
this is fixable by making a build environment more like the prod |
35 |
environment. |
36 |
The challenge here is that if you have not done that (kept the |
37 |
copies of ebuilds around, the distfiles, etc) it can be challenging to |
38 |
"recreate" the existing older prod environments. |
39 |
But if you do the above thing (where devs build in a container) |
40 |
and you can make that container like the prod environments, then you |
41 |
can enable devs to build for the prod environment (in a container on |
42 |
their local machine) and get the outcome you want. |
43 |
- Understand that not upgrading prod is like, to use a finance term, |
44 |
picking up pennies in front of a steamroller. It's a great strategy, |
45 |
but eventually you will actually *need* to upgrade something. Maybe |
46 |
for a critical security issue, maybe for a feature. Having a build |
47 |
environment that matches prod is good practice, you should do it, but |
48 |
you should also really schedule maintenance for these prod nodes to |
49 |
get them upgraded. (For physical machines, I've often seen businesses |
50 |
just eat the risk and assume the machine will physically fail before |
51 |
the steamroller comes, but this is less true with virtualized |
52 |
environments that have longer real lifetimes.) |
53 |
|
54 |
> |
55 |
> So, what we've thought of so far is: |
56 |
> |
57 |
> (1) Keeping outdated developer boxes around and compile there. We |
58 |
> would freeze portage against accidental emerge sync by creating a |
59 |
> git branch in /var/db/repos/gentoo. This feels hacky and requires a |
60 |
> increating number of develper VMs. And sometimes we are hit by a |
61 |
> silent incompatibility we were not aware of. |
62 |
|
63 |
In general when you build binaries for some target, you should build |
64 |
on that target when possible. To me, this is the crux of your issue |
65 |
(that you do not) and one of the main causes of your pain. |
66 |
You will need to figure out a way to either: |
67 |
- Upgrade the older environments to new packages. |
68 |
- Build in copies of the older environments. |
69 |
|
70 |
I actually expect the second one to take 1-2 sprints (so like 1 engineer month?) |
71 |
- One sprint to make some scripts that makes a new production 'container' |
72 |
- One sprint to sort of integrate that container into your dev |
73 |
workflow, so devs build in the container instead of what they build in |
74 |
now. |
75 |
|
76 |
It might be more or less daunting depending on how many distinct |
77 |
(unique?) prod environments you have (how many containers will you |
78 |
actually need for good build coverage?), how experienced in Gentoo |
79 |
your developers are, and how many artifacts from prod you have. |
80 |
- A few crazy ideas are like: |
81 |
- Snapshot an existing prod machine, strip of it machine-specific |
82 |
bits, and use that as your container. |
83 |
- Use quickpkg to generate a bunch of bin pkgs from a prod machine, |
84 |
use that to bootstrap a container. |
85 |
- Probably some other exciting ideas on the list ;) |
86 |
|
87 |
> |
88 |
> (2) Using Ubuntu LTS for production and Gentoo for development is |
89 |
> hit by subtile libjpeg incompatibilites and such. |
90 |
|
91 |
I would advise, if possible, to make dev and prod as similar as |
92 |
possible[1]. I'd be curious what blockers you think there are to this |
93 |
pattern. |
94 |
Remember that "dev" is not "whatever your devs are using" but is |
95 |
ideally some maintained environment; segmented from their daily driver |
96 |
computer (somehow). |
97 |
|
98 |
> |
99 |
> (3) Distributing apps as VMs or docker: Even those tools advance and |
100 |
> become incompatible, right? And not suitable when for smaller Arm |
101 |
> devices. |
102 |
|
103 |
I think if your apps are small and self-contained and easily rebuilt, |
104 |
your (3) and (4) can be workable. |
105 |
|
106 |
If you need 1000 dependencies at runtime, your containers are going to |
107 |
be expensive to build, expensive to maintain, you are gonna have to |
108 |
build them often (for security issues), it can be challenging to |
109 |
support incremental builds and incremental updates...you generally |
110 |
want a clearer problem statement to adopt this pain. Two problem |
111 |
statements that might be worth it are below ;) |
112 |
|
113 |
If you told me you had 100 different production environments, or |
114 |
needed to support 12 different OSes, I'd tell you to use containers |
115 |
(or similar) |
116 |
If you told me you didn't control your production environment (because |
117 |
users installed the software wherever) I'd tell you use containers (or |
118 |
similar) |
119 |
|
120 |
> |
121 |
> (4) Flatpak: No experience, does it work well? |
122 |
|
123 |
Flatpak is conceptually similar to your (3). I know you are basically |
124 |
asking "does it work" and the answer is "probably", but see the other |
125 |
questions for (3). I suspect it's less about "does it work" and more |
126 |
about "is some container deployment thing really a great idea." |
127 |
|
128 |
> |
129 |
> (5) Inventing a full fledged OTA Gentoo OS updater and distribute |
130 |
> that together with the apps... Nah. |
131 |
|
132 |
This sounds like a very expensive solution that is likely rife with |
133 |
very exciting security problems, fwiw. |
134 |
|
135 |
> |
136 |
> Hm... Comments welcome. |
137 |
> |
138 |
|
139 |
Peter's comment about basically running your own fork of gentoo.git |
140 |
and sort of 'importing the updates' is workable. Google did this for |
141 |
debian testing (called project Rodete)[2]. I can't say it's a |
142 |
particularly cheap solution (significant automation and testing |
143 |
required) but I think as long as you are keeping up (I would advise |
144 |
never falling more than 365d behind time.now() in your fork) then I |
145 |
think it provides some benefits. |
146 |
- You control when you take updates. |
147 |
- You want to stay "close" to time.now() in the tree, since a |
148 |
rolling distro is how things are tested. |
149 |
- This buys you 365d or so to fix any problem you find. |
150 |
- It nominally requires that you test against ::gentoo and |
151 |
::your-gentoo-fork, so you find problems in ::gentoo before they are |
152 |
pulled into your fork, giving you a heads up that you need to put work |
153 |
in. |
154 |
|
155 |
[0] FWIW this is basically what #gentoo-infra does on our boxes and |
156 |
it's terrible and I would not recommend it to most people in the |
157 |
modern era. Upgrade your stuff regularly. |
158 |
[1] When I was at Google we had a hilarious outage because someone |
159 |
switched login managers (gdm vs kdm) and kdm had a different default |
160 |
umask somehow? Anyway it resulted in a critical component having the |
161 |
wrong permissions and it caused a massive outage (luckily we had |
162 |
sufficient redundancy that it was not user visible) but it was one of |
163 |
the scariest outages I had ever seen. I was in charge of investigating |
164 |
(being on the dev OS team at the time) and it was definitely very |
165 |
difficult to figure out "what changed" to produce the bad build. We |
166 |
stopped building on developer workstations soon after, FWIW. |
167 |
[2] https://cloud.google.com/blog/topics/developers-practitioners/how-google-got-to-rolling-linux-releases-for-desktops |
168 |
|
169 |
> Thanks |
170 |
> |
171 |
> |