1 |
On Tue, Jun 30, 2020, at 2:29 PM, Michał Górny wrote: |
2 |
> On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote: |
3 |
> > On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote: |
4 |
> > > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry <sid@××××.us> napisał(a): |
5 |
> > > > Hello, |
6 |
> > > > |
7 |
> > > > I have some runnable pseudocode outlining a faster tree verification |
8 |
> > > > algorithm. |
9 |
> > > > Before I create patches I'd like to see if there is any guidance on |
10 |
> > > > making the |
11 |
> > > > changes as unobtrusive as possible. If the radical change in algorithm |
12 |
> > > > is |
13 |
> > > > acceptable I can work on adding the changes. |
14 |
> > > > |
15 |
> > > > Instead of composing any kind of structured data out of the portage |
16 |
> > > > tree my |
17 |
> > > > algorithm just lists all files and then optionally batches them out to |
18 |
> > > > threads. |
19 |
> > > > There is a noticeable speedup by eliding the tree traversal operations |
20 |
> > > > which |
21 |
> > > > can be seen when running the algorithm with a single thread and |
22 |
> > > > comparing it to |
23 |
> > > > the current algorithm in gemato (which should still be discussed |
24 |
> > > > here?). |
25 |
> > > |
26 |
> > > Without reading the code: does your algorithm correctly detect extraneous files? |
27 |
> > > |
28 |
> > |
29 |
> > Yes and no. |
30 |
> > |
31 |
> > I am not sure why this is necessary. If the file does not appear in a manifest it is |
32 |
> > ignored. It makes the most sense to me to put the burden of not including |
33 |
> > untracked files on the publisher. If the user puts an untracked file into the tree it |
34 |
> > will be ignored to no consequence; the authored files don't refer to it, after all. |
35 |
> |
36 |
> This is necessary because a malicious third party can MITM you an rsync |
37 |
> tree with extraneous files (say, -r1 baselayout ebuild) that do horrible |
38 |
> things on your system. If you don't reject files not in Manifest, you |
39 |
> open a huge security hole. |
40 |
> |
41 |
|
42 |
Ok, I will refer to https://www.gentoo.org/glep/glep-0074.html and implement the |
43 |
checks in detail, but will still need to spend some time looking for the best place |
44 |
to insert the code. |
45 |
|
46 |
I think it best to address this from two fronts. On one hand rejecting extra files |
47 |
seems to have immediate benefit but the larger issue is portage exposing |
48 |
untracked potentially malicious files to the user. |
49 |
|
50 |
Has anything like a verity loopback filesystem been explored? It might reduce |
51 |
duplication of work. |
52 |
|
53 |
> > But it would be easy enough to build a second list of all files and compare it to |
54 |
> > the list of files built from the manifests. If there are extras an error can be |
55 |
> > generated. This is actually the first test I did on my manifest parsing code. I tried |
56 |
> > to see if my tracked files roughly matched the total files in tree. That can be |
57 |
> > repurposed for this check. |
58 |
> > |
59 |
> > > > Some simple tests like counting all objects traversed and verified |
60 |
> > > > returns the |
61 |
> > > > same(ish). Once it is put into portage it could be tested in detail. |
62 |
> > > > |
63 |
> > > > There is also my partial attempt at removing the brittle interface to |
64 |
> > > > GnuPG |
65 |
> > > > (it's not as if the current code is badly designed, just that parsing |
66 |
> > > > the |
67 |
> > > > output of GnuPG directly is likely not the best idea). |
68 |
> > > |
69 |
> > > The 'brittle interface' is well-defined machine-readable output. |
70 |
> > > |
71 |
> > |
72 |
> > Ok. I was aware there was a machine interface, but the classes that manipulate |
73 |
> > a temporary GPG home seemed like not the best solution. I guess that is all |
74 |
> > due to GPG assuming everything is in ~/.gnupg and keeping its state as a |
75 |
> > directory structure. |
76 |
> |
77 |
> A temporary home directory guarantees that user configuration does not |
78 |
> affect the verification result. |
79 |
> |
80 |
|
81 |
Yes, I know why it is there. The temporary construction of the directory is what |
82 |
stood out to me as messy but I guess there is no way around it. |
83 |
|
84 |
> > |
85 |
> > > > Needs gemato, dnspython, and requests. Slightly better than random code |
86 |
> > > > because |
87 |
> > > > I took inspiration from the existing gemato classes. |
88 |
> > > |
89 |
> > > The code makes a lot of brittle assumptions about the structure. The |
90 |
> > > GLEP was specifically designed to avoid that and let us adjust the |
91 |
> > > structure in the future to meet our needs. |
92 |
> > > |
93 |
> > |
94 |
> > These same assumptions are built into the code that operates on the |
95 |
> > tree structure. If the GLEP were changed the existing code would also |
96 |
> > potentially need changing. This code just uses the structure in a different |
97 |
> > way. |
98 |
> > |
99 |
> |
100 |
> The code that predates the GLEP, yes. It will eventually be changed to |
101 |
> be more flexible, especially when we can assume that we start removing |
102 |
> backwards compatibility. |
103 |
> |
104 |
|
105 |
My comment was more on the general nature of the code having to track |
106 |
what is intended. Even if parsing code (involving the filesystem?) was |
107 |
generated from a specification now the specification is "code" and still may |
108 |
not reflect what is intended. |
109 |
|
110 |
Taking liberties with the programmatic representation of the manifests |
111 |
seems appropriate. I also don't see anything in the GLEP that dictates an |
112 |
in-memory representation (and such a thing would probably be |
113 |
inappropriate). |