Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
Date: Mon, 20 Nov 2017 18:42:44
Message-Id: 1511203353.21924.1.camel@gentoo.org
In Reply to: [gentoo-dev] [RFC] GLEP 74 post-Council review update by "Michał Górny"
1 W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
2 napisał:
3 > Hi, everyone.
4 >
5 > Here's the updated version of GLEP 74 taking into consideration
6 > the points made during the Council pre-review.
7 >
8 > ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
9 > HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
10 >
11
12 New changes:
13
14 9d819c9 glep-0074: Disallow filenames containing whitespace
15 4124b2f glep-0074: Explicitly specify UTF-8 encoding
16 7f9bd9f glep-0074: Include suggestions from Daniel Campbell
17
18
19 ---
20 GLEP: 74
21 Title: Full-tree verification using Manifest files
22 Author: Michał Górny <mgorny@g.o>,
23 Robin Hugh Johnson <robbat2@g.o>,
24 Ulrich Müller <ulm@g.o>
25 Type: Standards Track
26 Status: Draft
27 Version: 1
28 Created: 2017-10-21
29 Last-Modified: 2017-11-16
30 Post-History: 2017-10-26, 2017-11-16
31 Content-Type: text/x-rst
32 Requires: 59, 61
33 Replaces: 44, 58, 60
34 ---
35
36 Abstract
37 ========
38
39 This GLEP extends the Manifest file format to cover full-tree file
40 integrity and authenticity checks. The format aims to be future-proof,
41 efficient and provide means of backwards compatibility.
42
43
44 Motivation
45 ==========
46
47 The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
48 means of verifying the integrity of distfiles and package files
49 in Gentoo. Combined with OpenPGP signatures, they provide means to
50 ensure the authenticity of the covered files. However, as noted
51 in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
52 authenticity verification as they do not cover any files outside
53 the package directory. In particular, they provide multiple ways
54 for a third party to inject malicious code into the ebuild environment.
55
56 Historically, the topic of providing authenticity coverage for the whole
57 repository has been mentioned multiple times. The most noteworthy effort
58 are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
59 They were accepted by the Council in 2010 but have never been
60 implemented. When potential implementation work started in 2017, a new
61 discussion about the specification arose. It prompted the creation
62 of a competing GLEP that would provide a redesigned alternative to
63 the old GLEPs.
64
65 This specification is designed with the following goals in mind:
66
67 1. It should provide means to ensure the authenticity of the complete
68 repository, including preventing the injection of additional files.
69
70 2. The format should be universal enough to work both for the Gentoo
71 repository and third-party repositories of different characteristics.
72
73 3. The Manifest files should be verifiable stand-alone, that is without
74 knowing any details about the underlying repository format.
75
76
77 Specification
78 =============
79
80 Manifest file format
81 --------------------
82
83 This specification reuses and extends the Manifest file format defined
84 in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
85 repurposed as a generic *tag* that could also indicate additional
86 (non-checksum) metadata. Appropriately, those tags can be followed by
87 other space-separated values.
88
89 Unless specified otherwise, the paths used in the Manifest files
90 are relative to the directory containing the Manifest file. The paths
91 must not reference the parent directory (``..``).
92
93 The Manifest files use UTF-8 encoding.
94
95
96 Manifest file locations and nesting
97 -----------------------------------
98
99 The ``Manifest`` file located in the root directory of the repository
100 is called top-level Manifest, and it is used to perform the full-tree
101 verification. In order to verify the authenticity, it must be signed
102 using OpenPGP, using the armored cleartext format.
103
104 The top-level Manifest may reference sub-Manifests contained
105 in subdirectories of the repository. The sub-Manifests are traditionally
106 named ``Manifest``; however, the implementation must support arbitrary
107 names, including the possibility of multiple (split) Manifests
108 for a single directory. The sub-Manifest can only cover the files inside
109 the directory tree where it resides.
110
111 The sub-Manifest can also be signed using OpenPGP armored cleartext
112 format. However, the signature verification can be omitted since it
113 already is covered by the signed top-level Manifest.
114
115
116 Directory tree coverage
117 -----------------------
118
119 The specification provides three ways of skipping Manifest verification
120 of specific files and directories (recursively):
121
122 1. explicit ``IGNORE`` entries in Manifest files,
123
124 2. injected ignore paths via package manager configuration,
125
126 3. using names starting with a dot (``.``) which are always skipped.
127
128 All files that are not ignored must be covered by at least one
129 of the Manifests.
130
131 A single file may be matched by multiple identical or equivalent
132 Manifest entries, if and only if the entries have the same semantics,
133 specify the same size and the checksums common to both entries match.
134 It is an error for a single file to be matched by multiple entries
135 of different semantics, file size or checksum values. It is an error
136 to specify another entry for a file matching ``IGNORE``, or one of its
137 subdirectories.
138
139 The file entries (except for ``IGNORE``) can be specified for regular
140 files only. Symbolic links are followed when opening files
141 and traversing directories. It is an error to specify an entry for
142 a different file type. If the tree contain files of other types
143 that are not otherwise ignored, they need to be covered by an explicit
144 ``IGNORE``.
145
146 All the local (non-``DIST``) files covered by a Manifest tree must
147 reside on the same filesystem. It is an error to specify entries
148 applying to files on another filesystem. If files or directories that
149 are not otherwise ignored reside on a different filesystem, or symbolic
150 links point to targets on a different filesystem, they must
151 be explicitly excluded via ``IGNORE``.
152
153 All paths specified in the Manifest file must consist of characters
154 corresponding to valid UTF-8 code points excluding the NULL character
155 (``U+0000``) and characters classified as whitespace in the current
156 version of the Unicode standard [#UNICODE]_. It is an error to use
157 Manifest files in directories containing files whose names contain
158 the disallowed characters.
159
160
161 File verification
162 -----------------
163
164 When verifying a file against the Manifest, the following rules are
165 used:
166
167 1. If the file is covered directly or indirectly by an entry
168 of the ``IGNORE`` type, the verification always succeeds.
169
170 2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
171 ``MISC``, ``EBUILD`` or ``AUX`` type:
172
173 a. if the file is not present, then the verification fails,
174
175 b. if the file is present but has a different size or one
176 of the checksums does not match, the verification fails,
177
178 c. otherwise, the verification succeeds.
179
180 3. If the file is present but not listed in Manifest, the verification
181 fails.
182
183 Unless specified otherwise, the package manager must not allow using
184 any files for which the verification failed. The package manager may
185 reject any package or even the whole repository if it may refer to files
186 for which the verification failed.
187
188
189 Timestamp verification
190 ----------------------
191
192 The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
193 for attacks against tree update distribution. If such an entry
194 is present, it should be updated every time at least one
195 of the Manifests changes. Every unique timestamp value must correspond
196 to a single tree state.
197
198 During the verification process, the client should compare the timestamp
199 against the update time obtained from a local clock or a trusted time
200 source. If the comparison result indicates that the Manifest at the time
201 of receiving was already significantly outdated, the client should
202 either fail the verification or require manual confirmation from user.
203
204 Furthermore, the Manifest provider may employ additional methods
205 of distributing the timestamps of recently generated Manifests
206 using a secure channel from a trusted source for exact comparison.
207 The exact details of such a solution are outside the scope of this
208 specification.
209
210 ``TIMESTAMP`` entries may also be present in sub-Manifests. Those
211 timestamps must not be newer than the timestamp of the top-level
212 Manifest (if present). This specification does not define any specific
213 use for them.
214
215
216 Modern Manifest tags
217 --------------------
218
219 The Manifest files can specify the following tags:
220
221 ``TIMESTAMP <iso8601>``
222 Specifies a timestamp of when the Manifest file was last updated.
223 The timestamp must be a valid second-precision ISO8601 extended format
224 combined date and time in UTC timezone, i.e. using the following
225 ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``. Optional.
226 The package manager can use it to detect an outdated repository
227 checkout as described in `Timestamp verification`_.
228
229 ``MANIFEST <path> <size> <checksums>...``
230 Specifies a sub-Manifest. The sub-Manifest must be verified like
231 a regular file. If the verification succeeds, the entries from
232 the sub-Manifest are included for verification as described
233 in `Manifest file locations and nesting`_.
234
235 ``IGNORE <path>``
236 Ignores a subdirectory or file from Manifest checks. If the specified
237 path is present, it and its contents are omitted from the Manifest
238 verification (always pass). *Path* must be a plain file or directory
239 path without a trailing slash, and must not contain wildcards.
240
241 ``DATA <path> <size> <checksums>...``
242 Specifies a regular file subject to Manifest verification. The file
243 is required to pass verification. Used for all files that do not match
244 any other type.
245
246 ``DIST <filename> <size> <checksums>...``
247 Specifies a distfile entry used to verify files fetched as part
248 of ``SRC_URI``. The filename must match the filename used to store
249 the fetched file as specified in the PMS [#PMS-FETCH]_. The package
250 manager must reject the fetched file if it fails verification.
251 ``DIST`` entries apply to all packages below the Manifest file
252 specifying them.
253
254
255 Deprecated Manifest tags
256 ------------------------
257
258 For backwards compatibility, the following tags are additionally
259 allowed at the package directory level:
260
261 ``EBUILD <filename> <size> <checksums>...``
262 Equivalent to the ``DATA`` type.
263
264 ``MISC <path> <size> <checksums>...``
265 Equivalent to the ``DATA`` type. Historically indicated that
266 the package manager may ignore a verification failure if operating
267 in non-strict mode. However, that behavior is deprecated.
268
269 ``AUX <filename> <size> <checksums>...``
270 Equivalent to the ``DATA`` type, except that the filename is relative
271 to ``files/`` subdirectory.
272
273
274 Algorithm for full-tree verification
275 ------------------------------------
276
277 In order to perform full-tree verification, the following algorithm
278 can be used:
279
280 1. Collect all files present in the repository into *present* set.
281
282 2. Start at the top-level Manifest file. Verify its OpenPGP signature.
283 Optionally verify the ``TIMESTAMP`` entry if present as specified
284 in `timestamp verification`. Remove the top-level Manifest
285 from the *present* set.
286
287 3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
288 files according to `file verification`_ section, and include their
289 entries in the current Manifest entry list (using paths relative
290 to directories containing the Manifests).
291
292 4. Process all ``IGNORE`` entries. Remove any paths matching them
293 from the *present* set.
294
295 5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
296 and ``AUX`` entries into the *covered* set.
297
298 6. Verify the entries in *covered* set for incompatible duplicates
299 and collisions with ignored files as explained in `Manifest file
300 locations and nesting`_.
301
302 7. Verify all the files in the union of the *present* and *covered*
303 sets, according to `file verification`_ section.
304
305
306 Algorithm for finding parent Manifests
307 --------------------------------------
308
309 In order to find the top-level Manifest from the current directory
310 the following algorithm can be used:
311
312 1. Store the current directory as *original* and the device ID
313 of the containing filesystem (``st_dev``) as *startdev*,
314
315 2. If the device ID of the containing filesystem (``st_dev``)
316 of the current directory is different than *startdev*, stop.
317
318 3. If the current directory contains a ``Manifest`` file:
319
320 a. If a ``IGNORE`` entry in the ``Manifest`` file covers
321 the *original* directory (or one of the parent directories), stop.
322
323 b. Otherwise, store the current directory as *last_found*.
324
325 4. If the current directory is the root system directory (``/``), stop.
326
327 5. Otherwise, enter the parent directory and jump to step 2.
328
329 Once the algorithm stops, *last_found* will contain the relevant
330 top-level Manifest. If *last_found* is null, then the directory tree
331 does not contain any valid top-level Manifest candidates and one should
332 be created in the *original* directory.
333
334 Once the top-level Manifest is found, its ``MANIFEST`` entries should
335 be used to find any sub-Manifests below the top-level Manifest,
336 up to and including the *original* directory. Note that those
337 sub-Manifests can use different filenames than ``Manifest``.
338
339
340 Checksum algorithms
341 -------------------
342
343 This section is informational only. Specifying the exact set
344 of supported algorithms is outside the scope of this specification.
345
346 The algorithm names reserved at the time of writing are:
347
348 - ``MD5`` [#MD5]_,
349 - ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
350 - ``SHA1`` [#SHS]_,
351 - ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
352 - ``WHIRLPOOL`` [#WHIRLPOOL]_,
353 - ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
354 - ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
355 - ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
356 [#STREEBOG]_.
357
358 The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
359 It is recommended that any new hashes are named after the Python
360 ``hashlib`` module algorithm names, transformed into uppercase.
361
362
363 Manifest compression
364 --------------------
365
366 The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
367 This section merely addresses interoperability issues between Manifest
368 compression and this specification.
369
370 The compressed Manifest files are required to be suffixed for their
371 compression algorithm. This suffix should be used to recognize
372 the compression and decompress Manifests transparently. The exact list
373 of algorithms and their corresponding suffixes are outside the scope
374 of this specification.
375
376 The top-level Manifest file must not be compressed. Since the OpenPGP
377 signature covers the uncompressed text and is compressed itself,
378 the data would have to be decompressed without any prior verification.
379 This could expose users e.g. to zip bombs or exploits on decompressor
380 vulnerabilities.
381
382 Whenever this specification refers to sub-Manifests, they can use any
383 names but are also required to use a specific compression suffix.
384 The ``MANIFEST`` entries are required to specify the full name including
385 compression suffix, and the verification is performed on the compressed
386 file.
387
388 The specification permits uncompressed Manifests to exist alongside
389 their compressed counterparts, and multiple compressed formats
390 to coexist. If that is the case, the files must have the same
391 uncompressed content and the specification is free to choose either
392 of the files using the same base name.
393
394
395 Combining multiple Manifest trees (informational)
396 -------------------------------------------------
397
398 This specification permits nesting multiple hierarchical Manifest trees.
399 In this layout, the specific directories of the Manifest tree can
400 be verified both as a part of another top-level Manifest,
401 and as an independent Manifest tree (when obtained without the parent
402 directory).
403
404 For this to work, the sub-Manifest file in the directory must also
405 satisfy the requirements for the top-level Manifest file. That is:
406
407 - it must be named ``Manifest`` and not compressed,
408
409 - it must cover all the files in this directory and its subdirectories
410 (i.e. no files from the directory tree can be covered by parent
411 Manifest),
412
413 - if authenticity verification is desired, it must be OpenPGP-signed.
414
415 It should be noted that if such a directory is a subdirectory of a valid
416 Manifest tree, the sub-Manifest needs to be valid according
417 to the top-level Manifest and the OpenPGP signature is disregarded
418 as detailed in `Manifest file locations and nesting`_. The top-level
419 behavior is exhibited only when the directory is obtained without parent
420 directories.
421
422
423 An example Manifest file (informational)
424 ----------------------------------------
425
426 An example top-level Manifest file for the Gentoo repository would have
427 the following content::
428
429 TIMESTAMP 2017-10-30T10:11:12Z
430 IGNORE distfiles
431 IGNORE local
432 IGNORE lost+found
433 IGNORE packages
434 MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
435 ...
436 MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
437 ...
438
439 An example modern Manifest (disregarding backwards compatibility)
440 for a package directory would have the following content::
441
442 DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
443 DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
444 DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
445 DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
446 DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
447 DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
448 DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..
449
450
451 Rationale
452 =========
453
454 Stand-alone format
455 ------------------
456
457 The first question that needed to be asked before proceeding with
458 the design was whether the Manifest file format was supposed to be
459 stand-alone, or tightly bound to the repository format.
460
461 The stand-alone format has been selected because of its three
462 advantages:
463
464 1. It is more future-proof. If an incompatible change to the repository
465 format is introduced, only developers need to upgrade the tools
466 they use to generate the Manifests. The tools used to verify
467 the updated Manifests will continue to work.
468
469 2. It is more flexible and universal. With a dedicated tool,
470 the Manifest files can be used to sign and verify arbitrary file
471 sets.
472
473 3. It keeps the verification tool simpler. In particular, we can easily
474 write an independent verification tool that could work on any
475 distribution without needing to depend on a package manager
476 implementation or rewrite parts of it.
477
478 Designing a stand-alone format requires that the Manifest carries enough
479 information to perform the verification following all the rules specific
480 to the Gentoo repository.
481
482
483 Tree design
484 -----------
485
486 The second important point of the design was determining whether
487 the Manifest files should be structured hierarchically, or independent.
488 Both options have their advantages.
489
490 In the hierarchical model, each sub-Manifest file is covered by a higher
491 level Manifest. As a result, only the top-level Manifest has to be
492 OpenPGP-signed, and subsequent Manifests need to be only verified by
493 checksum stored in the parent Manifest. This has the following
494 implications:
495
496 - Verifying any set of files in the repository requires using checksums
497 from the most relevant Manifests and the parent Manifests.
498
499 - The OpenPGP signature of the top-level Manifest needs to be verified
500 only once per process.
501
502 - Altering any set of files requires updating the relevant Manifests,
503 and their parent Manifests up to the top-level Manifest, and signing
504 the last one.
505
506 - As a result, the top-level Manifest changes on every commit,
507 and various middle-level Manifests change (and need to be transferred)
508 frequently.
509
510 In the independent model, each sub-Manifest file is independent
511 of the parent Manifests. As a result, each of them needs to be signed
512 and verified independently. However, the parent Manifests still need
513 to list sub-Manifests (albeit without verification data) in order
514 to detect removal or replacement of subdirectories. This has
515 the following implications:
516
517 - Verifying any set of files in the repository requires using checksums
518 and verifying signatures of the most relevant Manifest files.
519
520 - Altering any set of files requires updating the relevant Manifests
521 and signing them again.
522
523 - Parent Manifests are updated only when Manifests are added or removed
524 from subdirectories. As a result, they change infrequently.
525
526 While both models have their advantages, the hierarchical model was
527 selected because it reduces the number of OpenPGP operations
528 (which are comparatively costly) to the minimum.
529
530
531 Tree layout restrictions
532 ------------------------
533
534 The algorithm is meant to work primarily with ebuild repositories which
535 normally contain only files and directories. Directories provide
536 no useful metadata for verification, and specifying special entries
537 for additional file types is purposeless. Therefore, the specification
538 is restricted to dealing with regular files.
539
540 The Gentoo repository does not use symbolic links. Some Gentoo
541 repositories do, however. To provide a simple solution for dealing with
542 symlinks without having to take care to implement special handling for
543 them, the common behavior of implicitly resolving them is used.
544 Therefore, symbolic links to files are stored as if they were regular
545 files, and symbolic links to directories are followed as if they were
546 regular directories.
547
548 Dotfiles are implicitly ignored as that is a common notion used
549 in software written for POSIX systems. All other filenames require
550 explicit ``IGNORE`` lines.
551
552 An ability to inject additional ignore entries is provided to account
553 for site configuration affecting the repository tree -- placing
554 additional files in it, skipping some of the categories from syncing.
555 This configuration can extend beyond the limits of this GLEP,
556 e.g. by allowing wildcards or regular expressions.
557
558 The algorithm is restricted to work on a single filesystem. This is
559 mostly relevant when scanning for top-level Manifest -- we do not want
560 to cross filesystem boundaries then. However, to ensure consistent
561 bidirectional behavior we need to also ban them when operating downwards
562 the tree.
563
564 The directories and files on different filesystems need to be ignored
565 explicitly as implicitly skipping them would cause confusion.
566 In particular, tools might then claim that a file does not exist when
567 it clearly does because it was skipped due to filesystem boundaries.
568
569
570 Filename character set restriction
571 ----------------------------------
572
573 The valid set of filename characters for the Gentoo repository
574 is restricted by the devmanual 'File Naming Rules' section
575 [#FILE-NAMING-RULES]_, and enforced via a git hook. The valid distfile
576 names are not restricted explicitly -- however, the PMS dependency
577 specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
578 filenames containing whitespace.
579
580 This specification aims to avoid arbitrary restrictions. For this
581 reason, the filename characters are only restricted by excluding two
582 technically problematic groups:
583
584 1. The NULL character (``U+0000``) is normally used to indicate the end
585 of a null-terminated string. Its use could therefore break programs
586 written using C. Furthermore, it is not allowed in any known
587 filesystem.
588
589 2. The whitespace characters are used to separate Manifest fields. While
590 technically it would be enough to restrict space (``U+0020``)
591 character that is normally used as the separator, all whitespace
592 characters are forbidden to avoid confusion and implementation
593 errors.
594
595 While the specification could be extended to allow such filenames
596 by using some form of escaping, there is currently no apparent need
597 for such a feature.
598
599 Historically, Portage attempted to overcome the whitespace limitation
600 by attempting to locate the size field and take everything before it
601 as filename. This was terribly fragile and even if it worked, it would
602 solve the problem only partially.
603
604 Since the same restrictions apply to ``IGNORE`` rules, it is currently
605 not possible to either list or ignore the file using whitespace
606 characters. Therefore, the presence of such files is forbidden entirely.
607
608
609 File verification model
610 -----------------------
611
612 The verification model aims to provide full coverage against different
613 forms of attack. In particular, three different kinds of manipulation
614 are considered:
615
616 1. Alteration of the file content.
617
618 2. Removal of a file.
619
620 3. Addition of a new file.
621
622 In order to prevent against all three, the system requires that all
623 files in the repository are listed in Manifests and verified against
624 them.
625
626 As a special case, ignores are allowed to account for directories
627 that are not part of the repository but were traditionally placed inside
628 it. Those directories were ``distfiles``, ``local`` and ``packages``. It
629 could be also used to ignore VCS directories such as ``CVS``.
630
631
632 Non-strict Manifest verification
633 --------------------------------
634
635 Originally the Manifest2 format provided a special ``MISC`` tag that
636 was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
637 indicated that the Manifest verification failures could be ignored for
638 those files unless the package manager was working in strict mode.
639
640 The first versions of this specification continued the use of this tag.
641 However, after a long debate it was decided to deprecate it along with
642 the non-strict behavior, and require all files to strictly match.
643
644 Two arguments were mentioned for the usefulness of a ``MISC`` type:
645
646 1. being able to reduce the checkout size by stripping unnecessary
647 files out, and
648
649 2. being able to run update automatically generated files locally
650 without causing unnecessary verification failures.
651
652 However, the usefulness of ``MISC`` in both cases is doubtful.
653
654 The cases for stripping unnecessary files mostly focused around space
655 savings. For this purpose, stripping ``metadata.xml`` and similar files
656 has little value. It is much more common for users to strip whole
657 packages or categories. The ``MISC`` type is not suitable for that,
658 and so a dedicated package manager mechanism needs to be developed
659 instead. The same mechanism can also handle files that historically used
660 the ``MISC`` type. As an example, the package manager may choose
661 to generate both the rsync exclusion list and Manifest ignore list
662 using a single source list.
663
664 The cases for autogenerated files involve such cache files
665 as ``use.local.desc``. However, we can not include ``md5-cache`` there
666 due to security concerns which results in inconsistent cache handling.
667 Furthermore, the tools were historically modified to provide stable
668 output which means that their content can not change without
669 a non-``MISC`` content being changed first. This practically defeats
670 the purpose of using ``MISC``.
671
672 Finally, the non-strict mode could be used as means to an attack.
673 The allowance of missing or modified documentation file could be used
674 to spread misinformation, resulting in bad decisions made by the user.
675 A modified file could also be used, e.g. to exploit vulnerabilities
676 of an XML parser.
677
678
679 Timestamp field
680 ---------------
681
682 The top-level Manifest optionally allows using a ``TIMESTAMP`` tag
683 to include a generation timestamp in the Manifest. A similar feature
684 was originally proposed in GLEP 58 [#GLEP58]_.
685
686 A malicious third-party may use the principles of exclusion or replay
687 [#C08]_ to deny an update to clients, while at the same time recording
688 the identity of clients to attack. The timestamp field can be used to
689 detect that.
690
691 In order to provide more complete protection, the Gentoo Infrastructure
692 should provide an ability to obtain the timestamps of all Manifests
693 from a recent timeframe over a secure channel from a trusted source
694 for comparison.
695
696 Strictly speaking, this information is already provided by the various
697 ``metadata/timestamp*`` files that are already present. However,
698 including the value in the Manifest itself has a little cost
699 and provides the ability to perform the verification stand-alone.
700
701 Furthermore, some of the timestamp files are added very late
702 in the distribution process, past the Manifest generation phase. Those
703 files will most likely receive ``IGNORE`` entries and therefore
704 be unsafe to use.
705
706 The specification permits additional timestamps in sub-Manifest files
707 for local use. A generic testing tool should ignore them.
708
709
710 New vs deprecated tags
711 ----------------------
712
713 Out of the four types defined by Manifest2, only one is reused
714 and the remaining three are replaced by a single, universal ``DATA``
715 type.
716
717 The ``DIST`` tag is reused since the specification does not change
718 anything with regard to distfile handling.
719
720 The ``EBUILD`` tag could potentially be reused for generic file
721 verification data. However, it would be confusing if all the different
722 data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
723 type was introduced as a replacement.
724
725 The ``MISC`` tag and the relevant non-strict mode has been removed
726 as being of little value, as detailed in the `Non-strict Manifest
727 verification`_ section.
728
729 The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
730 the limiting property of implicit ``files/`` path prefix.
731
732
733 Finding top-level Manifest
734 --------------------------
735
736 The development of a reference implementation for this GLEP has brought
737 the following problem: how to find all the relevant Manifests when
738 the Manifest tool is run inside a subdirectory of the repository?
739
740 One of the options would be to provide a bi-directional linking
741 of Manifests via a ``PARENT`` tag. However, that would not solve
742 the problem when a new Manifest file is being created.
743
744 Instead, an algorithm for iterating over parent directories is proposed.
745 Since there is no obligatory explicit indicator for the top-level
746 Manifest, the algorithm assumes that the top-level Manifest
747 is the highest ``Manifest`` in the directory hierarchy that can cover
748 the current directory. This generally makes sense since the Manifest
749 files are required to provide coverage for all subdirectories, so all
750 Manifests starting from that one need to be updated.
751
752 If independent Manifest trees are nested in the directory structure,
753 then an ``IGNORE`` entry needs to be used to separate them.
754
755 Since sub-Manifests can use any filenames, the Manifest finding
756 algorithm must not short-cut the procedure by storing all ``Manifest``
757 files along the parent directories. Instead, it needs to retrace
758 the relevant sub-Manifest files along ``MANIFEST`` entries
759 in the top-level Manifest.
760
761
762 Injecting ChangeLogs into the checkout
763 --------------------------------------
764
765 One of the problems considered in the new Manifest format was injecting
766 historical and autogenerated ChangeLog into the repository. We normally
767 don't include those files, to reduce the checkout size. However, some
768 users have shown interest in them and Infra is working on providing them
769 via an additional rsync module.
770
771 If such files were injected into the repository, they would cause
772 verification failures of Manifests. To account for this, Infra could
773 provide ``IGNORE`` entries to allow them to exist.
774
775
776 Splitting distfile checksums from file checksums
777 ------------------------------------------------
778
779 Another problem with the current Manifest format is that the checksums
780 for fetched files are combined with checksums for local files
781 in a single file inside the package directory. It has been specifically
782 pointed out that:
783
784 - since distfiles are sometimes reused across different packages,
785 the repeating checksums are redundant [#DIST]_.
786
787 - mirror admins were interested in the possibility of verifying all
788 the distfiles with a single tool.
789
790 This specification does not provide a clean solution to this problem.
791 It technically permits moving ``DIST`` entries to higher-level Manifests
792 but the usefulness of such a solution is doubtful.
793
794 However, for the second problem we will probably deliver a dedicated
795 tool working with this Manifest format.
796
797
798 Hash algorithms
799 ---------------
800
801 While maintaining a consistent supported hash set is important
802 for interoperability, it is not a good fit for the generic layout
803 of this GLEP. Furthermore, it would require updating the GLEP
804 in the future every time the used algorithms change.
805
806 Instead, the specification focuses on listing the currently used
807 algorithm names for interoperability, and sets a recommendation
808 for consistent naming of algorithms in the future. The Python
809 ``hashlib`` module is used as a reference since it is used
810 as the provider of hash functions for most of the Python software,
811 including Portage and PkgCore.
812
813 The basic rules for changing hash algorithms are defined in GLEP 59
814 [#GLEP59]_. The implementations can focus only on those algorithms
815 that are actually used or planned on being used. It may be feasible
816 to devise a new GLEP that specifies the currently used hashes (or update
817 GLEP 59 accordingly).
818
819
820 Manifest compression
821 --------------------
822
823 The support for Manifest compression is introduced with minimal changes
824 to the file format. The ``MANIFEST`` entries are required to provide
825 the real (compressed) file path for compatibility with other file
826 entries and to avoid confusion.
827
828 The compression of top-level Manifest file has been prohibited
829 as the specification currently does not provide any means of verifying
830 the file prior to decompression. If the top-level Manifest is
831 compressed, tooling will have to unpack the file before being able
832 to verify the contents. This makes it possible for a malicious third
833 party to attack the system by providing a compressed Manifest that
834 exposes decompressor vulnerabilities, or a zip bomb.
835
836 The OpenPGP cleartext signature covers the contents of the Manifest,
837 and is therefore compressed along with them. The possibility of using
838 detached signature has been considered but it was rejected as
839 unnecessary complexity for minor gain.
840
841 Technically, a similar result could be effected via moving all the data
842 into a compressed sub-Manifest in the top directory (e.g.
843 ``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
844 in a signed, uncompressed top-level Manifest.
845
846 The existence of additional entries for uncompressed Manifest checksums
847 was debated. However, plain entries for the uncompressed file would
848 be confusing if only the compressed file existed, and conflicting
849 if both uncompressed and compressed variants existed. Furthermore,
850 it has been pointed out that ``DIST`` entries do not have uncompressed
851 variant either.
852
853
854 Performance considerations
855 --------------------------
856
857 Performing a full-tree verification on every sync raises some
858 performance concerns for end-user systems. The initial testing has shown
859 that a cold-cache verification on a btrfs file system can take up around
860 4 minutes, with the process being mostly I/O bound. On the other hand,
861 it can be expected that the verification will be performed directly
862 after syncing, taking advantage of a warm filesystem cache.
863
864 To improve speed on I/O and/or CPU-restrained systems even further,
865 the algorithms can be easily extended to perform incremental
866 verification. Given that rsync does not preserve mtimes by default,
867 the tool can take advantage of mtime and Manifest comparisons to recheck
868 only the parts of the repository that have changed.
869
870 Furthermore, the package manager implementations can restrict checking
871 only to the parts of the repository that are actually being used.
872
873
874 Backwards Compatibility
875 =======================
876
877 This GLEP provides optional means of preserving backwards compatibility.
878 To preserve the backwards compatibility, the following needs to hold
879 for the ``Manifest`` file in every package directory:
880
881 - all files must be covered by the single ``Manifest`` file,
882
883 - all distfiles used by the package must be included,
884
885 - all files inside the ``files/`` subdirectory need to use
886 the ``AUX`` tag (rather than ``DATA``),
887
888 - all ``.ebuild`` files need to use the ``EBUILD`` tag,
889
890 - the ``metadata.xml`` and ``ChangeLog`` files need to use
891 the ``MISC`` tag,
892
893 - the Manifest can be signed to provide authenticity verification,
894
895 - an uncompressed Manifest must always exist, and a compressed Manifest
896 of identical content may be present.
897
898 Once the backwards compatibility is no longer a concern, the above
899 no longer needs to hold and the deprecated tags can be removed.
900
901
902 Reference Implementation
903 ========================
904
905 The reference implementation for this GLEP is being developed
906 as the gemato project [#GEMATO]_.
907
908
909 Credits
910 =======
911
912 Thanks to all the people whose contributions were invaluable
913 to the creation of this GLEP. This includes but is not limited to:
914
915 - Robin Hugh Johnson,
916 - Ulrich Müller.
917
918 Additionally, thanks to Robin Hugh Johnson for the original
919 MetaManifest GLEP series which served both as inspiration and source
920 of many concepts used in this GLEP. Recursively, also thanks to all
921 the people who contributed to the original GLEPs.
922
923
924 References
925 ==========
926
927 .. [#GLEP44] GLEP 44: Manifest2 format
928 (https://www.gentoo.org/glep/glep-0044.html)
929
930 .. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
931 - Overview
932 (https://www.gentoo.org/glep/glep-0057.html)
933
934 .. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
935 - Infrastructure to User distribution - MetaManifest
936 (https://www.gentoo.org/glep/glep-0058.html)
937
938 .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
939 (https://www.gentoo.org/glep/glep-0059.html)
940
941 .. [#GLEP60] GLEP 60: Manifest2 filetypes
942 (https://www.gentoo.org/glep/glep-0060.html)
943
944 .. [#GLEP61] GLEP 61: Manifest2 compression
945 (https://www.gentoo.org/glep/glep-0061.html)
946
947 .. [#UNICODE] The Unicode standard
948 (https://unicode.org/versions/latest/)
949
950 .. [#PMS-FETCH] Package Manager Specification: Dependency Specification
951 Format - SRC_URI
952 (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)
953
954 .. [#FILE-NAMING-RULES] Ebuild File Format -- Gentoo Development Guide
955 (https://devmanual.gentoo.org/ebuild-writing/file-format/#file-naming-rules)
956
957 .. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
958 (https://www.ietf.org/rfc/rfc1321.txt)
959
960 .. [#RIPEMD160] The hash function RIPEMD-160
961 (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)
962
963 .. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
964 (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)
965
966 .. [#WHIRLPOOL] The WHIRLPOOL Hash Function
967 (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)
968
969 .. [#BLAKE2] BLAKE2 -- fast secure hashing
970 (https://blake2.net/)
971
972 .. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
973 and Extendable-Output Functions
974 (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)
975
976 .. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
977 (https://www.streebog.net/)
978
979 .. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
980 (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)
981
982 .. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
983 at the time of writing are duplicate, representing a 2 MiB
984 out of 25 MiB of DIST entries altogether.
985
986 .. [#GEMATO] gemato: Gentoo Manifest Tool
987 (https://github.com/mgorny/gemato/)
988
989 Copyright
990 =========
991 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
992 Unported License. To view a copy of this license, visit
993 http://creativecommons.org/licenses/by-sa/3.0/.
994
995 --
996 Best regards,
997 Michał Górny

Replies