Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v4]
Date: Wed, 22 Nov 2017 16:54:30
Message-Id: 1511369657.8591.1.camel@gentoo.org
In Reply to: [gentoo-dev] [RFC] GLEP 74 post-Council review update by "Michał Górny"
1 W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
2 napisał:
3 > Hi, everyone.
4 >
5 > Here's the updated version of GLEP 74 taking into consideration
6 > the points made during the Council pre-review.
7 >
8 > ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
9 > HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
10 >
11 > Changes:
12 >
13
14 b3964b6 glep-0074: Recommend escaping control characters, suggested by
15 ulm
16 11f19f9 glep-0074: Provide encoding for disallowed characters
17 da2aace glep-0074: Clarify ignoring directories
18
19
20 ---
21 GLEP: 74
22 Title: Full-tree verification using Manifest files
23 Author: Michał Górny <mgorny@g.o>,
24 Robin Hugh Johnson <robbat2@g.o>,
25 Ulrich Müller <ulm@g.o>
26 Type: Standards Track
27 Status: Draft
28 Version: 1
29 Created: 2017-10-21
30 Last-Modified: 2017-11-16
31 Post-History: 2017-10-26, 2017-11-16
32 Content-Type: text/x-rst
33 Requires: 59, 61
34 Replaces: 44, 58, 60
35 ---
36
37 Abstract
38 ========
39
40 This GLEP extends the Manifest file format to cover full-tree file
41 integrity and authenticity checks. The format aims to be future-proof,
42 efficient and provide means of backwards compatibility.
43
44
45 Motivation
46 ==========
47
48 The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
49 means of verifying the integrity of distfiles and package files
50 in Gentoo. Combined with OpenPGP signatures, they provide means to
51 ensure the authenticity of the covered files. However, as noted
52 in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
53 authenticity verification as they do not cover any files outside
54 the package directory. In particular, they provide multiple ways
55 for a third party to inject malicious code into the ebuild environment.
56
57 Historically, the topic of providing authenticity coverage for the whole
58 repository has been mentioned multiple times. The most noteworthy effort
59 are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
60 They were accepted by the Council in 2010 but have never been
61 implemented. When potential implementation work started in 2017, a new
62 discussion about the specification arose. It prompted the creation
63 of a competing GLEP that would provide a redesigned alternative to
64 the old GLEPs.
65
66 This specification is designed with the following goals in mind:
67
68 1. It should provide means to ensure the authenticity of the complete
69 repository, including preventing the injection of additional files.
70
71 2. The format should be universal enough to work both for the Gentoo
72 repository and third-party repositories of different characteristics.
73
74 3. The Manifest files should be verifiable stand-alone, that is without
75 knowing any details about the underlying repository format.
76
77
78 Specification
79 =============
80
81 Manifest file format
82 --------------------
83
84 This specification reuses and extends the Manifest file format defined
85 in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
86 repurposed as a generic *tag* that could also indicate additional
87 (non-checksum) metadata. Appropriately, those tags can be followed by
88 other space-separated values.
89
90 Unless specified otherwise, the paths used in the Manifest files
91 are relative to the directory containing the Manifest file. The paths
92 must not reference the parent directory (``..``). Forward slash (``/``)
93 is used as path component separator.
94
95 The Manifest files use UTF-8 encoding.
96
97
98 Manifest file locations and nesting
99 -----------------------------------
100
101 The ``Manifest`` file located in the root directory of the repository
102 is called top-level Manifest, and it is used to perform the full-tree
103 verification. In order to verify the authenticity, it must be signed
104 using OpenPGP, using the armored cleartext format.
105
106 The top-level Manifest may reference sub-Manifests contained
107 in subdirectories of the repository. The sub-Manifests are traditionally
108 named ``Manifest``; however, the implementation must support arbitrary
109 names, including the possibility of multiple (split) Manifests
110 for a single directory. The sub-Manifest can only cover the files inside
111 the directory tree where it resides.
112
113 The sub-Manifest can also be signed using OpenPGP armored cleartext
114 format. However, the signature verification can be omitted since it
115 already is covered by the signed top-level Manifest.
116
117
118 Directory tree coverage
119 -----------------------
120
121 The specification provides three ways of skipping Manifest verification
122 of specific files and directories (recursively):
123
124 1. explicit ``IGNORE`` entries in Manifest files,
125
126 2. injected ignore paths via package manager configuration,
127
128 3. using names starting with a dot (``.``) which are always skipped.
129
130 All files that are not ignored must be covered by at least one
131 of the Manifests.
132
133 A single file may be matched by multiple identical or equivalent
134 Manifest entries, if and only if the entries have the same semantics,
135 specify the same size and the checksums common to both entries match.
136 It is an error for a single file to be matched by multiple entries
137 of different semantics, file size or checksum values. It is an error
138 to specify another entry for a file that matches ``IGNORE``, or that
139 is located inside an ignored directory.
140
141 The file entries (except for ``IGNORE``) can be specified for regular
142 files only. Symbolic links are followed when opening files
143 and traversing directories. It is an error to specify an entry for
144 a different file type. If the tree contain files of other types
145 that are not otherwise ignored, they need to be covered by an explicit
146 ``IGNORE``.
147
148 All the local (non-``DIST``) files covered by a Manifest tree must
149 reside on the same filesystem. It is an error to specify entries
150 applying to files on another filesystem. If files or directories that
151 are not otherwise ignored reside on a different filesystem, or symbolic
152 links point to targets on a different filesystem, they must
153 be explicitly excluded via ``IGNORE``.
154
155
156 Path and filename encoding
157 --------------------------
158
159 The path fields in the Manifest file must consist of characters
160 corresponding to valid UTF-8 code points excluding the NULL character
161 (``U+0000``), the backwards slash (``\``) and characters classified
162 as whitespace in the current version of the Unicode standard
163 [#UNICODE]_.
164
165 Any of the excluded characters that are present in path must be encoded
166 using one of the following escape sequences:
167
168 - characters in the ``U+0000`` to ``U+007F`` range can be encoded
169 as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal
170 character code,
171
172 - characters in the ``U+0000`` to ``U+FFFF`` range can be encoded
173 as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal
174 character code,
175
176 - characters in the UCS-4 range can be encoded as ``\UHHHHHHHH``
177 where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character
178 code.
179
180 It is invalid for backwards slash to be used in any other context,
181 and a backwards slash present in filename must be encoded. Backwards
182 slash used as path component separator should be replaced by forward
183 slash instead.
184
185 The encoding can be used for other characters as well. In particular,
186 escaping control characters is recommended to ensure that the file
187 works correctly in text editors.
188
189
190 File verification
191 -----------------
192
193 When verifying a file against the Manifest, the following rules are
194 used:
195
196 1. If the file is covered directly or indirectly by an entry
197 of the ``IGNORE`` type, the verification always succeeds.
198
199 2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
200 ``MISC``, ``EBUILD`` or ``AUX`` type:
201
202 a. if the file is not present, then the verification fails,
203
204 b. if the file is present but has a different size or one
205 of the checksums does not match, the verification fails,
206
207 c. otherwise, the verification succeeds.
208
209 3. If the file is present but not listed in Manifest, the verification
210 fails.
211
212 Unless specified otherwise, the package manager must not allow using
213 any files for which the verification failed. The package manager may
214 reject any package or even the whole repository if it may refer to files
215 for which the verification failed.
216
217
218 Timestamp verification
219 ----------------------
220
221 The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
222 for attacks against tree update distribution. If such an entry
223 is present, it should be updated every time at least one
224 of the Manifests changes. Every unique timestamp value must correspond
225 to a single tree state.
226
227 During the verification process, the client should compare the timestamp
228 against the update time obtained from a local clock or a trusted time
229 source. If the comparison result indicates that the Manifest at the time
230 of receiving was already significantly outdated, the client should
231 either fail the verification or require manual confirmation from
232 the user.
233
234 Furthermore, the Manifest provider may employ additional methods
235 of distributing the timestamps of recently generated Manifests
236 using a secure channel from a trusted source for exact comparison.
237 The exact details of such a solution are outside the scope of this
238 specification.
239
240 ``TIMESTAMP`` entries may also be present in sub-Manifests. Those
241 timestamps must not be newer than the timestamp of the top-level
242 Manifest (if present). This specification does not define any specific
243 use for them.
244
245
246 Modern Manifest tags
247 --------------------
248
249 The Manifest files can specify the following tags:
250
251 ``TIMESTAMP <iso8601>``
252 Specifies a timestamp of when the Manifest file was last updated.
253 The timestamp must be a valid second-precision ISO 8601 extended
254 format combined date and time in UTC timezone, i.e. using
255 the following ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``.
256 Optional. The package manager can use it to detect an outdated
257 repository checkout as described in `Timestamp verification`_.
258
259 ``MANIFEST <path> <size> <checksums>...``
260 Specifies a sub-Manifest. The sub-Manifest must be verified like
261 a regular file. If the verification succeeds, the entries from
262 the sub-Manifest are included for verification as described
263 in `Manifest file locations and nesting`_.
264
265 ``IGNORE <path>``
266 Ignores a subdirectory or file from Manifest checks. If the specified
267 path is present, it and its contents are omitted from the Manifest
268 verification (always pass). *Path* must be a plain file or directory
269 path without a trailing slash. Wildcards are not supported
270 and wildcard characters are interpreted literally.
271
272 ``DATA <path> <size> <checksums>...``
273 Specifies a regular file subject to Manifest verification. The file
274 is required to pass verification. Used for all files that do not match
275 any other type.
276
277 ``DIST <filename> <size> <checksums>...``
278 Specifies a distfile entry used to verify files fetched as part
279 of ``SRC_URI``. The filename must match the filename used to store
280 the fetched file as specified in the PMS [#PMS-FETCH]_. The package
281 manager must reject the fetched file if it fails verification.
282 ``DIST`` entries apply to all packages below the Manifest file
283 specifying them.
284
285
286 Deprecated Manifest tags
287 ------------------------
288
289 For backwards compatibility, the following tags are additionally
290 allowed at the package directory level:
291
292 ``EBUILD <filename> <size> <checksums>...``
293 Equivalent to the ``DATA`` type.
294
295 ``MISC <path> <size> <checksums>...``
296 Equivalent to the ``DATA`` type. Historically indicated that
297 the package manager may ignore a verification failure if operating
298 in non-strict mode. However, that behavior is deprecated.
299
300 ``AUX <filename> <size> <checksums>...``
301 Equivalent to the ``DATA`` type, except that the filename is relative
302 to the ``files/`` subdirectory.
303
304
305 Algorithm for full-tree verification
306 ------------------------------------
307
308 In order to perform full-tree verification, the following algorithm
309 can be used:
310
311 1. Collect all files present in the repository into *present* set.
312
313 2. Start at the top-level Manifest file. Verify its OpenPGP signature.
314 Optionally verify the ``TIMESTAMP`` entry if present as specified
315 in `timestamp verification`. Remove the top-level Manifest
316 from the *present* set.
317
318 3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
319 files according to the `file verification`_ section, and include
320 their entries in the current Manifest entry list (using paths
321 relative to directories containing the Manifests).
322
323 4. Process all ``IGNORE`` entries. Remove any paths matching them
324 from the *present* set.
325
326 5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
327 and ``AUX`` entries into the *covered* set.
328
329 6. Verify the entries in the *covered* set for incompatible duplicates
330 and collisions with ignored files as explained in `Manifest file
331 locations and nesting`_.
332
333 7. Verify all the files in the union of the *present* and *covered*
334 sets, according to the `file verification`_ section.
335
336
337 Algorithm for finding parent Manifests
338 --------------------------------------
339
340 In order to find the top-level Manifest from the current directory
341 the following algorithm can be used:
342
343 1. Store the current directory as *original* and the device ID
344 of the containing filesystem (``st_dev``) as *startdev*,
345
346 2. If the device ID of the containing filesystem (``st_dev``)
347 of the current directory is different than *startdev*, stop.
348
349 3. If the current directory contains a ``Manifest`` file:
350
351 a. If an ``IGNORE`` entry in the ``Manifest`` file covers
352 the *original* directory (or one of the parent directories), stop.
353
354 b. Otherwise, store the current directory as *last_found*.
355
356 4. If the current directory is the root system directory (``/``), stop.
357
358 5. Otherwise, enter the parent directory and jump to step 2.
359
360 Once the algorithm stops, *last_found* will contain the relevant
361 top-level Manifest. If *last_found* is null, then the directory tree
362 does not contain any valid top-level Manifest candidates and one should
363 be created in the *original* directory.
364
365 Once the top-level Manifest is found, its ``MANIFEST`` entries should
366 be used to find any sub-Manifests below the top-level Manifest,
367 up to and including the *original* directory. Note that those
368 sub-Manifests can use different filenames than ``Manifest``.
369
370
371 Checksum algorithms
372 -------------------
373
374 This section is informational only. Specifying the exact set
375 of supported algorithms is outside the scope of this specification.
376
377 The algorithm names reserved at the time of writing are:
378
379 - ``MD5`` [#MD5]_,
380 - ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
381 - ``SHA1`` [#SHS]_,
382 - ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
383 - ``WHIRLPOOL`` [#WHIRLPOOL]_,
384 - ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
385 - ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
386 - ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
387 [#STREEBOG]_.
388
389 The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
390 It is recommended that any new hashes are named after the Python
391 ``hashlib`` module algorithm names, transformed into uppercase.
392
393
394 Manifest compression
395 --------------------
396
397 The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
398 This section merely addresses interoperability issues between Manifest
399 compression and this specification.
400
401 The compressed Manifest files are required to be suffixed for their
402 compression algorithm. This suffix should be used to recognize
403 the compression and decompress Manifests transparently. The exact list
404 of algorithms and their corresponding suffixes are outside the scope
405 of this specification.
406
407 The top-level Manifest file must not be compressed. Since the OpenPGP
408 signature covers the uncompressed text and is compressed itself,
409 the data would have to be decompressed without any prior verification.
410 This could expose users e.g. to zip bombs or exploits on decompressor
411 vulnerabilities.
412
413 Whenever this specification refers to sub-Manifests, they can use any
414 names but are also required to use a specific compression suffix.
415 The ``MANIFEST`` entries are required to specify the full name including
416 compression suffix, and the verification is performed on the compressed
417 file.
418
419 The specification permits uncompressed Manifests to exist alongside
420 their compressed counterparts, and multiple compressed formats
421 to coexist. If that is the case, the files must have the same
422 uncompressed content and the specification is free to choose either
423 of the files using the same base name.
424
425
426 Combining multiple Manifest trees (informational)
427 -------------------------------------------------
428
429 This specification permits nesting multiple hierarchical Manifest trees.
430 In this layout, the specific directories of the Manifest tree can
431 be verified both as a part of another top-level Manifest,
432 and as an independent Manifest tree (when obtained without the parent
433 directory).
434
435 For this to work, the sub-Manifest file in the directory must also
436 satisfy the requirements for the top-level Manifest file. That is:
437
438 - it must be named ``Manifest`` and not compressed,
439
440 - it must cover all the files in this directory and its subdirectories
441 (i.e. no files from the directory tree can be covered by parent
442 Manifest),
443
444 - if authenticity verification is desired, it must be OpenPGP-signed.
445
446 It should be noted that if such a directory is a subdirectory of a valid
447 Manifest tree, the sub-Manifest needs to be valid according
448 to the top-level Manifest and the OpenPGP signature is disregarded
449 as detailed in `Manifest file locations and nesting`_. The top-level
450 behavior is exhibited only when the directory is obtained without parent
451 directories.
452
453
454 An example Manifest file (informational)
455 ----------------------------------------
456
457 An example top-level Manifest file for the Gentoo repository would have
458 the following content::
459
460 TIMESTAMP 2017-10-30T10:11:12Z
461 IGNORE distfiles
462 IGNORE local
463 IGNORE lost+found
464 IGNORE packages
465 MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
466 ...
467 MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
468 ...
469
470 An example modern Manifest (disregarding backwards compatibility)
471 for a package directory would have the following content::
472
473 DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
474 DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
475 DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
476 DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
477 DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
478 DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
479 DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..
480
481
482 Rationale
483 =========
484
485 Stand-alone format
486 ------------------
487
488 The first question that needed to be asked before proceeding with
489 the design was whether the Manifest file format was supposed to be
490 stand-alone, or tightly bound to the repository format.
491
492 The stand-alone format has been selected because of its three
493 advantages:
494
495 1. It is more future-proof. If an incompatible change to the repository
496 format is introduced, only developers need to upgrade the tools
497 they use to generate the Manifests. The tools used to verify
498 the updated Manifests will continue to work.
499
500 2. It is more flexible and universal. With a dedicated tool,
501 the Manifest files can be used to sign and verify arbitrary file
502 sets.
503
504 3. It keeps the verification tool simpler. In particular, we can easily
505 write an independent verification tool that could work on any
506 distribution without needing to depend on a package manager
507 implementation or rewrite parts of it.
508
509 Designing a stand-alone format requires that the Manifest carries enough
510 information to perform the verification following all the rules specific
511 to the Gentoo repository.
512
513
514 Tree design
515 -----------
516
517 The second important point of the design was determining whether
518 the Manifest files should be structured hierarchically, or independent.
519 Both options have their advantages.
520
521 In the hierarchical model, each sub-Manifest file is covered by a higher
522 level Manifest. As a result, only the top-level Manifest has to be
523 OpenPGP-signed, and subsequent Manifests need to be only verified by
524 checksum stored in the parent Manifest. This has the following
525 implications:
526
527 - Verifying any set of files in the repository requires using checksums
528 from the most relevant Manifests and the parent Manifests.
529
530 - The OpenPGP signature of the top-level Manifest needs to be verified
531 only once per process.
532
533 - Altering any set of files requires updating the relevant Manifests,
534 and their parent Manifests up to the top-level Manifest, and signing
535 the last one.
536
537 - As a result, the top-level Manifest changes on every commit,
538 and various middle-level Manifests change (and need to be transferred)
539 frequently.
540
541 In the independent model, each sub-Manifest file is independent
542 of the parent Manifests. As a result, each of them needs to be signed
543 and verified independently. However, the parent Manifests still need
544 to list sub-Manifests (albeit without verification data) in order
545 to detect removal or replacement of subdirectories. This has
546 the following implications:
547
548 - Verifying any set of files in the repository requires using checksums
549 and verifying signatures of the most relevant Manifest files.
550
551 - Altering any set of files requires updating the relevant Manifests
552 and signing them again.
553
554 - Parent Manifests are updated only when Manifests are added or removed
555 from subdirectories. As a result, they change infrequently.
556
557 While both models have their advantages, the hierarchical model was
558 selected because it reduces the number of OpenPGP operations
559 (which are comparatively costly) to the minimum.
560
561
562 Tree layout restrictions
563 ------------------------
564
565 The algorithm is meant to work primarily with ebuild repositories which
566 normally contain only files and directories. Directories provide
567 no useful metadata for verification, and specifying special entries
568 for additional file types is purposeless. Therefore, the specification
569 is restricted to dealing with regular files.
570
571 The Gentoo repository does not use symbolic links. Some Gentoo
572 repositories do, however. To provide a simple solution for dealing with
573 symlinks without having to take care to implement special handling for
574 them, the common behavior of implicitly resolving them is used.
575 Therefore, symbolic links to files are stored as if they were regular
576 files, and symbolic links to directories are followed as if they were
577 regular directories.
578
579 Dotfiles are implicitly ignored as that is a common notion used
580 in software written for POSIX systems. All other filenames require
581 explicit ``IGNORE`` lines.
582
583 An ability to inject additional ignore entries is provided to account
584 for site configuration affecting the repository tree -- placing
585 additional files in it, skipping some of the categories from syncing.
586 This configuration can extend beyond the limits of this GLEP,
587 e.g. by allowing wildcards or regular expressions.
588
589 The algorithm is restricted to work on a single filesystem. This is
590 mostly relevant when scanning for top-level Manifest -- we do not want
591 to cross filesystem boundaries then. However, to ensure consistent
592 bidirectional behavior we need to also ban them when operating downwards
593 the tree.
594
595 The directories and files on different filesystems need to be ignored
596 explicitly as implicitly skipping them would cause confusion.
597 In particular, tools might then claim that a file does not exist when
598 it clearly does because it was skipped due to filesystem boundaries.
599
600
601 Filename character set restriction
602 ----------------------------------
603
604 The valid set of filename characters for the Gentoo repository
605 is restricted by the devmanual 'File Naming Rules' section
606 [#FILE-NAMING-RULES]_, and enforced via a git hook. The valid distfile
607 names are not restricted explicitly -- however, the PMS dependency
608 specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
609 filenames containing whitespace.
610
611 This specification aims to avoid arbitrary restrictions. For this
612 reason, filename characters are only restricted by excluding three
613 technically problematic groups:
614
615 1. The NULL character (``U+0000``) is normally used to indicate the end
616 of a null-terminated string. Its use could therefore break programs
617 written using C. Furthermore, it is not allowed in any known
618 filesystem.
619
620 2. The backwards slash character (``\``) is used as path separator
621 on Windows systems, so it's extremely unlikely to be used in real
622 filenames. For this reason it is used to implement character
623 encoding with minimal risk of breaking backwards compatibility.
624
625 3. Whitespace characters are used to separate Manifest fields
626 and entries. While technically it would be enough to restrict space
627 (``U+0020``) character that is normally used as the separator
628 and newline (``U+000A``) character that is used to separate lines,
629 all whitespace characters are forbidden to avoid confusion
630 and implementation errors.
631
632 Historically, Portage attempted to overcome the whitespace limitation
633 by attempting to locate the size field and take everything before it
634 as filename. This was terribly fragile and even if it worked, it would
635 solve the problem only partially.
636
637 The character encoding method provides means to overcome the character
638 restrictions to extend the tool usability beyond immediate Gentoo uses.
639 The backslash escape form based on Python unicode strings is used
640 since it can encode all characters within the Unicode range, the syntax
641 is familiar to many programmers and the backwards slash character
642 is extremely unlikely to appear in real filenames.
643
644 Syntax is limited to the minimum necessary to implement the encoding.
645 Shorthand forms (e.g. ``\t`` or ``\\``) are omitted to avoid unnecessary
646 complexity, and to reduce the risk of shell users using backslash
647 to escape space directly. The ``\x`` form is limited to ``\x00..\x7F``
648 range to avoid ambiguity of higher values which might be interpreted
649 either as UCS-2 code points or part of a UTF-8 encoded character.
650
651 Encoding stores UCS-2/UCS-4 characters directly rather than hex-encoded
652 UTF-8 string to simplify the implementation. In particular, it makes it
653 possible to process the Manifest file as UTF-8 encoded text without
654 having to perform additional UTF-8 decoding (and verification)
655 of the escaped data.
656
657 URL-encoding was considered as an alternative. However, it could collide
658 with ``DIST`` entries that are implicitly named after the URL filename
659 part where URL-encoding is pretty common.
660
661
662 File verification model
663 -----------------------
664
665 The verification model aims to provide full coverage against different
666 forms of attack. In particular, three different kinds of manipulation
667 are considered:
668
669 1. Alteration of the file content.
670
671 2. Removal of a file.
672
673 3. Addition of a new file.
674
675 In order to prevent against all three, the system requires that all
676 files in the repository are listed in Manifests and verified against
677 them.
678
679 As a special case, ignores are allowed to account for directories
680 that are not part of the repository but were traditionally placed inside
681 it. Those directories were ``distfiles``, ``local`` and ``packages``. It
682 could be also used to ignore VCS directories such as ``CVS``.
683
684
685 Non-strict Manifest verification
686 --------------------------------
687
688 Originally the Manifest2 format provided a special ``MISC`` tag that
689 was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
690 indicated that the Manifest verification failures could be ignored for
691 those files unless the package manager was working in strict mode.
692
693 The first versions of this specification continued the use of this tag.
694 However, after a long debate it was decided to deprecate it along with
695 the non-strict behavior, and require all files to strictly match.
696
697 Two arguments were mentioned for the usefulness of a ``MISC`` type:
698
699 1. being able to reduce the checkout size by stripping unnecessary
700 files out, and
701
702 2. being able to update automatically generated files locally
703 without causing unnecessary verification failures.
704
705 However, the usefulness of ``MISC`` in both cases is doubtful.
706
707 The cases for stripping unnecessary files mostly focused around space
708 savings. For this purpose, stripping ``metadata.xml`` and similar files
709 has little value. It is much more common for users to strip whole
710 packages or categories. The ``MISC`` type is not suitable for that,
711 and so a dedicated package manager mechanism needs to be developed
712 instead. The same mechanism can also handle files that historically used
713 the ``MISC`` type. As an example, the package manager may choose
714 to generate both the rsync exclusion list and Manifest ignore list
715 using a single source list.
716
717 The cases for autogenerated files involve such cache files
718 as ``use.local.desc``. However, we can not include ``md5-cache`` there
719 due to security concerns which results in inconsistent cache handling.
720 Furthermore, the tools were historically modified to provide stable
721 output which means that their content can not change without
722 a non-``MISC`` content being changed first. This practically defeats
723 the purpose of using ``MISC``.
724
725 Finally, the non-strict mode could be used as means to an attack.
726 The allowance of missing or modified documentation file could be used
727 to spread misinformation, resulting in bad decisions made by the user.
728 A modified file could also be used, e.g. to exploit vulnerabilities
729 of an XML parser.
730
731
732 Timestamp field
733 ---------------
734
735 The top-level Manifest optionally allows using a ``TIMESTAMP`` tag
736 to include a generation timestamp in the Manifest. A similar feature
737 was originally proposed in GLEP 58 [#GLEP58]_.
738
739 A malicious third-party may use the principles of exclusion or replay
740 [#C08]_ to deny an update to clients, while at the same time recording
741 the identity of clients to attack. The timestamp field can be used to
742 detect that.
743
744 In order to provide more complete protection, the Gentoo Infrastructure
745 should provide an ability to obtain the timestamps of all Manifests
746 from a recent timeframe over a secure channel from a trusted source
747 for comparison.
748
749 Strictly speaking, this information is provided by the various
750 ``metadata/timestamp*`` files that are already present. However,
751 including the value in the Manifest itself has a little cost
752 and provides the ability to perform the verification stand-alone.
753
754 Furthermore, some of the timestamp files are added very late
755 in the distribution process, past the Manifest generation phase. Those
756 files will most likely receive ``IGNORE`` entries and therefore
757 be unsafe to use.
758
759 The specification permits additional timestamps in sub-Manifest files
760 for local use. A generic testing tool should ignore them.
761
762
763 New vs deprecated tags
764 ----------------------
765
766 Out of the four types defined by Manifest2, only one is reused
767 and the remaining three are replaced by a single, universal ``DATA``
768 type.
769
770 The ``DIST`` tag is reused since the specification does not change
771 anything with regard to distfile handling.
772
773 The ``EBUILD`` tag could potentially be reused for generic file
774 verification data. However, it would be confusing if all the different
775 data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
776 type was introduced as a replacement.
777
778 The ``MISC`` tag and the relevant non-strict mode has been removed
779 as being of little value, as detailed in the `Non-strict Manifest
780 verification`_ section.
781
782 The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
783 the limiting property of implicit ``files/`` path prefix.
784
785
786 Finding top-level Manifest
787 --------------------------
788
789 The development of a reference implementation for this GLEP has brought
790 the following problem: how to find all the relevant Manifests when
791 the Manifest tool is run inside a subdirectory of the repository?
792
793 One of the options would be to provide a bi-directional linking
794 of Manifests via a ``PARENT`` tag. However, that would not solve
795 the problem when a new Manifest file is being created.
796
797 Instead, an algorithm for iterating over parent directories is proposed.
798 Since there is no obligatory explicit indicator for the top-level
799 Manifest, the algorithm assumes that the top-level Manifest
800 is the highest ``Manifest`` in the directory hierarchy that can cover
801 the current directory. This generally makes sense since the Manifest
802 files are required to provide coverage for all subdirectories, so all
803 Manifests starting from that one need to be updated.
804
805 If independent Manifest trees are nested in the directory structure,
806 then an ``IGNORE`` entry needs to be used to separate them.
807
808 Since sub-Manifests can use any filenames, the Manifest finding
809 algorithm must not short-cut the procedure by storing all ``Manifest``
810 files along the parent directories. Instead, it needs to retrace
811 the relevant sub-Manifest files along ``MANIFEST`` entries
812 in the top-level Manifest.
813
814
815 Injecting ChangeLogs into the checkout
816 --------------------------------------
817
818 One of the problems considered in the new Manifest format was injecting
819 historical and autogenerated ChangeLog into the repository. We normally
820 don't include those files, to reduce the checkout size. However, some
821 users have shown interest in them and Infra is working on providing them
822 via an additional rsync module.
823
824 If such files were injected into the repository, they would cause
825 verification failures of Manifests. To account for this, Infra could
826 provide ``IGNORE`` entries to allow them to exist.
827
828
829 Splitting distfile checksums from file checksums
830 ------------------------------------------------
831
832 Another problem with the current Manifest format is that the checksums
833 for fetched files are combined with checksums for local files
834 in a single file inside the package directory. It has been specifically
835 pointed out that:
836
837 - since distfiles are sometimes reused across different packages,
838 the repeating checksums are redundant [#DIST]_.
839
840 - mirror admins were interested in the possibility of verifying all
841 the distfiles with a single tool.
842
843 This specification does not provide a clean solution to this problem.
844 It technically permits moving ``DIST`` entries to higher-level Manifests
845 but the usefulness of such a solution is doubtful.
846
847 However, for the second problem we will probably deliver a dedicated
848 tool working with this Manifest format.
849
850
851 Hash algorithms
852 ---------------
853
854 While maintaining a consistent supported hash set is important
855 for interoperability, it is not a good fit for the generic layout
856 of this GLEP. Furthermore, it would require updating the GLEP
857 in the future every time the used algorithms change.
858
859 Instead, the specification focuses on listing the currently used
860 algorithm names for interoperability, and sets a recommendation
861 for consistent naming of algorithms in the future. The Python
862 ``hashlib`` module is used as a reference since it is used
863 as the provider of hash functions for most of the Python software,
864 including Portage and PkgCore.
865
866 The basic rules for changing hash algorithms are defined in GLEP 59
867 [#GLEP59]_. The implementations can focus only on those algorithms
868 that are actually used or planned on being used. It may be feasible
869 to devise a new GLEP that specifies the currently used hashes (or update
870 GLEP 59 accordingly).
871
872
873 Manifest compression
874 --------------------
875
876 The support for Manifest compression is introduced with minimal changes
877 to the file format. The ``MANIFEST`` entries are required to provide
878 the real (compressed) file path for compatibility with other file
879 entries and to avoid confusion.
880
881 The compression of top-level Manifest file has been prohibited
882 as the specification currently does not provide any means of verifying
883 the file prior to decompression. If the top-level Manifest is
884 compressed, tooling will have to unpack the file before being able
885 to verify the contents. This makes it possible for a malicious third
886 party to attack the system by providing a compressed Manifest that
887 exposes decompressor vulnerabilities, or a zip bomb.
888
889 The OpenPGP cleartext signature covers the contents of the Manifest,
890 and is therefore compressed along with them. The possibility of using
891 a detached signature has been considered but it was rejected as
892 unnecessary complexity for minor gain.
893
894 Technically, a similar result could be effected via moving all the data
895 into a compressed sub-Manifest in the top directory (e.g.
896 ``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
897 in a signed, uncompressed top-level Manifest.
898
899 The existence of additional entries for uncompressed Manifest checksums
900 was debated. However, plain entries for the uncompressed file would
901 be confusing if only the compressed file existed, and conflicting
902 if both uncompressed and compressed variants existed. Furthermore,
903 it has been pointed out that ``DIST`` entries do not have
904 an uncompressed variant either.
905
906
907 Performance considerations
908 --------------------------
909
910 Performing a full-tree verification on every sync raises some
911 performance concerns for end-user systems. The initial testing has shown
912 that a cold-cache verification on a btrfs file system can take up around
913 4 minutes, with the process being mostly I/O bound. On the other hand,
914 it can be expected that the verification will be performed directly
915 after syncing, taking advantage of a warm filesystem cache.
916
917 To improve speed on I/O and/or CPU-restrained systems even further,
918 the algorithms can be easily extended to perform incremental
919 verification. Given that rsync does not preserve mtimes by default,
920 the tool can take advantage of mtime and Manifest comparisons to recheck
921 only the parts of the repository that have changed.
922
923 Furthermore, the package manager implementations can restrict checking
924 only to the parts of the repository that are actually being used.
925
926
927 Backwards Compatibility
928 =======================
929
930 This GLEP provides optional means of preserving backwards compatibility.
931 To preserve the backwards compatibility, the following needs to hold
932 for the ``Manifest`` file in every package directory:
933
934 - all files must be covered by the single ``Manifest`` file,
935
936 - all distfiles used by the package must be included,
937
938 - all files inside the ``files/`` subdirectory need to use
939 the ``AUX`` tag (rather than ``DATA``),
940
941 - all ``.ebuild`` files need to use the ``EBUILD`` tag,
942
943 - the ``metadata.xml`` and ``ChangeLog`` files need to use
944 the ``MISC`` tag,
945
946 - the Manifest can be signed to provide authenticity verification,
947
948 - an uncompressed Manifest must always exist, and a compressed Manifest
949 of identical content may be present.
950
951 Once the backwards compatibility is no longer a concern, the above
952 no longer needs to hold and the deprecated tags can be removed.
953
954
955 Reference Implementation
956 ========================
957
958 The reference implementation for this GLEP is being developed
959 as the gemato project [#GEMATO]_.
960
961
962 Credits
963 =======
964
965 Thanks to all the people whose contributions were invaluable
966 to the creation of this GLEP. This includes but is not limited to:
967
968 - Robin Hugh Johnson,
969 - Ulrich Müller.
970
971 Additionally, thanks to Robin Hugh Johnson for the original
972 MetaManifest GLEP series which served both as inspiration and source
973 of many concepts used in this GLEP. Recursively, also thanks to all
974 the people who contributed to the original GLEPs.
975
976
977 References
978 ==========
979
980 .. [#GLEP44] GLEP 44: Manifest2 format
981 (https://www.gentoo.org/glep/glep-0044.html)
982
983 .. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
984 - Overview
985 (https://www.gentoo.org/glep/glep-0057.html)
986
987 .. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
988 - Infrastructure to User distribution - MetaManifest
989 (https://www.gentoo.org/glep/glep-0058.html)
990
991 .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
992 (https://www.gentoo.org/glep/glep-0059.html)
993
994 .. [#GLEP60] GLEP 60: Manifest2 filetypes
995 (https://www.gentoo.org/glep/glep-0060.html)
996
997 .. [#GLEP61] GLEP 61: Manifest2 compression
998 (https://www.gentoo.org/glep/glep-0061.html)
999
1000 .. [#UNICODE] The Unicode standard
1001 (https://unicode.org/versions/latest/)
1002
1003 .. [#PMS-FETCH] Package Manager Specification: Dependency Specification
1004 Format - SRC_URI
1005 (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)
1006
1007 .. [#FILE-NAMING-RULES] Ebuild File Format -- Gentoo Development Guide
1008 (https://devmanual.gentoo.org/ebuild-writing/file-format/#file-naming-rules)
1009
1010 .. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
1011 (https://www.ietf.org/rfc/rfc1321.txt)
1012
1013 .. [#RIPEMD160] The hash function RIPEMD-160
1014 (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)
1015
1016 .. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
1017 (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)
1018
1019 .. [#WHIRLPOOL] The WHIRLPOOL Hash Function
1020 (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)
1021
1022 .. [#BLAKE2] BLAKE2 -- fast secure hashing
1023 (https://blake2.net/)
1024
1025 .. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
1026 and Extendable-Output Functions
1027 (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)
1028
1029 .. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
1030 (https://www.streebog.net/)
1031
1032 .. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
1033 (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)
1034
1035 .. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
1036 at the time of writing are duplicate, representing 2 MiB
1037 out of 25 MiB of DIST entries altogether.
1038
1039 .. [#GEMATO] gemato: Gentoo Manifest Tool
1040 (https://github.com/mgorny/gemato/)
1041
1042
1043 Copyright
1044 =========
1045 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
1046 Unported License. To view a copy of this license, visit
1047 http://creativecommons.org/licenses/by-sa/3.0/.
1048
1049 --
1050 Best regards,
1051 Michał Górny

Replies

Subject Author
Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v4] Ulrich Mueller <ulm@g.o>