Gentoo Archives: gentoo-dev

From: "Michał Górny" <mgorny@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5]
Date: Thu, 23 Nov 2017 20:54:10
Message-Id: 1511470437.15571.1.camel@gentoo.org
In Reply to: [gentoo-dev] [RFC] GLEP 74 post-Council review update by "Michał Górny"
1 W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
2 napisał:
3 > Hi, everyone.
4 >
5 > Here's the updated version of GLEP 74 taking into consideration
6 > the points made during the Council pre-review.
7 >
8 > ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
9 > HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
10 >
11 > Changes:
12
13 27c2a9e glep-0074: Grammar corrections from Ulrich Müller
14 d39f865 glep-0074: Make extended filename encoding optional
15 ed111f8 glep-0074: Always exclude control characters
16
17 ---
18 GLEP: 74
19 Title: Full-tree verification using Manifest files
20 Author: Michał Górny <mgorny@g.o>,
21 Robin Hugh Johnson <robbat2@g.o>,
22 Ulrich Müller <ulm@g.o>
23 Type: Standards Track
24 Status: Draft
25 Version: 1
26 Created: 2017-10-21
27 Last-Modified: 2017-11-23
28 Post-History: 2017-10-26, 2017-11-16
29 Content-Type: text/x-rst
30 Requires: 59, 61
31 Replaces: 44, 58, 60
32 ---
33
34 Abstract
35 ========
36
37 This GLEP extends the Manifest file format to cover full-tree file
38 integrity and authenticity checks. The format aims to be future-proof,
39 efficient and provide means of backwards compatibility.
40
41
42 Motivation
43 ==========
44
45 The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
46 means of verifying the integrity of distfiles and package files
47 in Gentoo. Combined with OpenPGP signatures, they provide means to
48 ensure the authenticity of the covered files. However, as noted
49 in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
50 authenticity verification as they do not cover any files outside
51 the package directory. In particular, they provide multiple ways
52 for a third party to inject malicious code into the ebuild environment.
53
54 Historically, the topic of providing authenticity coverage for the whole
55 repository has been mentioned multiple times. The most noteworthy effort
56 are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
57 They were accepted by the Council in 2010 but have never been
58 implemented. When potential implementation work started in 2017, a new
59 discussion about the specification arose. It prompted the creation
60 of a competing GLEP that would provide a redesigned alternative to
61 the old GLEPs.
62
63 This specification is designed with the following goals in mind:
64
65 1. It should provide means to ensure the authenticity of the complete
66 repository, including preventing the injection of additional files.
67
68 2. The format should be universal enough to work both for the Gentoo
69 repository and third-party repositories of different characteristics.
70
71 3. The Manifest files should be verifiable stand-alone, that is without
72 knowing any details about the underlying repository format.
73
74
75 Specification
76 =============
77
78 Manifest file format
79 --------------------
80
81 This specification reuses and extends the Manifest file format defined
82 in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
83 repurposed as a generic *tag* that could also indicate additional
84 (non-checksum) metadata. Appropriately, those tags can be followed by
85 other space-separated values.
86
87 Unless specified otherwise, the paths used in the Manifest files
88 are relative to the directory containing the Manifest file. The paths
89 must not reference the parent directory (``..``). Forward slash (``/``)
90 is used as path component separator.
91
92 The Manifest files use UTF-8 encoding.
93
94
95 Manifest file locations and nesting
96 -----------------------------------
97
98 The ``Manifest`` file located in the root directory of the repository
99 is called top-level Manifest, and it is used to perform the full-tree
100 verification. In order to verify the authenticity, it must be signed
101 using OpenPGP, using the armored cleartext format.
102
103 The top-level Manifest may reference sub-Manifests contained
104 in subdirectories of the repository. The sub-Manifests are traditionally
105 named ``Manifest``; however, the implementation must support arbitrary
106 names, including the possibility of multiple (split) Manifests
107 for a single directory. The sub-Manifest can only cover the files inside
108 the directory tree where it resides.
109
110 The sub-Manifest can also be signed using OpenPGP armored cleartext
111 format. However, the signature verification can be omitted since it
112 already is covered by the signed top-level Manifest.
113
114
115 Directory tree coverage
116 -----------------------
117
118 The specification provides three ways of skipping Manifest verification
119 of specific files and directories (recursively):
120
121 1. explicit ``IGNORE`` entries in Manifest files,
122
123 2. injected ignore paths via package manager configuration,
124
125 3. using names starting with a dot (``.``) which are always skipped.
126
127 All files that are not ignored must be covered by at least one
128 of the Manifests.
129
130 A single file may be matched by multiple identical or equivalent
131 Manifest entries, if and only if the entries have the same semantics,
132 specify the same size and the checksums common to both entries match.
133 It is an error for a single file to be matched by multiple entries
134 of different semantics, file size or checksum values. It is an error
135 to specify another entry for a file that matches ``IGNORE``, or that
136 is located inside an ignored directory.
137
138 The file entries (except for ``IGNORE``) can be specified for regular
139 files only. Symbolic links are followed when opening files
140 and traversing directories. It is an error to specify an entry for
141 a different file type. If the tree contain files of other types
142 that are not otherwise ignored, they need to be covered by an explicit
143 ``IGNORE``.
144
145 All the local (non-``DIST``) files covered by a Manifest tree must
146 reside on the same filesystem. It is an error to specify entries
147 applying to files on another filesystem. If files or directories that
148 are not otherwise ignored reside on a different filesystem, or symbolic
149 links point to targets on a different filesystem, they must
150 be explicitly excluded via ``IGNORE``.
151
152
153 Path and filename encoding
154 --------------------------
155
156 The path fields in the Manifest file must consist of characters
157 corresponding to valid UTF-8 code points excluding the backwards slash
158 (``\``) and characters classified as control characters or as whitespace
159 in the current version of the Unicode standard [#UNICODE]_.
160
161 The implementation can optionally support extended filename encoding
162 to support those paths. If encoding is not supported, the implementation
163 must reject directories containing any files using non-compliant names,
164 as well as Manifest files whose filename field contains such filenames.
165
166 If encoding is supported, then all of the excluded characters that
167 are present in paths must be encoded using one of the following escape
168 sequences:
169
170 - characters in the ``U+0000`` to ``U+007F`` range can be encoded
171 as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal
172 character code,
173
174 - characters in the ``U+0000`` to ``U+FFFF`` range can be encoded
175 as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal
176 character code,
177
178 - characters in the UCS-4 range can be encoded as ``\UHHHHHHHH``
179 where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character
180 code.
181
182 It is invalid for the backwards slash to be used in any other context,
183 and a backwards slash present in filename must be encoded. A backwards
184 slash used as a path component separator should be replaced by a forward
185 slash instead.
186
187 The encoding can be used for other characters as well. In particular,
188 escaping non-printable characters might be desirable.
189
190
191 File verification
192 -----------------
193
194 When verifying a file against the Manifest, the following rules are
195 used:
196
197 1. If the file is covered directly or indirectly by an entry
198 of the ``IGNORE`` type, the verification always succeeds.
199
200 2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
201 ``MISC``, ``EBUILD`` or ``AUX`` type:
202
203 a. if the file is not present, then the verification fails,
204
205 b. if the file is present but has a different size or one
206 of the checksums does not match, the verification fails,
207
208 c. otherwise, the verification succeeds.
209
210 3. If the file is present but not listed in Manifest, the verification
211 fails.
212
213 Unless specified otherwise, the package manager must not allow using
214 any files for which the verification failed. The package manager may
215 reject any package or even the whole repository if it may refer to files
216 for which the verification failed.
217
218
219 Timestamp verification
220 ----------------------
221
222 The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
223 for attacks against tree update distribution. If such an entry
224 is present, it should be updated every time at least one
225 of the Manifests changes. Every unique timestamp value must correspond
226 to a single tree state.
227
228 During the verification process, the client should compare the timestamp
229 against the update time obtained from a local clock or a trusted time
230 source. If the comparison result indicates that the Manifest at the time
231 of receiving was already significantly outdated, the client should
232 either fail the verification or require manual confirmation from
233 the user.
234
235 Furthermore, the Manifest provider may employ additional methods
236 of distributing the timestamps of recently generated Manifests
237 using a secure channel from a trusted source for exact comparison.
238 The exact details of such a solution are outside the scope of this
239 specification.
240
241 ``TIMESTAMP`` entries may also be present in sub-Manifests. Those
242 timestamps must not be newer than the timestamp of the top-level
243 Manifest (if present). This specification does not define any specific
244 use for them.
245
246
247 Modern Manifest tags
248 --------------------
249
250 The Manifest files can specify the following tags:
251
252 ``TIMESTAMP <iso8601>``
253 Specifies a timestamp of when the Manifest file was last updated.
254 The timestamp must be a valid second-precision ISO 8601 extended
255 format combined date and time in UTC timezone, i.e. using
256 the following ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``.
257 Optional. The package manager can use it to detect an outdated
258 repository checkout as described in `Timestamp verification`_.
259
260 ``MANIFEST <path> <size> <checksums>...``
261 Specifies a sub-Manifest. The sub-Manifest must be verified like
262 a regular file. If the verification succeeds, the entries from
263 the sub-Manifest are included for verification as described
264 in `Manifest file locations and nesting`_.
265
266 ``IGNORE <path>``
267 Ignores a subdirectory or file from Manifest checks. If the specified
268 path is present, it and its contents are omitted from the Manifest
269 verification (always pass). *Path* must be a plain file or directory
270 path without a trailing slash. Wildcards are not supported
271 and wildcard characters are interpreted literally.
272
273 ``DATA <path> <size> <checksums>...``
274 Specifies a regular file subject to Manifest verification. The file
275 is required to pass verification. Used for all files that do not match
276 any other type.
277
278 ``DIST <filename> <size> <checksums>...``
279 Specifies a distfile entry used to verify files fetched as part
280 of ``SRC_URI``. The filename must match the filename used to store
281 the fetched file as specified in the PMS [#PMS-FETCH]_. The package
282 manager must reject the fetched file if it fails verification.
283 ``DIST`` entries apply to all packages below the Manifest file
284 specifying them.
285
286
287 Deprecated Manifest tags
288 ------------------------
289
290 For backwards compatibility, the following tags are additionally
291 allowed at the package directory level:
292
293 ``EBUILD <filename> <size> <checksums>...``
294 Equivalent to the ``DATA`` type.
295
296 ``MISC <path> <size> <checksums>...``
297 Equivalent to the ``DATA`` type. Historically indicated that
298 the package manager may ignore a verification failure if operating
299 in non-strict mode. However, that behavior is deprecated.
300
301 ``AUX <filename> <size> <checksums>...``
302 Equivalent to the ``DATA`` type, except that the filename is relative
303 to the ``files/`` subdirectory.
304
305
306 Algorithm for full-tree verification
307 ------------------------------------
308
309 In order to perform full-tree verification, the following algorithm
310 can be used:
311
312 1. Collect all files present in the repository into *present* set.
313
314 2. Start at the top-level Manifest file. Verify its OpenPGP signature.
315 Optionally verify the ``TIMESTAMP`` entry if present as specified
316 in `timestamp verification`. Remove the top-level Manifest
317 from the *present* set.
318
319 3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
320 files according to the `file verification`_ section, and include
321 their entries in the current Manifest entry list (using paths
322 relative to directories containing the Manifests).
323
324 4. Process all ``IGNORE`` entries. Remove any paths matching them
325 from the *present* set.
326
327 5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
328 and ``AUX`` entries into the *covered* set.
329
330 6. Verify the entries in the *covered* set for incompatible duplicates
331 and collisions with ignored files as explained in `Manifest file
332 locations and nesting`_.
333
334 7. Verify all the files in the union of the *present* and *covered*
335 sets, according to the `file verification`_ section.
336
337
338 Algorithm for finding parent Manifests
339 --------------------------------------
340
341 In order to find the top-level Manifest from the current directory
342 the following algorithm can be used:
343
344 1. Store the current directory as *original* and the device ID
345 of the containing filesystem (``st_dev``) as *startdev*,
346
347 2. If the device ID of the containing filesystem (``st_dev``)
348 of the current directory is different than *startdev*, stop.
349
350 3. If the current directory contains a ``Manifest`` file:
351
352 a. If an ``IGNORE`` entry in the ``Manifest`` file covers
353 the *original* directory (or one of the parent directories), stop.
354
355 b. Otherwise, store the current directory as *last_found*.
356
357 4. If the current directory is the root system directory (``/``), stop.
358
359 5. Otherwise, enter the parent directory and jump to step 2.
360
361 Once the algorithm stops, *last_found* will contain the relevant
362 top-level Manifest. If *last_found* is null, then the directory tree
363 does not contain any valid top-level Manifest candidates and one should
364 be created in the *original* directory.
365
366 Once the top-level Manifest is found, its ``MANIFEST`` entries should
367 be used to find any sub-Manifests below the top-level Manifest,
368 up to and including the *original* directory. Note that those
369 sub-Manifests can use different filenames than ``Manifest``.
370
371
372 Checksum algorithms
373 -------------------
374
375 This section is informational only. Specifying the exact set
376 of supported algorithms is outside the scope of this specification.
377
378 The algorithm names reserved at the time of writing are:
379
380 - ``MD5`` [#MD5]_,
381 - ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
382 - ``SHA1`` [#SHS]_,
383 - ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
384 - ``WHIRLPOOL`` [#WHIRLPOOL]_,
385 - ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
386 - ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
387 - ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
388 [#STREEBOG]_.
389
390 The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
391 It is recommended that any new hashes are named after the Python
392 ``hashlib`` module algorithm names, transformed into uppercase.
393
394
395 Manifest compression
396 --------------------
397
398 The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
399 This section merely addresses interoperability issues between Manifest
400 compression and this specification.
401
402 The compressed Manifest files are required to be suffixed for their
403 compression algorithm. This suffix should be used to recognize
404 the compression and decompress Manifests transparently. The exact list
405 of algorithms and their corresponding suffixes are outside the scope
406 of this specification.
407
408 The top-level Manifest file must not be compressed. Since the OpenPGP
409 signature covers the uncompressed text and is compressed itself,
410 the data would have to be decompressed without any prior verification.
411 This could expose users e.g. to zip bombs or exploits on decompressor
412 vulnerabilities.
413
414 Whenever this specification refers to sub-Manifests, they can use any
415 names but are also required to use a specific compression suffix.
416 The ``MANIFEST`` entries are required to specify the full name including
417 compression suffix, and the verification is performed on the compressed
418 file.
419
420 The specification permits uncompressed Manifests to exist alongside
421 their compressed counterparts, and multiple compressed formats
422 to coexist. If that is the case, the files must have the same
423 uncompressed content and the specification is free to choose either
424 of the files using the same base name.
425
426
427 Combining multiple Manifest trees (informational)
428 -------------------------------------------------
429
430 This specification permits nesting multiple hierarchical Manifest trees.
431 In this layout, the specific directories of the Manifest tree can
432 be verified both as a part of another top-level Manifest,
433 and as an independent Manifest tree (when obtained without the parent
434 directory).
435
436 For this to work, the sub-Manifest file in the directory must also
437 satisfy the requirements for the top-level Manifest file. That is:
438
439 - it must be named ``Manifest`` and not compressed,
440
441 - it must cover all the files in this directory and its subdirectories
442 (i.e. no files from the directory tree can be covered by parent
443 Manifest),
444
445 - if authenticity verification is desired, it must be OpenPGP-signed.
446
447 It should be noted that if such a directory is a subdirectory of a valid
448 Manifest tree, the sub-Manifest needs to be valid according
449 to the top-level Manifest and the OpenPGP signature is disregarded
450 as detailed in `Manifest file locations and nesting`_. The top-level
451 behavior is exhibited only when the directory is obtained without parent
452 directories.
453
454
455 An example Manifest file (informational)
456 ----------------------------------------
457
458 An example top-level Manifest file for the Gentoo repository would have
459 the following content::
460
461 TIMESTAMP 2017-10-30T10:11:12Z
462 IGNORE distfiles
463 IGNORE local
464 IGNORE lost+found
465 IGNORE packages
466 MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
467 ...
468 MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
469 ...
470
471 An example modern Manifest (disregarding backwards compatibility)
472 for a package directory would have the following content::
473
474 DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
475 DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
476 DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
477 DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
478 DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
479 DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
480 DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..
481
482
483 Rationale
484 =========
485
486 Stand-alone format
487 ------------------
488
489 The first question that needed to be asked before proceeding with
490 the design was whether the Manifest file format was supposed to be
491 stand-alone, or tightly bound to the repository format.
492
493 The stand-alone format has been selected because of its three
494 advantages:
495
496 1. It is more future-proof. If an incompatible change to the repository
497 format is introduced, only developers need to upgrade the tools
498 they use to generate the Manifests. The tools used to verify
499 the updated Manifests will continue to work.
500
501 2. It is more flexible and universal. With a dedicated tool,
502 the Manifest files can be used to sign and verify arbitrary file
503 sets.
504
505 3. It keeps the verification tool simpler. In particular, we can easily
506 write an independent verification tool that could work on any
507 distribution without needing to depend on a package manager
508 implementation or rewrite parts of it.
509
510 Designing a stand-alone format requires that the Manifest carries enough
511 information to perform the verification following all the rules specific
512 to the Gentoo repository.
513
514
515 Tree design
516 -----------
517
518 The second important point of the design was determining whether
519 the Manifest files should be structured hierarchically, or independent.
520 Both options have their advantages.
521
522 In the hierarchical model, each sub-Manifest file is covered by a higher
523 level Manifest. As a result, only the top-level Manifest has to be
524 OpenPGP-signed, and subsequent Manifests need to be only verified by
525 checksum stored in the parent Manifest. This has the following
526 implications:
527
528 - Verifying any set of files in the repository requires using checksums
529 from the most relevant Manifests and the parent Manifests.
530
531 - The OpenPGP signature of the top-level Manifest needs to be verified
532 only once per process.
533
534 - Altering any set of files requires updating the relevant Manifests,
535 and their parent Manifests up to the top-level Manifest, and signing
536 the last one.
537
538 - As a result, the top-level Manifest changes on every commit,
539 and various middle-level Manifests change (and need to be transferred)
540 frequently.
541
542 In the independent model, each sub-Manifest file is independent
543 of the parent Manifests. As a result, each of them needs to be signed
544 and verified independently. However, the parent Manifests still need
545 to list sub-Manifests (albeit without verification data) in order
546 to detect removal or replacement of subdirectories. This has
547 the following implications:
548
549 - Verifying any set of files in the repository requires using checksums
550 and verifying signatures of the most relevant Manifest files.
551
552 - Altering any set of files requires updating the relevant Manifests
553 and signing them again.
554
555 - Parent Manifests are updated only when Manifests are added or removed
556 from subdirectories. As a result, they change infrequently.
557
558 While both models have their advantages, the hierarchical model was
559 selected because it reduces the number of OpenPGP operations
560 (which are comparatively costly) to the minimum.
561
562
563 Tree layout restrictions
564 ------------------------
565
566 The algorithm is meant to work primarily with ebuild repositories which
567 normally contain only files and directories. Directories provide
568 no useful metadata for verification, and specifying special entries
569 for additional file types is purposeless. Therefore, the specification
570 is restricted to dealing with regular files.
571
572 The Gentoo repository does not use symbolic links. Some Gentoo
573 repositories do, however. To provide a simple solution for dealing with
574 symlinks without having to take care to implement special handling for
575 them, the common behavior of implicitly resolving them is used.
576 Therefore, symbolic links to files are stored as if they were regular
577 files, and symbolic links to directories are followed as if they were
578 regular directories.
579
580 Dotfiles are implicitly ignored as that is a common notion used
581 in software written for POSIX systems. All other filenames require
582 explicit ``IGNORE`` lines.
583
584 An ability to inject additional ignore entries is provided to account
585 for site configuration affecting the repository tree -- placing
586 additional files in it, skipping some of the categories from syncing.
587 This configuration can extend beyond the limits of this GLEP,
588 e.g. by allowing wildcards or regular expressions.
589
590 The algorithm is restricted to work on a single filesystem. This is
591 mostly relevant when scanning for top-level Manifest -- we do not want
592 to cross filesystem boundaries then. However, to ensure consistent
593 bidirectional behavior we need to also ban them when operating downwards
594 the tree.
595
596 The directories and files on different filesystems need to be ignored
597 explicitly as implicitly skipping them would cause confusion.
598 In particular, tools might then claim that a file does not exist when
599 it clearly does because it was skipped due to filesystem boundaries.
600
601
602 Filename character set restriction
603 ----------------------------------
604
605 The valid set of filename characters for the Gentoo repository
606 is restricted by the devmanual 'File Naming Rules' section
607 [#FILE-NAMING-RULES]_, and enforced via a git hook. The valid distfile
608 names are not restricted explicitly -- however, the PMS dependency
609 specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
610 filenames containing whitespace.
611
612 This specification aims to avoid arbitrary restrictions. For this
613 reason, filename characters are only restricted by excluding three
614 technically problematic groups:
615
616 1. The backwards slash character (``\``) is used as path separator
617 on Windows systems, so it's extremely unlikely to be used in real
618 filenames. For this reason it is used to implement character
619 encoding with minimal risk of breaking backwards compatibility.
620
621 2. The control characters can trigger special behavior in various
622 programs and confuse them from recognizing text files. In particular,
623 the NULL character (``U+0000``) is normally used to indicate the end
624 of a null-terminated string. Its use could therefore break
625 implementations written in the C language. Other control characters
626 could trigger various formatting routines, garbling text output.
627
628 3. Whitespace characters are used to separate Manifest fields
629 and entries. While technically it would be enough to restrict space
630 (``U+0020``) character that is normally used as the separator
631 and newline (``U+000A``) character that is used to separate lines,
632 all whitespace characters are forbidden to avoid confusion
633 and implementation errors.
634
635 Historically, Portage attempted to overcome the whitespace limitation
636 by attempting to locate the size field and take everything before it
637 as filename. This was terribly fragile and even if it worked, it would
638 solve the problem only partially.
639
640 To preserve compatibility with the current implementations and given
641 that all of the listed characters are not allowed for the foreseeable
642 Gentoo uses, extended encoding support is optional. If such support
643 is not provided, the implementation must unconditionally reject any
644 such files. Ignoring them implicitly would be confusing, and it is
645 not possible to use them in explicit ``IGNORE`` entries.
646
647 The character encoding method provides means to overcome the character
648 restrictions to extend the tool usability beyond immediate Gentoo uses.
649 The backslash escape form based on Python unicode strings is used
650 since it can encode all characters within the Unicode range, the syntax
651 is familiar to many programmers and the backwards slash character
652 is extremely unlikely to appear in real filenames.
653
654 Syntax is limited to the minimum necessary to implement the encoding.
655 Shorthand forms (e.g. ``\t`` or ``\\``) are omitted to avoid unnecessary
656 complexity, and to reduce the risk of shell users using backslash
657 to escape space directly. The ``\x`` form is limited to ``\x00..\x7F``
658 range to avoid ambiguity of higher values which might be interpreted
659 either as UCS-2 code points or part of a UTF-8 encoded character.
660
661 Encoding stores UCS-2/UCS-4 characters directly rather than hex-encoded
662 UTF-8 string to simplify the implementation. In particular, it makes it
663 possible to process the Manifest file as UTF-8 encoded text without
664 having to perform additional UTF-8 decoding (and verification)
665 of the escaped data.
666
667 URL-encoding was considered as an alternative. However, it could collide
668 with ``DIST`` entries that are implicitly named after the URL filename
669 part where URL-encoding is pretty common.
670
671
672 File verification model
673 -----------------------
674
675 The verification model aims to provide full coverage against different
676 forms of attack. In particular, three different kinds of manipulation
677 are considered:
678
679 1. Alteration of the file content.
680
681 2. Removal of a file.
682
683 3. Addition of a new file.
684
685 In order to prevent against all three, the system requires that all
686 files in the repository are listed in Manifests and verified against
687 them.
688
689 As a special case, ignores are allowed to account for directories
690 that are not part of the repository but were traditionally placed inside
691 it. Those directories were ``distfiles``, ``local`` and ``packages``. It
692 could be also used to ignore VCS directories such as ``CVS``.
693
694
695 Non-strict Manifest verification
696 --------------------------------
697
698 Originally the Manifest2 format provided a special ``MISC`` tag that
699 was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
700 indicated that the Manifest verification failures could be ignored for
701 those files unless the package manager was working in strict mode.
702
703 The first versions of this specification continued the use of this tag.
704 However, after a long debate it was decided to deprecate it along with
705 the non-strict behavior, and require all files to strictly match.
706
707 Two arguments were mentioned for the usefulness of a ``MISC`` type:
708
709 1. being able to reduce the checkout size by stripping unnecessary
710 files out, and
711
712 2. being able to update automatically generated files locally
713 without causing unnecessary verification failures.
714
715 However, the usefulness of ``MISC`` in both cases is doubtful.
716
717 The cases for stripping unnecessary files mostly focused around space
718 savings. For this purpose, stripping ``metadata.xml`` and similar files
719 has little value. It is much more common for users to strip whole
720 packages or categories. The ``MISC`` type is not suitable for that,
721 and so a dedicated package manager mechanism needs to be developed
722 instead. The same mechanism can also handle files that historically used
723 the ``MISC`` type. As an example, the package manager may choose
724 to generate both the rsync exclusion list and Manifest ignore list
725 using a single source list.
726
727 The cases for autogenerated files involve such cache files
728 as ``use.local.desc``. However, we can not include ``md5-cache`` there
729 due to security concerns which results in inconsistent cache handling.
730 Furthermore, the tools were historically modified to provide stable
731 output which means that their content can not change without
732 a non-``MISC`` content being changed first. This practically defeats
733 the purpose of using ``MISC``.
734
735 Finally, the non-strict mode could be used as means to an attack.
736 The allowance of missing or modified documentation file could be used
737 to spread misinformation, resulting in bad decisions made by the user.
738 A modified file could also be used, e.g. to exploit vulnerabilities
739 of an XML parser.
740
741
742 Timestamp field
743 ---------------
744
745 The top-level Manifest optionally allows using a ``TIMESTAMP`` tag
746 to include a generation timestamp in the Manifest. A similar feature
747 was originally proposed in GLEP 58 [#GLEP58]_.
748
749 A malicious third-party may use the principles of exclusion or replay
750 [#C08]_ to deny an update to clients, while at the same time recording
751 the identity of clients to attack. The timestamp field can be used to
752 detect that.
753
754 In order to provide more complete protection, the Gentoo Infrastructure
755 should provide an ability to obtain the timestamps of all Manifests
756 from a recent timeframe over a secure channel from a trusted source
757 for comparison.
758
759 Strictly speaking, this information is provided by the various
760 ``metadata/timestamp*`` files that are already present. However,
761 including the value in the Manifest itself has a little cost
762 and provides the ability to perform the verification stand-alone.
763
764 Furthermore, some of the timestamp files are added very late
765 in the distribution process, past the Manifest generation phase. Those
766 files will most likely receive ``IGNORE`` entries and therefore
767 be unsafe to use.
768
769 The specification permits additional timestamps in sub-Manifest files
770 for local use. A generic testing tool should ignore them.
771
772
773 New vs deprecated tags
774 ----------------------
775
776 Out of the four types defined by Manifest2, only one is reused
777 and the remaining three are replaced by a single, universal ``DATA``
778 type.
779
780 The ``DIST`` tag is reused since the specification does not change
781 anything with regard to distfile handling.
782
783 The ``EBUILD`` tag could potentially be reused for generic file
784 verification data. However, it would be confusing if all the different
785 data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
786 type was introduced as a replacement.
787
788 The ``MISC`` tag and the relevant non-strict mode has been removed
789 as being of little value, as detailed in the `Non-strict Manifest
790 verification`_ section.
791
792 The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
793 the limiting property of implicit ``files/`` path prefix.
794
795
796 Finding top-level Manifest
797 --------------------------
798
799 The development of a reference implementation for this GLEP has brought
800 the following problem: how to find all the relevant Manifests when
801 the Manifest tool is run inside a subdirectory of the repository?
802
803 One of the options would be to provide a bi-directional linking
804 of Manifests via a ``PARENT`` tag. However, that would not solve
805 the problem when a new Manifest file is being created.
806
807 Instead, an algorithm for iterating over parent directories is proposed.
808 Since there is no obligatory explicit indicator for the top-level
809 Manifest, the algorithm assumes that the top-level Manifest
810 is the highest ``Manifest`` in the directory hierarchy that can cover
811 the current directory. This generally makes sense since the Manifest
812 files are required to provide coverage for all subdirectories, so all
813 Manifests starting from that one need to be updated.
814
815 If independent Manifest trees are nested in the directory structure,
816 then an ``IGNORE`` entry needs to be used to separate them.
817
818 Since sub-Manifests can use any filenames, the Manifest finding
819 algorithm must not short-cut the procedure by storing all ``Manifest``
820 files along the parent directories. Instead, it needs to retrace
821 the relevant sub-Manifest files along ``MANIFEST`` entries
822 in the top-level Manifest.
823
824
825 Injecting ChangeLogs into the checkout
826 --------------------------------------
827
828 One of the problems considered in the new Manifest format was injecting
829 historical and autogenerated ChangeLog into the repository. We normally
830 don't include those files, to reduce the checkout size. However, some
831 users have shown interest in them and Infra is working on providing them
832 via an additional rsync module.
833
834 If such files were injected into the repository, they would cause
835 verification failures of Manifests. To account for this, Infra could
836 provide ``IGNORE`` entries to allow them to exist.
837
838
839 Splitting distfile checksums from file checksums
840 ------------------------------------------------
841
842 Another problem with the current Manifest format is that the checksums
843 for fetched files are combined with checksums for local files
844 in a single file inside the package directory. It has been specifically
845 pointed out that:
846
847 - since distfiles are sometimes reused across different packages,
848 the repeating checksums are redundant [#DIST]_.
849
850 - mirror admins were interested in the possibility of verifying all
851 the distfiles with a single tool.
852
853 This specification does not provide a clean solution to this problem.
854 It technically permits moving ``DIST`` entries to higher-level Manifests
855 but the usefulness of such a solution is doubtful.
856
857 However, for the second problem we will probably deliver a dedicated
858 tool working with this Manifest format.
859
860
861 Hash algorithms
862 ---------------
863
864 While maintaining a consistent supported hash set is important
865 for interoperability, it is not a good fit for the generic layout
866 of this GLEP. Furthermore, it would require updating the GLEP
867 in the future every time the used algorithms change.
868
869 Instead, the specification focuses on listing the currently used
870 algorithm names for interoperability, and sets a recommendation
871 for consistent naming of algorithms in the future. The Python
872 ``hashlib`` module is used as a reference since it is used
873 as the provider of hash functions for most of the Python software,
874 including Portage and PkgCore.
875
876 The basic rules for changing hash algorithms are defined in GLEP 59
877 [#GLEP59]_. The implementations can focus only on those algorithms
878 that are actually used or planned on being used. It may be feasible
879 to devise a new GLEP that specifies the currently used hashes (or update
880 GLEP 59 accordingly).
881
882
883 Manifest compression
884 --------------------
885
886 The support for Manifest compression is introduced with minimal changes
887 to the file format. The ``MANIFEST`` entries are required to provide
888 the real (compressed) file path for compatibility with other file
889 entries and to avoid confusion.
890
891 The compression of top-level Manifest file has been prohibited
892 as the specification currently does not provide any means of verifying
893 the file prior to decompression. If the top-level Manifest is
894 compressed, tooling will have to unpack the file before being able
895 to verify the contents. This makes it possible for a malicious third
896 party to attack the system by providing a compressed Manifest that
897 exposes decompressor vulnerabilities, or a zip bomb.
898
899 The OpenPGP cleartext signature covers the contents of the Manifest,
900 and is therefore compressed along with them. The possibility of using
901 a detached signature has been considered but it was rejected as
902 unnecessary complexity for minor gain.
903
904 Technically, a similar result could be effected via moving all the data
905 into a compressed sub-Manifest in the top directory (e.g.
906 ``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
907 in a signed, uncompressed top-level Manifest.
908
909 The existence of additional entries for uncompressed Manifest checksums
910 was debated. However, plain entries for the uncompressed file would
911 be confusing if only the compressed file existed, and conflicting
912 if both uncompressed and compressed variants existed. Furthermore,
913 it has been pointed out that ``DIST`` entries do not have
914 an uncompressed variant either.
915
916
917 Performance considerations
918 --------------------------
919
920 Performing a full-tree verification on every sync raises some
921 performance concerns for end-user systems. The initial testing has shown
922 that a cold-cache verification on a btrfs file system can take up around
923 4 minutes, with the process being mostly I/O bound. On the other hand,
924 it can be expected that the verification will be performed directly
925 after syncing, taking advantage of a warm filesystem cache.
926
927 To improve speed on I/O and/or CPU-restrained systems even further,
928 the algorithms can be easily extended to perform incremental
929 verification. Given that rsync does not preserve mtimes by default,
930 the tool can take advantage of mtime and Manifest comparisons to recheck
931 only the parts of the repository that have changed.
932
933 Furthermore, the package manager implementations can restrict checking
934 only to the parts of the repository that are actually being used.
935
936
937 Backwards Compatibility
938 =======================
939
940 This GLEP provides optional means of preserving backwards compatibility.
941 To preserve the backwards compatibility, the following needs to hold
942 for the ``Manifest`` file in every package directory:
943
944 - all files must be covered by the single ``Manifest`` file,
945
946 - all distfiles used by the package must be included,
947
948 - all files inside the ``files/`` subdirectory need to use
949 the ``AUX`` tag (rather than ``DATA``),
950
951 - all ``.ebuild`` files need to use the ``EBUILD`` tag,
952
953 - the ``metadata.xml`` and ``ChangeLog`` files need to use
954 the ``MISC`` tag,
955
956 - the Manifest can be signed to provide authenticity verification,
957
958 - an uncompressed Manifest must always exist, and a compressed Manifest
959 of identical content may be present.
960
961 Once the backwards compatibility is no longer a concern, the above
962 no longer needs to hold and the deprecated tags can be removed.
963
964
965 Reference Implementation
966 ========================
967
968 The reference implementation for this GLEP is being developed
969 as the gemato project [#GEMATO]_.
970
971
972 Credits
973 =======
974
975 Thanks to all the people whose contributions were invaluable
976 to the creation of this GLEP. This includes but is not limited to:
977
978 - Robin Hugh Johnson,
979 - Ulrich Müller.
980
981 Additionally, thanks to Robin Hugh Johnson for the original
982 MetaManifest GLEP series which served both as inspiration and source
983 of many concepts used in this GLEP. Recursively, also thanks to all
984 the people who contributed to the original GLEPs.
985
986
987 References
988 ==========
989
990 .. [#GLEP44] GLEP 44: Manifest2 format
991 (https://www.gentoo.org/glep/glep-0044.html)
992
993 .. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
994 - Overview
995 (https://www.gentoo.org/glep/glep-0057.html)
996
997 .. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
998 - Infrastructure to User distribution - MetaManifest
999 (https://www.gentoo.org/glep/glep-0058.html)
1000
1001 .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
1002 (https://www.gentoo.org/glep/glep-0059.html)
1003
1004 .. [#GLEP60] GLEP 60: Manifest2 filetypes
1005 (https://www.gentoo.org/glep/glep-0060.html)
1006
1007 .. [#GLEP61] GLEP 61: Manifest2 compression
1008 (https://www.gentoo.org/glep/glep-0061.html)
1009
1010 .. [#UNICODE] The Unicode standard
1011 (https://unicode.org/versions/latest/)
1012
1013 .. [#PMS-FETCH] Package Manager Specification: Dependency Specification
1014 Format - SRC_URI
1015 (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)
1016
1017 .. [#FILE-NAMING-RULES] Ebuild File Format -- Gentoo Development Guide
1018 (https://devmanual.gentoo.org/ebuild-writing/file-format/#file-naming-rules)
1019
1020 .. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
1021 (https://www.ietf.org/rfc/rfc1321.txt)
1022
1023 .. [#RIPEMD160] The hash function RIPEMD-160
1024 (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)
1025
1026 .. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
1027 (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)
1028
1029 .. [#WHIRLPOOL] The WHIRLPOOL Hash Function
1030 (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)
1031
1032 .. [#BLAKE2] BLAKE2 -- fast secure hashing
1033 (https://blake2.net/)
1034
1035 .. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
1036 and Extendable-Output Functions
1037 (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)
1038
1039 .. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
1040 (https://www.streebog.net/)
1041
1042 .. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
1043 (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)
1044
1045 .. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
1046 at the time of writing are duplicate, representing 2 MiB
1047 out of 25 MiB of DIST entries altogether.
1048
1049 .. [#GEMATO] gemato: Gentoo Manifest Tool
1050 (https://github.com/mgorny/gemato/)
1051
1052
1053 Copyright
1054 =========
1055 This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
1056 Unported License. To view a copy of this license, visit
1057 http://creativecommons.org/licenses/by-sa/3.0/.
1058
1059 --
1060 Best regards,
1061 Michał Górny

Replies

Subject Author
Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5] Fabian Groffen <grobian@g.o>