Gentoo Archives: gentoo-commits

From: "Michał Górny" <mgorny@g.o>
To: gentoo-commits@l.g.o
Subject: [gentoo-commits] data/glep:master commit in: /
Date: Sat, 25 Nov 2017 20:49:51
Message-Id: 1511642957.b00ade7b6467a3ae066d66f6e4ce71fb10309710.mgorny@gentoo
1 commit: b00ade7b6467a3ae066d66f6e4ce71fb10309710
2 Author: Michał Górny <mgorny <AT> gentoo <DOT> org>
3 AuthorDate: Wed Nov 22 11:40:34 2017 +0000
4 Commit: Michał Górny <mgorny <AT> gentoo <DOT> org>
5 CommitDate: Sat Nov 25 20:49:17 2017 +0000
6 URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=b00ade7b
7
8 glep-0074: Provide encoding for disallowed characters
9
10 glep-0074.rst | 75 ++++++++++++++++++++++++++++++++++++++++++++---------------
11 1 file changed, 56 insertions(+), 19 deletions(-)
12
13 diff --git a/glep-0074.rst b/glep-0074.rst
14 index b0daa05..3dc6730 100644
15 --- a/glep-0074.rst
16 +++ b/glep-0074.rst
17 @@ -70,7 +70,8 @@ other space-separated values.
18
19 Unless specified otherwise, the paths used in the Manifest files
20 are relative to the directory containing the Manifest file. The paths
21 -must not reference the parent directory (``..``).
22 +must not reference the parent directory (``..``). Forward slash (``/``)
23 +is used as path component separator.
24
25 The Manifest files use UTF-8 encoding.
26
27 @@ -132,13 +133,35 @@ are not otherwise ignored reside on a different filesystem, or symbolic
28 links point to targets on a different filesystem, they must
29 be explicitly excluded via ``IGNORE``.
30
31 -All paths specified in the Manifest file must consist of characters
32 +
33 +Path and filename encoding
34 +--------------------------
35 +
36 +The path fields in the Manifest file must consist of characters
37 corresponding to valid UTF-8 code points excluding the NULL character
38 (``U+0000``), the backwards slash (``\``) and characters classified
39 as whitespace in the current version of the Unicode standard
40 -[#UNICODE]_. It is an error to use Manifest files in directories
41 -containing files whose names contain the disallowed characters.
42 -The forward slash (``/``) must be used as path separator.
43 +[#UNICODE]_.
44 +
45 +Any of the excluded characters that are present in path must be encoded
46 +using one of the following escape sequences:
47 +
48 +- characters in the ``U+0000`` to ``U+007F`` range can be encoded
49 + as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal
50 + character code,
51 +
52 +- characters in the ``U+0000`` to ``U+FFFF`` range can be encoded
53 + as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal
54 + character code,
55 +
56 +- characters in the UCS-4 range can be encoded as ``\UHHHHHHHH``
57 + where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character
58 + code.
59 +
60 +It is invalid for backwards slash to be used in any other context,
61 +and a backwards slash present in filename must be encoded. Backwards
62 +slash used as path component separator should be replaced by forward
63 +slash instead.
64
65
66 File verification
67 @@ -563,7 +586,7 @@ specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
68 filenames containing whitespace.
69
70 This specification aims to avoid arbitrary restrictions. For this
71 -reason, filename characters are only restricted by excluding two
72 +reason, filename characters are only restricted by excluding three
73 technically problematic groups:
74
75 1. The NULL character (``U+0000``) is normally used to indicate the end
76 @@ -571,12 +594,10 @@ technically problematic groups:
77 written using C. Furthermore, it is not allowed in any known
78 filesystem.
79
80 -2. The backwards slash character (``\``) is frequently used as an escape
81 - character, in particular in the languages derived from C and in shell
82 - script. Furthermore, it is used as path separator on Windows systems.
83 - It is forbidden to avoid implementation mistakes (in particular,
84 - attempting to use it to escape whitespace or as path separator
85 - on Windows) but also reserved for possible future extension.
86 +2. The backwards slash character (``\``) is used as path separator
87 + on Windows systems, so it's extremely unlikely to be used in real
88 + filenames. For this reason it is used to implement character
89 + encoding with minimal risk of breaking backwards compatibility.
90
91 3. Whitespace characters are used to separate Manifest fields
92 and entries. While technically it would be enough to restrict space
93 @@ -585,18 +606,34 @@ technically problematic groups:
94 all whitespace characters are forbidden to avoid confusion
95 and implementation errors.
96
97 -While the specification could be extended to allow such filenames
98 -by using some form of escaping, there is currently no apparent need
99 -for such a feature.
100 -
101 Historically, Portage attempted to overcome the whitespace limitation
102 by attempting to locate the size field and take everything before it
103 as filename. This was terribly fragile and even if it worked, it would
104 solve the problem only partially.
105
106 -Since the same restrictions apply to ``IGNORE`` rules, it is currently
107 -not possible to either list or ignore the file using whitespace
108 -characters. Therefore, the presence of such files is forbidden entirely.
109 +The character encoding method provides means to overcome the character
110 +restrictions to extend the tool usability beyond immediate Gentoo uses.
111 +The backslash escape form based on Python unicode strings is used
112 +since it can encode all characters within the Unicode range, the syntax
113 +is familiar to many programmers and the backwards slash character
114 +is extremely unlikely to appear in real filenames.
115 +
116 +Syntax is limited to the minimum necessary to implement the encoding.
117 +Shorthand forms (e.g. ``\t`` or ``\\``) are omitted to avoid unnecessary
118 +complexity, and to reduce the risk of shell users using backslash
119 +to escape space directly. The ``\x`` form is limited to ``\x00..\x7F``
120 +range to avoid ambiguity of higher values which might be interpreted
121 +either as UCS-2 code points or part of a UTF-8 encoded character.
122 +
123 +Encoding stores UCS-2/UCS-4 characters directly rather than hex-encoded
124 +UTF-8 string to simplify the implementation. In particular, it makes it
125 +possible to process the Manifest file as UTF-8 encoded text without
126 +having to perform additional UTF-8 decoding (and verification)
127 +of the escaped data.
128 +
129 +URL-encoding was considered as an alternative. However, it could collide
130 +with ``DIST`` entries that are implicitly named after the URL filename
131 +part where URL-encoding is pretty common.
132
133
134 File verification model