1 |
commit: b00ade7b6467a3ae066d66f6e4ce71fb10309710 |
2 |
Author: Michał Górny <mgorny <AT> gentoo <DOT> org> |
3 |
AuthorDate: Wed Nov 22 11:40:34 2017 +0000 |
4 |
Commit: Michał Górny <mgorny <AT> gentoo <DOT> org> |
5 |
CommitDate: Sat Nov 25 20:49:17 2017 +0000 |
6 |
URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=b00ade7b |
7 |
|
8 |
glep-0074: Provide encoding for disallowed characters |
9 |
|
10 |
glep-0074.rst | 75 ++++++++++++++++++++++++++++++++++++++++++++--------------- |
11 |
1 file changed, 56 insertions(+), 19 deletions(-) |
12 |
|
13 |
diff --git a/glep-0074.rst b/glep-0074.rst |
14 |
index b0daa05..3dc6730 100644 |
15 |
--- a/glep-0074.rst |
16 |
+++ b/glep-0074.rst |
17 |
@@ -70,7 +70,8 @@ other space-separated values. |
18 |
|
19 |
Unless specified otherwise, the paths used in the Manifest files |
20 |
are relative to the directory containing the Manifest file. The paths |
21 |
-must not reference the parent directory (``..``). |
22 |
+must not reference the parent directory (``..``). Forward slash (``/``) |
23 |
+is used as path component separator. |
24 |
|
25 |
The Manifest files use UTF-8 encoding. |
26 |
|
27 |
@@ -132,13 +133,35 @@ are not otherwise ignored reside on a different filesystem, or symbolic |
28 |
links point to targets on a different filesystem, they must |
29 |
be explicitly excluded via ``IGNORE``. |
30 |
|
31 |
-All paths specified in the Manifest file must consist of characters |
32 |
+ |
33 |
+Path and filename encoding |
34 |
+-------------------------- |
35 |
+ |
36 |
+The path fields in the Manifest file must consist of characters |
37 |
corresponding to valid UTF-8 code points excluding the NULL character |
38 |
(``U+0000``), the backwards slash (``\``) and characters classified |
39 |
as whitespace in the current version of the Unicode standard |
40 |
-[#UNICODE]_. It is an error to use Manifest files in directories |
41 |
-containing files whose names contain the disallowed characters. |
42 |
-The forward slash (``/``) must be used as path separator. |
43 |
+[#UNICODE]_. |
44 |
+ |
45 |
+Any of the excluded characters that are present in path must be encoded |
46 |
+using one of the following escape sequences: |
47 |
+ |
48 |
+- characters in the ``U+0000`` to ``U+007F`` range can be encoded |
49 |
+ as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal |
50 |
+ character code, |
51 |
+ |
52 |
+- characters in the ``U+0000`` to ``U+FFFF`` range can be encoded |
53 |
+ as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal |
54 |
+ character code, |
55 |
+ |
56 |
+- characters in the UCS-4 range can be encoded as ``\UHHHHHHHH`` |
57 |
+ where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character |
58 |
+ code. |
59 |
+ |
60 |
+It is invalid for backwards slash to be used in any other context, |
61 |
+and a backwards slash present in filename must be encoded. Backwards |
62 |
+slash used as path component separator should be replaced by forward |
63 |
+slash instead. |
64 |
|
65 |
|
66 |
File verification |
67 |
@@ -563,7 +586,7 @@ specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use |
68 |
filenames containing whitespace. |
69 |
|
70 |
This specification aims to avoid arbitrary restrictions. For this |
71 |
-reason, filename characters are only restricted by excluding two |
72 |
+reason, filename characters are only restricted by excluding three |
73 |
technically problematic groups: |
74 |
|
75 |
1. The NULL character (``U+0000``) is normally used to indicate the end |
76 |
@@ -571,12 +594,10 @@ technically problematic groups: |
77 |
written using C. Furthermore, it is not allowed in any known |
78 |
filesystem. |
79 |
|
80 |
-2. The backwards slash character (``\``) is frequently used as an escape |
81 |
- character, in particular in the languages derived from C and in shell |
82 |
- script. Furthermore, it is used as path separator on Windows systems. |
83 |
- It is forbidden to avoid implementation mistakes (in particular, |
84 |
- attempting to use it to escape whitespace or as path separator |
85 |
- on Windows) but also reserved for possible future extension. |
86 |
+2. The backwards slash character (``\``) is used as path separator |
87 |
+ on Windows systems, so it's extremely unlikely to be used in real |
88 |
+ filenames. For this reason it is used to implement character |
89 |
+ encoding with minimal risk of breaking backwards compatibility. |
90 |
|
91 |
3. Whitespace characters are used to separate Manifest fields |
92 |
and entries. While technically it would be enough to restrict space |
93 |
@@ -585,18 +606,34 @@ technically problematic groups: |
94 |
all whitespace characters are forbidden to avoid confusion |
95 |
and implementation errors. |
96 |
|
97 |
-While the specification could be extended to allow such filenames |
98 |
-by using some form of escaping, there is currently no apparent need |
99 |
-for such a feature. |
100 |
- |
101 |
Historically, Portage attempted to overcome the whitespace limitation |
102 |
by attempting to locate the size field and take everything before it |
103 |
as filename. This was terribly fragile and even if it worked, it would |
104 |
solve the problem only partially. |
105 |
|
106 |
-Since the same restrictions apply to ``IGNORE`` rules, it is currently |
107 |
-not possible to either list or ignore the file using whitespace |
108 |
-characters. Therefore, the presence of such files is forbidden entirely. |
109 |
+The character encoding method provides means to overcome the character |
110 |
+restrictions to extend the tool usability beyond immediate Gentoo uses. |
111 |
+The backslash escape form based on Python unicode strings is used |
112 |
+since it can encode all characters within the Unicode range, the syntax |
113 |
+is familiar to many programmers and the backwards slash character |
114 |
+is extremely unlikely to appear in real filenames. |
115 |
+ |
116 |
+Syntax is limited to the minimum necessary to implement the encoding. |
117 |
+Shorthand forms (e.g. ``\t`` or ``\\``) are omitted to avoid unnecessary |
118 |
+complexity, and to reduce the risk of shell users using backslash |
119 |
+to escape space directly. The ``\x`` form is limited to ``\x00..\x7F`` |
120 |
+range to avoid ambiguity of higher values which might be interpreted |
121 |
+either as UCS-2 code points or part of a UTF-8 encoded character. |
122 |
+ |
123 |
+Encoding stores UCS-2/UCS-4 characters directly rather than hex-encoded |
124 |
+UTF-8 string to simplify the implementation. In particular, it makes it |
125 |
+possible to process the Manifest file as UTF-8 encoded text without |
126 |
+having to perform additional UTF-8 decoding (and verification) |
127 |
+of the escaped data. |
128 |
+ |
129 |
+URL-encoding was considered as an alternative. However, it could collide |
130 |
+with ``DIST`` entries that are implicitly named after the URL filename |
131 |
+part where URL-encoding is pretty common. |
132 |
|
133 |
|
134 |
File verification model |