1 |
On Tue, 28 Feb 2006 11:58:03 +0100 |
2 |
Patrick Lauer <patrick@g.o> wrote: |
3 |
|
4 |
> During that discussion we realized that having utf-8 not enabled by |
5 |
> default and no utf8 fonts available by default causes lots of |
6 |
> recompilation and reconfiguration. |
7 |
> |
8 |
> Enabling the unicode useflag in the profiles should help our |
9 |
> international users and should not cause any problems. Are there any |
10 |
> known bugs / problems this would trigger? Any reasons against that? |
11 |
|
12 |
Enabling support for utf-8 should be fine, but I'd like to sound a note |
13 |
of caution about using a utf-8 locale as a system-wide setting. Since |
14 |
UTF-8 contains "holes" in the representation (i.e. some sequences of |
15 |
8-bit values are invalid), when something is asked to parse such |
16 |
invalid data unexpected results can ensue. |
17 |
|
18 |
For an example, see bug #125375 - it turns out that invalid sequences |
19 |
do not match '.' in sed regular expressions (sed-4.1.4). The other gnu |
20 |
tools probably behave similarly. Up to a point this is in line with the |
21 |
UTF-8 spec, which says, "When a process interprets a code unit sequence |
22 |
which purports to be in a Unicode character encoding form, it shall |
23 |
treat ill-formed code unit sequences as an error condition, and shall |
24 |
not interpret such sequences as characters." (chapter 3 para 2 rule |
25 |
C12a). This clearly means that the invalid bytes cannot match "." (or |
26 |
anything else for that matter). However sed should either generate an |
27 |
error, filter the illegal bytes out of its input, or replace them with |
28 |
a marker (replacement character) - instead it leaves the non-conformant |
29 |
bytes alone. |
30 |
|
31 |
-- |
32 |
Kevin F. Quinn |