Gentoo Archives: gentoo-dev

From: "Kevin F. Quinn (Gentoo)" <kevquinn@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] enable UTF8 per default?
Date: Thu, 09 Mar 2006 20:22:07
Message-Id: 20060309212511.2b92a73d@c1358217.kevquinn.com
In Reply to: [gentoo-dev] enable UTF8 per default? by Patrick Lauer
1 On Tue, 28 Feb 2006 11:58:03 +0100
2 Patrick Lauer <patrick@g.o> wrote:
3
4 > During that discussion we realized that having utf-8 not enabled by
5 > default and no utf8 fonts available by default causes lots of
6 > recompilation and reconfiguration.
7 >
8 > Enabling the unicode useflag in the profiles should help our
9 > international users and should not cause any problems. Are there any
10 > known bugs / problems this would trigger? Any reasons against that?
11
12 Enabling support for utf-8 should be fine, but I'd like to sound a note
13 of caution about using a utf-8 locale as a system-wide setting. Since
14 UTF-8 contains "holes" in the representation (i.e. some sequences of
15 8-bit values are invalid), when something is asked to parse such
16 invalid data unexpected results can ensue.
17
18 For an example, see bug #125375 - it turns out that invalid sequences
19 do not match '.' in sed regular expressions (sed-4.1.4). The other gnu
20 tools probably behave similarly. Up to a point this is in line with the
21 UTF-8 spec, which says, "When a process interprets a code unit sequence
22 which purports to be in a Unicode character encoding form, it shall
23 treat ill-formed code unit sequences as an error condition, and shall
24 not interpret such sequences as characters." (chapter 3 para 2 rule
25 C12a). This clearly means that the invalid bytes cannot match "." (or
26 anything else for that matter). However sed should either generate an
27 error, filter the illegal bytes out of its input, or replace them with
28 a marker (replacement character) - instead it leaves the non-conformant
29 bytes alone.
30
31 --
32 Kevin F. Quinn

Attachments

File name MIME type
signature.asc application/pgp-signature