Gentoo Archives: gentoo-dev

From:	"Kevin F. Quinn (Gentoo)" <kevquinn@g.o>
To:	gentoo-dev@l.g.o
Subject:	Re: [gentoo-dev] enable UTF8 per default?
Date:	Thu, 09 Mar 2006 20:22:07
Message-Id:	`20060309212511.2b92a73d@c1358217.kevquinn.com`
In Reply to:	[gentoo-dev] enable UTF8 per default? by Patrick Lauer

1	On Tue, 28 Feb 2006 11:58:03 +0100
2	Patrick Lauer <patrick@g.o> wrote:
3
4	> During that discussion we realized that having utf-8 not enabled by
5	> default and no utf8 fonts available by default causes lots of
6	> recompilation and reconfiguration.
7	>
8	> Enabling the unicode useflag in the profiles should help our
9	> international users and should not cause any problems. Are there any
10	> known bugs / problems this would trigger? Any reasons against that?
11
12	Enabling support for utf-8 should be fine, but I'd like to sound a note
13	of caution about using a utf-8 locale as a system-wide setting. Since
14	UTF-8 contains "holes" in the representation (i.e. some sequences of
15	8-bit values are invalid), when something is asked to parse such
16	invalid data unexpected results can ensue.
17
18	For an example, see bug #125375 - it turns out that invalid sequences
19	do not match '.' in sed regular expressions (sed-4.1.4). The other gnu
20	tools probably behave similarly. Up to a point this is in line with the
21	UTF-8 spec, which says, "When a process interprets a code unit sequence
22	which purports to be in a Unicode character encoding form, it shall
23	treat ill-formed code unit sequences as an error condition, and shall
24	not interpret such sequences as characters." (chapter 3 para 2 rule
25	C12a). This clearly means that the invalid bytes cannot match "." (or
26	anything else for that matter). However sed should either generate an
27	error, filter the illegal bytes out of its input, or replace them with
28	a marker (replacement character) - instead it leaves the non-conformant
29	bytes alone.
30
31	--
32	Kevin F. Quinn

Attachments

File name	MIME type
signature.asc	application/pgp-signature

Report Message

Find on MARC Find on Google Groups