Gentoo Archives: gentoo-dev

From:	Alastair Tse <liquidx@g.o>
To:	gentoo-dev@g.o
Subject:	Re: [gentoo-dev] python-2.3.2 testing required
Date:	Thu, 13 Nov 2003 09:38:33
Message-Id:	`1068716306.25166.33.camel@huggins.eng.cam.ac.uk`
In Reply to:	Re: [gentoo-dev] python-2.3.2 testing required by Paul de Vrieze

1	On Thu, 2003-11-13 at 09:05, Paul de Vrieze wrote:
2
3	> Isn't it true that it is possible to to encode the few 4 byte characters
4	> into a number of 2byte sequences. I think that is more than enough for
5	> most cases (who needs to read/write cjk anyway ;-) )
6
7	According to my understanding of UCS, it doesn't seem to be the case. I
8	believe that UCS is the internal representation of a unicode character,
9	whereas UTF is the encoding of the character into octets for
10	representation on a computer.
11
12	As an example, a UCS4 character on one of the higher planes (the one
13	where extra CJK characters are placed), UTF-8 would require 6
14	characters. UTF-8 (or -16 -32) are all able to represent the whole UCS4
15	space. UCS2 does not do any "chaining" so it can only have, at most, 16
16	bit characters (eg. 65535). UCS2 is a subset of UCS4[1].
17
18	Of course, I've left out alot of details like UCS2 doesn't actually have
19	64K chars, etc.
20
21	With that said, most Linux machines that have wchar support, has wchar
22	defined as 4 bytes (int). So anything with wchar support probably
23	already uses 4 bytes. Maybe someone who has used wchar support can
24	comment on this.
25
26	Cheers,
27
28	[1]
29	http://www.gnuenterprise.org/doc/console-tools-libs/html/lct-4.html#sec-unicode
30
31	--
32	Alastair 'liquidx' Tse
33	>> Gentoo Developer
34	>> http://www.liquidx.net/ \| http://dev.gentoo.org/~liquidx/

File name	MIME type
signature.asc	application/pgp-signature