Gentoo Archives: gentoo-dev

From: Alastair Tse <liquidx@g.o>
To: gentoo-dev@g.o
Subject: Re: [gentoo-dev] python-2.3.2 testing required
Date: Thu, 13 Nov 2003 09:38:33
Message-Id: 1068716306.25166.33.camel@huggins.eng.cam.ac.uk
In Reply to: Re: [gentoo-dev] python-2.3.2 testing required by Paul de Vrieze
1 On Thu, 2003-11-13 at 09:05, Paul de Vrieze wrote:
2
3 > Isn't it true that it is possible to to encode the few 4 byte characters
4 > into a number of 2byte sequences. I think that is more than enough for
5 > most cases (who needs to read/write cjk anyway ;-) )
6
7 According to my understanding of UCS, it doesn't seem to be the case. I
8 believe that UCS is the internal representation of a unicode character,
9 whereas UTF is the encoding of the character into octets for
10 representation on a computer.
11
12 As an example, a UCS4 character on one of the higher planes (the one
13 where extra CJK characters are placed), UTF-8 would require 6
14 characters. UTF-8 (or -16 -32) are all able to represent the whole UCS4
15 space. UCS2 does not do any "chaining" so it can only have, at most, 16
16 bit characters (eg. 65535). UCS2 is a subset of UCS4[1].
17
18 Of course, I've left out alot of details like UCS2 doesn't actually have
19 64K chars, etc.
20
21 With that said, most Linux machines that have wchar support, has wchar
22 defined as 4 bytes (int). So anything with wchar support probably
23 already uses 4 bytes. Maybe someone who has used wchar support can
24 comment on this.
25
26 Cheers,
27
28 [1]
29 http://www.gnuenterprise.org/doc/console-tools-libs/html/lct-4.html#sec-unicode
30
31 --
32 Alastair 'liquidx' Tse
33 >> Gentoo Developer
34 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/

Attachments

File name MIME type
signature.asc application/pgp-signature