1 |
On Thu, 2003-11-13 at 09:05, Paul de Vrieze wrote: |
2 |
|
3 |
> Isn't it true that it is possible to to encode the few 4 byte characters |
4 |
> into a number of 2byte sequences. I think that is more than enough for |
5 |
> most cases (who needs to read/write cjk anyway ;-) ) |
6 |
|
7 |
According to my understanding of UCS, it doesn't seem to be the case. I |
8 |
believe that UCS is the internal representation of a unicode character, |
9 |
whereas UTF is the encoding of the character into octets for |
10 |
representation on a computer. |
11 |
|
12 |
As an example, a UCS4 character on one of the higher planes (the one |
13 |
where extra CJK characters are placed), UTF-8 would require 6 |
14 |
characters. UTF-8 (or -16 -32) are all able to represent the whole UCS4 |
15 |
space. UCS2 does not do any "chaining" so it can only have, at most, 16 |
16 |
bit characters (eg. 65535). UCS2 is a subset of UCS4[1]. |
17 |
|
18 |
Of course, I've left out alot of details like UCS2 doesn't actually have |
19 |
64K chars, etc. |
20 |
|
21 |
With that said, most Linux machines that have wchar support, has wchar |
22 |
defined as 4 bytes (int). So anything with wchar support probably |
23 |
already uses 4 bytes. Maybe someone who has used wchar support can |
24 |
comment on this. |
25 |
|
26 |
Cheers, |
27 |
|
28 |
[1] |
29 |
http://www.gnuenterprise.org/doc/console-tools-libs/html/lct-4.html#sec-unicode |
30 |
|
31 |
-- |
32 |
Alastair 'liquidx' Tse |
33 |
>> Gentoo Developer |
34 |
>> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/ |