1 |
On 30/12/2020 16:35, Andreas K. Huettel wrote: |
2 |
>> I don't know if this has improved over the years, but my initial |
3 |
>> experience with unicode was rather negative. The fact that text |
4 |
>> files were twice as large wasn't a major problem in itself. The |
5 |
>> real showstopper was that importing text files into spreadsheets |
6 |
>> and text-editors and word processors failed miseraby. |
7 |
>> |
8 |
>> I looked at a unicode text file with a binary viewer. It turns out |
9 |
>> that a simple text string like "1234" was actually... |
10 |
>> "1" binary-zero "2" binary-zero "3" binary-zero "4" binary zero, etc. |
11 |
> |
12 |
> That's (as someone has already pointed out) UTF-16, which is the default for |
13 |
> some Windows tools (but understood in Linux too). (Even UTF-32 exists where |
14 |
> all characters are 4 byte wide, but I've never seen it in the wild.) |
15 |
> |
16 |
> UTF-8 is normally used on Linux (and ASCII chars look exactly the same there); |
17 |
> even for "long characters" outside the ASCII range spreadsheets and word |
18 |
> processors should not be a problem anymore. |
19 |
> |
20 |
Following up on my previous answer, you need to separate in your mind |
21 |
UTF the character set, and UTF-x the representation. When UTF was |
22 |
introduced MS - in accordance with the thoughts of the time - thought |
23 |
the future was a 16-bit char, which can store 32 thousand characters. |
24 |
(Note that, BY DEFINITION, the high bit of a UTF character *must* be |
25 |
zero. Just like standard ASCII.) |
26 |
|
27 |
So MS and Windows uses UTF-16 as its encoding. Unix LATER went down the |
28 |
route of UTF-8 which - I think - can only encode 16 thousand characters |
29 |
in two bytes, but because most (western) text does encode successfully |
30 |
in one byte is actually a major saving in network operations such as |
31 |
email, web etc which is where Unix has traditionally been very strong. |
32 |
|
33 |
But UTF-16 works very well for MS, because they are primarily desktop, |
34 |
and UTF-16 means that there are very few multi-char characters. That |
35 |
reduces pressure on CPU, which is a desktop-limited resource. |
36 |
|
37 |
And lastly, very importantly, given that AT PRESENT all characters can |
38 |
be encoded in 31 bits, UTF-32 the representation is equivalent to UTF |
39 |
the character set. But should we need more than 2 billion characters, |
40 |
there is nothing stopping us rolling out characters encoded in two |
41 |
32-bit chars, and UTF-64. |
42 |
|
43 |
Cheers, |
44 |
Wol |