Gentoo Archives: gentoo-user

From:	antlists <antlists@××××××××××××.uk>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] ncurses; I think I wrecked my fresh install
Date:	Wed, 30 Dec 2020 17:42:59
Message-Id:	`da2b63e1-2c9c-33ee-6601-f4568e0eb901@youngman.org.uk`
In Reply to:	Re: [gentoo-user] ncurses; I think I wrecked my fresh install by "Andreas K. Huettel"

1	On 30/12/2020 16:35, Andreas K. Huettel wrote:
2	>> I don't know if this has improved over the years, but my initial
3	>> experience with unicode was rather negative. The fact that text
4	>> files were twice as large wasn't a major problem in itself. The
5	>> real showstopper was that importing text files into spreadsheets
6	>> and text-editors and word processors failed miseraby.
7	>>
8	>> I looked at a unicode text file with a binary viewer. It turns out
9	>> that a simple text string like "1234" was actually...
10	>> "1" binary-zero "2" binary-zero "3" binary-zero "4" binary zero, etc.
11	>
12	> That's (as someone has already pointed out) UTF-16, which is the default for
13	> some Windows tools (but understood in Linux too). (Even UTF-32 exists where
14	> all characters are 4 byte wide, but I've never seen it in the wild.)
15	>
16	> UTF-8 is normally used on Linux (and ASCII chars look exactly the same there);
17	> even for "long characters" outside the ASCII range spreadsheets and word
18	> processors should not be a problem anymore.
19	>
20	Following up on my previous answer, you need to separate in your mind
21	UTF the character set, and UTF-x the representation. When UTF was
22	introduced MS - in accordance with the thoughts of the time - thought
23	the future was a 16-bit char, which can store 32 thousand characters.
24	(Note that, BY DEFINITION, the high bit of a UTF character must be
25	zero. Just like standard ASCII.)
26
27	So MS and Windows uses UTF-16 as its encoding. Unix LATER went down the
28	route of UTF-8 which - I think - can only encode 16 thousand characters
29	in two bytes, but because most (western) text does encode successfully
30	in one byte is actually a major saving in network operations such as
31	email, web etc which is where Unix has traditionally been very strong.
32
33	But UTF-16 works very well for MS, because they are primarily desktop,
34	and UTF-16 means that there are very few multi-char characters. That
35	reduces pressure on CPU, which is a desktop-limited resource.
36
37	And lastly, very importantly, given that AT PRESENT all characters can
38	be encoded in 31 bits, UTF-32 the representation is equivalent to UTF
39	the character set. But should we need more than 2 billion characters,
40	there is nothing stopping us rolling out characters encoded in two
41	32-bit chars, and UTF-64.
42
43	Cheers,
44	Wol

Report Message

Find on MARC Find on Google Groups