Gentoo Archives: gentoo-user

From: felix@×××××××.com
To: gentoo-user@l.g.o
Subject: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
Date: Thu, 03 Dec 2009 19:20:38
Message-Id: 20091203192003.GA1702@crowfix.com
1 I have a project which requires normalizing names, and by that, I mean
2 converting to lower case etc, whatever eliminates redundancies. I
3 know Unicode has a different "normalize" meaning, but for my purposes,
4 that has already been done. Maybe I should call it standardization or
5 make up a new cromulent word.
6
7 By which I really mean I am confused by a lot of advice I have gotten
8 from USAians who get by with the good old 7 bit ASCII character set on
9 a daily basis, whether it be written in Unicode or not.
10
11 One of the puzzles to me is all the accented chars. Umlauts, etc. I
12 am not trying to convert names for permanent purposes but for internal
13 comparison. In Germany is a district "Busingen", with an umlauted
14 'u'. Is it reasonable to consider it the same word whether with or
15 without the unlauted u? French has the cedilla and acute and grave
16 accents. Spanish has the tilde n. Scandinavian languages (all?
17 some?) have the o with a slash.
18
19 Or put another way, I don't know much about German, French, Spanish,
20 etc keyboards. Do your keyboards have any of the extra keys, all of
21 them? Are German keyboards and French and Spanish keyboards as
22 restricted to their own languages as US keyboards are? If you have to
23 hit two or three keys to keep the umlauts, accents, and tildes, do you
24 get lazy sometimes and type the base character by itself? Is it even
25 considered the base character, or is it considered lazy and sloppy,
26 much as I get complaints about typing "thru" because "through" is too
27 much trouble?
28
29 I need something the equivalent of the C function strcasecmp() which
30 not only ignores case, but all other differences without distinction,
31 whatever they may be. If leaving off umlauts horrifies academics and
32 purists but is what people do in the real world, I want to take that
33 into consideration, so that if one person uses the ummlaut and another
34 doesn't, it won't generated two separate entries. But if leaving off
35 the umlaut or accent is a distinct place name, then I can't do that --
36 but if real world people do that and live with the confusion, then I
37 guess I have to make a different choice.
38
39 Yes, I am something of an ignorant American. I know some Japanese,
40 French, and Spanish, but not the details of everyday usage. I'd like
41 to learn.
42
43 --
44 ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
45 Felix Finch: scarecrow repairman & rocket surgeon / felix@×××××××.com
46 GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
47 I've found a solution to Fermat's Last Theorem but I see I've run out of room o

Replies