Gentoo Archives: gentoo-user

From: "J. Roeleveld" <joost@××××××××.org>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
Date: Tue, 15 Dec 2009 18:01:41
Message-Id: 200912151705.51917.joost@antarean.org
In Reply to: [gentoo-user] [OT] Need advice from people who use non-ascii all day long by felix@crowfix.com
1 On Thursday 03 December 2009 20:20:03 felix@×××××××.com wrote:
2 > I have a project which requires normalizing names, and by that, I mean
3 > converting to lower case etc, whatever eliminates redundancies. I
4 > know Unicode has a different "normalize" meaning, but for my purposes,
5 > that has already been done. Maybe I should call it standardization or
6 > make up a new cromulent word.
7 >
8 > By which I really mean I am confused by a lot of advice I have gotten
9 > from USAians who get by with the good old 7 bit ASCII character set on
10 > a daily basis, whether it be written in Unicode or not.
11 >
12 > One of the puzzles to me is all the accented chars. Umlauts, etc. I
13 > am not trying to convert names for permanent purposes but for internal
14 > comparison. In Germany is a district "Busingen", with an umlauted
15 > 'u'. Is it reasonable to consider it the same word whether with or
16 > without the unlauted u? French has the cedilla and acute and grave
17 > accents. Spanish has the tilde n. Scandinavian languages (all?
18 > some?) have the o with a slash.
19 >
20 > Or put another way, I don't know much about German, French, Spanish,
21 > etc keyboards. Do your keyboards have any of the extra keys, all of
22 > them? Are German keyboards and French and Spanish keyboards as
23 > restricted to their own languages as US keyboards are? If you have to
24 > hit two or three keys to keep the umlauts, accents, and tildes, do you
25 > get lazy sometimes and type the base character by itself? Is it even
26 > considered the base character, or is it considered lazy and sloppy,
27 > much as I get complaints about typing "thru" because "through" is too
28 > much trouble?
29 >
30 > I need something the equivalent of the C function strcasecmp() which
31 > not only ignores case, but all other differences without distinction,
32 > whatever they may be. If leaving off umlauts horrifies academics and
33 > purists but is what people do in the real world, I want to take that
34 > into consideration, so that if one person uses the ummlaut and another
35 > doesn't, it won't generated two separate entries. But if leaving off
36 > the umlaut or accent is a distinct place name, then I can't do that --
37 > but if real world people do that and live with the confusion, then I
38 > guess I have to make a different choice.
39 >
40 > Yes, I am something of an ignorant American. I know some Japanese,
41 > French, and Spanish, but not the details of everyday usage. I'd like
42 > to learn.
43 >
44 Hi Felix,
45
46 Apart from what was already mentioned, you might want to also consider the
47 following:
48
49 1) Even though people tend to try to do it correctly, non-natives can still
50 make mistakes with the names. These mistakes are frowned upon by the natives,
51 but are a part of live.
52
53 2) Names of cities can change with time, example:
54 "New York" used to be called "Nieuw Amsterdam" (Or "New Amsterdam" in english)
55
56 3) Some cities have multiple valid spellings in the same language:
57 The Hague = "Den Haag" or " 's Gravenhage" (yes, the apostrofe before the "s"
58 is part of the second version of the name)
59
60 An easier option might be to filter on the post-codes, these should be unique
61 and if you put the countries international abreviation in front of it, like
62 so: "NL-1234 AA" or "D-12345", you have a single field to check and then link
63 the actual city-name to the postcode area.
64
65 Disclaimer: I have no clue if these 2 postcodes actually exist
66
67 --
68 Joost