Gentoo Archives: gentoo-user

From:	"J. Roeleveld" <joost@××××××××.org>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
Date:	Tue, 15 Dec 2009 18:01:41
Message-Id:	`200912151705.51917.joost@antarean.org`
In Reply to:	[gentoo-user] [OT] Need advice from people who use non-ascii all day long by felix@crowfix.com

1	On Thursday 03 December 2009 20:20:03 felix@×××××××.com wrote:
2	> I have a project which requires normalizing names, and by that, I mean
3	> converting to lower case etc, whatever eliminates redundancies. I
4	> know Unicode has a different "normalize" meaning, but for my purposes,
5	> that has already been done. Maybe I should call it standardization or
6	> make up a new cromulent word.
7	>
8	> By which I really mean I am confused by a lot of advice I have gotten
9	> from USAians who get by with the good old 7 bit ASCII character set on
10	> a daily basis, whether it be written in Unicode or not.
11	>
12	> One of the puzzles to me is all the accented chars. Umlauts, etc. I
13	> am not trying to convert names for permanent purposes but for internal
14	> comparison. In Germany is a district "Busingen", with an umlauted
15	> 'u'. Is it reasonable to consider it the same word whether with or
16	> without the unlauted u? French has the cedilla and acute and grave
17	> accents. Spanish has the tilde n. Scandinavian languages (all?
18	> some?) have the o with a slash.
19	>
20	> Or put another way, I don't know much about German, French, Spanish,
21	> etc keyboards. Do your keyboards have any of the extra keys, all of
22	> them? Are German keyboards and French and Spanish keyboards as
23	> restricted to their own languages as US keyboards are? If you have to
24	> hit two or three keys to keep the umlauts, accents, and tildes, do you
25	> get lazy sometimes and type the base character by itself? Is it even
26	> considered the base character, or is it considered lazy and sloppy,
27	> much as I get complaints about typing "thru" because "through" is too
28	> much trouble?
29	>
30	> I need something the equivalent of the C function strcasecmp() which
31	> not only ignores case, but all other differences without distinction,
32	> whatever they may be. If leaving off umlauts horrifies academics and
33	> purists but is what people do in the real world, I want to take that
34	> into consideration, so that if one person uses the ummlaut and another
35	> doesn't, it won't generated two separate entries. But if leaving off
36	> the umlaut or accent is a distinct place name, then I can't do that --
37	> but if real world people do that and live with the confusion, then I
38	> guess I have to make a different choice.
39	>
40	> Yes, I am something of an ignorant American. I know some Japanese,
41	> French, and Spanish, but not the details of everyday usage. I'd like
42	> to learn.
43	>
44	Hi Felix,
45
46	Apart from what was already mentioned, you might want to also consider the
47	following:
48
49	1) Even though people tend to try to do it correctly, non-natives can still
50	make mistakes with the names. These mistakes are frowned upon by the natives,
51	but are a part of live.
52
53	2) Names of cities can change with time, example:
54	"New York" used to be called "Nieuw Amsterdam" (Or "New Amsterdam" in english)
55
56	3) Some cities have multiple valid spellings in the same language:
57	The Hague = "Den Haag" or " 's Gravenhage" (yes, the apostrofe before the "s"
58	is part of the second version of the name)
59
60	An easier option might be to filter on the post-codes, these should be unique
61	and if you put the countries international abreviation in front of it, like
62	so: "NL-1234 AA" or "D-12345", you have a single field to check and then link
63	the actual city-name to the postcode area.
64
65	Disclaimer: I have no clue if these 2 postcodes actually exist
66
67	--
68	Joost

Report Message

Find on MARC Find on Google Groups