1 |
On Thursday 03 December 2009 20:20:03 felix@×××××××.com wrote: |
2 |
> I have a project which requires normalizing names, and by that, I mean |
3 |
> converting to lower case etc, whatever eliminates redundancies. I |
4 |
> know Unicode has a different "normalize" meaning, but for my purposes, |
5 |
> that has already been done. Maybe I should call it standardization or |
6 |
> make up a new cromulent word. |
7 |
> |
8 |
> By which I really mean I am confused by a lot of advice I have gotten |
9 |
> from USAians who get by with the good old 7 bit ASCII character set on |
10 |
> a daily basis, whether it be written in Unicode or not. |
11 |
> |
12 |
> One of the puzzles to me is all the accented chars. Umlauts, etc. I |
13 |
> am not trying to convert names for permanent purposes but for internal |
14 |
> comparison. In Germany is a district "Busingen", with an umlauted |
15 |
> 'u'. Is it reasonable to consider it the same word whether with or |
16 |
> without the unlauted u? French has the cedilla and acute and grave |
17 |
> accents. Spanish has the tilde n. Scandinavian languages (all? |
18 |
> some?) have the o with a slash. |
19 |
> |
20 |
> Or put another way, I don't know much about German, French, Spanish, |
21 |
> etc keyboards. Do your keyboards have any of the extra keys, all of |
22 |
> them? Are German keyboards and French and Spanish keyboards as |
23 |
> restricted to their own languages as US keyboards are? If you have to |
24 |
> hit two or three keys to keep the umlauts, accents, and tildes, do you |
25 |
> get lazy sometimes and type the base character by itself? Is it even |
26 |
> considered the base character, or is it considered lazy and sloppy, |
27 |
> much as I get complaints about typing "thru" because "through" is too |
28 |
> much trouble? |
29 |
> |
30 |
> I need something the equivalent of the C function strcasecmp() which |
31 |
> not only ignores case, but all other differences without distinction, |
32 |
> whatever they may be. If leaving off umlauts horrifies academics and |
33 |
> purists but is what people do in the real world, I want to take that |
34 |
> into consideration, so that if one person uses the ummlaut and another |
35 |
> doesn't, it won't generated two separate entries. But if leaving off |
36 |
> the umlaut or accent is a distinct place name, then I can't do that -- |
37 |
> but if real world people do that and live with the confusion, then I |
38 |
> guess I have to make a different choice. |
39 |
> |
40 |
> Yes, I am something of an ignorant American. I know some Japanese, |
41 |
> French, and Spanish, but not the details of everyday usage. I'd like |
42 |
> to learn. |
43 |
> |
44 |
Hi Felix, |
45 |
|
46 |
Apart from what was already mentioned, you might want to also consider the |
47 |
following: |
48 |
|
49 |
1) Even though people tend to try to do it correctly, non-natives can still |
50 |
make mistakes with the names. These mistakes are frowned upon by the natives, |
51 |
but are a part of live. |
52 |
|
53 |
2) Names of cities can change with time, example: |
54 |
"New York" used to be called "Nieuw Amsterdam" (Or "New Amsterdam" in english) |
55 |
|
56 |
3) Some cities have multiple valid spellings in the same language: |
57 |
The Hague = "Den Haag" or " 's Gravenhage" (yes, the apostrofe before the "s" |
58 |
is part of the second version of the name) |
59 |
|
60 |
An easier option might be to filter on the post-codes, these should be unique |
61 |
and if you put the countries international abreviation in front of it, like |
62 |
so: "NL-1234 AA" or "D-12345", you have a single field to check and then link |
63 |
the actual city-name to the postcode area. |
64 |
|
65 |
Disclaimer: I have no clue if these 2 postcodes actually exist |
66 |
|
67 |
-- |
68 |
Joost |