1 |
I have a project which requires normalizing names, and by that, I mean |
2 |
converting to lower case etc, whatever eliminates redundancies. I |
3 |
know Unicode has a different "normalize" meaning, but for my purposes, |
4 |
that has already been done. Maybe I should call it standardization or |
5 |
make up a new cromulent word. |
6 |
|
7 |
By which I really mean I am confused by a lot of advice I have gotten |
8 |
from USAians who get by with the good old 7 bit ASCII character set on |
9 |
a daily basis, whether it be written in Unicode or not. |
10 |
|
11 |
One of the puzzles to me is all the accented chars. Umlauts, etc. I |
12 |
am not trying to convert names for permanent purposes but for internal |
13 |
comparison. In Germany is a district "Busingen", with an umlauted |
14 |
'u'. Is it reasonable to consider it the same word whether with or |
15 |
without the unlauted u? French has the cedilla and acute and grave |
16 |
accents. Spanish has the tilde n. Scandinavian languages (all? |
17 |
some?) have the o with a slash. |
18 |
|
19 |
Or put another way, I don't know much about German, French, Spanish, |
20 |
etc keyboards. Do your keyboards have any of the extra keys, all of |
21 |
them? Are German keyboards and French and Spanish keyboards as |
22 |
restricted to their own languages as US keyboards are? If you have to |
23 |
hit two or three keys to keep the umlauts, accents, and tildes, do you |
24 |
get lazy sometimes and type the base character by itself? Is it even |
25 |
considered the base character, or is it considered lazy and sloppy, |
26 |
much as I get complaints about typing "thru" because "through" is too |
27 |
much trouble? |
28 |
|
29 |
I need something the equivalent of the C function strcasecmp() which |
30 |
not only ignores case, but all other differences without distinction, |
31 |
whatever they may be. If leaving off umlauts horrifies academics and |
32 |
purists but is what people do in the real world, I want to take that |
33 |
into consideration, so that if one person uses the ummlaut and another |
34 |
doesn't, it won't generated two separate entries. But if leaving off |
35 |
the umlaut or accent is a distinct place name, then I can't do that -- |
36 |
but if real world people do that and live with the confusion, then I |
37 |
guess I have to make a different choice. |
38 |
|
39 |
Yes, I am something of an ignorant American. I know some Japanese, |
40 |
French, and Spanish, but not the details of everyday usage. I'd like |
41 |
to learn. |
42 |
|
43 |
-- |
44 |
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. |
45 |
Felix Finch: scarecrow repairman & rocket surgeon / felix@×××××××.com |
46 |
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 |
47 |
I've found a solution to Fermat's Last Theorem but I see I've run out of room o |