1 |
On 7/27/05, Moshe Kaminsky <kaminsky@××××××××××××.il> wrote: |
2 |
> Hi, |
3 |
> * Fernando Canizo <conan@××××××××××.ar> [27/07/05 14:14]: |
4 |
<snip> |
5 |
> > I investigate what was in the archives, so i saved a copy (using 'C' |
6 |
> > command from mutt) of the first message (the one i receive from me) |
7 |
> > and file says: 'UTF-8 Unicode mail text', check what's inside with |
8 |
> > hexedit and see that LATIN SMALL LETTER A WITH ACUTE is encoded with |
9 |
> > this hex: C3 A1 (which is not 00 E1 from unicode chart from |
10 |
> > http://www.unicode.org/charts/) |
11 |
> |
12 |
> I think this is just the way these characters are represented in utf-8. |
13 |
|
14 |
Yes, it is. |
15 |
|
16 |
00E1 hex is '0000000 11100001' in binary. |
17 |
|
18 |
When encoding this as UTF-8 this value is stored in two bytes. |
19 |
|
20 |
The last byte will begin with '10' followed by the last 6 bits of data. |
21 |
|
22 |
'10 100001' binary or 'A1' in hex. |
23 |
|
24 |
The first byte will begin with '110' to indicate that it is a two byte |
25 |
character followed by the remaining significant data. |
26 |
|
27 |
'110 00011' binary or 'C3' hex. |
28 |
|
29 |
This is correct. |
30 |
|
31 |
The problem seem to be that mutt(?) takes this UTF-8 encoded data |
32 |
and encodes as UTF-8 again as if the data was two 8 bit characters. |
33 |
|
34 |
'C3' then becomes 'C3 83' and 'A1' becomes 'C2 A1' |
35 |
|
36 |
|
37 |
/Andreas |
38 |
|
39 |
-- |
40 |
gentoo-user@g.o mailing list |