Gentoo Archives: gentoo-dev

From: Wiktor Wandachowicz <siryes@×××××.com>
To: gentoo-dev@l.g.o
Subject: [gentoo-dev] UTF-8 encoding and file format of manuals
Date: Thu, 01 Jun 2006 14:48:23
Message-Id: loom.20060601T142548-650@post.gmane.org
1 Respectful Gentoo developers,
2
3 I would like to ask what do you think about UTF-8 encoded manual pages?
4 I mean, the files like ls.1.gz, which are used by honorable "man" program.
5 Recently I attacked the problem a little and before submitting any
6 patches/proposals to Gentoo bugzilla I'd like to know your opinions first.
7
8 Disclaimer: for daily use I have LANG="pl_PL.UTF-8" and LC_ALL="pl_PL.UTF-8",
9 but the original issue is of a more universal nature.
10
11 Back on subject. ISO-8859-* 8-bit encodings are fine and most localized
12 manuals use them. However, there are some examples where UTF-8 manuals are
13 installed as well. Namely, newest portage uses "linguas_pl" by this means:
14
15 $ emerge -pv portage
16 [ebuild R ] sys-apps/portage-2.1_rc3-r3 USE="-build -doc" LINGUAS="pl"
17
18 In effect, a translated manual pages are added to the system. The problem
19 is that they use UTF-8 encoding. Having both man-pages-pl and this version
20 of portage installed gives unexpected results. This way "man ls" prints all
21 the letters with correct encoding, but "man emerge" does not. On the other
22 hand, if "man" is configured to display UTF-8 encoded manuals correctly,
23 all the other manuals print funny characters instead of desired output.
24
25 I wrote a simple script [1] which checks all installed Polish manuals by
26 using "file" program. For "pl" locale it produces currently about ~70kB
27 of text, and for default locale it's about 458kB. After grepping for all
28 occurences of "UTF" I've found out that only the newest portage's manuals
29 are in UTF-8 ("pl"), plus: flow.1, gnome-keyring-manager.1, ImageMagick.1,
30 Encode::Unicode::UTF7.3pm (but I think they are false positives, anyway).
31
32 While it's easy to contact Polish translators of the portage's manuals so
33 they could correct them, the problem will have to be solved sooner or later.
34 UTF-8 encoded manuals will probably occur with higher frequency, and some
35 general resolution should be made.
36
37 After some discussion on the Polish forum [2] I've learnt about groff
38 deficiencies with UTF-8 handling. However, a wrapper exists [3] that helps
39 somewhat in that matter. But it also requires that all manuals be unified
40 wrt. encoding: *all* ISO-8859-* or *all* UTF-8, no compromise.
41 So I don't know what course to take.
42
43 Summing up:
44 * UTF-8 manuals: good or bad?
45 * how to handle mixed encodings of manuals?
46 * should man and/or groff handle UTF-8 better?
47 * should an eclass function be created to aid in correcting the encoding
48 of manual pages while installing them?
49
50 Any constructive comments are more than welcome!
51
52 Best regards,
53 Wiktor Wandachowicz
54 (SirYes)
55
56 [1] http://ics.p.lodz.pl/~wiktorw/gentoo/checkman
57 [2] http://forums.gentoo.org/viewtopic-p-3352287.html
58 [3] http://hoth.amu.edu.pl/~d_szeluga/groff-utf8.tar.bz2
59
60
61 --
62 gentoo-dev@g.o mailing list

Replies