Gentoo Archives: gentoo-dev

From: Ulrich Mueller <ulm@g.o>
To: gentoo-dev@l.g.o
Subject: [gentoo-dev] RFC: BCP 47 for L10N? (was: News item: LINGUAS USE_EXPAND renamed to L10N)
Date: Fri, 10 Jun 2016 09:29:55
Message-Id: 22362.34949.883069.715248@a1i15.kph.uni-mainz.de
In Reply to: Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N by "Chí-Thanh Christopher Nguyễn"
1 >>>>> On Tue, 7 Jun 2016, Chí-Thanh Christopher Nguyễn wrote:
2
3 >> 4. According to Gettext documentation, "'@VARIANT' can denote any
4 >> kind of characteristics that is not already implied by the language
5 >> LL and the country CC." (So IIUC the BCP-47 variant "valencia"
6 >> would become "@valencia".)
7
8 > This I think is wrong and collides with POSIX.
9 > POSIX modifiers are not allowed for LANG or LC_ALL in
10 > POSIX.1-2008[1] Section 8.2 says you can have at most one modifier
11 > field to "select a specific instance of localization data within a
12 > single category", which I don't think applies because it is its own
13 > locale, not an instance of an existing one. Furthermore (but that
14 > doesn't apply in our use case), POSIX spec lists the example
15 > LC_COLLATE=De_DE@dict
16 > So what if you want Catalan Valencian with dictionary order? Or if
17 > someone hypothetically came up with a different script?
18
19 >> I haven't found any mention or usage of ISO 3166-2 region
20 >> subdivisions in the context of locale. Can you provide any
21 >> references for this?
22
23 > As I wrote before, it is not used. But I think it is the only
24 > spec-compliant way to marry POSIX locales with Catalan Valencian.
25 > BCP-47 does it in a more natural way.
26
27 So, trying to summarise: We cannot follow strict POSIX syntax, so our
28 two choices are either to stick to Gettext LL_CC@VARIANT syntax or
29 to change to BCP 47.
30
31 Using BCP 47 would have some advantages:
32 - It is a well defined standard [1] and tools for validation of
33 language tags exist, e.g. [2].
34 - The L10N USE_EXPAND could follow usual USE flag syntax, as BCP 47
35 tags contain neither underscores (which are supposed to be reserved
36 as USE_EXPAND separators) nor @ signs (which PMS explicitly
37 mentions as an exception for LINGUAS).
38 - Gettext's @VARIANT is ill-defined and conflates different
39 characteristics like script and variant. There is no further
40 subdivision within @VARIANT, which leads to locale names like
41 sr@ijekavianlatin. Also different upstreams use different
42 conventions, like @latin and @Latn for the latin script.
43 - For the vast majority of languages, identifiers are either identical
44 ("de" -> "de") or they can be converted by simple shell substitution
45 ("pt-BR" -> "pt_BR").
46 - IIUC, L10N is primarily intended to control things like additional
47 language bundles of packages. Some upstreams like libreoffice
48 already use BCP 47 for these.
49
50 On the other hand, there will be some cost:
51 - If BCP 47 tags containing a script or a variant should be used to
52 generate LINGUAS, they will require explicit mapping. (OTOH, such
53 mapping will also be needed if we stick to Gettext syntax but unify
54 variants like "sr@latin" and "sr@Latn".)
55 - Different syntax for LINGUAS and L10N might be confusing to users,
56 so additional documentation will be needed.
57
58 Comments?
59
60 Ulrich
61
62 [1] https://tools.ietf.org/html/bcp47
63 [2] http://schneegans.de/lv/

Replies