1 |
On 19/02/2012 01:00, James Cloos wrote: |
2 |
>>>>>> "KM" == Kerin Millar<kerframil@×××××.com> writes: |
3 |
> |
4 |
> KM> Arch also used to define LC_COLLATE="C" by default, probably to |
5 |
> KM> mitigate unpredictable behaviour in some applications, but have |
6 |
> KM> since dropped this additional variable so they must have deemed it |
7 |
> KM> no longer necessary. |
8 |
> |
9 |
> Without LC_COLLATE="C" things like [a-z]* gets a false=positive match |
10 |
> on files like Makefile. |
11 |
|
12 |
Indeed, character classes are a potential minefield. Incidentally, I |
13 |
just tested Ubuntu and Arch with only LANG set to a UTF-8 locale:- |
14 |
|
15 |
$ echo Makefile | sed -re 's/[a-z]//g' # collation rules ignored |
16 |
M |
17 |
|
18 |
$ echo Makefile | grep -Eo '[a-z]*' # collation rules ignored |
19 |
akefile |
20 |
|
21 |
In neither case are the collation rules being obeyed. In Gentoo, however:- |
22 |
|
23 |
$ echo Makefile | sed -re 's/[a-z]//g' # collation rules obeyed |
24 |
|
25 |
$ echo Makefile | grep -Eo '[a-z]*' # collation rules ignored |
26 |
akefile |
27 |
|
28 |
Obeying the collation rules is ostensibly the correct thing to do but, |
29 |
until everyone starts using named character classes (which will never |
30 |
happen), it's not safe. The thing that worries me here is the |
31 |
inconsistency in Gentoo. LC_COLLATE="C" is sufficient to work around the |
32 |
issue but the above makes me wonder why we still need it. |
33 |
|
34 |
--Kerin |