1 |
All, |
2 |
|
3 |
TL;DR: I think we should switch from DTD to RELAX NG (compact syntax, |
4 |
ideally) for our XML validation needs. It is more expressive and more |
5 |
readable. |
6 |
|
7 |
Most people who know anything about XML stuff know that DTDs are not |
8 |
that great a solution for validation. Their expression power is very |
9 |
limited; there are a few examples of this is in our metadata.dtd [1]. |
10 |
For a few years now, I've wanted to see if we could replace |
11 |
metadata.dtd with something in RELAX NG, which is a more modern XML |
12 |
schema language; it's an ISO standard with an emphasis on readability |
13 |
both for humans and for tools (by using a rigorous formalism). Some |
14 |
arguments in favor of RELAX NG (and some counter-arguments) are |
15 |
enumerated on Tim Bray's weblog [2]. I've created a compact syntax |
16 |
schema for metadata that can validate all metadata.xml files currently |
17 |
in the tree, as an example [3]. |
18 |
|
19 |
Some arguments against: |
20 |
|
21 |
- Not enough tool support for RELAX NG: I'd be curious to hear what |
22 |
tools you want to use. At least libxml2 supports RELAX NG natively. |
23 |
The Python lxml library uses that support to provide pretty simple |
24 |
RELAX NG validation. libxml2 does not have native compact syntax |
25 |
support, but I maintain a simple library called rnc2rng [4] that is |
26 |
used transparently by lxml if installed. rnc2rng also comes with a |
27 |
rnc2rng command-line script to do the conversion. |
28 |
|
29 |
- Performance: in a quick test with lxml (backed by libxml2), RELAX NG |
30 |
validation takes very similar time compared to DTD. Testing with |
31 |
~19000 metadata.xml files in the tree, with DTD (best of 3): |
32 |
|
33 |
real 0m2.861s |
34 |
user 0m2.560s |
35 |
sys 0m0.296s |
36 |
|
37 |
With RNC (best of 3): |
38 |
|
39 |
real 0m3.058s |
40 |
user 0m2.688s |
41 |
sys 0m0.364s |
42 |
|
43 |
We could probably easily maintain an XML Schema shadow schema if |
44 |
that's really desired, but I would be in favor of making RELAX NG our |
45 |
main schema language. I can easily do the work to update repoman for |
46 |
this (I've already refactored the metadata code in repoman). What |
47 |
other stuff would need to be updated? |
48 |
|
49 |
Comments? |
50 |
|
51 |
Cheers, |
52 |
|
53 |
Dirkjan |
54 |
|
55 |
[1] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.dtd |
56 |
[2] https://www.tbray.org/ongoing/When/200x/2006/11/27/Choose-Relax |
57 |
[3] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.rnc |
58 |
[4] https://github.com/djc/rnc2rng |