Gentoo Archives: gentoo-dev

From: Dirkjan Ochtman <djc@g.o>
To: Gentoo Development <gentoo-dev@l.g.o>
Subject: [gentoo-dev] New schema language for metadata validation?
Date: Tue, 26 Jan 2016 19:55:09
Message-Id: CAKmKYaBFGtwjqEB8a-8q0EH4MR8Mh-CXt9zvD5gJQ6LWmxpCiQ@mail.gmail.com
1 All,
2
3 TL;DR: I think we should switch from DTD to RELAX NG (compact syntax,
4 ideally) for our XML validation needs. It is more expressive and more
5 readable.
6
7 Most people who know anything about XML stuff know that DTDs are not
8 that great a solution for validation. Their expression power is very
9 limited; there are a few examples of this is in our metadata.dtd [1].
10 For a few years now, I've wanted to see if we could replace
11 metadata.dtd with something in RELAX NG, which is a more modern XML
12 schema language; it's an ISO standard with an emphasis on readability
13 both for humans and for tools (by using a rigorous formalism). Some
14 arguments in favor of RELAX NG (and some counter-arguments) are
15 enumerated on Tim Bray's weblog [2]. I've created a compact syntax
16 schema for metadata that can validate all metadata.xml files currently
17 in the tree, as an example [3].
18
19 Some arguments against:
20
21 - Not enough tool support for RELAX NG: I'd be curious to hear what
22 tools you want to use. At least libxml2 supports RELAX NG natively.
23 The Python lxml library uses that support to provide pretty simple
24 RELAX NG validation. libxml2 does not have native compact syntax
25 support, but I maintain a simple library called rnc2rng [4] that is
26 used transparently by lxml if installed. rnc2rng also comes with a
27 rnc2rng command-line script to do the conversion.
28
29 - Performance: in a quick test with lxml (backed by libxml2), RELAX NG
30 validation takes very similar time compared to DTD. Testing with
31 ~19000 metadata.xml files in the tree, with DTD (best of 3):
32
33 real 0m2.861s
34 user 0m2.560s
35 sys 0m0.296s
36
37 With RNC (best of 3):
38
39 real 0m3.058s
40 user 0m2.688s
41 sys 0m0.364s
42
43 We could probably easily maintain an XML Schema shadow schema if
44 that's really desired, but I would be in favor of making RELAX NG our
45 main schema language. I can easily do the work to update repoman for
46 this (I've already refactored the metadata code in repoman). What
47 other stuff would need to be updated?
48
49 Comments?
50
51 Cheers,
52
53 Dirkjan
54
55 [1] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.dtd
56 [2] https://www.tbray.org/ongoing/When/200x/2006/11/27/Choose-Relax
57 [3] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.rnc
58 [4] https://github.com/djc/rnc2rng

Replies