1 |
The VDB uses a one-file-per-variable format. This has some |
2 |
inefficiencies, with many file systems. For example the 'EAPI' file |
3 |
that contains a single character will consume a 4K block on disk. |
4 |
|
5 |
$ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ |
6 |
$ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { |
7 |
print sum }' |
8 |
418517 |
9 |
$ du -sh --apparent-size . |
10 |
413K . |
11 |
$ du -sh . |
12 |
556K . |
13 |
|
14 |
During normal operations, portage has to read each of these 35+ |
15 |
files/package individually. |
16 |
|
17 |
I suggest that we change the VDB format to a commonly used format that |
18 |
can be quickly read by portage and any other tools. Combining these |
19 |
35+ files into a single file with a commonly used format should: |
20 |
|
21 |
- speed up vdb access |
22 |
- improve disk usage |
23 |
- allow external tools to access VDB data more easily |
24 |
|
25 |
I've attached a program that prints the VDB contents of a specified |
26 |
package in different formats: json, toml, and yaml (and also Python |
27 |
PrettyPrinter, just because). I think it's important to keep the VDB |
28 |
format as plain-text for ease of manipulation, so I have not |
29 |
considered anything like sqlite. |
30 |
|
31 |
I expected to prefer toml, but I actually find it to be rather gross looking. |
32 |
|
33 |
$ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c |
34 |
444663 |
35 |
$ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c |
36 |
385112 |
37 |
$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c |
38 |
273428 |
39 |
|
40 |
toml and yaml are formatted in a human-readable manner, but json is |
41 |
not. Pipe the json output to app-misc/jq to get a better sense of its |
42 |
structure: |
43 |
|
44 |
$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq |
45 |
... |
46 |
|
47 |
Compare with the raw contents of the files: |
48 |
|
49 |
$ ls -lh --block-size=1 | grep -v |
50 |
'\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; } |
51 |
{ sum += $5; } END { print sum }' |
52 |
378658 |
53 |
|
54 |
Yes, the json is actually smaller because it does not contain large |
55 |
amounts of duplicated path strings in CONTENTS (which is 375320 bytes |
56 |
by itself, or 89% of the total size). |
57 |
|
58 |
I recommend json and think it is the best choice because: |
59 |
|
60 |
- json provides the smallest on-disk footprint |
61 |
- json is part of Python's standard library (so is yaml, and toml will |
62 |
be in Python 3.11) |
63 |
- Every programming language has multiple json parsers |
64 |
-- lots of effort has been spent making them extremely fast. |
65 |
|
66 |
I think we would have a significant time period for the transition. I |
67 |
think I would include support for the new format in Portage, and ship |
68 |
a tool with portage to switch back and forth between old and new |
69 |
formats on-disk. Maybe after a year, drop the code from Portage to |
70 |
support the old format? |
71 |
|
72 |
Thoughts? |