Gentoo Archives: gentoo-portage-dev

From: Matt Turner <mattst88@g.o>
To: gentoo-portage-dev@l.g.o
Cc: Tim Harder <radhermit@g.o>
Subject: [gentoo-portage-dev] Changing the VDB format
Date: Mon, 14 Mar 2022 01:06:39
Message-Id: CAEdQ38HAnJjOU7shtS8x8Z-WcFou08Nsbc6WTMZ-9pLSnWWnHw@mail.gmail.com
1 The VDB uses a one-file-per-variable format. This has some
2 inefficiencies, with many file systems. For example the 'EAPI' file
3 that contains a single character will consume a 4K block on disk.
4
5 $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
6 $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
7 print sum }'
8 418517
9 $ du -sh --apparent-size .
10 413K .
11 $ du -sh .
12 556K .
13
14 During normal operations, portage has to read each of these 35+
15 files/package individually.
16
17 I suggest that we change the VDB format to a commonly used format that
18 can be quickly read by portage and any other tools. Combining these
19 35+ files into a single file with a commonly used format should:
20
21 - speed up vdb access
22 - improve disk usage
23 - allow external tools to access VDB data more easily
24
25 I've attached a program that prints the VDB contents of a specified
26 package in different formats: json, toml, and yaml (and also Python
27 PrettyPrinter, just because). I think it's important to keep the VDB
28 format as plain-text for ease of manipulation, so I have not
29 considered anything like sqlite.
30
31 I expected to prefer toml, but I actually find it to be rather gross looking.
32
33 $ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
34 444663
35 $ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
36 385112
37 $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
38 273428
39
40 toml and yaml are formatted in a human-readable manner, but json is
41 not. Pipe the json output to app-misc/jq to get a better sense of its
42 structure:
43
44 $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq
45 ...
46
47 Compare with the raw contents of the files:
48
49 $ ls -lh --block-size=1 | grep -v
50 '\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; }
51 { sum += $5; } END { print sum }'
52 378658
53
54 Yes, the json is actually smaller because it does not contain large
55 amounts of duplicated path strings in CONTENTS (which is 375320 bytes
56 by itself, or 89% of the total size).
57
58 I recommend json and think it is the best choice because:
59
60 - json provides the smallest on-disk footprint
61 - json is part of Python's standard library (so is yaml, and toml will
62 be in Python 3.11)
63 - Every programming language has multiple json parsers
64 -- lots of effort has been spent making them extremely fast.
65
66 I think we would have a significant time period for the transition. I
67 think I would include support for the new format in Portage, and ship
68 a tool with portage to switch back and forth between old and new
69 formats on-disk. Maybe after a year, drop the code from Portage to
70 support the old format?
71
72 Thoughts?

Attachments

File name MIME type
vdb.py text/x-python

Replies

Subject Author
Re: [gentoo-portage-dev] Changing the VDB format Fabian Groffen <grobian@g.o>
Re: [gentoo-portage-dev] Changing the VDB format James Cloos <cloos@×××××××.com>
Re: [gentoo-portage-dev] Changing the VDB format Joshua Kinard <kumba@g.o>