Gentoo Archives: gentoo-portage-dev

From: Joshua Kinard <kumba@g.o>
To: gentoo-portage-dev@l.g.o
Subject: Re: [gentoo-portage-dev] Changing the VDB format
Date: Mon, 11 Apr 2022 19:02:55
Message-Id: d544280d-e05c-a370-d2e6-9b6689570ce7@gentoo.org
In Reply to: [gentoo-portage-dev] Changing the VDB format by Matt Turner
1 On 3/13/2022 21:06, Matt Turner wrote:
2 > The VDB uses a one-file-per-variable format. This has some
3 > inefficiencies, with many file systems. For example the 'EAPI' file
4 > that contains a single character will consume a 4K block on disk.
5 >
6 > $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
7 > $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
8 > print sum }'
9 > 418517
10 > $ du -sh --apparent-size .
11 > 413K .
12 > $ du -sh .
13 > 556K .
14 >
15 > During normal operations, portage has to read each of these 35+
16 > files/package individually.
17 >
18 > I suggest that we change the VDB format to a commonly used format that
19 > can be quickly read by portage and any other tools. Combining these
20 > 35+ files into a single file with a commonly used format should:
21 >
22 > - speed up vdb access
23 > - improve disk usage
24 > - allow external tools to access VDB data more easily
25 >
26 > I've attached a program that prints the VDB contents of a specified
27 > package in different formats: json, toml, and yaml (and also Python
28 > PrettyPrinter, just because). I think it's important to keep the VDB
29 > format as plain-text for ease of manipulation, so I have not
30 > considered anything like sqlite.
31 >
32 > I expected to prefer toml, but I actually find it to be rather gross looking.
33
34 Agreed, the toml output is rather "cluttered" looking.
35
36
37 > I recommend json and think it is the best choice because:
38 >
39 > - json provides the smallest on-disk footprint
40 > - json is part of Python's standard library (so is yaml, and toml will
41 > be in Python 3.11)
42 > - Every programming language has multiple json parsers
43 > -- lots of effort has been spent making them extremely fast.
44 >
45 > I think we would have a significant time period for the transition. I
46 > think I would include support for the new format in Portage, and ship
47 > a tool with portage to switch back and forth between old and new
48 > formats on-disk. Maybe after a year, drop the code from Portage to
49 > support the old format?
50 >
51 > Thoughts?
52
53 I think json is the best format for storing the data on-disk. It's intended
54 to be a data serialization format to convert data from a non-specific memory
55 format to a storable on-disk format and back again, so this is a perfect use
56 for it.
57
58 That said, I actually do like the yaml output as well, but I think the
59 better use-case for that would be in the form of a secondary tool that maybe
60 could be a part of portage's 'q' commands (qpkg, qfile, qlist, etc) to read
61 the JSON-formatted VDB data and export it in yaml for review. Something
62 like 'qvdb --yaml sys-libs/glibc-2.35-r2' to dump the VDB data to stdout
63 (and maybe do other tasks, but that's a discussion for another thread).
64
65 As far as support for the old format goes, I think one year is too short.
66 Two years is preferable, though I would not be totally opposed to as long as
67 three years. Adoption could probably be helped by turning this vdb.py
68 script into something more functional that can actually walk the current VDB
69 and convert it to the new chosen format and write that out to an alternate
70 location that a user could then transplant into /var/db/pkg after verifying it.
71
72 One other thought -- I think there should be a tuning knob in make.conf to
73 enable compression of the VDB's new format or not. The specific compression
74 format I leave up for debate (I'd say go with zstd, though), but if I am
75 running on a filesystem that supports native compression (e.g., ZFS), I'd
76 want to turn VDB compression off and let ZFS handle that at the filesystem
77 level. But on another system with say, XFS, I'd want to turn that on to get
78 some benefits, especially on older hardware that's going to be more I/O bound.
79
80 E.g., in JSON format, sys-libs/glibc-2.35-r2 clocks in at ~345KB:
81
82 # ./vdb.py --json /var/db/pkg/sys-libs/glibc-2.35-r2 > glibc.json
83 # ls -lh --block-size=1 glibc.json
84 -rw-r--r-- 1 root root 352479 Apr 11 14:53 glibc.json
85
86 # zstd glibc.json
87 glibc.json : 21.70% ( 344 KiB => 74.7 KiB, glibc.json.zst)
88
89 # ls -lh --block-size=1 glibc.json.zst
90 -rw-r--r-- 1 root root 76498 Apr 11 14:53 glibc.json.zst
91
92 (this is on a tmpfs-formatted ramdrive)
93
94 --
95 Joshua Kinard
96 Gentoo/MIPS
97 kumba@g.o
98 rsa6144/5C63F4E3F5C6C943 2015-04-27
99 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943
100
101 "The past tempts us, the present confuses us, the future frightens us. And
102 our lives slip away, moment by moment, lost in that vast, terrible in-between."
103
104 --Emperor Turhan, Centauri Republic

Replies

Subject Author
Re: [gentoo-portage-dev] Changing the VDB format Sid Spry <sid@××××.us>