1 |
On 3/13/2022 21:06, Matt Turner wrote: |
2 |
> The VDB uses a one-file-per-variable format. This has some |
3 |
> inefficiencies, with many file systems. For example the 'EAPI' file |
4 |
> that contains a single character will consume a 4K block on disk. |
5 |
> |
6 |
> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ |
7 |
> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { |
8 |
> print sum }' |
9 |
> 418517 |
10 |
> $ du -sh --apparent-size . |
11 |
> 413K . |
12 |
> $ du -sh . |
13 |
> 556K . |
14 |
> |
15 |
> During normal operations, portage has to read each of these 35+ |
16 |
> files/package individually. |
17 |
> |
18 |
> I suggest that we change the VDB format to a commonly used format that |
19 |
> can be quickly read by portage and any other tools. Combining these |
20 |
> 35+ files into a single file with a commonly used format should: |
21 |
> |
22 |
> - speed up vdb access |
23 |
> - improve disk usage |
24 |
> - allow external tools to access VDB data more easily |
25 |
> |
26 |
> I've attached a program that prints the VDB contents of a specified |
27 |
> package in different formats: json, toml, and yaml (and also Python |
28 |
> PrettyPrinter, just because). I think it's important to keep the VDB |
29 |
> format as plain-text for ease of manipulation, so I have not |
30 |
> considered anything like sqlite. |
31 |
> |
32 |
> I expected to prefer toml, but I actually find it to be rather gross looking. |
33 |
|
34 |
Agreed, the toml output is rather "cluttered" looking. |
35 |
|
36 |
|
37 |
> I recommend json and think it is the best choice because: |
38 |
> |
39 |
> - json provides the smallest on-disk footprint |
40 |
> - json is part of Python's standard library (so is yaml, and toml will |
41 |
> be in Python 3.11) |
42 |
> - Every programming language has multiple json parsers |
43 |
> -- lots of effort has been spent making them extremely fast. |
44 |
> |
45 |
> I think we would have a significant time period for the transition. I |
46 |
> think I would include support for the new format in Portage, and ship |
47 |
> a tool with portage to switch back and forth between old and new |
48 |
> formats on-disk. Maybe after a year, drop the code from Portage to |
49 |
> support the old format? |
50 |
> |
51 |
> Thoughts? |
52 |
|
53 |
I think json is the best format for storing the data on-disk. It's intended |
54 |
to be a data serialization format to convert data from a non-specific memory |
55 |
format to a storable on-disk format and back again, so this is a perfect use |
56 |
for it. |
57 |
|
58 |
That said, I actually do like the yaml output as well, but I think the |
59 |
better use-case for that would be in the form of a secondary tool that maybe |
60 |
could be a part of portage's 'q' commands (qpkg, qfile, qlist, etc) to read |
61 |
the JSON-formatted VDB data and export it in yaml for review. Something |
62 |
like 'qvdb --yaml sys-libs/glibc-2.35-r2' to dump the VDB data to stdout |
63 |
(and maybe do other tasks, but that's a discussion for another thread). |
64 |
|
65 |
As far as support for the old format goes, I think one year is too short. |
66 |
Two years is preferable, though I would not be totally opposed to as long as |
67 |
three years. Adoption could probably be helped by turning this vdb.py |
68 |
script into something more functional that can actually walk the current VDB |
69 |
and convert it to the new chosen format and write that out to an alternate |
70 |
location that a user could then transplant into /var/db/pkg after verifying it. |
71 |
|
72 |
One other thought -- I think there should be a tuning knob in make.conf to |
73 |
enable compression of the VDB's new format or not. The specific compression |
74 |
format I leave up for debate (I'd say go with zstd, though), but if I am |
75 |
running on a filesystem that supports native compression (e.g., ZFS), I'd |
76 |
want to turn VDB compression off and let ZFS handle that at the filesystem |
77 |
level. But on another system with say, XFS, I'd want to turn that on to get |
78 |
some benefits, especially on older hardware that's going to be more I/O bound. |
79 |
|
80 |
E.g., in JSON format, sys-libs/glibc-2.35-r2 clocks in at ~345KB: |
81 |
|
82 |
# ./vdb.py --json /var/db/pkg/sys-libs/glibc-2.35-r2 > glibc.json |
83 |
# ls -lh --block-size=1 glibc.json |
84 |
-rw-r--r-- 1 root root 352479 Apr 11 14:53 glibc.json |
85 |
|
86 |
# zstd glibc.json |
87 |
glibc.json : 21.70% ( 344 KiB => 74.7 KiB, glibc.json.zst) |
88 |
|
89 |
# ls -lh --block-size=1 glibc.json.zst |
90 |
-rw-r--r-- 1 root root 76498 Apr 11 14:53 glibc.json.zst |
91 |
|
92 |
(this is on a tmpfs-formatted ramdrive) |
93 |
|
94 |
-- |
95 |
Joshua Kinard |
96 |
Gentoo/MIPS |
97 |
kumba@g.o |
98 |
rsa6144/5C63F4E3F5C6C943 2015-04-27 |
99 |
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 |
100 |
|
101 |
"The past tempts us, the present confuses us, the future frightens us. And |
102 |
our lives slip away, moment by moment, lost in that vast, terrible in-between." |
103 |
|
104 |
--Emperor Turhan, Centauri Republic |