Gentoo Archives: gentoo-portage-dev

From:	Sid Spry <sid@××××.us>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] Changing the VDB format
Date:	Mon, 11 Apr 2022 19:21:00
Message-Id:	`64fdae3d-c17b-46e9-88b0-e1656c4bd5d9@www.fastmail.com`
In Reply to:	Re: [gentoo-portage-dev] Changing the VDB format by Joshua Kinard

1	On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote:
2	> On 3/13/2022 21:06, Matt Turner wrote:
3	>> The VDB uses a one-file-per-variable format. This has some
4	>> inefficiencies, with many file systems. For example the 'EAPI' file
5	>> that contains a single character will consume a 4K block on disk.
6	>>
7	>> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
8	>> $ ls -lh --block-size=1 \| awk 'BEGIN { sum = 0; } { sum += $5; } END {
9	>> print sum }'
10	>> 418517
11	>> $ du -sh --apparent-size .
12	>> 413K .
13	>> $ du -sh .
14	>> 556K .
15	>>
16	>> During normal operations, portage has to read each of these 35+
17	>> files/package individually.
18	>>
19	>> I suggest that we change the VDB format to a commonly used format that
20	>> can be quickly read by portage and any other tools. Combining these
21	>> 35+ files into a single file with a commonly used format should:
22	>>
23	>> - speed up vdb access
24	>> - improve disk usage
25	>> - allow external tools to access VDB data more easily
26	>>
27	>> I've attached a program that prints the VDB contents of a specified
28	>> package in different formats: json, toml, and yaml (and also Python
29	>> PrettyPrinter, just because). I think it's important to keep the VDB
30	>> format as plain-text for ease of manipulation, so I have not
31	>> considered anything like sqlite.
32	>>
33	>> I expected to prefer toml, but I actually find it to be rather gross looking.
34	>
35	> Agreed, the toml output is rather "cluttered" looking.
36	>
37	>
38	>> I recommend json and think it is the best choice because:
39	>>
40	>> - json provides the smallest on-disk footprint
41	>> - json is part of Python's standard library (so is yaml, and toml will
42	>> be in Python 3.11)
43	>> - Every programming language has multiple json parsers
44	>> -- lots of effort has been spent making them extremely fast.
45	>>
46	>> I think we would have a significant time period for the transition. I
47	>> think I would include support for the new format in Portage, and ship
48	>> a tool with portage to switch back and forth between old and new
49	>> formats on-disk. Maybe after a year, drop the code from Portage to
50	>> support the old format?
51	>>
52	>> Thoughts?
53	>
54	> I think json is the best format for storing the data on-disk. It's intended
55	> to be a data serialization format to convert data from a non-specific memory
56	> format to a storable on-disk format and back again, so this is a perfect use
57	> for it.
58
59	Can we avoid adding another format? I find json very hard to edit by hand, it's
60	good at storing lots of data in a quasi-textual format, but is strict enough to be
61	obnoxious to work with.
62
63	Can the files not be concatenated? Doing so is similar to the tar suggestion,
64	but would keep everything very portage-like. Have the contents assigned to
65	variables. I am betting someone tried this at the start but settled on the current
66	scheme. Does anyone know why? (This would have to be done in bash syntax
67	I assume.)
68
69	Alternatively, I think the tar suggestion is quite elegant. There's streaming
70	decompressors you can use from python. It adds an extra step to modify but
71	that could be handled transparently by a dev mode. In dev mode, leave the files
72	after extraction and do not re-extract, for release mode replace the archive with
73	what is on disk.
74
75	Sid.

Replies

Subject	Author
Re: [gentoo-portage-dev] Changing the VDB format	Joshua Kinard <kumba@g.o>

Report Message

Find on MARC Find on Google Groups