* [gentoo-portage-dev] Changing the VDB format
@ 2022-03-14 1:06 99% Matt Turner
0 siblings, 0 replies; 1+ results
From: Matt Turner @ 2022-03-14 1:06 UTC (permalink / raw
To: gentoo-portage-dev; +Cc: Tim Harder
[-- Attachment #1: Type: text/plain, Size: 2590 bytes --]
The VDB uses a one-file-per-variable format. This has some
inefficiencies, with many file systems. For example the 'EAPI' file
that contains a single character will consume a 4K block on disk.
$ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
$ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
print sum }'
418517
$ du -sh --apparent-size .
413K .
$ du -sh .
556K .
During normal operations, portage has to read each of these 35+
files/package individually.
I suggest that we change the VDB format to a commonly used format that
can be quickly read by portage and any other tools. Combining these
35+ files into a single file with a commonly used format should:
- speed up vdb access
- improve disk usage
- allow external tools to access VDB data more easily
I've attached a program that prints the VDB contents of a specified
package in different formats: json, toml, and yaml (and also Python
PrettyPrinter, just because). I think it's important to keep the VDB
format as plain-text for ease of manipulation, so I have not
considered anything like sqlite.
I expected to prefer toml, but I actually find it to be rather gross looking.
$ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
444663
$ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
385112
$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
273428
toml and yaml are formatted in a human-readable manner, but json is
not. Pipe the json output to app-misc/jq to get a better sense of its
structure:
$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq
...
Compare with the raw contents of the files:
$ ls -lh --block-size=1 | grep -v
'\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; }
{ sum += $5; } END { print sum }'
378658
Yes, the json is actually smaller because it does not contain large
amounts of duplicated path strings in CONTENTS (which is 375320 bytes
by itself, or 89% of the total size).
I recommend json and think it is the best choice because:
- json provides the smallest on-disk footprint
- json is part of Python's standard library (so is yaml, and toml will
be in Python 3.11)
- Every programming language has multiple json parsers
-- lots of effort has been spent making them extremely fast.
I think we would have a significant time period for the transition. I
think I would include support for the new format in Portage, and ship
a tool with portage to switch back and forth between old and new
formats on-disk. Maybe after a year, drop the code from Portage to
support the old format?
Thoughts?
[-- Attachment #2: vdb.py --]
[-- Type: text/x-python, Size: 2334 bytes --]
#!/usr/bin/env python
import argparse
import json
import pprint
import sys
import toml
import yaml
from pathlib import Path
def main(argv):
pp = pprint.PrettyPrinter(indent=2)
parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument('--json', action='store_true')
group.add_argument('--toml', action='store_true')
group.add_argument('--yaml', action='store_true')
group.add_argument('--pprint', action='store_true')
parser.add_argument('vdbdir', type=str)
opts = parser.parse_args(argv[1:])
vdb = Path(opts.vdbdir)
if not vdb.is_dir():
print(f'{vdb} is not a directory')
sys.exit(-1)
d = {}
for file in (x for x in vdb.iterdir()):
if not file.name.isupper():
# print(f"Ignoring file {file.name}")
continue
value = file.read_text().rstrip('\n')
if file.name == "CONTENTS":
contents = {}
for line in value.splitlines(keepends=False):
(type, *rest) = line.split(sep=' ')
parts = rest[0].split(sep='/')
p = contents
if type == 'dir':
assert(len(rest) == 1)
for part in parts[1:]:
p = p.setdefault(part, {})
else:
for part in parts[1:-1]:
p = p.get(part)
if type == 'obj':
assert(len(rest) == 3)
p[parts[-1]] = {'hash': rest[1], 'size': rest[2]}
elif type == 'sym':
assert(len(rest) == 4)
p[parts[-1]] = {'target': rest[2], 'size': rest[3]}
d[file.name] = contents
elif file.name in ('DEFINED_PHASES', 'FEATURES', 'HOMEPAGE',
'INHERITED', 'IUSE', 'IUSE_EFFECTIVE', 'LICENSE',
'KEYWORDS', 'PKGUSE', 'RESTRICT', 'USE'):
d[file.name] = value.split(' ')
else:
d[file.name] = value
if opts.json:
json.dump(d, sys.stdout)
if opts.toml:
toml.dump(d, sys.stdout)
if opts.yaml:
yaml.dump(d, sys.stdout)
if opts.pprint:
pp.pprint(d)
if __name__ == '__main__':
main(sys.argv)
^ permalink raw reply [relevance 99%]
Results 1-1 of 1 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2022-03-14 1:06 99% [gentoo-portage-dev] Changing the VDB format Matt Turner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox