1 |
On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote: |
2 |
> On 3/13/2022 21:06, Matt Turner wrote: |
3 |
>> The VDB uses a one-file-per-variable format. This has some |
4 |
>> inefficiencies, with many file systems. For example the 'EAPI' file |
5 |
>> that contains a single character will consume a 4K block on disk. |
6 |
>> |
7 |
>> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ |
8 |
>> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { |
9 |
>> print sum }' |
10 |
>> 418517 |
11 |
>> $ du -sh --apparent-size . |
12 |
>> 413K . |
13 |
>> $ du -sh . |
14 |
>> 556K . |
15 |
>> |
16 |
>> During normal operations, portage has to read each of these 35+ |
17 |
>> files/package individually. |
18 |
>> |
19 |
>> I suggest that we change the VDB format to a commonly used format that |
20 |
>> can be quickly read by portage and any other tools. Combining these |
21 |
>> 35+ files into a single file with a commonly used format should: |
22 |
>> |
23 |
>> - speed up vdb access |
24 |
>> - improve disk usage |
25 |
>> - allow external tools to access VDB data more easily |
26 |
>> |
27 |
>> I've attached a program that prints the VDB contents of a specified |
28 |
>> package in different formats: json, toml, and yaml (and also Python |
29 |
>> PrettyPrinter, just because). I think it's important to keep the VDB |
30 |
>> format as plain-text for ease of manipulation, so I have not |
31 |
>> considered anything like sqlite. |
32 |
>> |
33 |
>> I expected to prefer toml, but I actually find it to be rather gross looking. |
34 |
> |
35 |
> Agreed, the toml output is rather "cluttered" looking. |
36 |
> |
37 |
> |
38 |
>> I recommend json and think it is the best choice because: |
39 |
>> |
40 |
>> - json provides the smallest on-disk footprint |
41 |
>> - json is part of Python's standard library (so is yaml, and toml will |
42 |
>> be in Python 3.11) |
43 |
>> - Every programming language has multiple json parsers |
44 |
>> -- lots of effort has been spent making them extremely fast. |
45 |
>> |
46 |
>> I think we would have a significant time period for the transition. I |
47 |
>> think I would include support for the new format in Portage, and ship |
48 |
>> a tool with portage to switch back and forth between old and new |
49 |
>> formats on-disk. Maybe after a year, drop the code from Portage to |
50 |
>> support the old format? |
51 |
>> |
52 |
>> Thoughts? |
53 |
> |
54 |
> I think json is the best format for storing the data on-disk. It's intended |
55 |
> to be a data serialization format to convert data from a non-specific memory |
56 |
> format to a storable on-disk format and back again, so this is a perfect use |
57 |
> for it. |
58 |
|
59 |
Can we avoid adding another format? I find json very hard to edit by hand, it's |
60 |
good at storing lots of data in a quasi-textual format, but is strict enough to be |
61 |
obnoxious to work with. |
62 |
|
63 |
Can the files not be concatenated? Doing so is similar to the tar suggestion, |
64 |
but would keep everything very portage-like. Have the contents assigned to |
65 |
variables. I am betting someone tried this at the start but settled on the current |
66 |
scheme. Does anyone know why? (This would have to be done in bash syntax |
67 |
I assume.) |
68 |
|
69 |
Alternatively, I think the tar suggestion is quite elegant. There's streaming |
70 |
decompressors you can use from python. It adds an extra step to modify but |
71 |
that could be handled transparently by a dev mode. In dev mode, leave the files |
72 |
after extraction and do not re-extract, for release mode replace the archive with |
73 |
what is on disk. |
74 |
|
75 |
Sid. |