1 |
Hi, |
2 |
|
3 |
I've recently been thinking about this too. |
4 |
|
5 |
On 13-03-2022 18:06:21 -0700, Matt Turner wrote: |
6 |
> The VDB uses a one-file-per-variable format. This has some |
7 |
> inefficiencies, with many file systems. For example the 'EAPI' file |
8 |
> that contains a single character will consume a 4K block on disk. |
9 |
> I recommend json and think it is the best choice because: |
10 |
|
11 |
[snip] |
12 |
|
13 |
> - json provides the smallest on-disk footprint |
14 |
> - json is part of Python's standard library (so is yaml, and toml will |
15 |
> be in Python 3.11) |
16 |
> - Every programming language has multiple json parsers |
17 |
> -- lots of effort has been spent making them extremely fast. |
18 |
|
19 |
I would like to suggest to use "tar". The reason behind this is a bit |
20 |
convoluted, but I try to be as clear and sound as I can: |
21 |
- "new style" bin-packages use tar too |
22 |
- tar-file allows to keep all individual files/members, e.g. for legacy |
23 |
tools to unpack and look at the VDB that way |
24 |
- tar-file allows streaming, so single file read, for efficient |
25 |
retrieval |
26 |
- single tar-file for entire VDB, allows to make it "atomic", one can |
27 |
modify tar archives lateron to add new vdb entries, or perform |
28 |
updates -- again, without inplace (e.g. memory backing) this could be |
29 |
done atomic) |
30 |
- tar-file could be used for (rsync) tree metadata (md5-cache) in the |
31 |
same way, e.g. re-use streaming approach, or unpack for legacy tools |
32 |
- tar-file could be used for Packages file, instead of flat file with |
33 |
keys, basically just write VDB entries with some additional keys, very |
34 |
similar in practise. |
35 |
- tar-files are slightly easier to manage from command line, tools to do |
36 |
so exist for a long time and are installed. (jq isn't pulled in by |
37 |
@system these days, I think) |
38 |
- tar-files can easily (optionally) be compressed retaining streaming |
39 |
abilities (this is for these usages very likely to pay off), a much |
40 |
higher dictionary benefit for a single tar vs many files. |
41 |
- single tar-file is much more efficient to GPG-sign (which would allow |
42 |
some securing of the VDB, not sure if useful though) |
43 |
- going back to the first point, vdb entry from binary package could |
44 |
simply be dropped into the vdb tar, and vice-versa |
45 |
- going back to metadata, dep-resolving could simply load the entire |
46 |
system available/installed packages with two reads in memory (if it |
47 |
has enough of that -- pretty common these days), which should allow |
48 |
for vast speedups, especially on cold(ish) filesystems. |
49 |
|
50 |
> I think we would have a significant time period for the transition. I |
51 |
> think I would include support for the new format in Portage, and ship |
52 |
> a tool with portage to switch back and forth between old and new |
53 |
> formats on-disk. Maybe after a year, drop the code from Portage to |
54 |
> support the old format? |
55 |
|
56 |
Here I believe that with tar-format, initially code could be written to |
57 |
instead of accessing a file directly, it could open up the tar-file, |
58 |
locate the member it needs, and then retrieve that instead. This is a |
59 |
bit naive, but probably sort of managable, and allows to having a switch |
60 |
that specifies which format to write. It's easy to detect which form |
61 |
you have automatically. E.g. nothing has to change for users unless |
62 |
they actively make a change for it. |
63 |
|
64 |
Like you, I think the main reason for doing this should be performance, |
65 |
basically allowing faster operations. |
66 |
|
67 |
I feel though that we should aim to use a single solution to maintain a |
68 |
number of "trees" that we have: metadata, vdb, Packages/binpkgs, for |
69 |
they all seem to exhibit a similar (IO) behaviour when being employed. |
70 |
|
71 |
Thanks, |
72 |
Fabian |
73 |
|
74 |
-- |
75 |
Fabian Groffen |
76 |
Gentoo on a different level |