Gentoo Archives: gentoo-dev

From: Jason Zaman <perfinion@g.o>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Sun, 28 Jan 2018 07:01:26
In Reply to: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure by "Michał Górny"
On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote:
> Migrating mirrors to the hashed structure > -----------------------------------------
> The hard link solution allows us to save space on the master mirror. > Additionally, if ``-H`` option is used by the mirrors it avoids > transferring existing files again. However, this option is known > to be expensive and could cause significant server load. Without it, > all mirrors need to transfer a second copy of all the existing files. > > The symbolic link solution could be more reliable if we could rely > on mirrors using the ``--links`` rsync option. Without that, symbolic > links are not transferred at all.
These rsync options might help for mirrors too: --compare-dest=DIR also compare destination files relative to DIR --copy-dest=DIR ... and include copies of unchanged files --link-dest=DIR hardlink to files in DIR when unchanged
> Using hashed structure for local distfiles > ------------------------------------------ > The hashed structure defined above could also be used for local distfile > storage as used by the package manager. For this to work, the package > manager authors need to ensure that: > > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary > directory where distfiles specific to the package are linked > in a flat structure. > > b. All tools are updated to support the nested structure. > > c. The package manager provides a tool for users to easily manipulate > distfiles, in particular to add distfiles for fetch-restricted > packages into an appropriate subdirectory. > > For extended compatibility, the package manager may support finding > distfiles in flat and nested structure simultaneously.
trying nested first then falling back to flat would make it easy for users if they have to download distfiles for fetch-restricted packages because then the instructions stay as "move it to /usr/portage/distfiles". or alternatively the tool could have a mode which will go through all files in the base dir and move it to where it should be in the nested tree. then you save everything to the same dir and run edist --fix
> Rationale > ========= > Algorithm for splitting distfiles > --------------------------------- > In the original debate that occurred in bug #534528 [#BUG534528]_, > three possible solutions for splitting distfiles were listed: > > a. using initial portion of filename, > > b. using initial portion of file hash, > > c. using initial portion of filename hash. > > The significant advantage of the filename option was simplicity. With > that solution, the users could easily determine the correct subdirectory > themselves. However, it's significant disadvantage was very uneven > shuffling of data. In particular, the TeΧ Live packages alone count > almost 23500 distfiles and all use a common prefix, making it impossible > to split them further.
the filename is the original upstream or the renamed one? eg SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar? I think im in favour of using the initial part of the filename anyway. sure its not balanced but its still a hell of a lot more balanced than today and its really easy. Another thing im wondering is if we can just use the same dir layout as the packages themselves. that would fix texlive since it has a whole lot of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz there is a problem if many packages use the same distfiles (quite extensive for SELinux, every single of the sec-policy/selinux-* packages has identical distfiles) so im not sure how to deal with it. this would also make it easy in future to make the sandbox restrict access to files outside of that package if we wanted to do that.
> The alternate option of using file hash has the advantage of having > a more balanced split. Furthermore, since hashes are stored > in Manifests using them is zero-cost. However, this solution has two > significant disadvantages: > > 1. The hash values are unknown for newly-downloaded distfiles, so > ``repoman`` (or an equivalent tool) would have to use a temporary > directory before locating the file in appropriate subdirectory. > > 2. User-provided distfiles (e.g. for fetch-restricted packages) with > hash mismatches would be placed in the wrong subdirectory, > potentially causing confusing errors.
Not just this, but on principle, I also think you should be able to read an ebuild and compute the url to download the file from the mirrors without any extra knowledge (especially downloading the distfile).
> Using filename hashes has proven to provide a similar balance > to using file hashes. Furthermore, since filenames are known up front > this solution does not suffer from the both listed problems. While > hashes need to be computed manually, hashing short string should not > cause any performance problems. > > .. figure:: glep-0075-extras/by-filename.png > > Distribution of distfiles by first character of filenames > > .. figure:: glep-0075-extras/by-csum.png > > Distribution of distfiles by first hex-digit of checksum > (x --- content checksum, + --- filename checksum) > > .. figure:: glep-0075-extras/by-csum2.png > > Distribution of distfiles by two first hex-digits of checksum > (x --- content checksum, + --- filename checksum)
do you have an easy way to calculate how big the distfiles are per category or cat/pkg? i'd be interested to see.
> Backwards Compatibility > ======================= > Mirror compatibility > -------------------- > The mirrored files are propagated to other mirrors as opaque directory > structure. Therefore, there are no backwards compatibility concerns > on the mirroring side. > > Backwards compatibility with existing clients is detailed > in `migrating mirrors to the hashed structure`_ section. Backwards > compatibility with the old clients will be provided by preserving > the flat structure during the transitional period.
Even if there was no transition, things wouldnt be terrible because portage would fall back to just downloading from SRC_URI directly if the mirrors fail.