It has come to my attention that, during recent weeks, a small number of
users have been complaining recently about the size of the rsync tree.
My august colleagues have proposed many ingenious solutions, but
misfortunately they are all complicated and involve a lot of manual
work. I believe the following small changes (which can mostly be
automated) would prove of much larger benefit to the community for a
vastly reduced cost.
To begin with, I'd like to draw your attention to comments in ebuilds.
It is an oft-forgotten fact that these items provide absolutely no
benefit to the end user. "Surely", I hear you say, "it is not worth
getting hung up over such an insignificant triviality! What harm do a
few trifling little remarks do?". Yet, when actually measured, these
'innocent minutiae' (as you might call them had you a penchant for
obsolete vocabulary or a predilection for pomposity) account for
approximately 20% of the total ebuild content in the tree. It is obvious
that an immediate ban upon these silly things, alongside a small script
to remove them from the tree, would provide a very large gain for our
users without having to remove any existing code. Adding in a repoman
check to error out if such lines were present would clearly be a good
Next up are blank lines, which, as all the world knows are of no use at
all to anyone. These account for a staggering 150KBytes of data in the
main tree, which, over a 9600 dialup line, would save us over two
minutes on an emerge sync. Again, removing these pointless wastes of
space via a bash script is trivial.
Staying with the blank spaces thing, leading whitespaces (which serve no
practical purpose and are only used to make the code "look pretty" --
although how a bash script could ever be considered "pretty" is beyond
my limited mind) account for nearly half a megabyte of data. Clearly
these should immediately be removed and any developer using them in the
future should have their cvs access suspended pending a review of their
status within the project -- as devrel and our managers will tell you,
being nice to the users is our number one priority.
There are other trivial ways to save space too. The commonly used helper
function "emake", for example, is a shocking five bytes in length.
Replacing this with a much more helpfully named "e", and likewise
replacing "econf" with "c", would gain something like 50KBytes. If we
also replace src_unpack, src_compile and src_install with more
appropriate alternatives we could shave off a further 300KBytes. I have
no doubt that the reader could extend this logic to the other portage
internals and common function names, bring the total up to half a
megabyte or more.
This can be extended to other functions, of course. In particular I'd
like to draw your attention to the absurdly named "flag-o-matic.eclass".
Merely inheriting this eclass adds at least thirteen bytes (that's over
a hundred bits!) of bloat to an ebuild, and that's before we start on
the ridiculously verbose function names. What's all this "replace-flags"
nonsense I ask you? Any educated programmer can see that "rf" is a far
more useful name. Even those who are not convinced that space needs to
be saved must surely notice how much developer time would be saved
through reduced typing.
It remains a mystery to me how anyone could possibly have overlooked the
following suggestion. Currently, we install 'dependency information'
inside ebuilds. This is blatantly pointless -- as RedHat have so ably
demonstrated with their 'rpm' installer (and, albeit in a non-Linux
environment, I am assured that Microsoft are in the same boat), there is
no need for automatic dependency tracking and resolution. Our users are
more than capable of working this out for themselves. Similarly, the
HOMEPAGE variable is entirely pointless and has been supersede by Google
Oh, and then we come to metadata.xml. As all the world knows, xml is a
massive waste of space, and (as a data interchange format not a data
storage format) utterly unsuited for configuration files. A typical
metadata.xml file is 95%+ noise. By replacing these with flat text files
listing the maintainers, we could save somewhere in the region of one
and a half megabytes.
Also, no-one has yet considered all the useless fluff in the tree that
nobody actually uses. By removing all ebuilds and eclasses related to
emacs, kde, gnome, php, gaim or java related from the tree, as well as
anything which is only supplied as a binary we could save... Well, I'll
let you do the calculations yourselves. Although mathematics is not the
main focus of my degree, I believe I understand enough to know that the
result is a very big number.
Similarly, all those "compile fix" patches we supply are obviously
worthless. If anyone has any doubt, I suggest they just look at how
many users are using broken CFLAGS and compilers -- clearly, working
code is not a major concern. We should of course leave in security
patches, since security is our number one priority.
ChangeLogs are the next thing to fall under my scrutiny. Clearly these
are entirely worthless, since anyone who cares can just read the cvs
logs and use diff. Kiss goodbye to 14MBytes of junk. Hang on? Did I just
say 14MBytes? Yes. Fourteen Megabytes. That's a one, then a four, then
six zeros. That's fourteen million bytes, or over one hundred and ten
million bits. When syncing my GPRS phone whilst sitting inside a large
metal cage in north Yorkshire, that could save me over TWELVE HOURS on
I understand that my previous point may cause a small amount of disquiet
amongst a small proportion of our userbase. After all, how are they
supposed to decide whether to update if they do not know what an update
will change? To them, I must point out that whilst such an attitude is
appropriate for a small hobbyist distribution aimed at skilled users, it
is utterly at odds with what enterprise users require. For them, it is
important that they can perform updates without having to know what they
are doing -- remember that in a corporate environment, any information
is too much information, and time spent reading ChangeLogs is time not
spent doing useful work. Please do not forget that better enterprise
support is our number one priority.
Finally, I must draw KEYWORDS to your scrutiny, and in particular the
misguided choice of ~ to indicate unstable. In ASCII, the tilde
character is represented by the octet 0x7E (hexadecimal), or, in binary,
01111110. A cursory glance at this will show that it contains
significantly more 1 bits than 0 bits. As anyone who has had a basic
schooling in the field of compression can tell you, 1 bits do not
compress as well as 0 bits (they don't have as much empty space in the
middle), so clearly we would be better off picking something else. I
propose the ( character, which has only one 1 bit for every four 0 bits.
Also, I suggest we drop the amd64 keyword and just use x86 to save
space, since we all know fine well that amd64 is just like x86 with a
few extra bits stuck onto the end. Or rather, the start, since x86 gets
its bytes backwards...
Gentlemen, ladies, jforman, I believe those remedies outlined herein are
a far more sensible solution than any other current proposal. I eagerly
await the implementation.
Ciaran McCreesh : Gentoo Developer (Vim, Fluxbox, Sparc, Mips)
Mail : ciaranm at gentoo.org
Web : http://dev.gentoo.org/~ciaranm