Gentoo Archives: gentoo-science

From: Auke Booij <auke@××××××.com>
To: gentoo-soc@l.g.o, gentoo-science@l.g.o
Subject: [gentoo-science] G-CRAN weekly report #7 (warning: big read)
Date: Mon, 12 Jul 2010 20:39:25
Message-Id: AANLkTikupO3Ei5CdsJIJQcsTUPbJeCvb16mplqHqEdPc@mail.gmail.com
As the subject says, this report is pretty long. It's intended for
those who haven't closely followed my work up until now and would like
to catch up, so go grab a cup of coffee if you really want to read
this to the end.

Subjects in this report (in order):
-intro of the project
-what have I been up to last week
-instructions on installing packages from bioconductor and CRAN
-g-common, the interface (or actually lack of interface) this project will have
-plans for the coming week and next week

Perhaps an introduction of the circumstances is in place. R is a
language for statisticians. With statistics being such a wide topic,
there are thousands of additional packages you can install to further
analyze data, and the Bioconductor project adds another field to R by
introducing genomics. My job is to cleanly enable Gentoo users to
install the latest versions of these packages systemwide, as opposed
to directly calling R's package installers and ending up with dangling
files. Last week, I was up to the point where some packages installed
correctly, but there were some rough edges too. For packages not
relying on external (non-R) libraries, this should all be smoothed out
now.

I've spent a lot of time communicating with several parties last week.
There was a minor issue with the Bioconductor repositories, I've
spoken to some people about g-common, talked a bit with the CRAN
maintainers and had some technical discussions with rafaelmartins,
who's a gsoc student working on g-octave, as you may know.

Then there are some helpful dependency resolution changes.
Dependencies on R packages now work perfectly fine, and external
dependencies are going to be tackled soon (but it won't be pretty).

So why is this helpful? It means you can install most Bioconductor
packages flawlessly.

As promised in an earlier email to the gentoo-science ML, some
instructions. Please note that this will of course not be the way
you'll eventually use g-cran, but I'm still working on the interface
(more on that later).

First, create two overlays. I'm simply calling them bioconductor_1 and
bioconductor_2. One of them primarily contains code, the other
consists primarily of gene databases.
# mkdir -p /usr/local/portage/bioconductor_1/profiles
# mkdir -p /usr/local/portage/bioconductor_2/profiles
Now we need to set the repo_name and categories of these overlays, too.
# echo "bioconductor_1" >> /usr/local/portage/bioconductor_1/profiles/repo_name
# echo "bioconductor_2" >> /usr/local/portage/bioconductor_2/profiles/repo_name
# echo "dev-R" >> /usr/local/portage/bioconductor_1/profiles/categories
# echo "dev-R" >> /usr/local/portage/bioconductor_2/profiles/categories
It's time to actually get the tree. Make sure you've installed g-cran
(it's in the science overlay), sync the repositories and then generate
the tree:
# g-cran /usr/local/portage/bioconductor_1 sync
http://www.bioconductor.org/packages/devel/bioc
# g-cran /usr/local/portage/bioconductor_2 sync
http://www.bioconductor.org/packages/devel/data/annotation
# g-cran /usr/local/portage/bioconductor_1 generate-tree
# g-cran /usr/local/portage/bioconductor_2 generate-tree

You can now add the overlays to your favorite package manager and
start emerging (*ahem* - installing) packages. If all is well, you
should be able to install, for example, dev-R/zebrafishdb (this is a
bioconductor_2 database package that pulls in several bioconductor_1
packages). I have absolutely no clue as to what you can do with these
packages, but I suppose some biology fans out there can clarify that.

Now, it may be that portage complains about missing Manifest files. If
that's the case, then also run:
# for x in /usr/local/portage/bioconductor_{1,2}/dev-R/*; do touch
"${x}/Manifest"; done
I hope that should do the trick, please tell me if it does, and if
it's needed at all. Once you've done this and this trick actually
works, you should be able to install dev-R/zebrafishdb.

If you don't need no stinkin' databases of deoxyribonucleic acid, but
are interested in CRAN, just create a cran overlay as we did for
bioconductor_1 and bioconductor_2, but use http://cran.r-project.org
as the source repository, and 'cran' for the overlay name. Better yet,
find a mirror close to you at http://cran.r-project.org/mirrors.html

Okay, so that was quite a journey to get a simple sqlite database of
gene data. g-common is what will be making all this easier.
Unfortunately I haven't heard much from the other two students I was
cooperating with before, anymore, so I'm going to invent something of
my own. The plan has remained roughly the same, but time after time
I'm struggling to explain it, so please bear with me as you read this.

[start explanation of g-common]
Current projects to install non-ebuild packages generate ebuild files
at request, put them in an overlay and tell portage to install them.
The problem with this approach is that the ebuilds are only generated
when you know what you want to install, ie. the overlay doesn't get
fully populated upfront. This approach implies you cannot search for
packages in such repositories, you cannot depend on packages in such
repositories, and you can't trivially update packages in such
repositories. I'd like to generate a full package tree at sync time,
no matter if you want to use it or not. Further, this syncing should
work like any other overlay: ideally, support for non-ebuild
repositories is transparent to the users. I'm going to do this via an
abstraction layer called g-common, for which support needs to be
written for all package managers. But once that support is written,
and the non-ebuild repository reading code is adjusted to work with
g-common, there is nothing stopping you from using a non-ebuild
repository like a regular ebuild overlay.
How this works is not exactly trivial to explain, but the important
part is that even though tools like g-cran are really functioning, the
package managers thinks it's dealing with a regular PMS-worthy tree.
At sync time, the package manager simply calls the g-common method for
syncing a tree, which in turn calls the appropriate repository driver
to fetch the new package listing from the true remote repository. To
integrate this well, some patching is needed. At install time, all the
various pkg_unpack, src_install, etc. phases result in calls to
g-common, and again those result in calls to the appropriate
repository driver, which then executes the phase, but all this is sort
of PMS-compliant. Call it over-engineering, but it'll feel like magic
and I'm going to prove it.
[end explanation of g-common]

The plan for this week is to /finally/ get some work done on g-common
and perhaps prepare the code for external dependency resolution. On
Saturday, I'm unfortunately leaving for vacation, so you won't see me
doing much. After that vacation, first of all there's GUADEC 2010
which I'm going to attend, but of course I'm also going to continue
developing g-common and finish external dependency resolution.

Now, if you've come to this point in my email, I'd really like to
thank you, because I know how easy it is to simply mark an email as
read and move on. You are why I'm developing this, thanks a lot!

The next weekly report will be in two weeks,
Auke Booij / tulcod.