Gentoo Archives: gentoo-science

From: Auke Booij <auke@××××××.com>
To: gentoo-soc@l.g.o, gentoo-science@l.g.o
Subject: [gentoo-science] G-CRAN weekly report #7 (warning: big read)
Date: Mon, 12 Jul 2010 20:39:25
Message-Id: AANLkTikupO3Ei5CdsJIJQcsTUPbJeCvb16mplqHqEdPc@mail.gmail.com
1 As the subject says, this report is pretty long. It's intended for
2 those who haven't closely followed my work up until now and would like
3 to catch up, so go grab a cup of coffee if you really want to read
4 this to the end.
5
6 Subjects in this report (in order):
7 -intro of the project
8 -what have I been up to last week
9 -instructions on installing packages from bioconductor and CRAN
10 -g-common, the interface (or actually lack of interface) this project will have
11 -plans for the coming week and next week
12
13 Perhaps an introduction of the circumstances is in place. R is a
14 language for statisticians. With statistics being such a wide topic,
15 there are thousands of additional packages you can install to further
16 analyze data, and the Bioconductor project adds another field to R by
17 introducing genomics. My job is to cleanly enable Gentoo users to
18 install the latest versions of these packages systemwide, as opposed
19 to directly calling R's package installers and ending up with dangling
20 files. Last week, I was up to the point where some packages installed
21 correctly, but there were some rough edges too. For packages not
22 relying on external (non-R) libraries, this should all be smoothed out
23 now.
24
25 I've spent a lot of time communicating with several parties last week.
26 There was a minor issue with the Bioconductor repositories, I've
27 spoken to some people about g-common, talked a bit with the CRAN
28 maintainers and had some technical discussions with rafaelmartins,
29 who's a gsoc student working on g-octave, as you may know.
30
31 Then there are some helpful dependency resolution changes.
32 Dependencies on R packages now work perfectly fine, and external
33 dependencies are going to be tackled soon (but it won't be pretty).
34
35 So why is this helpful? It means you can install most Bioconductor
36 packages flawlessly.
37
38 As promised in an earlier email to the gentoo-science ML, some
39 instructions. Please note that this will of course not be the way
40 you'll eventually use g-cran, but I'm still working on the interface
41 (more on that later).
42
43 First, create two overlays. I'm simply calling them bioconductor_1 and
44 bioconductor_2. One of them primarily contains code, the other
45 consists primarily of gene databases.
46 # mkdir -p /usr/local/portage/bioconductor_1/profiles
47 # mkdir -p /usr/local/portage/bioconductor_2/profiles
48 Now we need to set the repo_name and categories of these overlays, too.
49 # echo "bioconductor_1" >> /usr/local/portage/bioconductor_1/profiles/repo_name
50 # echo "bioconductor_2" >> /usr/local/portage/bioconductor_2/profiles/repo_name
51 # echo "dev-R" >> /usr/local/portage/bioconductor_1/profiles/categories
52 # echo "dev-R" >> /usr/local/portage/bioconductor_2/profiles/categories
53 It's time to actually get the tree. Make sure you've installed g-cran
54 (it's in the science overlay), sync the repositories and then generate
55 the tree:
56 # g-cran /usr/local/portage/bioconductor_1 sync
57 http://www.bioconductor.org/packages/devel/bioc
58 # g-cran /usr/local/portage/bioconductor_2 sync
59 http://www.bioconductor.org/packages/devel/data/annotation
60 # g-cran /usr/local/portage/bioconductor_1 generate-tree
61 # g-cran /usr/local/portage/bioconductor_2 generate-tree
62
63 You can now add the overlays to your favorite package manager and
64 start emerging (*ahem* - installing) packages. If all is well, you
65 should be able to install, for example, dev-R/zebrafishdb (this is a
66 bioconductor_2 database package that pulls in several bioconductor_1
67 packages). I have absolutely no clue as to what you can do with these
68 packages, but I suppose some biology fans out there can clarify that.
69
70 Now, it may be that portage complains about missing Manifest files. If
71 that's the case, then also run:
72 # for x in /usr/local/portage/bioconductor_{1,2}/dev-R/*; do touch
73 "${x}/Manifest"; done
74 I hope that should do the trick, please tell me if it does, and if
75 it's needed at all. Once you've done this and this trick actually
76 works, you should be able to install dev-R/zebrafishdb.
77
78 If you don't need no stinkin' databases of deoxyribonucleic acid, but
79 are interested in CRAN, just create a cran overlay as we did for
80 bioconductor_1 and bioconductor_2, but use http://cran.r-project.org
81 as the source repository, and 'cran' for the overlay name. Better yet,
82 find a mirror close to you at http://cran.r-project.org/mirrors.html
83
84 Okay, so that was quite a journey to get a simple sqlite database of
85 gene data. g-common is what will be making all this easier.
86 Unfortunately I haven't heard much from the other two students I was
87 cooperating with before, anymore, so I'm going to invent something of
88 my own. The plan has remained roughly the same, but time after time
89 I'm struggling to explain it, so please bear with me as you read this.
90
91 [start explanation of g-common]
92 Current projects to install non-ebuild packages generate ebuild files
93 at request, put them in an overlay and tell portage to install them.
94 The problem with this approach is that the ebuilds are only generated
95 when you know what you want to install, ie. the overlay doesn't get
96 fully populated upfront. This approach implies you cannot search for
97 packages in such repositories, you cannot depend on packages in such
98 repositories, and you can't trivially update packages in such
99 repositories. I'd like to generate a full package tree at sync time,
100 no matter if you want to use it or not. Further, this syncing should
101 work like any other overlay: ideally, support for non-ebuild
102 repositories is transparent to the users. I'm going to do this via an
103 abstraction layer called g-common, for which support needs to be
104 written for all package managers. But once that support is written,
105 and the non-ebuild repository reading code is adjusted to work with
106 g-common, there is nothing stopping you from using a non-ebuild
107 repository like a regular ebuild overlay.
108 How this works is not exactly trivial to explain, but the important
109 part is that even though tools like g-cran are really functioning, the
110 package managers thinks it's dealing with a regular PMS-worthy tree.
111 At sync time, the package manager simply calls the g-common method for
112 syncing a tree, which in turn calls the appropriate repository driver
113 to fetch the new package listing from the true remote repository. To
114 integrate this well, some patching is needed. At install time, all the
115 various pkg_unpack, src_install, etc. phases result in calls to
116 g-common, and again those result in calls to the appropriate
117 repository driver, which then executes the phase, but all this is sort
118 of PMS-compliant. Call it over-engineering, but it'll feel like magic
119 and I'm going to prove it.
120 [end explanation of g-common]
121
122 The plan for this week is to /finally/ get some work done on g-common
123 and perhaps prepare the code for external dependency resolution. On
124 Saturday, I'm unfortunately leaving for vacation, so you won't see me
125 doing much. After that vacation, first of all there's GUADEC 2010
126 which I'm going to attend, but of course I'm also going to continue
127 developing g-common and finish external dependency resolution.
128
129 Now, if you've come to this point in my email, I'd really like to
130 thank you, because I know how easy it is to simply mark an email as
131 read and move on. You are why I'm developing this, thanks a lot!
132
133 The next weekly report will be in two weeks,
134 Auke Booij / tulcod.