[gentoo-science] G-CRAN weekly report #7 (warning: big read) - gentoo-science

From:	Auke Booij <auke@××××××.com>
To:	gentoo-soc@l.g.o, gentoo-science@l.g.o
Subject:	[gentoo-science] G-CRAN weekly report #7 (warning: big read)
Date:	Mon, 12 Jul 2010 20:39:25
Message-Id:	`AANLkTikupO3Ei5CdsJIJQcsTUPbJeCvb16mplqHqEdPc@mail.gmail.com`

1

As the subject says, this report is pretty long. It's intended for

2

those who haven't closely followed my work up until now and would like

3

to catch up, so go grab a cup of coffee if you really want to read

4

this to the end.

5

6

Subjects in this report (in order):

7

-intro of the project

8

-what have I been up to last week

9

-instructions on installing packages from bioconductor and CRAN

10

-g-common, the interface (or actually lack of interface) this project will have

11

-plans for the coming week and next week

12

13

Perhaps an introduction of the circumstances is in place. R is a

14

language for statisticians. With statistics being such a wide topic,

15

there are thousands of additional packages you can install to further

16

analyze data, and the Bioconductor project adds another field to R by

17

introducing genomics. My job is to cleanly enable Gentoo users to

18

install the latest versions of these packages systemwide, as opposed

19

to directly calling R's package installers and ending up with dangling

20

files. Last week, I was up to the point where some packages installed

21

correctly, but there were some rough edges too. For packages not

22

relying on external (non-R) libraries, this should all be smoothed out

23

now.

24

25

I've spent a lot of time communicating with several parties last week.

26

There was a minor issue with the Bioconductor repositories, I've

27

spoken to some people about g-common, talked a bit with the CRAN

28

maintainers and had some technical discussions with rafaelmartins,

29

who's a gsoc student working on g-octave, as you may know.

30

31

Then there are some helpful dependency resolution changes.

32

Dependencies on R packages now work perfectly fine, and external

33

dependencies are going to be tackled soon (but it won't be pretty).

34

35

So why is this helpful? It means you can install most Bioconductor

36

packages flawlessly.

37

38

As promised in an earlier email to the gentoo-science ML, some

39

instructions. Please note that this will of course not be the way

40

you'll eventually use g-cran, but I'm still working on the interface

41

(more on that later).

42

43

First, create two overlays. I'm simply calling them bioconductor_1 and

44

bioconductor_2. One of them primarily contains code, the other

45

consists primarily of gene databases.

46

# mkdir -p /usr/local/portage/bioconductor_1/profiles

47

# mkdir -p /usr/local/portage/bioconductor_2/profiles

48

Now we need to set the repo_name and categories of these overlays, too.

49

# echo "bioconductor_1" >> /usr/local/portage/bioconductor_1/profiles/repo_name

50

# echo "bioconductor_2" >> /usr/local/portage/bioconductor_2/profiles/repo_name

51

# echo "dev-R" >> /usr/local/portage/bioconductor_1/profiles/categories

52

# echo "dev-R" >> /usr/local/portage/bioconductor_2/profiles/categories

53

It's time to actually get the tree. Make sure you've installed g-cran

54

(it's in the science overlay), sync the repositories and then generate

55

the tree:

56

# g-cran /usr/local/portage/bioconductor_1 sync

57

http://www.bioconductor.org/packages/devel/bioc

58

# g-cran /usr/local/portage/bioconductor_2 sync

59

http://www.bioconductor.org/packages/devel/data/annotation

60

# g-cran /usr/local/portage/bioconductor_1 generate-tree

61

# g-cran /usr/local/portage/bioconductor_2 generate-tree

62

63

You can now add the overlays to your favorite package manager and

64

start emerging (*ahem* - installing) packages. If all is well, you

65

should be able to install, for example, dev-R/zebrafishdb (this is a

66

bioconductor_2 database package that pulls in several bioconductor_1

67

packages). I have absolutely no clue as to what you can do with these

68

packages, but I suppose some biology fans out there can clarify that.

69

70

Now, it may be that portage complains about missing Manifest files. If

71

that's the case, then also run:

72

# for x in /usr/local/portage/bioconductor_{1,2}/dev-R/*; do touch

73

"${x}/Manifest"; done

74

I hope that should do the trick, please tell me if it does, and if

75

it's needed at all. Once you've done this and this trick actually

76

works, you should be able to install dev-R/zebrafishdb.

77

78

If you don't need no stinkin' databases of deoxyribonucleic acid, but

79

are interested in CRAN, just create a cran overlay as we did for

80

bioconductor_1 and bioconductor_2, but use http://cran.r-project.org

81

as the source repository, and 'cran' for the overlay name. Better yet,

82

find a mirror close to you at http://cran.r-project.org/mirrors.html

83

84

Okay, so that was quite a journey to get a simple sqlite database of

85

gene data. g-common is what will be making all this easier.

86

Unfortunately I haven't heard much from the other two students I was

87

cooperating with before, anymore, so I'm going to invent something of

88

my own. The plan has remained roughly the same, but time after time

89

I'm struggling to explain it, so please bear with me as you read this.

90

91

[start explanation of g-common]

92

Current projects to install non-ebuild packages generate ebuild files

93

at request, put them in an overlay and tell portage to install them.

94

The problem with this approach is that the ebuilds are only generated

95

when you know what you want to install, ie. the overlay doesn't get

96

fully populated upfront. This approach implies you cannot search for

97

packages in such repositories, you cannot depend on packages in such

98

repositories, and you can't trivially update packages in such

99

repositories. I'd like to generate a full package tree at sync time,

100

no matter if you want to use it or not. Further, this syncing should

101

work like any other overlay: ideally, support for non-ebuild

102

repositories is transparent to the users. I'm going to do this via an

103

abstraction layer called g-common, for which support needs to be

104

written for all package managers. But once that support is written,

105

and the non-ebuild repository reading code is adjusted to work with

106

g-common, there is nothing stopping you from using a non-ebuild

107

repository like a regular ebuild overlay.

108

How this works is not exactly trivial to explain, but the important

109

part is that even though tools like g-cran are really functioning, the

110

package managers thinks it's dealing with a regular PMS-worthy tree.

111

At sync time, the package manager simply calls the g-common method for

112

syncing a tree, which in turn calls the appropriate repository driver

113

to fetch the new package listing from the true remote repository. To

114

integrate this well, some patching is needed. At install time, all the

115

various pkg_unpack, src_install, etc. phases result in calls to

116

g-common, and again those result in calls to the appropriate

117

repository driver, which then executes the phase, but all this is sort

118

of PMS-compliant. Call it over-engineering, but it'll feel like magic

119

and I'm going to prove it.

120

[end explanation of g-common]

121

122

The plan for this week is to /finally/ get some work done on g-common

123

and perhaps prepare the code for external dependency resolution. On

124

Saturday, I'm unfortunately leaving for vacation, so you won't see me

125

doing much. After that vacation, first of all there's GUADEC 2010

126

which I'm going to attend, but of course I'm also going to continue

127

developing g-common and finish external dependency resolution.

128

129

Now, if you've come to this point in my email, I'd really like to

130

thank you, because I know how easy it is to simply mark an email as

131

read and move on. You are why I'm developing this, thanks a lot!

132

133

The next weekly report will be in two weeks,

134

Auke Booij / tulcod.

1	As the subject says, this report is pretty long. It's intended for
2	those who haven't closely followed my work up until now and would like
3	to catch up, so go grab a cup of coffee if you really want to read
4	this to the end.
5
6	Subjects in this report (in order):
7	-intro of the project
8	-what have I been up to last week
9	-instructions on installing packages from bioconductor and CRAN
10	-g-common, the interface (or actually lack of interface) this project will have
11	-plans for the coming week and next week
12
13	Perhaps an introduction of the circumstances is in place. R is a
14	language for statisticians. With statistics being such a wide topic,
15	there are thousands of additional packages you can install to further
16	analyze data, and the Bioconductor project adds another field to R by
17	introducing genomics. My job is to cleanly enable Gentoo users to
18	install the latest versions of these packages systemwide, as opposed
19	to directly calling R's package installers and ending up with dangling
20	files. Last week, I was up to the point where some packages installed
21	correctly, but there were some rough edges too. For packages not
22	relying on external (non-R) libraries, this should all be smoothed out
23	now.
24
25	I've spent a lot of time communicating with several parties last week.
26	There was a minor issue with the Bioconductor repositories, I've
27	spoken to some people about g-common, talked a bit with the CRAN
28	maintainers and had some technical discussions with rafaelmartins,
29	who's a gsoc student working on g-octave, as you may know.
30
31	Then there are some helpful dependency resolution changes.
32	Dependencies on R packages now work perfectly fine, and external
33	dependencies are going to be tackled soon (but it won't be pretty).
34
35	So why is this helpful? It means you can install most Bioconductor
36	packages flawlessly.
37
38	As promised in an earlier email to the gentoo-science ML, some
39	instructions. Please note that this will of course not be the way
40	you'll eventually use g-cran, but I'm still working on the interface
41	(more on that later).
42
43	First, create two overlays. I'm simply calling them bioconductor_1 and
44	bioconductor_2. One of them primarily contains code, the other
45	consists primarily of gene databases.
46	# mkdir -p /usr/local/portage/bioconductor_1/profiles
47	# mkdir -p /usr/local/portage/bioconductor_2/profiles
48	Now we need to set the repo_name and categories of these overlays, too.
49	# echo "bioconductor_1" >> /usr/local/portage/bioconductor_1/profiles/repo_name
50	# echo "bioconductor_2" >> /usr/local/portage/bioconductor_2/profiles/repo_name
51	# echo "dev-R" >> /usr/local/portage/bioconductor_1/profiles/categories
52	# echo "dev-R" >> /usr/local/portage/bioconductor_2/profiles/categories
53	It's time to actually get the tree. Make sure you've installed g-cran
54	(it's in the science overlay), sync the repositories and then generate
55	the tree:
56	# g-cran /usr/local/portage/bioconductor_1 sync
57	http://www.bioconductor.org/packages/devel/bioc
58	# g-cran /usr/local/portage/bioconductor_2 sync
59	http://www.bioconductor.org/packages/devel/data/annotation
60	# g-cran /usr/local/portage/bioconductor_1 generate-tree
61	# g-cran /usr/local/portage/bioconductor_2 generate-tree
62
63	You can now add the overlays to your favorite package manager and
64	start emerging (ahem - installing) packages. If all is well, you
65	should be able to install, for example, dev-R/zebrafishdb (this is a
66	bioconductor_2 database package that pulls in several bioconductor_1
67	packages). I have absolutely no clue as to what you can do with these
68	packages, but I suppose some biology fans out there can clarify that.
69
70	Now, it may be that portage complains about missing Manifest files. If
71	that's the case, then also run:
72	# for x in /usr/local/portage/bioconductor_{1,2}/dev-R/*; do touch
73	"${x}/Manifest"; done
74	I hope that should do the trick, please tell me if it does, and if
75	it's needed at all. Once you've done this and this trick actually
76	works, you should be able to install dev-R/zebrafishdb.
77
78	If you don't need no stinkin' databases of deoxyribonucleic acid, but
79	are interested in CRAN, just create a cran overlay as we did for
80	bioconductor_1 and bioconductor_2, but use http://cran.r-project.org
81	as the source repository, and 'cran' for the overlay name. Better yet,
82	find a mirror close to you at http://cran.r-project.org/mirrors.html
83
84	Okay, so that was quite a journey to get a simple sqlite database of
85	gene data. g-common is what will be making all this easier.
86	Unfortunately I haven't heard much from the other two students I was
87	cooperating with before, anymore, so I'm going to invent something of
88	my own. The plan has remained roughly the same, but time after time
89	I'm struggling to explain it, so please bear with me as you read this.
90
91	[start explanation of g-common]
92	Current projects to install non-ebuild packages generate ebuild files
93	at request, put them in an overlay and tell portage to install them.
94	The problem with this approach is that the ebuilds are only generated
95	when you know what you want to install, ie. the overlay doesn't get
96	fully populated upfront. This approach implies you cannot search for
97	packages in such repositories, you cannot depend on packages in such
98	repositories, and you can't trivially update packages in such
99	repositories. I'd like to generate a full package tree at sync time,
100	no matter if you want to use it or not. Further, this syncing should
101	work like any other overlay: ideally, support for non-ebuild
102	repositories is transparent to the users. I'm going to do this via an
103	abstraction layer called g-common, for which support needs to be
104	written for all package managers. But once that support is written,
105	and the non-ebuild repository reading code is adjusted to work with
106	g-common, there is nothing stopping you from using a non-ebuild
107	repository like a regular ebuild overlay.
108	How this works is not exactly trivial to explain, but the important
109	part is that even though tools like g-cran are really functioning, the
110	package managers thinks it's dealing with a regular PMS-worthy tree.
111	At sync time, the package manager simply calls the g-common method for
112	syncing a tree, which in turn calls the appropriate repository driver
113	to fetch the new package listing from the true remote repository. To
114	integrate this well, some patching is needed. At install time, all the
115	various pkg_unpack, src_install, etc. phases result in calls to
116	g-common, and again those result in calls to the appropriate
117	repository driver, which then executes the phase, but all this is sort
118	of PMS-compliant. Call it over-engineering, but it'll feel like magic
119	and I'm going to prove it.
120	[end explanation of g-common]
121
122	The plan for this week is to /finally/ get some work done on g-common
123	and perhaps prepare the code for external dependency resolution. On
124	Saturday, I'm unfortunately leaving for vacation, so you won't see me
125	doing much. After that vacation, first of all there's GUADEC 2010
126	which I'm going to attend, but of course I'm also going to continue
127	developing g-common and finish external dependency resolution.
128
129	Now, if you've come to this point in my email, I'd really like to
130	thank you, because I know how easy it is to simply mark an email as
131	read and move on. You are why I'm developing this, thanks a lot!
132
133	The next weekly report will be in two weeks,
134	Auke Booij / tulcod.

Gentoo Archives: gentoo-science