Re: [gentoo-soc] [GSoC-status] Tree-wide collision checking and files database - gentoo-soc

From:	Eitan Mosenkis <eitan@××××××××.net>
To:	gentoo-soc@l.g.o
Subject:	Re: [gentoo-soc] [GSoC-status] Tree-wide collision checking and files database
Date:	Fri, 12 Jun 2009 16:28:31
Message-Id:	`36df18050906120928s499a3187nd95270bce3116edb@mail.gmail.com`
In Reply to:	Re: [gentoo-soc] [GSoC-status] Tree-wide collision checking and files database by Jeremy Olexa

1

I wonder... if you're going to be churning out a bunch of binpkgs, my

2

project (web-based system image generator) could almost certainly put

3

them to use.  As for USE flags, perhaps try enabling all flags -

4

you'll never be able to do every combination of flags or every

5

individual flag on and turning them all on would be a quick and dirty

6

guess at how to end up with the largest set of files, which is what

7

you want for collision checking.  You could also try having two builds

8

- one with all set and one with none set, which would probably get you

9

a few files in the list that you'd miss otherwise.  Still, you'll

10

probably run into occasional problems with packages where portage will

11

tell you you have to change your USE flags to make something install.

12

13

On Fri, Jun 12, 2009 at 11:05 AM, Jeremy Olexa<darkside@g.o> wrote:

14

> On Fri, Jun 12, 2009 at 8:32 AM, Stanislav

15

> Ochotnicky<sochotnicky@×××××.com> wrote:

16

>> Hi everyone,

17

>>

18

>> some of you already know that work on GSoC project "Tree-wide collision

19

>> checking and provided files database" has been started a few weeks ago.

20

>> For the rest, I will make a short introduction and goals of this

21

>> project (collagen).

22

>>

23

>> Collagen aims to improve quality of ebuilds in portage tree. It does

24

>> this by compiling as many ebuilds as possible. It specifically takes

25

>> into account various atoms in DEPEND variable. For example if package

26

>> ebuild states that it needs =dev-libs/glib-2*, that package should be

27

>> compilable with every version of glib-2* in portage (taking into account

28

>> keywords). Therefore collagen will install one version of glib-2*, then

29

>> ebuild in question, collect information, uninstall ebuild and first

30

>> glib version. If repeats this process for every glib-2* in the tree.

31

>

32

> Testing against every version of the deps as required seems like it is

33

> diverging from the original "Tree-wide collision checking and provided

34

> files database" - Would you say that the goal of this project is

35

> becoming more QA orientated? Something like: "Matchbox: A tinderboxen

36

> master server to provide QA for ebuilds"

37

>

38

> If you were strictly collision checking, then you don't care about

39

> every version of glib-2* you only care about the package in question

40

> and what installed files it provides. However for the provided files,

41

> you do care about every version of glib-2*, not for the other package,

42

> but to list the installed files of glib-2*

43

>

44

> After writing that down, I can see why you want to compile, check,

45

> uninstall, re-compile, repeat...but I worry about how efficient it is

46

> and what ways to improve that.

47

>

48

>>

49

>> Original idea was to have two sides:

50

>>  * master server (matchbox)

51

>>  * slaves compiling packages (tinderboxes)

52

>>

53

>> Master server decides what needs to be compiled (automatically or

54

>> semi-automatically). Tinderbox asks for job, master provides package

55

>> name (and optionally version). Tinderbox then goes and tries to compile

56

>> package with different sets of dependencies reporting results to

57

>> Matchbox.

58

>>

59

>> It seems that whole process could be sped up by hosting binary

60

>> packages on one central server (Binary host). Obviously various versions

61

>> of the same package would be created and therefore unique names could be

62

>> created by using some metadata to create hash part of filename. On a

63

>> first thought I would use USE flags and DEPEND as metadata to hash.

64

>

65

> This is a cool aspect of the project, I hope you can work with solar

66

> and zmedico to improve binpkgs. USE flags seem to be the trouble spot

67

> of binpkgs.

68

>

69

>>

70

>> So far two other projects came to light as possible source of

71

>> inspiration and/or collaboration:

72

>>  * catalyst (mainly tinderbox generating part)

73

>>  * AutotuA (automatic generic job framework)

74

>>

75

>> Especially AutotuA seems like good candidate for merging.

76

>>

77

>> It doesn't seem possible to compile every project with every version of

78

>> every dependency, therefore I'd like to ask for your opinion especially

79

>> about this part. One idea I had was to restrict testing to highest build

80

>> number for given version. For example we have:

81

>> glib-2.18.4-r1 and glib-2.18.4-r2, therefore we will only test against

82

>> glib-2.18.4-r2 and will assume that r1 would be OK too (or users would

83

>> upgrade since it's a bugfix release)

84

>

85

> IMO, you have two choices. Latest stable or latest ~arch. Stable users

86

> will not upgrade from glib-2.18.4-r1 to -r2 until -r2 is stable so

87

> that argument is out.

88

>

89

>>

90

>> Another approach to optimizing use of resources would be to have a

91

>> priority list of packages that need most testing. I imagine this could

92

>> be created by analyzing logs from gentoo mirrors, and figuring out which

93

>> packages are downloaded most frequently.

94

>

95

> Mirror log analysis is a fundamentally hard thing to do given the vast

96

> network of mirrors that we have.

97

>

98

>>

99

>> We would probably need at least one tinderbox per glibc version if I am

100

>> not mistaken since this cannot be freely up/downgraded.

101

>

102

> Its free to upgrade ;) Can't downgrade. Given how large the glibc

103

> tracker bugs get, I don't think this project should use the latest

104

> glibc available. Unless you are trying to hunt down bugs, but I think

105

> you will get buried with compile failures. If the goal of this project

106

> is to data mine the installed package's information, that is not

107

> dependant on a glibc version. Please think about this some more before

108

> going down that road, I want this project to be successful ;)

109

>

110

> -Jeremy

111

>

112

>

113

>> This email was meant just as a teaser, more information (data model, UML

114

>> diagrams) is available on project website (look for Documents):

115

>> http://soc.gentooexperimental.org/projects/show/collision-database

116

>>

117

>> I'd love to be hear some suggestions, opinions and criticism. You can

118

>> use this thread, or even various options on gentooexperimental.org.

119

>>

120

>> --

121

>> Stanislav Ochotnicky

122

>> Working for Gentoo Linux http://www.gentoo.org

123

>> Implementing Tree-wide collision checking and provided files database

124

>> http://soc.gentooexperimental.org/projects/show/collision-database

125

>> Blog: http://inputvalidation.blogspot.com/search/label/gsoc

126

>>

127

>>

128

>> jabber: sochotnicky@×××××.com

129

>> icq: 74274152

130

>> PGP: https://dl.getdropbox.com/u/165616/sochotnicky-key.asc

131

>>

132

>

133

>

1	I wonder... if you're going to be churning out a bunch of binpkgs, my
2	project (web-based system image generator) could almost certainly put
3	them to use. As for USE flags, perhaps try enabling all flags -
4	you'll never be able to do every combination of flags or every
5	individual flag on and turning them all on would be a quick and dirty
6	guess at how to end up with the largest set of files, which is what
7	you want for collision checking. You could also try having two builds
8	- one with all set and one with none set, which would probably get you
9	a few files in the list that you'd miss otherwise. Still, you'll
10	probably run into occasional problems with packages where portage will
11	tell you you have to change your USE flags to make something install.
12
13	On Fri, Jun 12, 2009 at 11:05 AM, Jeremy Olexa<darkside@g.o> wrote:
14	> On Fri, Jun 12, 2009 at 8:32 AM, Stanislav
15	> Ochotnicky<sochotnicky@×××××.com> wrote:
16	>> Hi everyone,
17	>>
18	>> some of you already know that work on GSoC project "Tree-wide collision
19	>> checking and provided files database" has been started a few weeks ago.
20	>> For the rest, I will make a short introduction and goals of this
21	>> project (collagen).
22	>>
23	>> Collagen aims to improve quality of ebuilds in portage tree. It does
24	>> this by compiling as many ebuilds as possible. It specifically takes
25	>> into account various atoms in DEPEND variable. For example if package
26	>> ebuild states that it needs =dev-libs/glib-2*, that package should be
27	>> compilable with every version of glib-2* in portage (taking into account
28	>> keywords). Therefore collagen will install one version of glib-2*, then
29	>> ebuild in question, collect information, uninstall ebuild and first
30	>> glib version. If repeats this process for every glib-2* in the tree.
31	>
32	> Testing against every version of the deps as required seems like it is
33	> diverging from the original "Tree-wide collision checking and provided
34	> files database" - Would you say that the goal of this project is
35	> becoming more QA orientated? Something like: "Matchbox: A tinderboxen
36	> master server to provide QA for ebuilds"
37	>
38	> If you were strictly collision checking, then you don't care about
39	> every version of glib-2* you only care about the package in question
40	> and what installed files it provides. However for the provided files,
41	> you do care about every version of glib-2*, not for the other package,
42	> but to list the installed files of glib-2*
43	>
44	> After writing that down, I can see why you want to compile, check,
45	> uninstall, re-compile, repeat...but I worry about how efficient it is
46	> and what ways to improve that.
47	>
48	>>
49	>> Original idea was to have two sides:
50	>> * master server (matchbox)
51	>> * slaves compiling packages (tinderboxes)
52	>>
53	>> Master server decides what needs to be compiled (automatically or
54	>> semi-automatically). Tinderbox asks for job, master provides package
55	>> name (and optionally version). Tinderbox then goes and tries to compile
56	>> package with different sets of dependencies reporting results to
57	>> Matchbox.
58	>>
59	>> It seems that whole process could be sped up by hosting binary
60	>> packages on one central server (Binary host). Obviously various versions
61	>> of the same package would be created and therefore unique names could be
62	>> created by using some metadata to create hash part of filename. On a
63	>> first thought I would use USE flags and DEPEND as metadata to hash.
64	>
65	> This is a cool aspect of the project, I hope you can work with solar
66	> and zmedico to improve binpkgs. USE flags seem to be the trouble spot
67	> of binpkgs.
68	>
69	>>
70	>> So far two other projects came to light as possible source of
71	>> inspiration and/or collaboration:
72	>> * catalyst (mainly tinderbox generating part)
73	>> * AutotuA (automatic generic job framework)
74	>>
75	>> Especially AutotuA seems like good candidate for merging.
76	>>
77	>> It doesn't seem possible to compile every project with every version of
78	>> every dependency, therefore I'd like to ask for your opinion especially
79	>> about this part. One idea I had was to restrict testing to highest build
80	>> number for given version. For example we have:
81	>> glib-2.18.4-r1 and glib-2.18.4-r2, therefore we will only test against
82	>> glib-2.18.4-r2 and will assume that r1 would be OK too (or users would
83	>> upgrade since it's a bugfix release)
84	>
85	> IMO, you have two choices. Latest stable or latest ~arch. Stable users
86	> will not upgrade from glib-2.18.4-r1 to -r2 until -r2 is stable so
87	> that argument is out.
88	>
89	>>
90	>> Another approach to optimizing use of resources would be to have a
91	>> priority list of packages that need most testing. I imagine this could
92	>> be created by analyzing logs from gentoo mirrors, and figuring out which
93	>> packages are downloaded most frequently.
94	>
95	> Mirror log analysis is a fundamentally hard thing to do given the vast
96	> network of mirrors that we have.
97	>
98	>>
99	>> We would probably need at least one tinderbox per glibc version if I am
100	>> not mistaken since this cannot be freely up/downgraded.
101	>
102	> Its free to upgrade ;) Can't downgrade. Given how large the glibc
103	> tracker bugs get, I don't think this project should use the latest
104	> glibc available. Unless you are trying to hunt down bugs, but I think
105	> you will get buried with compile failures. If the goal of this project
106	> is to data mine the installed package's information, that is not
107	> dependant on a glibc version. Please think about this some more before
108	> going down that road, I want this project to be successful ;)
109	>
110	> -Jeremy
111	>
112	>
113	>> This email was meant just as a teaser, more information (data model, UML
114	>> diagrams) is available on project website (look for Documents):
115	>> http://soc.gentooexperimental.org/projects/show/collision-database
116	>>
117	>> I'd love to be hear some suggestions, opinions and criticism. You can
118	>> use this thread, or even various options on gentooexperimental.org.
119	>>
120	>> --
121	>> Stanislav Ochotnicky
122	>> Working for Gentoo Linux http://www.gentoo.org
123	>> Implementing Tree-wide collision checking and provided files database
124	>> http://soc.gentooexperimental.org/projects/show/collision-database
125	>> Blog: http://inputvalidation.blogspot.com/search/label/gsoc
126	>>
127	>>
128	>> jabber: sochotnicky@×××××.com
129	>> icq: 74274152
130	>> PGP: https://dl.getdropbox.com/u/165616/sochotnicky-key.asc
131	>>
132	>
133	>

Gentoo Archives: gentoo-soc