Re: [gentoo-portage-dev] search functionality in emerge - gentoo-portage-dev

From:	tvali <qtvali@×××××.com>
To:	gentoo-portage-dev@l.g.o
Subject:	Re: [gentoo-portage-dev] search functionality in emerge
Date:	Sun, 23 Nov 2008 20:00:31
Message-Id:	`cea53e3c0811231200u5b91aa8dqee4435ccdb6b8663@mail.gmail.com`
In Reply to:	Re: [gentoo-portage-dev] search functionality in emerge by Emma Strubell

1

Yes if it would be a low-level implementation to portage, speeding up it's

2

native code for searching and using indexes, then it would make everything

3

faster, including emerge (because emerge does search first for package

4

relations). I have actually wanted to do it myself several years ago, so

5

reacting here to have my ideas discussed, too.

6

7

Douglas Anderson 16:46 reply is about locks and I think that it would need

8

to rethink portages locking methods - what, when and why it locks. This is

9

probably quite hard task by itself. Anyway, as portage actually lets user

10

make two emerges at the same time, locks might be OK, too.

11

12

I think that the best thing would be bottom-up refactoring - first to make a

13

list of lowest-level functions, which have to do with reading data from

14

portage tree or writing into it; then making indexer class, which will be

15

used by all of those low-level functions.

16

17

To have it OOP, it should be implemented in such way:

18

19

   - Low-level portage tree handler does everything with portage tree, no

20

   function in portage uses it directly.

21

   - Tree handler has several needed and several optional methods - so that

22

   implementing new handler would be easy, but creating things like native

23

   regex search would be possible.

24

   - One could implement a new tree handler with SQLite or other interface

25

   instead of filesystem and do other tricks through this interface; for

26

   example, boost it.

27

28

So, nice way to go would be:

29

30

   1. Implementing portage tree handler and it's proxy, which uses current

31

   portage tree in non-indexed way and simply gives methods for the same kind

32

   of access, as currently implemented one.

33

   2. Refactoring portage to rely only on portage tree handler and use

34

   direct portage tree nowhere. To test if it is so, portage tree should be

35

   moved to another directory, about which only this handler knows, and check

36

   if portage works well. Indexing all places, where portage uses it's tree

37

   handler (by std. comment, for example) and making clear, which methods would

38

   contain all boostable code of it.

39

   3. Implementing those methods in proxy, which could simulate fast regex

40

   search and other stuff using simplest possible interface of portage tree

41

   handler (smth. like four methods add, remove, get, list). Proxy should be

42

   able to use handler's methods if they are implemented.

43

   4. Refactoring portage to use advanced methods in proxy.

44

   5. Now, having taken all code together into one place and looking this

45

   nice and readable code, real optimizations could be discussed here, for

46

   example. Ideally, i think, portage would have such tree handlers:

47

      - Filesystem handler - fast searches over current portage tree

48

      structure.

49

      - SQL handler - rewrite of tree functions into SQL queries.

50

      - Network-based handler - maybe sometimes it would nice to have

51

      portage tree only in one machine of cluster, for example if I

52

want to have

53

      100 really small computers with fast connection with mother-computer and

54

      portage tree is too big to be wholly copied to all of them.

55

      - Memory-buffered handler with daemon, which is actually proxy to some

56

      other handler - daemon, which reads whole tree (from filesystem

57

or SQL) into

58

      memory on boot or first use, creates really fast index (because

59

now it does

60

      matter to have better indexing) and optionally deletes some [less needed]

61

      parts of it's index from memory when it's becoming full and behaves as

62

      really simple proxy if it stays full. This should be implemented after

63

      critical parts of filesystem or SQL handler.

64

65

2008/11/23 Emma Strubell <emma.strubell@×××××.com>

66

67

> Wow, that's extremely helpful!! I happen to particularly enjoy tries, so

68

> the suffix trie sounds like a great idea. The trie class example is really

69

> helpful too, because this will be my first time programming in Python, and

70

> it's a bit easier to figure out what's going on syntax-wise in that simple

71

> trie class than in the middle of the portage source code!

72

>

73

> Seriously, thanks again :]

74

>

75

>

76

> On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston <lucianposton@×××××.com>wrote:

77

>

78

>> > Thanks for the replies! I know there are a couple programs out there

79

>> that

80

>> > basically already do what I'm looking to do... Unfortunately I wasn't

81

>> aware

82

>> > of these pre-existing utilities until after I submitted my project

83

>> proposal

84

>> > to my professor. So, I'm looking to implement a better search myself.

85

>> > Preferably by editing the existing portage code, not writing a separate

86

>> > program. So if anyone can offer any help regarding the actual

87

>> implementation

88

>> > of search in portage, I would greatly appreciate it!

89

>>

90

>> Most of the search implementation is in

91

>> /usr/lib/portage/pym/_emerge/__init__.py in class search.  The class's

92

>> execute() method simply iterates over all packages (and descriptions

93

>> and package sets) and matches against the searchkey.  You might need

94

>> to look into pym/portage/dbapi/porttree.py for portdbapi as well.

95

>>

96

>> If you intend to index and support fast regex lookup, then you need to

97

>> do some fancy indexing, which I'm not terribly familiar with.  You

98

>> could follow in the steps of eix[1] or other indexed search utilities

99

>> and design some sort of index layout, which is easier than the

100

>> following thought.  You might consider implementing a suffix trie or

101

>> similar that has sublinear regexp lookup and marshalling the structure

102

>> for the index.  I couldn't find a python implementation for something

103

>> like this, but here is a general trie class[2] that you might start

104

>> with if you go that route.  There is a perl module[3],

105

>> Tie::Hash::Regex, that does that, but properly implementing that in

106

>> python would be a chore. :)

107

>>

108

>> That project sounds interesting and fun. Good luck!

109

>>

110

>> Lucian Poston

111

>>

112

>> [1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout

113

>> [2]

114

>> http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx

115

>> [3]

116

>> http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm<http://search.cpan.org/%7Edavecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm>

117

>>

118

>>

119

>

120

121

122

--

123

tvali

124

125

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.

126

Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi

127

täica pea nagu prügikast...

1	Yes if it would be a low-level implementation to portage, speeding up it's
2	native code for searching and using indexes, then it would make everything
3	faster, including emerge (because emerge does search first for package
4	relations). I have actually wanted to do it myself several years ago, so
5	reacting here to have my ideas discussed, too.
6
7	Douglas Anderson 16:46 reply is about locks and I think that it would need
8	to rethink portages locking methods - what, when and why it locks. This is
9	probably quite hard task by itself. Anyway, as portage actually lets user
10	make two emerges at the same time, locks might be OK, too.
11
12	I think that the best thing would be bottom-up refactoring - first to make a
13	list of lowest-level functions, which have to do with reading data from
14	portage tree or writing into it; then making indexer class, which will be
15	used by all of those low-level functions.
16
17	To have it OOP, it should be implemented in such way:
18
19	- Low-level portage tree handler does everything with portage tree, no
20	function in portage uses it directly.
21	- Tree handler has several needed and several optional methods - so that
22	implementing new handler would be easy, but creating things like native
23	regex search would be possible.
24	- One could implement a new tree handler with SQLite or other interface
25	instead of filesystem and do other tricks through this interface; for
26	example, boost it.
27
28	So, nice way to go would be:
29
30	1. Implementing portage tree handler and it's proxy, which uses current
31	portage tree in non-indexed way and simply gives methods for the same kind
32	of access, as currently implemented one.
33	2. Refactoring portage to rely only on portage tree handler and use
34	direct portage tree nowhere. To test if it is so, portage tree should be
35	moved to another directory, about which only this handler knows, and check
36	if portage works well. Indexing all places, where portage uses it's tree
37	handler (by std. comment, for example) and making clear, which methods would
38	contain all boostable code of it.
39	3. Implementing those methods in proxy, which could simulate fast regex
40	search and other stuff using simplest possible interface of portage tree
41	handler (smth. like four methods add, remove, get, list). Proxy should be
42	able to use handler's methods if they are implemented.
43	4. Refactoring portage to use advanced methods in proxy.
44	5. Now, having taken all code together into one place and looking this
45	nice and readable code, real optimizations could be discussed here, for
46	example. Ideally, i think, portage would have such tree handlers:
47	- Filesystem handler - fast searches over current portage tree
48	structure.
49	- SQL handler - rewrite of tree functions into SQL queries.
50	- Network-based handler - maybe sometimes it would nice to have
51	portage tree only in one machine of cluster, for example if I
52	want to have
53	100 really small computers with fast connection with mother-computer and
54	portage tree is too big to be wholly copied to all of them.
55	- Memory-buffered handler with daemon, which is actually proxy to some
56	other handler - daemon, which reads whole tree (from filesystem
57	or SQL) into
58	memory on boot or first use, creates really fast index (because
59	now it does
60	matter to have better indexing) and optionally deletes some [less needed]
61	parts of it's index from memory when it's becoming full and behaves as
62	really simple proxy if it stays full. This should be implemented after
63	critical parts of filesystem or SQL handler.
64
65	2008/11/23 Emma Strubell <emma.strubell@×××××.com>
66
67	> Wow, that's extremely helpful!! I happen to particularly enjoy tries, so
68	> the suffix trie sounds like a great idea. The trie class example is really
69	> helpful too, because this will be my first time programming in Python, and
70	> it's a bit easier to figure out what's going on syntax-wise in that simple
71	> trie class than in the middle of the portage source code!
72	>
73	> Seriously, thanks again :]
74	>
75	>
76	> On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston <lucianposton@×××××.com>wrote:
77	>
78	>> > Thanks for the replies! I know there are a couple programs out there
79	>> that
80	>> > basically already do what I'm looking to do... Unfortunately I wasn't
81	>> aware
82	>> > of these pre-existing utilities until after I submitted my project
83	>> proposal
84	>> > to my professor. So, I'm looking to implement a better search myself.
85	>> > Preferably by editing the existing portage code, not writing a separate
86	>> > program. So if anyone can offer any help regarding the actual
87	>> implementation
88	>> > of search in portage, I would greatly appreciate it!
89	>>
90	>> Most of the search implementation is in
91	>> /usr/lib/portage/pym/_emerge/__init__.py in class search. The class's
92	>> execute() method simply iterates over all packages (and descriptions
93	>> and package sets) and matches against the searchkey. You might need
94	>> to look into pym/portage/dbapi/porttree.py for portdbapi as well.
95	>>
96	>> If you intend to index and support fast regex lookup, then you need to
97	>> do some fancy indexing, which I'm not terribly familiar with. You
98	>> could follow in the steps of eix[1] or other indexed search utilities
99	>> and design some sort of index layout, which is easier than the
100	>> following thought. You might consider implementing a suffix trie or
101	>> similar that has sublinear regexp lookup and marshalling the structure
102	>> for the index. I couldn't find a python implementation for something
103	>> like this, but here is a general trie class[2] that you might start
104	>> with if you go that route. There is a perl module[3],
105	>> Tie::Hash::Regex, that does that, but properly implementing that in
106	>> python would be a chore. :)
107	>>
108	>> That project sounds interesting and fun. Good luck!
109	>>
110	>> Lucian Poston
111	>>
112	>> [1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout
113	>> [2]
114	>> http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx
115	>> [3]
116	>> http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm<http://search.cpan.org/%7Edavecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm>
117	>>
118	>>
119	>
120
121
122	--
123	tvali
124
125	Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.
126	Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi
127	täica pea nagu prügikast...

Gentoo Archives: gentoo-portage-dev