[gentoo-user] Re: [OT sphinx] Any users of sphinx here - gentoo-user

From:	Harry Putnam <reader@×××××××.com>
To:	gentoo-user@l.g.o
Subject:	[gentoo-user] Re: [OT sphinx] Any users of sphinx here
Date:	Sat, 12 Jun 2010 23:06:34
Message-Id:	`87sk4rrj2t.fsf@newsguy.com`
In Reply to:	Re: [gentoo-user] Re: [OT sphinx] Any users of sphinx here by Brandon Vargo

1

Brandon Vargo <brandon.vargo@×××××.com> writes:

2

3

> do. When I go to find code that I have written, I do not remember

4

> variable names, lines of code, etc that I can match with a regular

5

> expression. Thus, that kind of search is pointless for me. I remember

6

> what the code does, the project for which I wrote the code, and

7

> approximately where the code is located within the project. I remember

8

> function calls for libraries that I probably used. If I cannot find what

9

> I am looking for, I use grep on the name of a function call I remember,

10

> or I have a ctags file containing all the information I need about

11

> function definitions.

12

13

Again, thanks for a thorough answer... just a note on the above

14

comment.

15

16

I often find myself searching for a technique... NOT variable names or

17

sub function names because who knows what I might call stuff in any

18

particular script.

19

20

For example... I once was shown how to compile as regular expression

21

an element of @ARGV in perl, in one step:

22

23

   my $what_re = qr/@{[shift]}/;

24

25

I liked that and have used it many times... but only recently could I

26

remember at a moments notice how to write it.

27

28

I used `grep -r' or 'egrep -r' as you've mentioned, now I use a

29

my own perl script (recently written [since posting original query])

30

that uses regex and File::Find, where user feeds the regex and the

31

approximate location to begin the search, on the cmd line.

32

33

In my case that would be an nfs share /projects/reader/perl which is

34

kept in my ENV as $perlp

35

36

So:

37

  script.pl 'qr/.*?@' $perlp

38

39

Will find a number of examples of using that particular technique.

40

41

What prompted my query here, was looking for a way to search several

42

thousand html pages that are a collection of Perl books on CD.

43

44

These are 2 of the Oreilly Perl CDbooks.  (I spent $150 for the first

45

one, and I think the second was a little cheaper, it was yrs ago) The

46

Books on CD have built in search tools but those only work on a

47

windows OS and aren't up to much anyway.

48

49

I've since downloaded the data from the CDS onto an opensolaris zfs

50

server and access them through NFS.

51

52

I was attempting to use `webglimpse'

53

(http://webglimpse.net/download.php) for the task, hence the interest

54

in indexing.  But I suspect a search for a particular technique I read

55

about, but have forgotten how to code, would be best searched for

56

using regular expressions.  This would be long after I've forgotten

57

which section or even which book I read about it in.

58

59

The tool I've written can be made to strip html if necessary and can

60

be made to include (by regex) only certain kinds of filenames, but

61

uses no index so consequently is pretty slow... but still very useful

62

and is fully perl regex capable.

63

64

It returns up to 4 lines of context, 2 above the line with the hit,

65

and 1 below (where possible), along with the page number and the

66

absolute filename where the hit was found.

67

68

Here is an example search being timed:

69

-------        ---------       ---=---       ---------      -------- 

70

(I purposely picked something that would be found many times)

71

72

 time ./pgrep3  /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/ hash

73

74

 (So above we are searching a collection from the Oreilly CDbooks for

75

 the term `hash'..)

76

77

 (Just one example of the thousands of lines returned)

78

  [...]

79

80

   /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/perlnut/index/idx_p.htm

81

  135         dereferencing with : [104]4.8.2. Dereferencing

82

  136         modulus operator : [105]4.5.3. Arithmetic Operators

83

  137         prototype symbol (hash) : [106]4.7.5. Prototypes

84

  138         %= (assignment) operator : [107]4.5.6. Assignment Operators

85

---

86

87

 [...]

88

89

 Total files searched: 522 

90

 Total lines searched: 431689

91

 real    1m48.344s

92

 user    1m25.234s

93

 sys     0m14.336s

94

95

-------        ---------       ---=---       ---------      -------- 

96

Almost 2 minutes to search 431689 lines

97

98

So it is slow, maybe even very slow by comparison to tools using an

99

indexed search.

100

101

I don't really mind the sloth, but of course it would not be scalable

102

very much above the scope of use I'm doing with it. I do like the

103

precision search capability and plenty of context. All of the above is

104

also possible with grep, egrep... and friends too, of course, but only

105

with quite a lot more cmdline manipulation and piping.

106

107

I'm currently working on using something like this basic search script

108

to return URLS linking to the page and lines found, and working the

109

whole thing into something that can be carried out with a web browser.

110

111

Something pretty similar to webglimpse, I guess but without the

112

benefit of indexing.

113

114

Also webglimpe relies on glimpse which is not capable of full regex

115

search but does have a rich mixture of regex, regex like and boolean

116

query capability.

1	Brandon Vargo <brandon.vargo@×××××.com> writes:
2
3	> do. When I go to find code that I have written, I do not remember
4	> variable names, lines of code, etc that I can match with a regular
5	> expression. Thus, that kind of search is pointless for me. I remember
6	> what the code does, the project for which I wrote the code, and
7	> approximately where the code is located within the project. I remember
8	> function calls for libraries that I probably used. If I cannot find what
9	> I am looking for, I use grep on the name of a function call I remember,
10	> or I have a ctags file containing all the information I need about
11	> function definitions.
12
13	Again, thanks for a thorough answer... just a note on the above
14	comment.
15
16	I often find myself searching for a technique... NOT variable names or
17	sub function names because who knows what I might call stuff in any
18	particular script.
19
20	For example... I once was shown how to compile as regular expression
21	an element of @ARGV in perl, in one step:
22
23	my $what_re = qr/@{[shift]}/;
24
25	I liked that and have used it many times... but only recently could I
26	remember at a moments notice how to write it.
27
28	I used `grep -r' or 'egrep -r' as you've mentioned, now I use a
29	my own perl script (recently written [since posting original query])
30	that uses regex and File::Find, where user feeds the regex and the
31	approximate location to begin the search, on the cmd line.
32
33	In my case that would be an nfs share /projects/reader/perl which is
34	kept in my ENV as $perlp
35
36	So:
37	script.pl 'qr/.*?@' $perlp
38
39	Will find a number of examples of using that particular technique.
40
41	What prompted my query here, was looking for a way to search several
42	thousand html pages that are a collection of Perl books on CD.
43
44	These are 2 of the Oreilly Perl CDbooks. (I spent $150 for the first
45	one, and I think the second was a little cheaper, it was yrs ago) The
46	Books on CD have built in search tools but those only work on a
47	windows OS and aren't up to much anyway.
48
49	I've since downloaded the data from the CDS onto an opensolaris zfs
50	server and access them through NFS.
51
52	I was attempting to use `webglimpse'
53	(http://webglimpse.net/download.php) for the task, hence the interest
54	in indexing. But I suspect a search for a particular technique I read
55	about, but have forgotten how to code, would be best searched for
56	using regular expressions. This would be long after I've forgotten
57	which section or even which book I read about it in.
58
59	The tool I've written can be made to strip html if necessary and can
60	be made to include (by regex) only certain kinds of filenames, but
61	uses no index so consequently is pretty slow... but still very useful
62	and is fully perl regex capable.
63
64	It returns up to 4 lines of context, 2 above the line with the hit,
65	and 1 below (where possible), along with the page number and the
66	absolute filename where the hit was found.
67
68	Here is an example search being timed:
69	------- --------- ---=--- --------- --------
70	(I purposely picked something that would be found many times)
71
72	time ./pgrep3 /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/ hash
73
74	(So above we are searching a collection from the Oreilly CDbooks for
75	the term `hash'..)
76
77	(Just one example of the thousands of lines returned)
78	[...]
79
80	/var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/perlnut/index/idx_p.htm
81	135 dereferencing with : [104]4.8.2. Dereferencing
82	136 modulus operator : [105]4.5.3. Arithmetic Operators
83	137 prototype symbol (hash) : [106]4.7.5. Prototypes
84	138 %= (assignment) operator : [107]4.5.6. Assignment Operators
85	---
86
87	[...]
88
89	Total files searched: 522
90	Total lines searched: 431689
91	real 1m48.344s
92	user 1m25.234s
93	sys 0m14.336s
94
95	------- --------- ---=--- --------- --------
96	Almost 2 minutes to search 431689 lines
97
98	So it is slow, maybe even very slow by comparison to tools using an
99	indexed search.
100
101	I don't really mind the sloth, but of course it would not be scalable
102	very much above the scope of use I'm doing with it. I do like the
103	precision search capability and plenty of context. All of the above is
104	also possible with grep, egrep... and friends too, of course, but only
105	with quite a lot more cmdline manipulation and piping.
106
107	I'm currently working on using something like this basic search script
108	to return URLS linking to the page and lines found, and working the
109	whole thing into something that can be carried out with a web browser.
110
111	Something pretty similar to webglimpse, I guess but without the
112	benefit of indexing.
113
114	Also webglimpe relies on glimpse which is not capable of full regex
115	search but does have a rich mixture of regex, regex like and boolean
116	query capability.

Gentoo Archives: gentoo-user