Gentoo Archives: gentoo-user

From: Harry Putnam <reader@×××××××.com>
To: gentoo-user@l.g.o
Subject: [gentoo-user] Re: [OT sphinx] Any users of sphinx here
Date: Sat, 12 Jun 2010 23:06:34
Message-Id: 87sk4rrj2t.fsf@newsguy.com
In Reply to: Re: [gentoo-user] Re: [OT sphinx] Any users of sphinx here by Brandon Vargo
1 Brandon Vargo <brandon.vargo@×××××.com> writes:
2
3 > do. When I go to find code that I have written, I do not remember
4 > variable names, lines of code, etc that I can match with a regular
5 > expression. Thus, that kind of search is pointless for me. I remember
6 > what the code does, the project for which I wrote the code, and
7 > approximately where the code is located within the project. I remember
8 > function calls for libraries that I probably used. If I cannot find what
9 > I am looking for, I use grep on the name of a function call I remember,
10 > or I have a ctags file containing all the information I need about
11 > function definitions.
12
13 Again, thanks for a thorough answer... just a note on the above
14 comment.
15
16 I often find myself searching for a technique... NOT variable names or
17 sub function names because who knows what I might call stuff in any
18 particular script.
19
20 For example... I once was shown how to compile as regular expression
21 an element of @ARGV in perl, in one step:
22
23 my $what_re = qr/@{[shift]}/;
24
25 I liked that and have used it many times... but only recently could I
26 remember at a moments notice how to write it.
27
28 I used `grep -r' or 'egrep -r' as you've mentioned, now I use a
29 my own perl script (recently written [since posting original query])
30 that uses regex and File::Find, where user feeds the regex and the
31 approximate location to begin the search, on the cmd line.
32
33 In my case that would be an nfs share /projects/reader/perl which is
34 kept in my ENV as $perlp
35
36 So:
37 script.pl 'qr/.*?@' $perlp
38
39 Will find a number of examples of using that particular technique.
40
41 What prompted my query here, was looking for a way to search several
42 thousand html pages that are a collection of Perl books on CD.
43
44 These are 2 of the Oreilly Perl CDbooks. (I spent $150 for the first
45 one, and I think the second was a little cheaper, it was yrs ago) The
46 Books on CD have built in search tools but those only work on a
47 windows OS and aren't up to much anyway.
48
49 I've since downloaded the data from the CDS onto an opensolaris zfs
50 server and access them through NFS.
51
52 I was attempting to use `webglimpse'
53 (http://webglimpse.net/download.php) for the task, hence the interest
54 in indexing. But I suspect a search for a particular technique I read
55 about, but have forgotten how to code, would be best searched for
56 using regular expressions. This would be long after I've forgotten
57 which section or even which book I read about it in.
58
59 The tool I've written can be made to strip html if necessary and can
60 be made to include (by regex) only certain kinds of filenames, but
61 uses no index so consequently is pretty slow... but still very useful
62 and is fully perl regex capable.
63
64 It returns up to 4 lines of context, 2 above the line with the hit,
65 and 1 below (where possible), along with the page number and the
66 absolute filename where the hit was found.
67
68 Here is an example search being timed:
69 ------- --------- ---=--- --------- --------
70 (I purposely picked something that would be found many times)
71
72 time ./pgrep3 /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/ hash
73
74 (So above we are searching a collection from the Oreilly CDbooks for
75 the term `hash'..)
76
77 (Just one example of the thousands of lines returned)
78 [...]
79
80 /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/perlnut/index/idx_p.htm
81 135 dereferencing with : [104]4.8.2. Dereferencing
82 136 modulus operator : [105]4.5.3. Arithmetic Operators
83 137 prototype symbol (hash) : [106]4.7.5. Prototypes
84 138 %= (assignment) operator : [107]4.5.6. Assignment Operators
85 ---
86
87 [...]
88
89 Total files searched: 522
90 Total lines searched: 431689
91 real 1m48.344s
92 user 1m25.234s
93 sys 0m14.336s
94
95 ------- --------- ---=--- --------- --------
96 Almost 2 minutes to search 431689 lines
97
98 So it is slow, maybe even very slow by comparison to tools using an
99 indexed search.
100
101 I don't really mind the sloth, but of course it would not be scalable
102 very much above the scope of use I'm doing with it. I do like the
103 precision search capability and plenty of context. All of the above is
104 also possible with grep, egrep... and friends too, of course, but only
105 with quite a lot more cmdline manipulation and piping.
106
107 I'm currently working on using something like this basic search script
108 to return URLS linking to the page and lines found, and working the
109 whole thing into something that can be carried out with a web browser.
110
111 Something pretty similar to webglimpse, I guess but without the
112 benefit of indexing.
113
114 Also webglimpe relies on glimpse which is not capable of full regex
115 search but does have a rich mixture of regex, regex like and boolean
116 query capability.