1 |
On Fri, 2010-06-04 at 17:52 -0500, Harry Putnam wrote: |
2 |
> I've been looking for a perl based search tool that uses some kind of |
3 |
> indexing to index and render searchable my home library of software |
4 |
> manual and the like. Quite a few html pages involved, maybe 15-16,000. |
5 |
> |
6 |
> Webglimpse is something I've worked with before and know a bit about |
7 |
> but thought I might like to see what else is available. |
8 |
> |
9 |
> Googling lead to a tool called Sphinx that apparently is coupled with |
10 |
> a data base tool like mysql. It is advertised as the kind of search |
11 |
> tool I'm after and has a perl front-end also available in portage |
12 |
> (dev-perl/Sphinx-Search). |
13 |
> |
14 |
> The trouble is I haven't been able to figure out the first thing about |
15 |
> using it. The overview, and Introduction, like a lot of such |
16 |
> documents fails to give a really basic idea of what the tool does. |
17 |
> |
18 |
> The call it a `full text search engine', but never really say what |
19 |
> that means. |
20 |
> |
21 |
> There are 12-15 FEATURES listed, and none appear to describe sensibly |
22 |
> what they really do. |
23 |
> |
24 |
> The faq is a string a questions about using sql.. really. |
25 |
> |
26 |
> So far I haven't found a good statement of what the darn thing really |
27 |
> does or how to aim it at data. |
28 |
> |
29 |
> The manual is probably great if you already know a lot about using |
30 |
> sphinx but very thin for my case. |
31 |
> |
32 |
> I've not even been able to get a rough idea of how to aim the darn |
33 |
> thing at the desired (Local lan) web site. |
34 |
> |
35 |
> Or, to show how thin it really is or how dumb I really am, I've been |
36 |
> unable to tell if it can even do what I want to do. |
37 |
> |
38 |
> I've posted on a sphinx list on gmane... but it appears to be only |
39 |
> moderately active and haven't gotten any replies... |
40 |
> |
41 |
> I hoped some one here may be familiar with sphinx and willing to coach |
42 |
> me a bit or at least let me know if it can even do what I want to do. |
43 |
> |
44 |
> Also any other perl based search tools involving indexing and some |
45 |
> kind of versatile search query capability.. like regular expressions |
46 |
> I'd be interested to know about. |
47 |
|
48 |
If you can put your HTML pages into a database, Sphinx might be able to |
49 |
help you with your issue. Basically what Sphinx does is let you search |
50 |
databases. You specify one or more SQL sources of data ans associated |
51 |
queries, and Sphinx provides an API (or a emulated SQL server) that |
52 |
makes searching easy. Sphinx is for full text database searching; it |
53 |
does not index files or websites directly. (Note that is this not |
54 |
actually true; it can search XML files directly, but you still specify |
55 |
XML attributes instead of database columns, etc, so it is treating the |
56 |
XML as a data store and not as a generic document.) I recall reading |
57 |
that Craigslist uses Sphinx to search their database of listings. |
58 |
|
59 |
As an example of how it works, suppose I am making a news website and |
60 |
have a bunch of news posts, each of which has an author, category, and |
61 |
text. With Sphinx, I can setup a source -- let's call it news_catalog -- |
62 |
that will index this data. news_catalog will be associated with an SQL |
63 |
query that will allow Sphinx to access the data it needs to index. Let's |
64 |
use "SELECT id, author, category, text FROM catalog" as our query. Note |
65 |
that catalog is a table or view in your database, though this query can |
66 |
also use complex joins, etc, as long as the database supports it. Via |
67 |
the Sphinx API, I can say I want to search for "Europe | America" and it |
68 |
will return a list of news articles containing the terms Europe, |
69 |
America, or both, as a pipe is the or operator. It actually returns a |
70 |
list of ids which correspond to the id I specified in my query; a unique |
71 |
key is always the first argument in the query. My application is |
72 |
responsible for fetching the actual data from the original database |
73 |
using that id and presenting the data in a useful way to the user. |
74 |
Extended query syntax allows for other boolean operators, searching |
75 |
specific fields, strict order, exact match, field start/end, etc. The |
76 |
documentation has lots of examples; look at |
77 |
http://www.sphinxsearch.com/docs/current.html for the current reference |
78 |
manual. |
79 |
|
80 |
If you have a bunch of HTML files on a disk or website that you want to |
81 |
index and search, I do not think Sphinx is the software you want. Yes, |
82 |
you could load your data into a database and then use Sphinx, but that |
83 |
does not seem like the best solution. Sphinx provides the API for use in |
84 |
your application; it does not provide a user interface. As an |
85 |
alternative, I recommend you look at something like ht://Dig |
86 |
(htdig.org), which will search HTML pages directly in addition to PDF, |
87 |
Word, Excel, Powerpoint, etc with the help of external converters. It |
88 |
also includes a user interface. After glancing at webglimpse, with which |
89 |
I am not familiar, it looks like it does something similar to ht://Dig. |
90 |
|
91 |
Regards, |
92 |
|
93 |
Brandon Vargo |