Gentoo Archives: gentoo-user

From: Brandon Vargo <brandon.vargo@×××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] [OT sphinx] Any users of sphinx here
Date: Sun, 06 Jun 2010 04:13:07
Message-Id: 1275797518.900.60.camel@venus
In Reply to: [gentoo-user] [OT sphinx] Any users of sphinx here by Harry Putnam
1 On Fri, 2010-06-04 at 17:52 -0500, Harry Putnam wrote:
2 > I've been looking for a perl based search tool that uses some kind of
3 > indexing to index and render searchable my home library of software
4 > manual and the like. Quite a few html pages involved, maybe 15-16,000.
5 >
6 > Webglimpse is something I've worked with before and know a bit about
7 > but thought I might like to see what else is available.
8 >
9 > Googling lead to a tool called Sphinx that apparently is coupled with
10 > a data base tool like mysql. It is advertised as the kind of search
11 > tool I'm after and has a perl front-end also available in portage
12 > (dev-perl/Sphinx-Search).
13 >
14 > The trouble is I haven't been able to figure out the first thing about
15 > using it. The overview, and Introduction, like a lot of such
16 > documents fails to give a really basic idea of what the tool does.
17 >
18 > The call it a `full text search engine', but never really say what
19 > that means.
20 >
21 > There are 12-15 FEATURES listed, and none appear to describe sensibly
22 > what they really do.
23 >
24 > The faq is a string a questions about using sql.. really.
25 >
26 > So far I haven't found a good statement of what the darn thing really
27 > does or how to aim it at data.
28 >
29 > The manual is probably great if you already know a lot about using
30 > sphinx but very thin for my case.
31 >
32 > I've not even been able to get a rough idea of how to aim the darn
33 > thing at the desired (Local lan) web site.
34 >
35 > Or, to show how thin it really is or how dumb I really am, I've been
36 > unable to tell if it can even do what I want to do.
37 >
38 > I've posted on a sphinx list on gmane... but it appears to be only
39 > moderately active and haven't gotten any replies...
40 >
41 > I hoped some one here may be familiar with sphinx and willing to coach
42 > me a bit or at least let me know if it can even do what I want to do.
43 >
44 > Also any other perl based search tools involving indexing and some
45 > kind of versatile search query capability.. like regular expressions
46 > I'd be interested to know about.
47
48 If you can put your HTML pages into a database, Sphinx might be able to
49 help you with your issue. Basically what Sphinx does is let you search
50 databases. You specify one or more SQL sources of data ans associated
51 queries, and Sphinx provides an API (or a emulated SQL server) that
52 makes searching easy. Sphinx is for full text database searching; it
53 does not index files or websites directly. (Note that is this not
54 actually true; it can search XML files directly, but you still specify
55 XML attributes instead of database columns, etc, so it is treating the
56 XML as a data store and not as a generic document.) I recall reading
57 that Craigslist uses Sphinx to search their database of listings.
58
59 As an example of how it works, suppose I am making a news website and
60 have a bunch of news posts, each of which has an author, category, and
61 text. With Sphinx, I can setup a source -- let's call it news_catalog --
62 that will index this data. news_catalog will be associated with an SQL
63 query that will allow Sphinx to access the data it needs to index. Let's
64 use "SELECT id, author, category, text FROM catalog" as our query. Note
65 that catalog is a table or view in your database, though this query can
66 also use complex joins, etc, as long as the database supports it. Via
67 the Sphinx API, I can say I want to search for "Europe | America" and it
68 will return a list of news articles containing the terms Europe,
69 America, or both, as a pipe is the or operator. It actually returns a
70 list of ids which correspond to the id I specified in my query; a unique
71 key is always the first argument in the query. My application is
72 responsible for fetching the actual data from the original database
73 using that id and presenting the data in a useful way to the user.
74 Extended query syntax allows for other boolean operators, searching
75 specific fields, strict order, exact match, field start/end, etc. The
76 documentation has lots of examples; look at
77 http://www.sphinxsearch.com/docs/current.html for the current reference
78 manual.
79
80 If you have a bunch of HTML files on a disk or website that you want to
81 index and search, I do not think Sphinx is the software you want. Yes,
82 you could load your data into a database and then use Sphinx, but that
83 does not seem like the best solution. Sphinx provides the API for use in
84 your application; it does not provide a user interface. As an
85 alternative, I recommend you look at something like ht://Dig
86 (htdig.org), which will search HTML pages directly in addition to PDF,
87 Word, Excel, Powerpoint, etc with the help of external converters. It
88 also includes a user interface. After glancing at webglimpse, with which
89 I am not familiar, it looks like it does something similar to ht://Dig.
90
91 Regards,
92
93 Brandon Vargo

Replies

Subject Author
[gentoo-user] Re: [OT sphinx] Any users of sphinx here Harry Putnam <reader@×××××××.com>