From: "Kris Lizuck" <lizuck@comnet.ca>
Date: Mon, 17 Jun 1996 20:31:48 -0400
Newsgroups: comp.infosystems.www.authoring.cgi
> Given a search string, internet search engines like Yahoo or Lycos
> etc., index the search results from highest to lowest points. How is
> this points rating calculated? Is only the HTML content-lines of files
> searched for occurrences of the search string, or are all lines
> (comments and HTML text) searched?
Hi Amit:
It really depends on the search engine.
I've just finished evaluating a whole slew of free/cheap-ware search
engines and the variety of indexing/searching methods has been almost
astounding. Generally, though, you can say the following of any search
engine:
- The index for a document collection is produced separately. The size
of the file(s) this creates depends on how thorough the indexing is and
whther or not the engine needs to support complex (read: concept or
'fuzzy') searches. In some cases this file is binary; sometimes it's
plain text.
- the "rank" of a hit increases depending (foremost) on the number of
search terms the engine can match within a document. Additionally, (and
this gets engine-dependent) the following may be used to weigh the
strength of a match:
- The relative placement of search terms
(i.e. FOO adjacent to BAR versus BAR adjacent to FOO, versus FOO
within X words of BAR, etc.)
- Spelling errors and case-sensistive matching
- HTML "container" the term was found in
(i.e. <H1>FOO</H1> ranks MUCH higher than <P>foo</P>, which ranks
higher than <!--foo-->)
- Distance from the top of a file
(i.e. is "foo" mentioned in the first 15 lines or does it show up 3
lines from the bottom of a file (where it is most likely a "see-also" or
cross-reference))
Each engine uses its own algorithm to weight its results - there's no hard
and fast rule or 'best method'.
If you're looking at a way of implementing a search engine on a document
collection, you might try one of the following (all are free, and are
ranked in order of my personal preference, though your mileage may vary):
1) Excite for Web Servers (EWS) (at www.excite.com)
-- VERY easy to set up, and very powerful
-- written almost entirely in Perl
2) Swish with a WWWWAIS gateway
-- can't think of a web site, but easily found
-- set-up and indexing are more involved, but search capabilities are
excellent for smaller sites.
-- search and index binaries in very vanilla C; cgi scripts in Perl
3) GlimpseHTTP or WebGlimpse (which is in beta)
-- Also can't think of a web site (sorry!)
-- set up not as harsh as SWISH, but search is not as full-featured
either
-- As with SWISH, some is written in C; other pieces in Perl
I hope that's been of some help to you - I apologize for my lack of
detail, but you can e-mail me for a list of the search engine evaluations
I just whipped up for my job (If there's interest, I can post them here as
well -- it's not really that long)
Cheers,
Kris Lizuck
lizuck@comnet.ca
From John's Useful Posting Archive (JUPA)
Maintained by John Callender
John's Home Page
Archive created with babymail