Re: Internal working of Internet Search Engines.

From: "Kris Lizuck" <lizuck@comnet.ca>
Date: Mon, 17 Jun 1996 20:31:48 -0400
Newsgroups: comp.infosystems.www.authoring.cgi

> Given a search string, internet search engines like Yahoo or Lycos
> etc., index the search results from highest to lowest points. How is
> this points rating calculated? Is only the HTML content-lines of files
> searched for occurrences of the search string, or are all lines
> (comments and HTML text) searched?

Hi Amit:

It really depends on the search engine.

I've just finished evaluating a whole slew of free/cheap-ware search
engines and the variety of indexing/searching methods has been almost
astounding.  Generally, though, you can say the following of any search
engine:

 - The index for a document collection is produced separately.  The size
of the file(s) this creates depends on how thorough the indexing is and
whther or not the engine needs to support complex (read: concept or
'fuzzy') searches.  In some cases this file is binary; sometimes it's
plain text.

- the "rank" of a hit increases depending (foremost) on the number of
search terms the engine can match within a document.  Additionally, (and
this gets engine-dependent) the following may be used to weigh the
strength of a  match:
     - The relative placement of search terms
       (i.e. FOO adjacent to BAR versus BAR adjacent to FOO, versus FOO
within X words of BAR, etc.)
     - Spelling errors and case-sensistive matching
     - HTML "container" the term was found in
       (i.e. <H1>FOO</H1> ranks MUCH higher than <P>foo</P>, which ranks
higher than <!--foo-->)
     - Distance from the top of a file 
       (i.e. is "foo" mentioned in the first 15 lines or does it show up 3
lines from the bottom of a file (where it is most likely a "see-also" or
cross-reference))

Each engine uses its own algorithm to weight its results - there's no hard
and fast rule or 'best method'.

If you're looking at a way of implementing a search engine on a document
collection, you might try one of the following (all are free, and are
ranked in order of my personal preference, though your mileage may vary):

1)  Excite for Web Servers (EWS)  (at www.excite.com)
  -- VERY easy to set up, and very powerful
  --  written almost entirely in Perl

2)  Swish with a WWWWAIS gateway
  -- can't think of a web site, but easily found
  -- set-up and indexing are more involved, but search capabilities are
excellent for smaller sites.
  -- search and index binaries in very vanilla C; cgi scripts in Perl

3) GlimpseHTTP or WebGlimpse (which is in beta)
  -- Also can't think of a web site (sorry!)
  -- set up not as harsh as SWISH, but search is not as full-featured
either
  -- As with SWISH, some is written in C; other pieces in Perl

I hope that's been of some help to you - I apologize for my lack of
detail, but you can e-mail me for a list of the search engine evaluations
I just whipped up for my job (If there's interest, I can post them here as
well -- it's not really that long)

Cheers,
Kris Lizuck
lizuck@comnet.ca
From John's Useful Posting Archive (JUPA)
Maintained by John Callender
John's Home Page
Archive created with babymail