From: "Kris Lizuck" <lizuck@comnet.ca> Date: Mon, 17 Jun 1996 20:31:48 -0400 Newsgroups: comp.infosystems.www.authoring.cgi > Given a search string, internet search engines like Yahoo or Lycos > etc., index the search results from highest to lowest points. How is > this points rating calculated? Is only the HTML content-lines of files > searched for occurrences of the search string, or are all lines > (comments and HTML text) searched? Hi Amit: It really depends on the search engine. I've just finished evaluating a whole slew of free/cheap-ware search engines and the variety of indexing/searching methods has been almost astounding. Generally, though, you can say the following of any search engine: - The index for a document collection is produced separately. The size of the file(s) this creates depends on how thorough the indexing is and whther or not the engine needs to support complex (read: concept or 'fuzzy') searches. In some cases this file is binary; sometimes it's plain text. - the "rank" of a hit increases depending (foremost) on the number of search terms the engine can match within a document. Additionally, (and this gets engine-dependent) the following may be used to weigh the strength of a match: - The relative placement of search terms (i.e. FOO adjacent to BAR versus BAR adjacent to FOO, versus FOO within X words of BAR, etc.) - Spelling errors and case-sensistive matching - HTML "container" the term was found in (i.e. <H1>FOO</H1> ranks MUCH higher than <P>foo</P>, which ranks higher than <!--foo-->) - Distance from the top of a file (i.e. is "foo" mentioned in the first 15 lines or does it show up 3 lines from the bottom of a file (where it is most likely a "see-also" or cross-reference)) Each engine uses its own algorithm to weight its results - there's no hard and fast rule or 'best method'. If you're looking at a way of implementing a search engine on a document collection, you might try one of the following (all are free, and are ranked in order of my personal preference, though your mileage may vary): 1) Excite for Web Servers (EWS) (at www.excite.com) -- VERY easy to set up, and very powerful -- written almost entirely in Perl 2) Swish with a WWWWAIS gateway -- can't think of a web site, but easily found -- set-up and indexing are more involved, but search capabilities are excellent for smaller sites. -- search and index binaries in very vanilla C; cgi scripts in Perl 3) GlimpseHTTP or WebGlimpse (which is in beta) -- Also can't think of a web site (sorry!) -- set up not as harsh as SWISH, but search is not as full-featured either -- As with SWISH, some is written in C; other pieces in Perl I hope that's been of some help to you - I apologize for my lack of detail, but you can e-mail me for a list of the search engine evaluations I just whipped up for my job (If there's interest, I can post them here as well -- it's not really that long) Cheers, Kris Lizuck lizuck@comnet.ca
From John's Useful Posting Archive (JUPA)
Maintained by John Callender
John's Home Page
Archive created with babymail