| Overwhelming number
of search-engines in the WWW like Google, AltaVista, Lycos, InfoSeek
etc. are spider-based. An understanding of how they work can greatly
help you make the best out of them.
Though the term "search engine" is often used to describe
all kinds of retrieval tools, spider-based search engines differ
considerably from human-powered directories. We discussed human-powered
directories in last issue, this week we take a close look at spider-based
search engines.
Unlike directory-type search engines, spider-based search
engines (also called crawlers, robots, worms) seek out webpages
by 'crawling' through the WWW and automatically index sites using
its own indexing rules or algorithm.
By simply telling the search engine what your URL is,
its software robot will go there automatically and index everything
they need. How much it will index and to what degree depends upon
its algorithm - a closely guarded secret in many cases.
Parts of Spider-Based Search Engine
Spider-based search engines have three major elements:
-
Spider
-
Index
-
Search
The spider or crawler, as its name implies, crawls through
the WWW, finds web page, reads it, and then follows links to other
pages within the site. It repeats this process at regular intervals
to check for new information/changes in the page.
Information collected by the spider goes into the second
part of the search engine - the index. The index is like a giant
book containing a copy of every web page that the spider finds.
If a web page changes, then this book is updated with new information.
The above two parts work in the background, we only
get to see the third part of a search engine - the search software.
This is a computer program that sifts through the millions of pages
recorded in the index to find matches to a search and rank them
in an order of relevance. The order of relevance is entirely decided
by its own algorithm.
Features of Spider-based Search Engine and Implication
in Search Result
The ability of a spider to crawl through millions of
web-pages and creating index without human intervention makes it
very powerful search tool with extremely broad coverage. The second
ability of checking for changes/new information in indexed pages
by re-visiting them at regular intervals and keeping the index up-to-date,
again without human intervention - is really awesome.
However, the greatest strength of spider-based search
engine is also its greatest weakness. Great coverage and absence
of human editing ensures significant amount of junk or useless information
in search result. This is particularly so when search query is loosely
worded.
The key to get the best out of a spider-based search
engine is to understand some basics of searching. We shall discuss
a few tips that can get you significantly better search result in
next issues.
Related Links:
Source: FAIDA
- Newsletter on Business Opportunties from India and Abroad
Vol: 3, Issue 10
June 27' 2002
Author :
Dr. Amit K. Chatterjee
(Amit worked in blue-chip Indian and MNCs for 15 years in various
capacities like Research and Information Analysis, Market Development,
MIS, R&D Information Systems etc. before starting his e-commerce
venture in 1997. The views expressed in this columns are of
his own. He may be reached at amit@infobanc.com
) |
|