Wednesday, April 23, 2008

Building Index Table

In my previous post we discussed about Search Engines in general. We also discussed that there are few basic functionalities of a Search Engine. They are:
  1. Building Index Table
  2. Performing the Search
  3. Building the Result for us
Building Index Table
In this post we will predominantly focus on how the Index Table is being built. The index tables are the heart and brain of a search engine. The process of building index table begins much before the search engine is live. This is an ongoing process which begins much before the search engine is made available and continues till the search engine exists. In a way we can say that the process of indexing determines the quality of result from the search engine.

The indexing is done by a piece of software called Crawler aka Spider. The crawler as the name suggest crawls on the web page and collects virtually all the information it can from the web page. The input to the crawler is the main URL of the web page. Once the crawler receives the URL it performs the following:
  1. Build an Index Table for each and every word on the Page. Since the word may appear more than once in the document it stores the word, the URL and the number of occurences of the word in the document. This is being done for almost all the words found on the page.

  2. Once the crawler is done with building table for each and every word on the page it then navigates to the first link which happens to be a URL again and crawls the new page. ie it performs the similar activity what it did before ie building an index table for each and every word on the page. At this point in time there are two situations possible.

    • It encounters the word that is not part of the current (or previous) document in the index table. So it just adds the new word to the index table along with URL and number of times the word is found in the document.

    • b) The word already existed in the table and in that case it locates the word in the index table and adds the reference to second URL where the word is found to it. Also the number of times the word occurs in the document.

  3. Once the crawler is finished with the current page then it moves on as described in Step 2.

  4. If there are no unvisited link found on the current page then it will go back to the previous page and will start from next link found on the page and will repeat step 2 and 3.
The flaw of this method is that the web is infinite and practically the step 2 - 4 will never finish. The best possible outcome of this procedure is a fraction of web pages are indexed today. Google which is assumed to be the most powerful search engine can index only 1-2% (approx) web pages on the World Wide Web.

This is not the efficient mechanism to build the index table. There has to be a limit where the crawler has to stop going further down the hierarchy and crawl to other pages in the list (in the original page. In the future post of the series I will bring out the discussion on different approaches to crawl the pages. The choice has to be made whether to go for Dephth-First or Breadth-First. For now we assume that the index table is being built and the search engine is ready to perform the search operation.

Performing Search Operation
The index table is used when we type in the keyword to perform search operation. In simplistic term it goes through the index table and searches for the keyword and the document (URL) where it appears and then builds the list of documents to retrieve. But as I said earlier this is the simplest case. The actual result building mechanism has much more to it than just retrieving the documents and presenting it to the user.

In the next post in this series I will bring the perspective of how the result is being built and shown to the user. What affects the page rank and few more details about the same.

Until Next Time... :)

0 comments: