Wednesday, May 14, 2008

Can we beat Google for Web Search?

Sounds like the most difficult question we faced so far? In Today's world we are relied on google so much that we are not in a position to think that we can have a day at work without using google search and also that there is a better search engine than google. Google has created so much hype around and has made us dependent on itself that we benchmark every new search engine against google. Including those who existed before google like yahoo, altavista etc.

But while google is becoming powerful with each new application it releases, and upgradation it does to its search engine there are still few important points which is missing in google search engine which happens to be the core of google.

Most of us who are interested in Search Engine and how it works have read the paper published by Page and Brin on original google search engine architecture and also the initial version of page ranking algorithm. But over last few years they have believed to change the original page rank algorithm. There are few problems with this search engine.
  1. The page rank considers based on Words and the documents.
  2. The google search is based on current web. Whereas the web is growing and evolving with every passing minute. The paradigm of World Wide Web is Persistent Publish and Read. Which holds good to an extent but the web we are looking at today is evolving. We are not in the era of one publisher and many readers but today we have more content producers than readers on web.
  3. The page ranking algorithm uses the index table and the crawler (software) traverses through the links available on page to navigate to next page and so on. The philosophy what google and many other search engines have adopted is to represent the pages as set of nodes (or documents) connected to each other by a static link (HREF). They see it as some sort of tree structure. Whereas the web is not exactly like that. There are pages that do not have any link at all no incoming and outgoing link. Such pages are left behind by google search. An example is my poetry page which is very much hidden from the google search. Though its been on web for almost 3 yrs now. The google crawler managed to reach the main page of my homepage but could not get to the poetry page as there is no link to poetry page from the main page.
  4. We do not maintain a registry which is based on relevance for the web pages outside the page. The google search engine uses the keywords found in the page while indexing. But there are chances that a page which is relavant might not contain the keyword at all.
  5. Though Google is planning to use Latent Semantic Indexing for its next upgrade for page ranking, the accuracy of result is still doubtful.
This was about the problem, but then what is required to beat the google search engine? As discussed in my previous post on similar topic. I stressed the need for a Semantic Search for the web. The semantic search is missing in google search engine and unless that is made available the google search (and for that matter all other search engine) will still give us the irrelavant results (in abundance) when we query them.

Until Next Time... :)

0 comments: