Thursday, November 20, 2008

Facts, Beliefs, Truths, Goals, Statements

Off-late I've been busy reading about Semantics of Facts and Goals etc. There are many things in our day-to-day life that we come across. These are Statements, Facts, Goals, Beliefs, Truths, Obligations etc. Most of what we know or don't know fall in any (or more) of these categories. This made me thinking as what is this all about? What is the common thing among all these? What differentiates them? and many more questions pertaining to this topic.

Recently while researching on these topics I could draw these relations.
  1. Statement is the common ancestor of Goal, Beliefs, Obligations, Truths, Facts etc.
  2. Except Goal everything works is valid for a given time (has time component) and place.
  3. Expression of goal needs two states for the same object. One is the initial state and one is the State of the object at any given time t when we claim the object has achieved the goal. The statement that an entity has achieved its goal is always in comparison to the state of entity at the time when the process to achieve the goal begun.
  4. Beliefs change over period of time.
  5. Facts are discovered not invented. They are present whether we know about them or not.
  6. Facts belong to Closed-World semantics.
  7. Beliefs belong to Open-World semantics and as discussed earlier they are constructed and can be destroyed as well.
I still have long way to go on this way. Any pointers from readers would be good.

PS: Its been almost 2 months since I had my last post on this blog. But I plan to come back in full swing soon and start posting regularly.

Until next time... !!!

Monday, September 01, 2008

Predicate and URIs

While going through different papers I realized that almost every research paper has different functions that do some common operations and work on two objects. Say isMotherOf(x,y) could be mother(x,y) and so on. Now the questions arises that if there are many such predicates exist in different knowledge base then how will someone know that they are related and they virtually do the same thing.

The Predicates are like function or relation-builder as they establish relationship between two objects. So we can very well say that the predicates are the base for our reasoning mechanism or determining what the two objects are all about. It also plays an important role in the context definition of the two objects.

This gives me a strong point to put forward the predicate being a URI so that two knowledgebase if they are using the same predicate or relation then they do mean the same thing. Using this principle a Reasoning Engine can be developed that can extract the meaning of the statements / facts etc in the system. If we design a system on this principle then the amount of ambiguity we need to deal with will be less.

The RDF facilitates the Predicates to be both URI and simple text, but in long run the simple text might look flexible or simple to work with but more people will use URIs for the predicate in RDF. Removing the ambiguity once and for all is a distant dream though. Even though we follow these principles, we will still have some duplicate predicates in the system. But somewhere along the line we need to develop a mapping that will assign aliases for both the predicates and based on popularity the less used one can be phased out in due course.

Until Next Time.....

Monday, July 21, 2008

Entities and Representing Facts

In my previous post I started a discussion about Inferencing and Facts. The biggest challenge we have is how to represent these facts in the computer system. Last week I was reading the paper "A Library of Generic Concepts for Composing Knowledge Bases" by Ken Barker, Bruce Porter and Peter Clark, Proceedings of K-CAP 01, October 22-23 2001.

In this paper the authors make a classification between Entities and Events. Entities are things that are and Events are things that happen. Events are states and actions. States are static situation brought about or changed by actions.

To elaborate further on this Entities are the state information or the facts. Actions apply on those entities. So if we revisit our earlier post the facts can be represented as attributes of an object. Like Sun rises in east can be represented as Rising Direction (Predicate) attribute of Sun (Subject) with a value East (an instance of type direction). Similarly we can also represent other facts mentioned in the earlier post.

Any second thought????

Until Next Time ....

Thursday, June 05, 2008

Inferencing and Facts

In my previous post I started a discussion on Reasoning and Inferencing. Going by very definition of inferencing it is an act of attaining conclusion based on certain facts. But what is fact? The facts could be
  • Universal Truths like Sun rises in east, it is winter in Australia.
  • Statements about an object instance. Toyota Yaris, YRS Rego ABC 123. Here we are considering only about one car.
  • A general statement about all objects of one type. Toyota cars are better than Honda in terms of easy maintenance (I am not going to start car manufacturer war here).
  • Statement applicable to more than one type of object. If battery is down then none of the battery operated or petrol vehicle will start.
But the actual problem is not what the fact is. The problem begins when we want to store the fact for computer systems to understand and reason. We need to store them in our KnowledgeBase which is nothing else but a collection of facts about one or more entities. To store facts in KB we need to have few issues sorted out like:
  1. How do we represent the facts in Computer System?
  2. How do we link the facts to the entities they describe about?
  3. How do we retrieve the facts and relate them to the entities?
  4. How do we find all the facts that are known about an entity?

There are many such buzzing questions which needs to be answered before we go ahead with building a system that infers these facts. I would like to get an opinion from the readers as what their opinion is.

Until Next Time...!!!

Thursday, May 22, 2008

Reasoning and Inferencing

Whenever I read about a journal or article on Artificial Intelligence, it sound like a science fiction to me. Okay okay I may not be up-to-date on my knowledge about what is happening in AI field. But more often than not what I find is most of the scenarios described in those fiction (?) are related to inference.

I read somewhere long back that Inference is the act of attaining a conclusion based on certain facts already present in the system. What are facts? In my opinion it is do to with the statements presented before us. But does the computers understand the statements as we do? I guess not. Then in relation to computer system the facts are the object, and their attributes. So we have few objects and their state information (as attributes) and we need to deduce a conclusion from that. How do we do that?

In order to combine these facts the system needs to have certain ability (A set of rules which will let us combine these facts together and infer something). This ability could not be anything else but Reasoning. By Reasoning we mean semantic relationship here. As we discussed in earlier posts that we need to have proper annotation in order to do Semantic Search and establish semantic relationship among entities in the system.

I guess I am getting more and more philosophical on this topic. I remember in our childhood we used to have a phrase. More Study More confusion, Less study less confusion, No Study NO Confusion :).

I would love to hear from readers about their opinion on relationship between Inference and Reasoning.

Until Next Time... :)

Wednesday, May 14, 2008

Can we beat Google for Web Search?

Sounds like the most difficult question we faced so far? In Today's world we are relied on google so much that we are not in a position to think that we can have a day at work without using google search and also that there is a better search engine than google. Google has created so much hype around and has made us dependent on itself that we benchmark every new search engine against google. Including those who existed before google like yahoo, altavista etc.

But while google is becoming powerful with each new application it releases, and upgradation it does to its search engine there are still few important points which is missing in google search engine which happens to be the core of google.

Most of us who are interested in Search Engine and how it works have read the paper published by Page and Brin on original google search engine architecture and also the initial version of page ranking algorithm. But over last few years they have believed to change the original page rank algorithm. There are few problems with this search engine.
  1. The page rank considers based on Words and the documents.
  2. The google search is based on current web. Whereas the web is growing and evolving with every passing minute. The paradigm of World Wide Web is Persistent Publish and Read. Which holds good to an extent but the web we are looking at today is evolving. We are not in the era of one publisher and many readers but today we have more content producers than readers on web.
  3. The page ranking algorithm uses the index table and the crawler (software) traverses through the links available on page to navigate to next page and so on. The philosophy what google and many other search engines have adopted is to represent the pages as set of nodes (or documents) connected to each other by a static link (HREF). They see it as some sort of tree structure. Whereas the web is not exactly like that. There are pages that do not have any link at all no incoming and outgoing link. Such pages are left behind by google search. An example is my poetry page which is very much hidden from the google search. Though its been on web for almost 3 yrs now. The google crawler managed to reach the main page of my homepage but could not get to the poetry page as there is no link to poetry page from the main page.
  4. We do not maintain a registry which is based on relevance for the web pages outside the page. The google search engine uses the keywords found in the page while indexing. But there are chances that a page which is relavant might not contain the keyword at all.
  5. Though Google is planning to use Latent Semantic Indexing for its next upgrade for page ranking, the accuracy of result is still doubtful.
This was about the problem, but then what is required to beat the google search engine? As discussed in my previous post on similar topic. I stressed the need for a Semantic Search for the web. The semantic search is missing in google search engine and unless that is made available the google search (and for that matter all other search engine) will still give us the irrelavant results (in abundance) when we query them.

Until Next Time... :)

Friday, May 09, 2008

Why Semantic Search?

In my previous post I discussed search engines in general and also how do they build the index table which is the core of any search engine. One thing which became very clear after these studies that the search engines available today are very limited when it comes to functionality. The keyword search does not leave much room for returning the relevant result. In google if we enter Paris Hilton as a search keyword we also get Hilton Hotel in Paris returned as search result that too on the first page. But the search engine there is not able to distinguish that we are not looking Hilton in Paris but Paris Hilton celebrity. On the other hand if we enter Hiton Paris we also get Paris Hilton in our search result. One way or the other the search result we get is not relevant to what we are looking for.

Last night I was reading about Latent Semantic Indexing (LSI) and that did give some hope. I found this page at SEOBook explaining it in a much simpler way about LSI. There are other references as well but this is one page which other than Wikipedia that explains it in a layman's term.

But the million dollar question we are faced with is whether LSI is going to take away the pain of going through irrelevant search results when we query the search engine. In my opinion that is still not very clear. As the algorithm of LSI is still based on the keywords found in the document. And that is not going to take the pain away unless we use the semantic search. But then why semantic search?

Semantic search as most of us know is based on the meanings conveyed by the objects. The term meaning has more depth than it appears from surface. The semantic search is not new, its been there for centuries now. In ancient times philosophers have given the mantra to the world as how to perform the semantic search. Its just that only a handful of people (technologists) today take the pain to read through those literatures. What the current search engines fail today is to restrict the result to what the user wants. We are allowed to input only a bunch of keywords.

In case of Semantic Search the driving factor is context as different terms (or concepts as John F. Sowa describes it) have different meaning or interpretation depending on where they are used. If we build a search engine around these philosophies then we can definitely achieve semantic search (upto a great extent).

Until Next Time... :)

Monday, April 28, 2008

Reusing Ontology

In one of my earlier post I had put emphasis on Why we need a Common Ontology. Over last couple of days while reading through different papers and books I came across few cases which explains why we need a common ontology.

One of the basic idea of Semantic Web is to allow user to reuse an existing ontology if it meets our needs. Alternatively in the worst case we should be able to use a part of it to fulfill our requirements.

At the same time if every user of the web starts to build their own ontology then there would be no common language and shared understanding about anything. There would be no interoperability of any kind between the two agents using those ontologies. There would be no global processing possible either. The exchange of message will not take place and different machines cannot interpret the messages either.

Thus without ontology reuse the very basic idea of Semantic Web is void. Reusing ontology to an extent is even more important than reusing a URI.

Until Next Time.....

Wednesday, April 23, 2008

Building Index Table

In my previous post we discussed about Search Engines in general. We also discussed that there are few basic functionalities of a Search Engine. They are:
  1. Building Index Table
  2. Performing the Search
  3. Building the Result for us
Building Index Table
In this post we will predominantly focus on how the Index Table is being built. The index tables are the heart and brain of a search engine. The process of building index table begins much before the search engine is live. This is an ongoing process which begins much before the search engine is made available and continues till the search engine exists. In a way we can say that the process of indexing determines the quality of result from the search engine.

The indexing is done by a piece of software called Crawler aka Spider. The crawler as the name suggest crawls on the web page and collects virtually all the information it can from the web page. The input to the crawler is the main URL of the web page. Once the crawler receives the URL it performs the following:
  1. Build an Index Table for each and every word on the Page. Since the word may appear more than once in the document it stores the word, the URL and the number of occurences of the word in the document. This is being done for almost all the words found on the page.

  2. Once the crawler is done with building table for each and every word on the page it then navigates to the first link which happens to be a URL again and crawls the new page. ie it performs the similar activity what it did before ie building an index table for each and every word on the page. At this point in time there are two situations possible.

    • It encounters the word that is not part of the current (or previous) document in the index table. So it just adds the new word to the index table along with URL and number of times the word is found in the document.

    • b) The word already existed in the table and in that case it locates the word in the index table and adds the reference to second URL where the word is found to it. Also the number of times the word occurs in the document.

  3. Once the crawler is finished with the current page then it moves on as described in Step 2.

  4. If there are no unvisited link found on the current page then it will go back to the previous page and will start from next link found on the page and will repeat step 2 and 3.
The flaw of this method is that the web is infinite and practically the step 2 - 4 will never finish. The best possible outcome of this procedure is a fraction of web pages are indexed today. Google which is assumed to be the most powerful search engine can index only 1-2% (approx) web pages on the World Wide Web.

This is not the efficient mechanism to build the index table. There has to be a limit where the crawler has to stop going further down the hierarchy and crawl to other pages in the list (in the original page. In the future post of the series I will bring out the discussion on different approaches to crawl the pages. The choice has to be made whether to go for Dephth-First or Breadth-First. For now we assume that the index table is being built and the search engine is ready to perform the search operation.

Performing Search Operation
The index table is used when we type in the keyword to perform search operation. In simplistic term it goes through the index table and searches for the keyword and the document (URL) where it appears and then builds the list of documents to retrieve. But as I said earlier this is the simplest case. The actual result building mechanism has much more to it than just retrieving the documents and presenting it to the user.

In the next post in this series I will bring the perspective of how the result is being built and shown to the user. What affects the page rank and few more details about the same.

Until Next Time... :)

Tuesday, March 25, 2008

Search Engines

After a long gap I am posting something to my blog. Well there has be numerous activities and the most important among all those was getting married last month. The whole February month was filled with travel, meeting family and friends etc. Finally the day when I realized, the holiday was already over and it was time for me to come back to real world. Well I am back now in real world and will be posting something interesting as I discover something on the way of my research on the semantic web.

The most common use of internet today is for searching. The idea is to locate and access information or resources on the web. For example finding out more about the Formula 1 cars etc.

Today the search engines are based on Keywords. They can retrieve documents which contain the given keywords. As long as they given document contains the keyword it will be included in the search result and later shown to the user. The current web then passes on the pain to read and interpret whether the page makes any sense to the user or not. To understand this let us see how search engines are constructed. In this and few more upcoming posts I will be discussing in detail about the search engine and why they function how they function. In this post I will focus primarily on the problem (as described earlier) and some more detail about the search engines.

Today the contains hundreds of millions of web pages. To locate few handful of paged we might be interested in among those hundreds of millions of paged we use search engines like google, yahoo etc. They are the most popular search engines besides others like Altavista, Live Search etc. Inspite of the differences claimed by them a large part of it still remains the same. The fundamental of a search engine building remains the almost the same.

In the future posts will discuss about how the search engines are constructed and why they can do only the keyword search. The next post will be based on creating Index table which is being used by the search engines.

Until Next Time....

Monday, January 28, 2008

Work on Annotation Found Elsewhere

Recently during my regular research work on annotation I came across some work done by W3C on Annotation. I was surprised to see that the work on annotation was quite active back in late 1990s.

Annotation project using RDF at W3C:
some interesting work elsewhere:
I hope you enjoy reading these links as much as I did.

Sunday, January 06, 2008

Uniqueness of Entities

I was having a discussion with a friend of mine about what makes an object (entity) identifiable. The conversation started with different attributes of the object and then we got into a situation where we had to distinguish two objects who had similar attributes.

More often than not in an Enterprise System we are faced with a situation where two objects come across having similar attributes, which primarily (on a higher level) identify them. When we are faced with such situations then the only way out is to identify another attribute attached to the object which is bound to be unique. In a database application we have the primary keys generated by a sequence generator which guarantees its uniqueness. One the commonly used real-world example is Social Security Number in USA. The social security number is bound to be unique. Another unique attribute is the Credit Card number which is supposed to be unique as well.

In light to the above example the question we have is what is the unique identifier of the object. Is that something attached as as attribute to the object or it is the one which defines the object (rather the object itself). The more we think and discuss about it the more we come to a conclusion that there is no fixed rule as such. Both object being its own unique identifier or the object having an attribute which could uniquely identify it have their own pros and cons. COM uses GUID as a unique identifier for the objects created. A GUID is likely to be unique even though it is generated at many computers simultaneously and is done for years without any interruption.

I personally kind of find myself torn between the two approaches. Object being its own unique identifier has the benefits like the two objects can be identified unique even though they have no attributes to identify them. Whereas having an attribute which makes an object unique is that we can always work out a better combination (of attributes) if the current one is no longer able to gurantee the uniqueness.

I would like to hear from our readers what approach they would prefer in such scenarios and your experience as well. Hope this year brings lots of Joy, Happiness and good times to all of us.

Until Next Time... :)