Wednesday, October 3, 2007

Natural Language Search - Mining the Web for Meaning

Do you have a question? Chances are that you can find the answer online in Wikipedia's 2+ million articles or somewhere on the web. How?

The Present: Keyword based search

How do we search today? Search engines use bots to crawl the web to find documents, process them, and then build an index. Google has indexed more than 25 billion web pages in 2006. Search engines consult this huge index in response to a user's query to find a set of matching documents. So far so good. Then they try to rank the potential matches to present the most relevant results. The ranking of the matches, and possibly the short presentation of each result, are tailored to the query and potentially any other information available to the search engine.

It is challenging to rank potentially thousands or millions of matches to get relevant results. However relevance is critical especially for mobile users. Keyword-based search engines such as Google rank pages using a number of criteria and features: PageRank [link graph analysis], keyword frequency, and keyword proximity, and many others. Many of these smart algorithms are discussed in my earlier post on Building Smart Web 2.0 Applications. Google was clearly the innovator in that area that made them he undisputed leader in the search space. What is the next step to improve relevance?

The Promise of the Semantic Web

The natural language of web pages is difficult to understand and process for computers. The vision of the Semantic Web promises information that is understandable by computers, so that they can perform more of the tedious work involved in finding, sharing and combining information on the web. The idea of the Semantic Web would require authors and publishers to make the information easier to process by computers using special markup languages.

There are many projects who aims to capture the knowledge of the world in a structured manner that software can process. Most interesting of them are Freebase and Google Base. However most of the web remains unstructured text therefore the idea of the Semantic Web remains largely unrealized. How is it possible to mine the meaning of those billions of web pages?

The Future: Natural Language Search

Imagine if you could ask a search engine the following question and get relevant results: "what did steve jobs say about the iPod?"

True natural language queries have linguistic structure which keyword-oriented search engines ignore. This includes queries where the function words matter, where word order means something, and where relationships that should be explicitly stated easily are stated. Instead of ignoring the function words, a natural language search engine respects their meaning and uses it to give better results.

In fact one of the most buzzed startup companies on the TechCrunch 40 Conference aims to implement such a natural language search engine. It is a big challenge.

Powerset has licensed key Natural Language Processing (NLP) technology from Xerox PARC. Their search engine examines the actual meaning and relationships of words in each sentence for the web pages as well as the queries.

The NLP technology they’re using has been under development for 30+ years now. Their unfair advantage is the fact that Powerset has reduced the processing time for indexing one sentence down from two minutes to one second. Currently they’re limited to a select few sites to crawl: Wikipedia, New York Times and ontological resources like Freebase and WordNet. They plan to use Amazon’s EC2 and build out their data centers to scale. Indexing billions of web pages will take time but natural language search is certainly an interesting wave in the ocean of web innovations.

Check out Powerset's blog or sign up for Powerset Labs to experience their latest technology.

No comments: