Thursday, May 14, 2009

Google Rich Snippets, Information Discovery and Open Web

This week Google launched rich snippets support which allows website owners to include interesting content by including structured data (such as Microformats and RDF) into their webpages. While this has been done before by Yahoo! Search Monkey before, this move from Google got a lot of attention because of it's market share in search and how it can positively impact the traffic to a website with interesting and unique content. So I was genuinely excited to see what this is all about and see how we can integrate the rich local data that we have at Center'd into snippets. And I was a bit disappointed after learning more about what this can do in general. Here is why.

According to the Google webmaster's blog post, they have describe the process as below:

Rich Snippets give users convenient summary information about their search results at a glance. We are currently supporting data about reviews and people. When searching for a product or service, users can easily see reviews and ratings, and when searching for a person, they'll get help distinguishing between people with the same name. It's a simple change to the display of search results, yet our experiments have shown that users find the new data valuable -- if they see useful and relevant information from the page, they are more likely to click through. Now we're beginning the process of opening up this successful experiment so that more websites can participate. As a webmaster, you can help by annotating your pages with structured data in a standard format.

Great - sounds simple right? Now look at the currently supported formats here in their documentation, this whole thing is limited to Reviews, People, Products and Businesses and Organizations - that's it? That seems to be a very limited set of structured data formats to express web in a structured format! And if you dig a bit deeper, even for this types of datasets the schema is very restrictive. Let's pick Businesses and Organizations for example, the following are the properties that are "supported":

Google recognizes the following Organization properties, and may include their content in search results. Where the RDFa Organization and microformats hCard property names differ, the hCard property name appears in parentheses.

name(org/name)
The name of the business. If you use microformats, you should use both org andname, and ensure that these have the same value.

url
Link to a web page

address (adr)
The location of the business

street-address
The street address. Child of address.

locality
The city. Child of address.

region
The geographic region. Child of address.

postal-code
The postal code. Child of address.

country-name
The country. Child of address.

tel
The telephone number

Now, that's disappointing - this level of limited support of structured data is not going to help users or publishers because it's going to aggregate all websites to a common denominator!

Consider this: let's say that you are looking for a restaurant in San Francisco. And let's assume there are 3 different websites that have three different data about this place. Say website A has ratings information, website B has menu information and website C has supported activity related information. When you run the query in google what would you expect to see? You want to find out as much information as possible such as reviews, menu and service related info from all 3 websites that have that info - but in the current model where every web site is forced to express their "structured data" with a limited set of fields you would see exactly similar and almost duplicative information. Note that no one is gaining in this process: the content publishers or the websites lost their uniqueness and the users are not getting the full information that they want to get.

Now I buy the argument that the standards can be extended and also it's impossible for anyone to define "structured format" that can fit every bit of information that's present on the web. That's why I am kind of "old-school" in thinking that we need to get smarter about mining this data using machines. Web is open by definition because there are no rigid rules about expressing content and the intent underneath. That's what makes web so much more fascinating - there is all kinds of data that is to be understood, mined and re-surfaced at end-points such as search engines. So putting a rigid structure at discovery level (i.e. search engines) we lose the open-ness of the web. Keyword search is already a limiting paradigm in discovering information since you can never discover content that you can't define (in a keyword) - now by hiding information that's not structured, you will never know what you would have discovered accidentally.

Hopefully that won't happen with this semantic web push from Google.

Related: Great discussions at Tim O'Reilly's radar and Ian Davis's blog on this topic.

Disclaimer: At Center'd we have unique data about a place's capabilities (such as is it good for kids, is it romantic etc) in a structured format and it cannot be currently expressed in the proposed solution. I have contacted Google about this issue and will update if I hear back :)

Tuesday, May 12, 2009

Can you please tell me your intent? Says Google with Search Options

Tons of news coming on the new Google Search Options release; Google Search options is a "tool belt" that can be used to organize the search results based on ones intent Google says. This is a significant step in a new direction for Google, a direction that seems to be openly admitting a couple of things:

a. "guessing" user's intent is super-hard
b. relevance is not uni-dimensional anymore (yes, Pagerank only represents only a fraction of it)

Hence let the user control what/how they want to see the information. Looks plain and simple - right? No - there is more.

While the usefulness of this tool is going to be limited for the end-users in "organizing" the search results - there is a significant bit of "learning data" that Google will be able to collect about keyword queries and the related user's intent with all the billions of click-streams flowing through the Search Options. This is exactly how they beat the first round of "web search" game - and it's time for the second round, perhaps?

Is this about real-time search?

Many bloggers screaming and shouting that this is all about real-time search (and by definition as answer to twitter search!) - I disagree. While recency (or real-time-ness) is a dimension of the new web search relevance paradigm, it's not all about that. This fundamental shift in the relevance paradigm is going to force all of us to think about alternative ways to "crawl" and index the web - and it is forcing Google too. The problem we are just beginning to notice with "real-time" search is a problem waiting to happen for a long-time.

Web is full of spam - it takes a bit of learning for any search engine to figure out how to differentiate between spam content and real content - and Google was far ahead of this learning curve with their brilliant feedback systems built into their search results - each time user clicks on a link from the search results, it was counted as a vote of confidence - that coupled with volume - they built an un-beatable asset - world's largest and finest uni-dimensional relevance database for the web (don't under-estimate the power of this database - they had billions of these clicks recorded across the web way before Microsoft even started building their search engine). Now with the evolution of new relevance paradigm that includes relevance as we know it and also the new dimensions such as time, location, rich-media and so on, there is a need for more detailed and elaborate feedback system that enables users to express their intent so that it can be captured, processed, understood and applied back to the web - and Google Search Options is just one way of doing that.  

What does this mean? New world of search is in order - perhaps we need a hybrid of "crawl/index" and "subscribe/index" - perhaps HTTP and XMPP should be integrated into a new kind of web servers that can serve and notify of the content at the same time, perhaps search is push instead of pull? What does this mean for SEO? :) I can go on and on and on here - all this indicates one thing for sure - web is still evolving, web still is full of spam (now even real-time - get that! :) and we need a new way to organize web's information.

Photo: intent by outlier*

On my mind: romantic things to do in new york

Monday, May 11, 2009

Google news update- Spam vs. Relevance

Today Google News launched an update to the service that included a "detail" page that organized salient sources with enhanced content such as photos, timelines, and quotes. I love Google news, but not sure if I like this addition. I use Google news regularly and I find this new update confusing.  

[v1+screenshot+of+new+slp.jpg]

TechCurnch said its an issue with the algorithm and went on to compare TechMeme with Google news. As a geek that plays with document clustering in more than one domain I don't think its a pure algo issue. The issue seems to be primarily around filtering the spam (or including stories that are less "authentic" - but still very strongly related - do you see the difference?). But that's a problem that can be solved easily in a number of different ways (of course the easiest way is to follow Techmeme's approach where you crawl only a known set of sources). But I must admit that I'm surprised to see that the Google QA teams let this slip through. I also still don't get why the detail page contains grouping of news based on the location. Any ideas on why that is being done?

Overall, I think this release is a bit of a disappointment, but hey, its still a great way to browse news on the web, especially in mother tongue telugu!

On my mind:kid friendly restaurants in san francisco