Thursday, May 14, 2009

Google Rich Snippets, Information Discovery and Open Web

This week Google launched rich snippets support which allows website owners to include interesting content by including structured data (such as Microformats and RDF) into their webpages. While this has been done before by Yahoo! Search Monkey before, this move from Google got a lot of attention because of it's market share in search and how it can positively impact the traffic to a website with interesting and unique content. So I was genuinely excited to see what this is all about and see how we can integrate the rich local data that we have at Center'd into snippets. And I was a bit disappointed after learning more about what this can do in general. Here is why.

According to the Google webmaster's blog post, they have describe the process as below:

Rich Snippets give users convenient summary information about their search results at a glance. We are currently supporting data about reviews and people. When searching for a product or service, users can easily see reviews and ratings, and when searching for a person, they'll get help distinguishing between people with the same name. It's a simple change to the display of search results, yet our experiments have shown that users find the new data valuable -- if they see useful and relevant information from the page, they are more likely to click through. Now we're beginning the process of opening up this successful experiment so that more websites can participate. As a webmaster, you can help by annotating your pages with structured data in a standard format.

Great - sounds simple right? Now look at the currently supported formats here in their documentation, this whole thing is limited to Reviews, People, Products and Businesses and Organizations - that's it? That seems to be a very limited set of structured data formats to express web in a structured format! And if you dig a bit deeper, even for this types of datasets the schema is very restrictive. Let's pick Businesses and Organizations for example, the following are the properties that are "supported":

Google recognizes the following Organization properties, and may include their content in search results. Where the RDFa Organization and microformats hCard property names differ, the hCard property name appears in parentheses.

name(org/name)
The name of the business. If you use microformats, you should use both org andname, and ensure that these have the same value.

url
Link to a web page

address (adr)
The location of the business

street-address
The street address. Child of address.

locality
The city. Child of address.

region
The geographic region. Child of address.

postal-code
The postal code. Child of address.

country-name
The country. Child of address.

tel
The telephone number

Now, that's disappointing - this level of limited support of structured data is not going to help users or publishers because it's going to aggregate all websites to a common denominator!

Consider this: let's say that you are looking for a restaurant in San Francisco. And let's assume there are 3 different websites that have three different data about this place. Say website A has ratings information, website B has menu information and website C has supported activity related information. When you run the query in google what would you expect to see? You want to find out as much information as possible such as reviews, menu and service related info from all 3 websites that have that info - but in the current model where every web site is forced to express their "structured data" with a limited set of fields you would see exactly similar and almost duplicative information. Note that no one is gaining in this process: the content publishers or the websites lost their uniqueness and the users are not getting the full information that they want to get.

Now I buy the argument that the standards can be extended and also it's impossible for anyone to define "structured format" that can fit every bit of information that's present on the web. That's why I am kind of "old-school" in thinking that we need to get smarter about mining this data using machines. Web is open by definition because there are no rigid rules about expressing content and the intent underneath. That's what makes web so much more fascinating - there is all kinds of data that is to be understood, mined and re-surfaced at end-points such as search engines. So putting a rigid structure at discovery level (i.e. search engines) we lose the open-ness of the web. Keyword search is already a limiting paradigm in discovering information since you can never discover content that you can't define (in a keyword) - now by hiding information that's not structured, you will never know what you would have discovered accidentally.

Hopefully that won't happen with this semantic web push from Google.

Related: Great discussions at Tim O'Reilly's radar and Ian Davis's blog on this topic.

Disclaimer: At Center'd we have unique data about a place's capabilities (such as is it good for kids, is it romantic etc) in a structured format and it cannot be currently expressed in the proposed solution. I have contacted Google about this issue and will update if I hear back :)

1 comment: