Tuesday, May 09, 2006

Tags, Taxonomies and User Based Classification

With popularity of the web sites like flickr and del.icio.us there is a much talking about tags as a way to organize and retrieve the content. It seems that many web 2.0 developers embrace the tags as a cornerstone of the information organization and retrieval. Although tags are not the new idea, they recently took the center stage pushing on side the taxonomies as a primary mean of information classification at least in the new web 2.0 world. Tags do have their value, but their current implementation has a number of shortcomings too. Their primary values, in my opinion come from simplicity. The simplicity, how they are used from end-users point of view but maybe more importantly, the simplicity of underlying system implementation. They are nothing more than an easy to implement but powerful indexing schema. Taxonomies on other end, do offer their values to the users by ability to navigate to the content if specific information about category in question is not known. Their biggest drawback is the cost. They are relatively expensive to be built and maintained.

I like the tags for their simplicity, but I believe their current implementation has some significant limitations. For example del.icio.us tags can only be one word long, so if somebody wants to categorize something using the multi-word expression it becomes difficult. That brings another issue, tags like: 'web2.0', 'web_2_0' or other variations if read by humans would be clearly the same thing but when machine interprets they are completelytely separate entities. Even singular vs. plural words would be consider the separate entities, which significantly limits their intended purpose to easily find the content. More than syntactical aspect of tags as words there is a semantical aspect too. Tag 'apple' might mean the company Apple for somebody or fruit for somebody else. Social aspect of the tagging relies on the users to consistently apply the same tags for a given content type. If I'm tagging a political blog as a 'liberal' or 'conservative' but not as a 'politics' somebody searching for political blogs using the tag 'politics' will not find it, although clearly belongs to that category.

There is a nice recent post from Joshua Porter on tags (The Del.icio.us Lesson). Joshua talks that the primary reason for tagging is the benefit for the end user that applies those tags to easily find his/her own content, and the social aspect of those comes secondary. I aggre, but I would add that I am very interested as a user to find the relevant and interesting content for a topic of my interest and if I'm a content publisher I'm very interested to describe the content in a way that other users can easily find it. If many users find something interesting that will likely mean that I'll find that content interesting too. I believe that the human based classification is always superior to any machine based classification and the success stories of sites like del.icio.us and wikipedia shows that the power of social networking and collaboration should not be underestimated. I also believe we are at the beginning of learning how to harness that power to create fundamentally better means for information organization and retrieval.

Wednesday, May 03, 2006

Keyword vs. conceptual based web search

Today’s web is dominated by the keyword based search. Although keyword as a concept is more natural to a machine than to a human it become ubiquitous over time. Users adjusted their behaviour so they know how to formulate a query in a way that machine has the best chance of returning the most relevant answers. In a way users adjusted to the machines instead of machines to the users. Not only that, but now, different specialist around the world are selling their skills of adjusting web sites so they can be easier searched and found by the search engines, so we are adjusting the information and the way we present it to the humans in order to satisfy machines. The obvious question here is, is this the best way of finding and organizing the information, is this where the future of web is going? My answer would be – obviosly no.

The keywords become popular because they are so simple from machine processing point of view. Any document that contains a particular keyword is index by the search engine for that keyword. Ranking of those pages and scaling with the size of Internet is a bit more complicated and that is where Google earned their success.

As the part of human to machine interaction keywords are very limited for some obvious reasons. Single word might have multiple meanings, multiple words sometime can only make sense if they are used together as a part of phrase. Judging the relevance of a particular document solely based on the fact it contains a partuclar word cannot produce succesful results. That is especially visible with the blog search sites. The web search engines use markup of the text on the web page to give different relevance to different words that are indexed. In the case of blogs which contain bunch of flat words, their ability to differentiate is substantially lower.

Also this produce inability to specify any constraint as the part of input query. If I asked question ‘all news articles on X from last year’ what I really want to find out are all news articles talking about X published last year, not the documents that contain those exact words. The only way to accomplish this today is to go through all news sources for X and search for ones that are published last year. And there could be more complex examples of queries that would be simple imposible to accomplish using existing methodologies.

The alternative to this would be a “smart” system that can recognize the meaning/intent of the user query and provide either the direct answer or the most relevant possible references where the answer might be found. Sounds too idealistic, maybe, but that is where we should be going towards. Instead of just words, the system should understand the concepts behind those words and phrases. If I ask for ‘web 2.0 companies’ I would like to get either the list of web 2.0 companies ranked by some means or the list of web sites that provide those information. Also the system should understand the relationships between different concepts and attributes that each concept might have. Similarly to the previous example, if I ask for ‘web 2.0 startup from California funded last year’, I would like to get as a result the list of those companies. In order for system to provide that first it needs to understand the basic query the way we humans understand and it needs to have sufficient information about the startup companies, where they are located and when they are funded. Extracting that information from the web pages, in this case probably the news sources is a very hard problem. However having sufficient metadata would fundamentally enable the answering of this kind of question.

The semantics web initiative has been started with the idea of provided the framework to exactly enable this kind of stuff. However, the semantics web adoption suffers from a classical ‘chicken and egg’ problem. Programmers would build the web sites with that level of metadata if there is a search engine that will take benefit out of it. On other end, you can’t build a commercial search engine if there is no sufficient number of sites providing that information. But maybe there are some alternative approcahes to gathering and organizing this information. The successes of web sites like del.icio.us and Wikipedia shows the power of social networking and group collaboration.

There should be a better way of searching the web and we are just at the beginning in our quest to organize and find the world’s information.