Saturday, March 3, 2012

Algorithms for Content Discovery

A thriving SNS such as Twitter suffers from two challenges related to information gathering: filtering and discovery. On the one hand, an active Twitter or Facebook user can easily receive hundreds of messages every day in her stream. Users therefore often express the desire to filter the stream down to those messages that are indeed of interest. On the other hand, many users also want to discover useful information outside their own streams, such as interesting URLs on Twitter posted by friends of friends.
When companies face these problems, there are a number of recommender designs they can try. Below is a table of such decisions, which I will elaborate further.
1. Candidate-Set: We can narrow down the candidate set to users in the immediate vicinity, for example followee-of-folowees. Presumably, if we sample tweets from people closer to us, it would produce more relevant results than selecting candidates from a random sample of users. Other choice is to sample postings from popular users. This would be using "social proof" to filter tweets to promote.
2. Ranking-Topic: Using topic relevance is an established approach to compute recommendations. The topic interest of a user is modeled from text content the user has interacted before, and candidate items are ranked by how well they match the topic interest. We calculate the topic relevance of the user, or the users in the immediate vicinity to rank tweets.
3. Ranking-Social: Here, we can use "social proof" to boost the postings. Each retweet or favorite carries a specific weight, and that tells how popular the tweet was.
These proposed approaches seem intuitive and effective. But just how effective will they be? If we had to pick one dimension, which would we concentrate on?
Jilin Chen, in 2011, carried out a comprehensive study on such recommender design. There are twelve possible algorithms in total (2 candidate-set * 3 ranking-topic * 3 ranking-social), and Chen ran the field experiment by having each algorithm to independently recommend its five highest ranked URLs. He combined and randomized these URLs, then had 44 volunteers vote on each url whether it was interesting or not.
The result is right below --- it might be fun to take a guess on which of these three dimensions would be most effective, and which would be least ineffective.

[spoiler alert!]
Were you correct? Here is the bullet point analysis of this result:
1. "Social proof" is the absolute dominating dimension in producing interesting URLs.
2. Selecting tweets from popular users or near-vicinity doesn't have much effect.
3. Topic relevance was definitely important, and using just the topic relevance of the user fared better than including topic relevance of the followees.
However, it is important to note that social proof alone is not enough to create a thriving community. If that was true, Reddit or Digg would be enough to satisfy all our needs. Chen mentions a need to balance serendipity and relevance. For example, if I write a lot about Caltech in Facebook, that doesn't mean that I only want to hear things about Caltech. Also, even though a posting may not be popular, it could mean something to the minority or immediate friends. As networks evolve, the challenge may be to give adequate attention for every created content, and encourage users to participate at their convenience.

Chen, J. 2011. Personalized recommendation in social network sites.

1 comment:

  1. Good introduction and discussion about the result on interesting content discovery. I agree that social proof should be combined with other factors to create a popular and relevant recommendation.