The Federal Government recently declassified an interesting letter sent to the NSA by none other than the legendary John Nash. In it, Nash predicts the future of cryptosystems and computational complexity with astonishing accuracy. He describes the framework of analyzing tasks in terms of the length of the input, and remarks that problems that scale exponentially with the input should be considered "hard", one of the foundations of modern complexity theory. In addition, he remarks on exploiting this hardness in cryptosystems, and argues that the security of a good system depends on breaking it being exponentially hard. He also included a novel encryption/decryption scheme of his own design. This letter anticipates complexity theory and cryptography by decades each, and is a testament to Nash's genius.
John Nash has been an extremely influential figure in the subject material of this course. He is probably most well-known for his formulation of Nash Equilibrium, a concept essential to modern game theory and conflict resolution. Cryptography is also essential to the internet and modern networks, since hiding and encrypting data is fundamental to internet security, preventing identity theft and other crimes of exploitation. The man was a genius far ahead of his time. Now who will step up and be the next to predict the future of computer science?
Tuesday, March 13, 2012
Thursday, March 8, 2012
The Game of Competition
For most of my life, there have been three big names in
technology: Microsoft, Apple, and Google. Each has a different backstory, which
most of us know quite well, but all three ended up with a monopoly on a
product. Google had its search engine, Microsoft had the PC, and Apple had the
iPhone. Each product has been updated over the years, but there was one element
noticeably lacking to aide in innovation: competition.
Fortunately for the consumers, competition is now beginning
to grow and the spoils are innumerable. In the beginning of the twenty-first
century, Apple finally had the Mac computer market begin to click, challenging the
PC market. Suddenly the crummy Windows OS updates such as Windows ME, Windows
2005, and Windows Vista weren’t forced upon the consumers because there was an
alternative. The innovation that Apple displayed forced Windows to change their
strategy and actually make innovative improvements to their product. The jury
is still out on how successful this will be, but Windows 7 and Windows 8 show
significant promise. The story is the same for the iPhone (with the Android
being the competition) and Google (with Bing being the competition).
As these three companies continue to attack each other
dominating products, the products get better faster and options available to
the consumer increase dramatically. Each of these three companies continue to
butt heads in the technology market and it will be an exciting to follow which
will reign supreme in the end.
Sources:
http://skattertech.com/2011/07/the-value-of-competition-a-microsoft-apple-google-analysis/
Sources:
http://skattertech.com/2011/07/the-value-of-competition-a-microsoft-apple-google-analysis/
Last lecture
Thanks to everyone who came to the last lecture! We had a nice discussion of the projects that you guys are considering for next term. Lots of interesting ideas! And then we did a little review quiz covering the big ideas from throughout the term. You all did really well, so it seems like many of the topics we covered will stick with you for a little while...
As I said in class, I really hope that you all can fill out the TQFRs. I'm particularly interested in hearing comments about whether the class should be 9 or 12 units next year, and whether the blog posts were a valuable part of the course.
Thanks for all the hard work everyone!
As I said in class, I really hope that you all can fill out the TQFRs. I'm particularly interested in hearing comments about whether the class should be 9 or 12 units next year, and whether the blog posts were a valuable part of the course.
Thanks for all the hard work everyone!
On data security and reliablity at data centers
Cloud storage is becoming very popular these days. Several companies are offering this service - Google cloud storage, iCloud by Apple, Ubuntuone, to name a few. These service providers essentially maintain and operate huge data centers. Although Google lists and gives information about around 10 data centers on their website , it is estimated (source, dates to 2008) that google has around 19 data centers in the US, around 12 in Europe, 3 in Asia, one each in Russia and South America. Managing massive amount of data has several challenges. We had a brief tour of the idea of data centers in our last class. We saw that apart from the challenges regarding handling servers and data, there are whole lot of issues like power systems, cooling systems etc.
The main aim of Data Centers is of course to host the data of the users reliably and securely.
Security Issues :
The users are concerned about the security of their data. Not only does the data need to be encrypted, but also lot of security measures have to be taken in order to ensure that
1. No unauthorized access is allowed physically.
2. Disposal of corrupted disks in a systematic and careful manner.
3. Avoid misuse of data by personnel within the data centers.
4. A reliable security system in place in case of emergency and so on.
This video shows the steps that google has taken in order to ensure data security.
Reliability Issues :
Failures can happen due to different reasons. Apart from possible disk errors, there could be physical damage due to wear and tear, fire, natural calamities etc. This gives rise to the need of distributed storage of data (store redundant data at different physical locations). Also, the system should immediately hand-off the user from data center with trouble to another data center with minimal possible disruption.
When a node fails, the data will be reconstructed using the redundant sources. This involves downloading data from the other nodes holding the redundant copy. Given the fact that data size is massive, considerable amount of bandwidth is consumed. So, do we keep huge number of redundant copies of all data and download all the lost data? Can we do better? There is a lot of work done on distributed data storage which proposes a smarter way of storing and recovering data which can decrease the amount of data to be stored and the download bandwidth for reconstruction.
The main idea is to encode the data and store pieces of it in different nodes in a redundant manner. For eg. say 1 Peta byte of data encoded into n symbols and stored in M nodes. The coding is done in such a way that when a node fails, it needs to connect to only d < n nodes to reconstruct the data.
(For more details: http://www.eecs.berkeley.edu/~rashmikv/research.html
http://scholar.google.com/scholar?hl=en&q=regenerative+codes+for+distributed+storage&btnG=Search&as_sdt=0,5&as_ylo=&as_vis=0 )
Google gives a really nice overview of different aspects of a data center here.
The main aim of Data Centers is of course to host the data of the users reliably and securely.
Security Issues :
The users are concerned about the security of their data. Not only does the data need to be encrypted, but also lot of security measures have to be taken in order to ensure that
1. No unauthorized access is allowed physically.
2. Disposal of corrupted disks in a systematic and careful manner.
3. Avoid misuse of data by personnel within the data centers.
4. A reliable security system in place in case of emergency and so on.
This video shows the steps that google has taken in order to ensure data security.
Reliability Issues :
Failures can happen due to different reasons. Apart from possible disk errors, there could be physical damage due to wear and tear, fire, natural calamities etc. This gives rise to the need of distributed storage of data (store redundant data at different physical locations). Also, the system should immediately hand-off the user from data center with trouble to another data center with minimal possible disruption.
When a node fails, the data will be reconstructed using the redundant sources. This involves downloading data from the other nodes holding the redundant copy. Given the fact that data size is massive, considerable amount of bandwidth is consumed. So, do we keep huge number of redundant copies of all data and download all the lost data? Can we do better? There is a lot of work done on distributed data storage which proposes a smarter way of storing and recovering data which can decrease the amount of data to be stored and the download bandwidth for reconstruction.
The main idea is to encode the data and store pieces of it in different nodes in a redundant manner. For eg. say 1 Peta byte of data encoded into n symbols and stored in M nodes. The coding is done in such a way that when a node fails, it needs to connect to only d < n nodes to reconstruct the data.
(For more details: http://www.eecs.berkeley.edu/~rashmikv/research.html
http://scholar.google.com/scholar?hl=en&q=regenerative+codes+for+distributed+storage&btnG=Search&as_sdt=0,5&as_ylo=&as_vis=0 )
Google gives a really nice overview of different aspects of a data center here.
Wednesday, March 7, 2012
Udacity, Coursera, MITx: Education for the Future
History of Internet Education
Education on the Internet is not a new concept. Even if we filter out all of the content websites that seek to inform audiences (e.g. Wikipedia, Quora, HowStuffWorks), the Internet was seen as an opportunity for extending traditional education beyond the walls of universities as early as 1993, when the Jones International University was founded as the first online-only institute of higher education1. Many universities followed suit (DeVry, University of Phoenix, and ITT Tech Online, to name some of the more popular ones), enabling Internet users to gain degrees from home. These programs were not without drawbacks. In general, they failed to have the prestige of their brick-and-mortar counterparts, and they were often just as expensive In addition, they did not tend to expose students to the same collaborative environments as traditional universities.
In an attempt to knock down the cost barriers to education by using the power of the Internet, the University of Tübingen in Germany started a program in 2000 in which lectures were recorded and published online, free for all to access2. It was a daring move, since there were some fears that losing the exclusivity of the university would devalue the degree program and drive students away. No such effect was observed, however, and just two years later, MIT followed suit with a similar program of their own3. This program, now known as OpenCourseWare (OCW), offered not just lectures, but homework assignments, lecture notes, and even exams in some cases. It has enjoyed considerable popularity, with hundreds of universities and education centers contributing their own content to the OpenCourseWareConsortium4. Still, the OCW programs fall short in a couple ways. There is no reliable way to gain any form of certification for completing work (important for people attempting to self-teach in order to move to better careers), and there is no synchronous "course offering" that drives students to collaborate and complete work in a timely manner.
Modern
Recently, a series of efforts coming from various places have been tackling the problems with OCW in a convincing manner. Last October, an effort from Stanford led by Sebastian Thrun and Peter Norvig brought a complete course on artificial intelligence to the Internet. The idea was not fundamentally different from OCW, but the manner in which the course was offered--following a schedule just like a traditional class, with lectures and homeworks posted gradually over a span of two months--proved to be hugely successful, attracting an astounding 160,000 people to enroll5. The course acted as a proof of concept, showing that the model of bringing traditional courses to the Internet for free had potential. After this success, Thrun co-founded a new company, Udacity, to continue the push towards free, approachable education.
Udacity is not alone in this push towards popularizing massive open online courses. A second company (also coming from Stanford) called Coursera has teamed up with University of Michigan, Stanford, and UC Berkeley to offer a handful of courses in a similar fashion6. MIT has also thrown its hat into the ring with MITx, offering an introductory course on electronics7. All three of these efforts share the same major features: a synchronized learning environment with high-quality course material, free for any person who can find access to the Internet.
There is still a long way to go before these programs can reasonably hope to compete with traditional universities as the new standard of education. Course offerings from all of these efforts tend to have a somewhat narrow skew towards computer science and technology, and it is unlikely that typical prospective employers would be eager to hire self-taught students in favor of those who have a degree from a known university on their resumes. Still, the trio of Udacity, Coursera, and MITx offer hope that we are making progress towards a society where higher education is just as accessible for any eager learners as the traditional K-12 system, and it is only possible due to the power of collaboration on the Internet.
(Is it sad that I hope to live to see a day where my Caltech degree becomes meaningless? :) )
References
Advertising Edge with Social Medium
As social medium like Facebook and Google+ start to take a growing cut
of digital advertising dollars, a increasingly convincing draw for advertisers
to choose Facebook and Google+ over conventional internet advertising venues is
the power of social connections. Both Facebook and Google+ have seen
significant increases in clicks when advertisements had social annotations
connected to them like liking and +1’ing an advertisement. Clicking these annotations says that specific uses approve of these advertisements and allows connections to see this approval as well. To look at some of these statistical differences, Nielsen
analyzed results from 79 Facebook/Google+ campaigns through the span of half a
year to determine how well social messages succeeded in getting through to the
audience.
The obvious advantage to advertising on Facebook/Google+ rather than a
website in line with a specific audience’s interests is the ability for people
to recommend an advertisement by liking or +1’ing which allows someone’s social
connections on these sites to see them as well. On average, social
advertisements generate a 55% increase in recall and social advertisements saw a
5% -10% increase in clicks. Nielsen says
this decided increase can be explained by consumers’ attitudes. In surveys
carried out in 2011, Nielsen saw 76 percent of U.S. Internet consumers said they most trusted recommendations from personal
acquaintances whereas 49 percent of U.S. Internet consumers trusted consumer opinions online. Therefore, this not so
small increase of 27 percent should serve as justification for the increase in
advertisement performance. It makes sense for consumers to trust recommendations
from connections than recommendation from random strangers. Now what is
interesting is the ability of Facebook and Google+ to continue finding clever
ways to push these numbers even higher. Google+ has allowed advertisements to
be delivered to end users when something in searched in Google and Facebook saw
a 46% increase in clicks when they provided end users with the “sponsored
stories” feature. Through both of these steps, we can see how social medium is
really taking a step forward in the advertising game. Through the mass data of
social networking and social connections, advertising has found a new venue in
Facebook and Google+ that show higher returns than any other online sites.
http://mashable.com/2012/03/06/google-plus-ads/
http://blog.nielsen.com/nielsenwire/?p=31136
Tuesday, March 6, 2012
Facebook's killing Feature, 'You May Know...'
Most, if not all, of the students in this class must know about one of the Facebook's killing features, 'You May Know..' This feature suggests a list of people to become friends with. I personally thought that the feature was indeed quite accurate, and wanted to know how Facebook determines which people to suggest. It was not so hard to find a research article written by a member from Facebook describing the algorithm behind the scene, and here I share it. Please, follow the attached link to find the original article. http://dl.acm.org/citation.cfm?id=1935914
Recommending a friend for a user had been studied even before Facebook's paper was published. One such example is “Make New Friends, but Keep the Old” – Recommending People on Social Networking Sites, which can be found at http://dl.acm.org/citation.cfm?id=1518735
We find that the methods used in both of the papers share two main themes: 1) recommendation based on personal information/interactions, such as the field of interest in his/her profile, the types of postings, the frequency of chat between the two, and etc, and the 2) social network structure centered at the user to suggest friends. In Facebook's paper, these two notions are naturally implemented in a simple, beautiful way. The key idea is well explained in the top right corner in the page 3 of the original paper, which I repeat here in my own words.
The algorithm is called ''Supervised Random Walks,'' and the basic idea is to construct the measure of closeness for each person, who is not yet a friend, by randomly traveling in the friendship network beginning from the person to whom Facebook will suggest friends. This process is the same as what we have learned in calculating PageRank for a given web structure. In fact, as we were finding pages that were more important than the others, this random walking process provides us which person is more important/close to the origin user. On top of this idea, Facebook adds more sophisticated approach. That is, rather than assuming equal probabilities for all links going out from a node, the algorithm assigns different weights to each connected link, and the weights are determined by the personal information/interaction of the two users. In practice, the function that gives the weight from the personal information/interaction can be learned by previous data.
http://dl.acm.org/citation.cfm?id=1935914
http://dl.acm.org/citation.cfm?id=1518735
Recommending a friend for a user had been studied even before Facebook's paper was published. One such example is “Make New Friends, but Keep the Old” – Recommending People on Social Networking Sites, which can be found at http://dl.acm.org/citation.cfm?id=1518735
We find that the methods used in both of the papers share two main themes: 1) recommendation based on personal information/interactions, such as the field of interest in his/her profile, the types of postings, the frequency of chat between the two, and etc, and the 2) social network structure centered at the user to suggest friends. In Facebook's paper, these two notions are naturally implemented in a simple, beautiful way. The key idea is well explained in the top right corner in the page 3 of the original paper, which I repeat here in my own words.
The algorithm is called ''Supervised Random Walks,'' and the basic idea is to construct the measure of closeness for each person, who is not yet a friend, by randomly traveling in the friendship network beginning from the person to whom Facebook will suggest friends. This process is the same as what we have learned in calculating PageRank for a given web structure. In fact, as we were finding pages that were more important than the others, this random walking process provides us which person is more important/close to the origin user. On top of this idea, Facebook adds more sophisticated approach. That is, rather than assuming equal probabilities for all links going out from a node, the algorithm assigns different weights to each connected link, and the weights are determined by the personal information/interaction of the two users. In practice, the function that gives the weight from the personal information/interaction can be learned by previous data.
http://dl.acm.org/citation.cfm?id=1935914
http://dl.acm.org/citation.cfm?id=1518735
Don't forget the TQFRs
As you'll see on HW7, it's really important to me that you guys take the time to fill out the TQFR reports. I really benefit from the feedback you give on the course and certainly will take it into account when I update the course next year. So, please fill them out ... and please do more than just give ratings, also write comments (since those are the most informative things).
If you manage to beat the percentage of surveys filled out by last year's class, everyone will get 5 points, and if you manage to get 100% completion percentage I'll give everyone 10 extra credit points!
If you manage to beat the percentage of surveys filled out by last year's class, everyone will get 5 points, and if you manage to get 100% completion percentage I'll give everyone 10 extra credit points!
Final two lectures
Well, we finally got around to talking about data centers on Monday in class. After giving an overview of what goes into a data center, we spent most of the time talking about 2 issues: energy usage and programming with MapReduce. You'll get a lot of practice with MapReduce on the HW, so hopefully by the end of the week you'll be comfortable.
And now we only have one lecture left. I hope everyone can try to come. We'll spend the time on two things: 1) letting you guys tell each other about the projects you're planning for 145, and 2) doing a little review of the material in the course. I really hope people come for the review. It's your chance to convince me that I'm making the right decision in not having a final!
And now we only have one lecture left. I hope everyone can try to come. We'll spend the time on two things: 1) letting you guys tell each other about the projects you're planning for 145, and 2) doing a little review of the material in the course. I really hope people come for the review. It's your chance to convince me that I'm making the right decision in not having a final!
Monday, March 5, 2012
How Does Facebook Decide Which Ads to Show?
The clickmaniac project raised the question of how Facebook decides which ads to show when and to whom. Facebook is given a set of advertisers and their bids for various ads, each targeted towards a particular demographic. Facebook also knows each advertiser's daily budget. The best way to go about showing ads using this information is not clear. I attempted to do some research and see if I could provide some insight as to how Facebook does it.
First I found the site http://www.facebook.com/help/?faq=173594992698822, a site run by Facebook that provides some explanation as to how their ad auctions work. The key features are as follows. As we talked about in class they use some variant of a second-price auction in which each advertiser pays the minimum amount they would have needed to bid in order to win their slot. However, they also take into account the ad's performance. It is not specified how exactly this is done but they say that ads that have a high clickthrough rate in the past are more likely to be shown again. This explains why a tactic that I thought up for clickmaniac doesn't work. I tried bidding a very high price per click for a very unpopular ad that was targeted towards the kind of people who would not want to click on it. The hope was that the high bid would cause it to "win the auction" and get shown the most. Since no one would click on it, it wouldn't cost us anything. As expected no one clicked on the ad, but it wasn't shown nearly as much as we had hoped it would be, probably because Facebook realized it had a low clickthrough rate. It of course makes sense for Facebook to show ads with high clickthrough rate (at least if advertisers are paying per click) in order to maximize their profit.
Another source I looked at is the news article www.stanford.edu/~saberi/sara.pdf. It details a theoretical solution to the ad auction problem (in the context of Google) proposed by a group of computer science researchers. They made an analogy between the ad auction problem and a problem called the online matching problem which they had already studied. The online matching problem is as follows:
"A matchmaker and n boys are gathered in a room, in which n girls appear, one at a time. Each girl has a list of boys who are acceptable to her, which she reveals to the matchmaker as she appears. The matchmaker matches her to one of the boys on her list or, if they are all taken, leaves her unmatched. The matchmaker’s goal is to maximize the total number of matches."
In the analogy, the girls are the advertisers and the boys are the advertising slots. The difficult part of the problem is the incomplete information -- the fact that the matchmaker doesn't know who is on each girl's list until it's that girl's turn to be matched. Using knowledge of the optimal solution to the online matching problem (which is detailed in the article), the researchers were able to derive the optimal solution for their formulation of the ad auction problem. An interesting attribute of the solution is that it depends both on the advertisers' bids and the fraction of their budget that is left. The algorithm favors displaying ads of advertisers who have a lot of budget left. I can think of one intuitive explanation for this. Each advertiser provides various ads that can be shown upon various Google search queries. Having many diverse advertisers allows Google to show ads upon almost any query. However, once an advertiser's budget is used, Google can no longer make money on that advertiser's ads and has less flexibility. If too many advertisers run out of budget, Google may be left with no ads to show on a particular query. (This is analagous to leaving a girl unmatched in the online matching problem.) It is in Google's favor to resist depleting any advertiser's budget in order to keep its options open for continuing to solve the matching problem in the future.
Sources:
http://www.facebook.com/help/?faq=173594992698822
www.stanford.edu/~saberi/sara.pdf
First I found the site http://www.facebook.com/help/?faq=173594992698822, a site run by Facebook that provides some explanation as to how their ad auctions work. The key features are as follows. As we talked about in class they use some variant of a second-price auction in which each advertiser pays the minimum amount they would have needed to bid in order to win their slot. However, they also take into account the ad's performance. It is not specified how exactly this is done but they say that ads that have a high clickthrough rate in the past are more likely to be shown again. This explains why a tactic that I thought up for clickmaniac doesn't work. I tried bidding a very high price per click for a very unpopular ad that was targeted towards the kind of people who would not want to click on it. The hope was that the high bid would cause it to "win the auction" and get shown the most. Since no one would click on it, it wouldn't cost us anything. As expected no one clicked on the ad, but it wasn't shown nearly as much as we had hoped it would be, probably because Facebook realized it had a low clickthrough rate. It of course makes sense for Facebook to show ads with high clickthrough rate (at least if advertisers are paying per click) in order to maximize their profit.
Another source I looked at is the news article www.stanford.edu/~saberi/sara.pdf. It details a theoretical solution to the ad auction problem (in the context of Google) proposed by a group of computer science researchers. They made an analogy between the ad auction problem and a problem called the online matching problem which they had already studied. The online matching problem is as follows:
"A matchmaker and n boys are gathered in a room, in which n girls appear, one at a time. Each girl has a list of boys who are acceptable to her, which she reveals to the matchmaker as she appears. The matchmaker matches her to one of the boys on her list or, if they are all taken, leaves her unmatched. The matchmaker’s goal is to maximize the total number of matches."
In the analogy, the girls are the advertisers and the boys are the advertising slots. The difficult part of the problem is the incomplete information -- the fact that the matchmaker doesn't know who is on each girl's list until it's that girl's turn to be matched. Using knowledge of the optimal solution to the online matching problem (which is detailed in the article), the researchers were able to derive the optimal solution for their formulation of the ad auction problem. An interesting attribute of the solution is that it depends both on the advertisers' bids and the fraction of their budget that is left. The algorithm favors displaying ads of advertisers who have a lot of budget left. I can think of one intuitive explanation for this. Each advertiser provides various ads that can be shown upon various Google search queries. Having many diverse advertisers allows Google to show ads upon almost any query. However, once an advertiser's budget is used, Google can no longer make money on that advertiser's ads and has less flexibility. If too many advertisers run out of budget, Google may be left with no ads to show on a particular query. (This is analagous to leaving a girl unmatched in the online matching problem.) It is in Google's favor to resist depleting any advertiser's budget in order to keep its options open for continuing to solve the matching problem in the future.
Sources:
http://www.facebook.com/help/?faq=173594992698822
www.stanford.edu/~saberi/sara.pdf
Saturday, March 3, 2012
Internet Browser Shares
According to statistics compiled by the Web tracking firm Net Applications, Google's Chrome browser's share fell slightly for a second month.
Net Applications acknowledged that until recently it had been over-counting Chrome's share due to Chrome's pre-rendering function, implemented last August. When users enter information in the address bar or the Google search bar Chrome will preemptively load some of the results; this caused Net Applications to over-estimate the number of sites visited by Chrome users. In February, the pre-rendering "accounted for 4.3% of Chrome's daily unique visitors". No other browser utilizes pre-rendering.
Additionally, Chrome's PageRank was still being penalized after a marketing campaign violated the company's rules against paid links; this dropped the Chrome website from the second result on the first page of a search for "browser" to the sixth item on the fifth page. It seems likely that Chrome will return to rising this month.
For comparison, Firefox has 20.92% of the browser share, Chrome 18.90%, Safari 5.24%, and the combined shares of Internet Explorer 6-9 is 52.84%. These represent changes of -1.77%, +7.49%, +1.11%, and -6.38%, respectively.
Source: computerworld.com
Net Applications acknowledged that until recently it had been over-counting Chrome's share due to Chrome's pre-rendering function, implemented last August. When users enter information in the address bar or the Google search bar Chrome will preemptively load some of the results; this caused Net Applications to over-estimate the number of sites visited by Chrome users. In February, the pre-rendering "accounted for 4.3% of Chrome's daily unique visitors". No other browser utilizes pre-rendering.
Additionally, Chrome's PageRank was still being penalized after a marketing campaign violated the company's rules against paid links; this dropped the Chrome website from the second result on the first page of a search for "browser" to the sixth item on the fifth page. It seems likely that Chrome will return to rising this month.
For comparison, Firefox has 20.92% of the browser share, Chrome 18.90%, Safari 5.24%, and the combined shares of Internet Explorer 6-9 is 52.84%. These represent changes of -1.77%, +7.49%, +1.11%, and -6.38%, respectively.
Source: computerworld.com
Algorithms for Content Discovery
A thriving SNS such as Twitter suffers from two challenges related to information gathering: filtering and discovery. On the one hand, an active Twitter or Facebook user can easily receive hundreds of messages every day in her stream. Users therefore often express the desire to filter the stream down to those messages that are indeed of interest. On the other hand, many users also want to discover useful information outside their own streams, such as interesting URLs on Twitter posted by friends of friends.
When companies face these problems, there are a number of recommender designs they can try. Below is a table of such decisions, which I will elaborate further.
1. Candidate-Set: We can narrow down the candidate set to users in the immediate vicinity, for example followee-of-folowees. Presumably, if we sample tweets from people closer to us, it would produce more relevant results than selecting candidates from a random sample of users. Other choice is to sample postings from popular users. This would be using "social proof" to filter tweets to promote.
2. Ranking-Topic: Using topic relevance is an established approach to compute recommendations. The topic interest of a user is modeled from text content the user has interacted before, and candidate items are ranked by how well they match the topic interest. We calculate the topic relevance of the user, or the users in the immediate vicinity to rank tweets.
3. Ranking-Social: Here, we can use "social proof" to boost the postings. Each retweet or favorite carries a specific weight, and that tells how popular the tweet was.
These proposed approaches seem intuitive and effective. But just how effective will they be? If we had to pick one dimension, which would we concentrate on?
Jilin Chen, in 2011, carried out a comprehensive study on such recommender design. There are twelve possible algorithms in total (2 candidate-set * 3 ranking-topic * 3 ranking-social), and Chen ran the field experiment by having each algorithm to independently recommend its five highest ranked URLs. He combined and randomized these URLs, then had 44 volunteers vote on each url whether it was interesting or not.
The result is right below --- it might be fun to take a guess on which of these three dimensions would be most effective, and which would be least ineffective.
[spoiler alert!]
Were you correct? Here is the bullet point analysis of this result:
1. "Social proof" is the absolute dominating dimension in producing interesting URLs.
2. Selecting tweets from popular users or near-vicinity doesn't have much effect.
3. Topic relevance was definitely important, and using just the topic relevance of the user fared better than including topic relevance of the followees.
However, it is important to note that social proof alone is not enough to create a thriving community. If that was true, Reddit or Digg would be enough to satisfy all our needs. Chen mentions a need to balance serendipity and relevance. For example, if I write a lot about Caltech in Facebook, that doesn't mean that I only want to hear things about Caltech. Also, even though a posting may not be popular, it could mean something to the minority or immediate friends. As networks evolve, the challenge may be to give adequate attention for every created content, and encourage users to participate at their convenience.
Chen, J. 2011. Personalized recommendation in social network sites. http://purl.umn.edu/117321
Friday, March 2, 2012
Foursquare's New Recommendation Engine
In today’s world of mass internet use, information about
users is worth more than ever. As scary and alarming as it may be, news of companies
collecting user data to target ads and using social profiles to personalize
search results is becoming the norm. While receiving targeted ads in return for
Facebook posts and likes does not sound positive to most users, there are
certainly benefits to be gained from web companies analyzing some data about
you.
Take Foursquare, for example. This social service allows
users to “check in” to places and publish that information to their social
network. Users willingly share location information with their friends, and one
can see a lot of potential in analyzing this data for useful purposes. In fact,
Foursquare has recently announced that they are in the process of doing
precisely that.
Their newly-updated Explore service intends to take users’
check-in data and provide personalized recommendations for places to go. The
engine will use statistics such as the time of day, a user’s friends’ check-ins
and likes, as well as the user’s history to choose places that it thinks the
user will like. With over a billion check-ins providing plenty of data, this service
has the potential to become a very powerful recommendation engine with accurate
and useful suggestions.
While its implementation requires tracking users more than
some would like, Explore is an undoubtedly useful feature and will benefit consumers
a great deal. The CEO of Foursquare says he eventually wants to have recommendation
dots on any virtual map, and with Foursquare’s social integration and growing
popularity, this can be a reality sooner than we think.
As in the example of Explore, analyzing user data can add
tremendous value to society. However, the unauthorized sharing of this data
with third-party advertisers will justifiably spark heated opposition, and the
challenge for companies like Foursquare is to properly address these privacy
concerns. In the new age of Internet companies playing the role of Big Brother,
the quest for balancing creepiness and features will continue to dominate news
headlines. The real question is not whether these companies will gather information,
but how evil they will want to be with it.
References:
(1) CNET article
(2) Foursquare blog post about Explore
Homework 7
The last HW is now posted. At this point, I've lectured about everything you'll need for all the problems except for problem 4, which focuses on MapReduce. We'll cover MapReduce in class on Monday... Good luck!
Thursday, March 1, 2012
Foursquare’s Transition from GoogleMap to OpenStreetMap
Yesterday, Foursquare announced, on their official blog, their decision to embrace the OpenStreetMap movement. Instead of continuing to use GoogleMap API, Foursquare decided to partner with a startup MapBox, using MapBox Street to power all of their maps, starting today. OpenStreetMap is an open, crowd-sourced global atlas. Because MapBox internally uses OpenStreetMap, foursquare no longer relies on any proprietary map information.
Although Foursquare claimed that they made this big technical transition because they would like to have more design flexibility, to support other startups, and to be able to use the Leaflet javascript library, many believe that the ultimate reason behind Foursquare’s transition is GoogleMap API’s new pricing policy. The pricing for GoogleMap APIs may not be very expensive, but for a startup with large amount of map usage, such as Foursquare, GoogleMap may be unaffordable. Thus, even though OpenStreetMap does not offer very good coverage in many places, especially relatively remote ones, Foursquare is somehow forced to transition to open-source data.
To me, this transition means much more than just moving to more economically affordable data. It really reflects an important change recently in many of the startups that use maps to provide location-related services. A few years ago, all the GPS device manufacturers would still have to make their own maps, and even Internet startups like Yelp or Foursquare would have to rely on map databases from big companies, such as Google. However, we are seeing increasingly more new companies that use and contribute to open-source maps. Factual seeks to integrate available open-source data (mostly geographical information) in an easy-to-use and coherent fashion; Waze uses crowd-sourced map and crowd-uploaded real time traffic data to provide driving directions. None of these promising startups relies on big company’s databases. Although open-source maps currently do not have the coverage of GoogleMap, they are getting better and better each day, as more and more people contribute to them. Waze, for example, updates the map automatically according to uploaded driving data. Google may have cars that drive around the US very frequently to track down geographical changes, but if enough people use open and crowd-sourced maps, they can be updated even faster and more accurately than GoogleMap. In a few years, we may see that crowd-sourced maps will gradually take over the Internet map market.
Clickmaniac results
As you know, clickmaniac ended yesterday. Here's the final standings for posterity! There were lots of changes in the final days and here's how it ended up...
Reach:
1) Tobias, Gustaf, Shiyu 910,760
2) Hari, Micheal B. 823,841
3) Kevin, Nathan, Mikhail, Stasja 668,576
Connections
1) Robert, Brock 1304
2) Kevin, Nathan, Mikhail, Stasja 1192
3) Brian M, Lucia, Tuling, Timothy 189
Clicks
1) Kevin, Nathan, Mikhail, Stasja 1660
2) Robert, Brock 1128
3) Hari, Michael B. 1082
Reach:
1) Tobias, Gustaf, Shiyu 910,760
2) Hari, Micheal B. 823,841
3) Kevin, Nathan, Mikhail, Stasja 668,576
Connections
1) Robert, Brock 1304
2) Kevin, Nathan, Mikhail, Stasja 1192
3) Brian M, Lucia, Tuling, Timothy 189
1) Kevin, Nathan, Mikhail, Stasja 1660
2) Robert, Brock 1128
3) Hari, Michael B. 1082
On Network Neutrality
One of the topics discussed in class is Network Neutrality (NN), a big buzzword in the recent history of the Internet. For those unfamiliar with it, this post will be a brief introduction to the concept along with its recent implementation in the U.S. law.
For start, it would be worthwhile to explore the definition of the idea first. As quoted from Columbia University's prof. Tim Wu's FAQ:
For start, it would be worthwhile to explore the definition of the idea first. As quoted from Columbia University's prof. Tim Wu's FAQ:
"Network neutrality is best defined as a network design principle. The idea is that a maximally useful public information network aspires to treat all content, sites, and platforms equally."While this can be treated very broadly, the NN idea as applied to the Internet is largely related to the ISPs' ability to control the flow of information through their networks. Well, what does neutral mean in this case then? Generally speaking, it would be discriminating against some kind of packets flowing through the net and preferentially treating other packets. One could imagine this could possibly be used to improve the Quality of Service for some ISPs, as they theoretically could speed up access to important or influential parts of the Internet, but such a system could possibly discriminate against free spread of information as well. And the latter possibility is one of the reasons that speak for advocating NN to make the Internet a free arena for information spread (one can read about it in Tim Wu's paper). One of the founding ideas of the Internet was to provide an open web of information, and the principles of NN aspire to protect this idea.
Some Insight Into Intermediaries
Everybody knows that the Internet is an
amazing invention. Through it, we can connect to friends, learn new
things, be entertained, and all sorts of other things all without
paying a dime. However, we have to remember that everything comes
with a price. And I'm not just talking about ads, everybody expects
to be assaulted by ads when going online, I'm talking about data
mining.
Now, this is not necessarily a bad
thing, in fact, in many cases it is a good thing. We get more
personalized content and importantly it lets publishers get more
information, so they can make more money from their ads, so we can
get more free content. However, to do this, our online information is
thrown about to a wide variety of companies forming the online
advertising pipeline: advertisers, intermediaries, and publishers.
And, although publishers and advertisers are what we see in this
pipeline, what are really interesting are the intermediaries, the
ones getting the right advertisers to the right publishers.
One of the most straightforward ways
they do this is with demographic, geographic, and behavioral
targeting. However, rather than just targeting publishers as in
traditional advertising, they can also target individuals. By
tracking cookies and trading information with the publishers, these
intermediary companies can track your online presence, so even if you
are currently looking up cute cat pictures on the internet, you could
still be getting ads for those heavy metal concert tickets you have
been eying recently.
But, that is just the beginning of what
they can do. Say you are a website and want to target customers who
are likely to buy again, how would you do this? One trick they use is
that when a customer comes to your site, you drop a cookie on them.
Then you send ads their way and see who comes back, creating a
profile of who are likely to come back. Then you just rinse and
repeat until the profile consists only of people likely to come back.
Pretty cool, huh?
However, they don't just match
advertisers with publishers, some also do quality control, making
sure the advertisers and the publishers make their money's worth. By
tracking which users get which ads and how they respond by collating
information among the publishers, it becomes possible for even third
party entities to track ad value.
The combination of these techniques and
many more shows just how complicated, yet amazing this whole industry
is. The fact that so many companies with limited information in this
industry form a distributed market that keeps the whole Internet
going and profitable is, quite frankly, mind-boggling.
Sources:
Mozilla Fights Online Tracking with "Collusion"
The Mozilla Foundation announced yesterday the release of a new Firefox add-on intent on helping users understand how tracking cookies can be used by advertisers and others to assemble a nearly comprehensive picture of users' activity online. Collusion is the latest tool released in Mozilla's ongoing effort to empower users to have more control over how they interact with the web. Its current implementation is focused on a graph representation of the flow of cookies, where nodes represent unique domains and edges represent that a request was made from a property of the first domain to one of the second. A demo of this visualization in action can be found on the above-linked Collusion page.
The idea is that within just a handful of visits to supposedly independent websites, it becomes quite clear that several advertisers have a very good idea of exactly how a given user interacts with the web. This is, of course, a manifestation of a couple of the universal properties of graph structure that we have discussed in class, in that the graph quickly converges to a giant strongly connected component with high connectivity. Mozilla goes one step further in the actual add-on by surrounding nodes that a user has explicitly requested with a white halo, and those that are "confirmed trackers by privacychoice.org" with a red one. Privacy advocates at Mozilla hope that the striking visualization will serve as a concerning wakeup call to web users heretofore unaware of the nature of tracking cookies. Empowering users to do something about it is mostly left for a later version. They state that future versions of the Collusion add-on will include the ability to block individual trackers as well as an opt-in to anonymously provide data to researchers and privacy advocates interested in investigating aggregate trends in internet tracking.
If this initiative, along with other Mozilla privacy initiatives like Do Not Track, is successful in encouraging a non-trivial portion of web users to take active control of their privacy on the web, it will be interesting to see what effects it has on the internet advertising industry. As we saw in our in-class experiment, this kind of rich data about user activity is extremely valuable to advertisers, and users opting out may make internet advertising a less lucrative game.
The idea is that within just a handful of visits to supposedly independent websites, it becomes quite clear that several advertisers have a very good idea of exactly how a given user interacts with the web. This is, of course, a manifestation of a couple of the universal properties of graph structure that we have discussed in class, in that the graph quickly converges to a giant strongly connected component with high connectivity. Mozilla goes one step further in the actual add-on by surrounding nodes that a user has explicitly requested with a white halo, and those that are "confirmed trackers by privacychoice.org" with a red one. Privacy advocates at Mozilla hope that the striking visualization will serve as a concerning wakeup call to web users heretofore unaware of the nature of tracking cookies. Empowering users to do something about it is mostly left for a later version. They state that future versions of the Collusion add-on will include the ability to block individual trackers as well as an opt-in to anonymously provide data to researchers and privacy advocates interested in investigating aggregate trends in internet tracking.
If this initiative, along with other Mozilla privacy initiatives like Do Not Track, is successful in encouraging a non-trivial portion of web users to take active control of their privacy on the web, it will be interesting to see what effects it has on the internet advertising industry. As we saw in our in-class experiment, this kind of rich data about user activity is extremely valuable to advertisers, and users opting out may make internet advertising a less lucrative game.
Subscribe to:
Posts (Atom)