Tuesday, March 13, 2012

John Nash Predicts Modern Computer Science

The Federal Government recently declassified an interesting letter sent to the NSA by none other than the legendary John Nash. In it, Nash predicts the future of cryptosystems and computational complexity with astonishing accuracy. He describes the framework of analyzing tasks in terms of the length of the input, and remarks that problems that scale exponentially with the input should be considered "hard", one of the foundations of modern complexity theory. In addition, he remarks on exploiting this hardness in cryptosystems, and argues that the security of a good system depends on breaking it being exponentially hard. He also included a novel encryption/decryption scheme of his own design. This letter anticipates complexity theory and cryptography by decades each, and is a testament to Nash's genius.

John Nash has been an extremely influential figure in the subject material of this course. He is probably most well-known for his formulation of Nash Equilibrium, a concept essential to modern game theory and conflict resolution. Cryptography is also essential to the internet and modern networks, since hiding and encrypting data is fundamental to internet security, preventing identity theft and other crimes of exploitation. The man was a genius far ahead of his time. Now who will step up and be the next to predict the future of computer science?

Thursday, March 8, 2012

The Game of Competition

For most of my life, there have been three big names in technology: Microsoft, Apple, and Google. Each has a different backstory, which most of us know quite well, but all three ended up with a monopoly on a product. Google had its search engine, Microsoft had the PC, and Apple had the iPhone. Each product has been updated over the years, but there was one element noticeably lacking to aide in innovation: competition.

Fortunately for the consumers, competition is now beginning to grow and the spoils are innumerable. In the beginning of the twenty-first century, Apple finally had the Mac computer market begin to click, challenging the PC market. Suddenly the crummy Windows OS updates such as Windows ME, Windows 2005, and Windows Vista weren’t forced upon the consumers because there was an alternative. The innovation that Apple displayed forced Windows to change their strategy and actually make innovative improvements to their product. The jury is still out on how successful this will be, but Windows 7 and Windows 8 show significant promise. The story is the same for the iPhone (with the Android being the competition) and Google (with Bing being the competition).

As these three companies continue to attack each other dominating products, the products get better faster and options available to the consumer increase dramatically. Each of these three companies continue to butt heads in the technology market and it will be an exciting to follow which will reign supreme in the end.


Last lecture

Thanks to everyone who came to the last lecture!  We had a nice discussion of the projects that you guys are considering for next term.  Lots of interesting ideas!  And then we did a little review quiz covering the big ideas from throughout the term.  You all did really well, so it seems like many of the topics we covered will stick with you for a little while...

As I said in class, I really hope that you all can fill out the TQFRs. I'm particularly interested in hearing comments about whether the class should be 9 or 12 units next year, and whether the blog posts were a valuable part of the course. 

Thanks for all the hard work everyone!

On data security and reliablity at data centers

Cloud storage is becoming very popular these days. Several companies are offering this service - Google cloud storage, iCloud by Apple, Ubuntuone, to name a few. These service providers essentially maintain and operate huge data centers. Although Google lists and gives information about around 10 data centers on their website , it is estimated (source, dates to 2008) that google has around 19 data centers in the US, around 12 in Europe, 3 in Asia, one each in Russia and South America. Managing massive amount of data has several challenges. We had a brief tour of the idea of data centers in our last class. We saw that apart from the challenges regarding handling servers and data, there are whole lot of issues like power systems, cooling systems etc.

The main aim of Data Centers is of course to host the data of the users reliably and securely.

Security Issues :
The users are concerned about the security of their data. Not only does the data need to be encrypted, but also lot of security measures have to be taken in order to ensure that
1. No unauthorized access is allowed physically.
2. Disposal of corrupted disks in a systematic and careful manner.
3. Avoid misuse of data by personnel within the data centers.
4. A reliable security system in place in case of emergency and so on.
This video shows the steps that google has taken in order to ensure data security.

Reliability Issues :
Failures can happen due to different reasons. Apart from possible disk errors, there could be physical damage due to wear and tear, fire, natural calamities etc. This gives rise to the need of distributed storage of data (store redundant data at different physical locations). Also, the system should immediately hand-off the user from data center with trouble to another data center with minimal possible disruption.

When a node fails, the data will be reconstructed using the redundant sources. This involves downloading data from the other nodes holding the redundant copy. Given the fact that data size is massive, considerable amount of bandwidth is consumed. So, do we keep huge number of redundant copies of all data and download all the lost data? Can we do better? There is a lot of work done on distributed data storage which proposes a smarter way of storing and recovering data which can decrease the amount of data to be stored and the download bandwidth for reconstruction.

The main idea is to encode the data and store pieces of it in different nodes in a redundant manner. For eg. say 1 Peta byte of data encoded into n symbols and stored in M nodes. The coding is done in such a way that when a node fails, it needs to connect to only d < n nodes to reconstruct the data.
(For more details: http://www.eecs.berkeley.edu/~rashmikv/research.html
http://scholar.google.com/scholar?hl=en&q=regenerative+codes+for+distributed+storage&btnG=Search&as_sdt=0,5&as_ylo=&as_vis=0 )

Google gives a really nice overview of different aspects of a data center here.

Wednesday, March 7, 2012

Udacity, Coursera, MITx: Education for the Future

History of Internet Education

Education on the Internet is not a new concept. Even if we filter out all of the content websites that seek to inform audiences (e.g. Wikipedia, Quora, HowStuffWorks), the Internet was seen as an opportunity for extending traditional education beyond the walls of universities as early as 1993, when the Jones International University was founded as the first online-only institute of higher education1. Many universities followed suit (DeVry, University of Phoenix, and ITT Tech Online, to name some of the more popular ones), enabling Internet users to gain degrees from home. These programs were not without drawbacks. In general, they failed to have the prestige of their brick-and-mortar counterparts, and they were often just as expensive In addition, they did not tend to expose students to the same collaborative environments as traditional universities.

In an attempt to knock down the cost barriers to education by using the power of the Internet, the University of Tübingen in Germany started a program in 2000 in which lectures were recorded and published online, free for all to access2. It was a daring move, since there were some fears that losing the exclusivity of the university would devalue the degree program and drive students away. No such effect was observed, however, and just two years later, MIT followed suit with a similar program of their own3. This program, now known as OpenCourseWare (OCW), offered not just lectures, but homework assignments, lecture notes, and even exams in some cases. It has enjoyed considerable popularity, with hundreds of universities and education centers contributing their own content to the OpenCourseWareConsortium4. Still, the OCW programs fall short in a couple ways. There is no reliable way to gain any form of certification for completing work (important for people attempting to self-teach in order to move to better careers), and there is no synchronous "course offering" that drives students to collaborate and complete work in a timely manner.


Recently, a series of efforts coming from various places have been tackling the problems with OCW in a convincing manner. Last October, an effort from Stanford led by Sebastian Thrun and Peter Norvig brought a complete course on artificial intelligence to the Internet. The idea was not fundamentally different from OCW, but the manner in which the course was offered--following a schedule just like a traditional class, with lectures and homeworks posted gradually over a span of two months--proved to be hugely successful, attracting an astounding 160,000 people to enroll5. The course acted as a proof of concept, showing that the model of bringing traditional courses to the Internet for free had potential. After this success, Thrun co-founded a new company, Udacity, to continue the push towards free, approachable education.

Udacity is not alone in this push towards popularizing massive open online courses. A second company (also coming from Stanford) called Coursera has teamed up with University of Michigan, Stanford, and UC Berkeley to offer a handful of courses in a similar fashion6. MIT has also thrown its hat into the ring with MITx, offering an introductory course on electronics7. All three of these efforts share the same major features: a synchronized learning environment with high-quality course material, free for any person who can find access to the Internet.

There is still a long way to go before these programs can reasonably hope to compete with traditional universities as the new standard of education. Course offerings from all of these efforts tend to have a somewhat narrow skew towards computer science and technology, and it is unlikely that typical prospective employers would be eager to hire self-taught students in favor of those who have a degree from a known university on their resumes. Still, the trio of Udacity, Coursera, and MITx offer hope that we are making progress towards a society where higher education is just as accessible for any eager learners as the traditional K-12 system, and it is only possible due to the power of collaboration on the Internet.

(Is it sad that I hope to live to see a day where my Caltech degree becomes meaningless? :) )

  1. http://www.jiu.edu/about/history
  2. http://timms.uni-tuebingen.de/Info/Info.aspx (in German)
  3. http://ocw.mit.edu/about/our-history/
  4. http://www.ocwconsortium.org/en/aboutus/abouttheocwc
  5. http://www.udacity.com/us
  6. https://www.coursera.org/landing/hub.php
  7. http://mitx.mit.edu/

Advertising Edge with Social Medium

As social medium like Facebook and Google+ start to take a growing cut of digital advertising dollars, a increasingly convincing draw for advertisers to choose Facebook and Google+ over conventional internet advertising venues is the power of social connections. Both Facebook and Google+ have seen significant increases in clicks when advertisements had social annotations connected to them like liking and +1’ing an advertisement.  Clicking these annotations says that specific uses approve of these advertisements and allows connections to see this approval as well. To look at some of these statistical differences, Nielsen analyzed results from 79 Facebook/Google+ campaigns through the span of half a year to determine how well social messages succeeded in getting through to the audience.

The obvious advantage to advertising on Facebook/Google+ rather than a website in line with a specific audience’s interests is the ability for people to recommend an advertisement by liking or +1’ing which allows someone’s social connections on these sites to see them as well. On average, social advertisements generate a 55% increase in recall and social advertisements saw a 5% -10% increase in clicks.  Nielsen says this decided increase can be explained by consumers’ attitudes. In surveys carried out in 2011, Nielsen saw 76 percent of U.S. Internet consumers said they most trusted recommendations from personal acquaintances whereas 49 percent of U.S. Internet consumers trusted consumer opinions online. Therefore, this not so small increase of 27 percent should serve as justification for the increase in advertisement performance. It makes sense for consumers to trust recommendations from connections than recommendation from random strangers. Now what is interesting is the ability of Facebook and Google+ to continue finding clever ways to push these numbers even higher. Google+ has allowed advertisements to be delivered to end users when something in searched in Google and Facebook saw a 46% increase in clicks when they provided end users with the “sponsored stories” feature. Through both of these steps, we can see how social medium is really taking a step forward in the advertising game. Through the mass data of social networking and social connections, advertising has found a new venue in Facebook and Google+ that show higher returns than any other online sites.


Tuesday, March 6, 2012

Facebook's killing Feature, 'You May Know...'

Most, if not all, of the students in this class must know about one of the Facebook's killing features, 'You May Know..' This feature suggests a list of people to become friends with. I personally thought that the feature was indeed quite accurate, and wanted to know how Facebook determines which people to suggest. It was not so hard to find a research article written by a member from Facebook describing the algorithm behind the scene, and here I share it. Please, follow the attached link to find the original article. http://dl.acm.org/citation.cfm?id=1935914

Recommending a friend for a user had been studied even before Facebook's paper was published. One such example is “Make New Friends, but Keep the Old” – Recommending People on Social Networking Sites, which can be found at http://dl.acm.org/citation.cfm?id=1518735

 We find that the methods used in both of the papers share two main themes: 1) recommendation based on personal information/interactions, such as the field of interest in his/her profile, the types of postings, the frequency of chat between the two, and etc, and the 2) social network structure centered at the user to suggest friends. In Facebook's paper, these two notions are naturally implemented in a simple, beautiful way. The key idea is well explained in the top right corner in the page 3 of the original paper, which I repeat here in my own words.

The algorithm is called ''Supervised Random Walks,'' and the basic idea is to construct the measure of closeness for each person, who is not yet a friend, by randomly traveling in the friendship network beginning from the person to whom Facebook will suggest friends. This process is the same as what we have learned in calculating PageRank for a given web structure. In fact, as we were finding pages that were more important than the others, this random walking process provides us which person is more important/close to the origin user. On top of this idea, Facebook adds more sophisticated approach. That is, rather than assuming equal probabilities for all links going out from a node, the algorithm assigns different weights to each connected link, and the weights are determined by the personal information/interaction of the two users. In practice, the function that gives the weight from the personal information/interaction can be learned by previous data.


Don't forget the TQFRs

As you'll see on HW7, it's really important to me that you guys take the time to fill out the TQFR reports.  I really benefit from the feedback you give on the course and certainly will take it into account when I update the course next year.  So, please fill them out ... and please do more than just give ratings, also write comments (since those are the most informative things).

If you manage to beat the percentage of surveys filled out by last year's class, everyone will get 5 points, and if you manage to get 100% completion percentage I'll give everyone 10 extra credit points!

Final two lectures

Well, we finally got around to talking about data centers on Monday in class.  After giving an overview of what goes into a data center, we spent most of the time talking about 2 issues: energy usage and programming with MapReduce.  You'll get a lot of practice with MapReduce on the HW, so hopefully by the end of the week you'll be comfortable.

And now we only have one lecture left.  I hope everyone can try to come.  We'll spend the time on two things: 1) letting you guys tell each other about the projects you're planning for 145, and 2) doing a little review of the material in the course.  I really hope people come for the review.  It's your chance to convince me that I'm making the right decision in not having a final! 

Monday, March 5, 2012

How Does Facebook Decide Which Ads to Show?

The clickmaniac project raised the question of how Facebook decides which ads to show when and to whom. Facebook is given a set of advertisers and their bids for various ads, each targeted towards a particular demographic. Facebook also knows each advertiser's daily budget. The best way to go about showing ads using this information is not clear. I attempted to do some research and see if I could provide some insight as to how Facebook does it.

First I found the site http://www.facebook.com/help/?faq=173594992698822, a site run by Facebook that provides some explanation as to how their ad auctions work. The key features are as follows. As we talked about in class they use some variant of a second-price auction in which each advertiser pays the minimum amount they would have needed to bid in order to win their slot. However, they also take into account the ad's performance. It is not specified how exactly this is done but they say that ads that have a high clickthrough rate in the past are more likely to be shown again. This explains why a tactic that I thought up for clickmaniac doesn't work. I tried bidding a very high price per click for a very unpopular ad that was targeted towards the kind of people who would not want to click on it. The hope was that the high bid would cause it to "win the auction" and get shown the most. Since no one would click on it, it wouldn't cost us anything. As expected no one clicked on the ad, but it wasn't shown nearly as much as we had hoped it would be, probably because Facebook realized it had a low clickthrough rate. It of course makes sense for Facebook to show ads with high clickthrough rate (at least if advertisers are paying per click) in order to maximize their profit.

Another source I looked at is the news article www.stanford.edu/~saberi/sara.pdf. It details a theoretical solution to the ad auction problem (in the context of Google) proposed by a group of computer science researchers. They made an analogy between the ad auction problem and a problem called the online matching problem which they had already studied. The online matching problem is as follows:

"A matchmaker and n boys are gathered in a room, in which n girls appear, one at a time. Each girl has a list of boys who are acceptable to her, which she reveals to the matchmaker as she appears. The matchmaker matches her to one of the boys on her list or, if they are all taken, leaves her unmatched. The matchmaker’s goal is to maximize the total number of matches."

In the analogy, the girls are the advertisers and the boys are the advertising slots. The difficult part of the problem is the incomplete information -- the fact that the matchmaker doesn't know who is on each girl's list until it's that girl's turn to be matched. Using knowledge of the optimal solution to the online matching problem (which is detailed in the article), the researchers were able to derive the optimal solution for their formulation of the ad auction problem. An interesting attribute of the solution is that it depends both on the advertisers' bids and the fraction of their budget that is left. The algorithm favors displaying ads of advertisers who have a lot of budget left. I can think of one intuitive explanation for this. Each advertiser provides various ads that can be shown upon various Google search queries. Having many diverse advertisers allows Google to show ads upon almost any query. However, once an advertiser's budget is used, Google can no longer make money on that advertiser's ads and has less flexibility. If too many advertisers run out of budget, Google may be left with no ads to show on a particular query. (This is analagous to leaving a girl unmatched in the online matching problem.) It is in Google's favor to resist depleting any advertiser's budget in order to keep its options open for continuing to solve the matching problem in the future.


Saturday, March 3, 2012

Internet Browser Shares

According to statistics compiled by the Web tracking firm Net Applications, Google's Chrome browser's share fell slightly for a second month.

Net Applications acknowledged that until recently it had been over-counting Chrome's share due to Chrome's pre-rendering function, implemented last August.  When users enter information in the address bar or the Google search bar Chrome will preemptively load some of the results; this caused Net Applications to over-estimate the number of sites visited by Chrome users.  In February, the pre-rendering "accounted for 4.3% of Chrome's daily unique visitors".  No other browser utilizes pre-rendering.

Additionally, Chrome's PageRank was still being penalized after a marketing campaign violated the company's rules against paid links; this dropped the Chrome website from the second result on the first page of a search for "browser" to the sixth item on the fifth page.  It seems likely that Chrome will return to rising this month.

For comparison, Firefox has 20.92% of the browser share, Chrome 18.90%, Safari 5.24%, and the combined shares of Internet Explorer 6-9 is 52.84%.  These represent changes of -1.77%, +7.49%, +1.11%, and -6.38%, respectively.

Source: computerworld.com

Algorithms for Content Discovery

A thriving SNS such as Twitter suffers from two challenges related to information gathering: filtering and discovery. On the one hand, an active Twitter or Facebook user can easily receive hundreds of messages every day in her stream. Users therefore often express the desire to filter the stream down to those messages that are indeed of interest. On the other hand, many users also want to discover useful information outside their own streams, such as interesting URLs on Twitter posted by friends of friends.
When companies face these problems, there are a number of recommender designs they can try. Below is a table of such decisions, which I will elaborate further.
1. Candidate-Set: We can narrow down the candidate set to users in the immediate vicinity, for example followee-of-folowees. Presumably, if we sample tweets from people closer to us, it would produce more relevant results than selecting candidates from a random sample of users. Other choice is to sample postings from popular users. This would be using "social proof" to filter tweets to promote.
2. Ranking-Topic: Using topic relevance is an established approach to compute recommendations. The topic interest of a user is modeled from text content the user has interacted before, and candidate items are ranked by how well they match the topic interest. We calculate the topic relevance of the user, or the users in the immediate vicinity to rank tweets.
3. Ranking-Social: Here, we can use "social proof" to boost the postings. Each retweet or favorite carries a specific weight, and that tells how popular the tweet was.
These proposed approaches seem intuitive and effective. But just how effective will they be? If we had to pick one dimension, which would we concentrate on?
Jilin Chen, in 2011, carried out a comprehensive study on such recommender design. There are twelve possible algorithms in total (2 candidate-set * 3 ranking-topic * 3 ranking-social), and Chen ran the field experiment by having each algorithm to independently recommend its five highest ranked URLs. He combined and randomized these URLs, then had 44 volunteers vote on each url whether it was interesting or not.
The result is right below --- it might be fun to take a guess on which of these three dimensions would be most effective, and which would be least ineffective.

[spoiler alert!]
Were you correct? Here is the bullet point analysis of this result:
1. "Social proof" is the absolute dominating dimension in producing interesting URLs.
2. Selecting tweets from popular users or near-vicinity doesn't have much effect.
3. Topic relevance was definitely important, and using just the topic relevance of the user fared better than including topic relevance of the followees.
However, it is important to note that social proof alone is not enough to create a thriving community. If that was true, Reddit or Digg would be enough to satisfy all our needs. Chen mentions a need to balance serendipity and relevance. For example, if I write a lot about Caltech in Facebook, that doesn't mean that I only want to hear things about Caltech. Also, even though a posting may not be popular, it could mean something to the minority or immediate friends. As networks evolve, the challenge may be to give adequate attention for every created content, and encourage users to participate at their convenience.

Chen, J. 2011. Personalized recommendation in social network sites. http://purl.umn.edu/117321

Friday, March 2, 2012

Foursquare's New Recommendation Engine

In today’s world of mass internet use, information about users is worth more than ever. As scary and alarming as it may be, news of companies collecting user data to target ads and using social profiles to personalize search results is becoming the norm. While receiving targeted ads in return for Facebook posts and likes does not sound positive to most users, there are certainly benefits to be gained from web companies analyzing some data about you.
Take Foursquare, for example. This social service allows users to “check in” to places and publish that information to their social network. Users willingly share location information with their friends, and one can see a lot of potential in analyzing this data for useful purposes. In fact, Foursquare has recently announced that they are in the process of doing precisely that.
Their newly-updated Explore service intends to take users’ check-in data and provide personalized recommendations for places to go. The engine will use statistics such as the time of day, a user’s friends’ check-ins and likes, as well as the user’s history to choose places that it thinks the user will like. With over a billion check-ins providing plenty of data, this service has the potential to become a very powerful recommendation engine with accurate and useful suggestions.
While its implementation requires tracking users more than some would like, Explore is an undoubtedly useful feature and will benefit consumers a great deal. The CEO of Foursquare says he eventually wants to have recommendation dots on any virtual map, and with Foursquare’s social integration and growing popularity, this can be a reality sooner than we think.
As in the example of Explore, analyzing user data can add tremendous value to society. However, the unauthorized sharing of this data with third-party advertisers will justifiably spark heated opposition, and the challenge for companies like Foursquare is to properly address these privacy concerns. In the new age of Internet companies playing the role of Big Brother, the quest for balancing creepiness and features will continue to dominate news headlines. The real question is not whether these companies will gather information, but how evil they will want to be with it.

(1) CNET article
(2) Foursquare blog post about Explore

Homework 7

The last HW is now posted.  At this point, I've lectured about everything you'll need for all the problems except for problem 4, which focuses on MapReduce.  We'll cover MapReduce in class on Monday...  Good luck!

Thursday, March 1, 2012

Foursquare’s Transition from GoogleMap to OpenStreetMap

Yesterday, Foursquare announced, on their official blog, their decision to embrace the OpenStreetMap movement. Instead of continuing to use GoogleMap API, Foursquare decided to partner with a startup MapBox, using MapBox Street to power all of their maps, starting today. OpenStreetMap is an open, crowd-sourced global atlas. Because MapBox internally uses OpenStreetMap, foursquare no longer relies on any proprietary map information.

Although Foursquare claimed that they made this big technical transition because they would like to have more design flexibility, to support other startups, and to be able to use the Leaflet javascript library, many believe that the ultimate reason behind Foursquare’s transition is GoogleMap API’s new pricing policy. The pricing for GoogleMap APIs may not be very expensive, but for a startup with large amount of map usage, such as Foursquare, GoogleMap may be unaffordable. Thus, even though OpenStreetMap does not offer very good coverage in many places, especially relatively remote ones, Foursquare is somehow forced to transition to open-source data.

To me, this transition means much more than just moving to more economically affordable data. It really reflects an important change recently in many of the startups that use maps to provide location-related services. A few years ago, all the GPS device manufacturers would still have to make their own maps, and even Internet startups like Yelp or Foursquare would have to rely on map databases from big companies, such as Google. However, we are seeing increasingly more new companies that use and contribute to open-source maps. Factual seeks to integrate available open-source data (mostly geographical information) in an easy-to-use and coherent fashion; Waze uses crowd-sourced map and crowd-uploaded real time traffic data to provide driving directions. None of these promising startups relies on big company’s databases. Although open-source maps currently do not have the coverage of GoogleMap, they are getting better and better each day, as more and more people contribute to them. Waze, for example, updates the map automatically according to uploaded driving data. Google may have cars that drive around the US very frequently to track down geographical changes, but if enough people use open and crowd-sourced maps, they can be updated even faster and more accurately than GoogleMap. In a few years, we may see that crowd-sourced maps will gradually take over the Internet map market.

Clickmaniac results

As you know, clickmaniac ended yesterday. Here's the final standings for posterity! There were lots of changes in the final days and here's how it ended up...

1) Tobias, Gustaf, Shiyu 910,760
2) Hari, Micheal B. 823,841
3) Kevin, Nathan, Mikhail, Stasja 668,576

1) Robert, Brock 1304
2) Kevin, Nathan, Mikhail, Stasja 1192
3) Brian M, Lucia, Tuling, Timothy 189

1) Kevin, Nathan, Mikhail, Stasja 1660
2) Robert, Brock 1128
3) Hari, Michael B. 1082

On Network Neutrality

One of the topics discussed in class is Network Neutrality (NN), a big buzzword in the recent history of the Internet. For those unfamiliar with it, this post will be a brief introduction to the concept along with its recent implementation in the U.S. law.

For start, it would be worthwhile to explore the definition of the idea first. As quoted from Columbia University's prof. Tim Wu's FAQ:
"Network neutrality is best defined as a network design principle. The idea is that a maximally useful public information network aspires to treat all content, sites, and platforms equally."
While this can be treated very broadly, the NN idea as applied to the Internet is largely related to the ISPs' ability to control the flow of information through their networks. Well, what does neutral mean in this case then? Generally speaking, it would be discriminating against some kind of packets flowing through the net and preferentially treating other packets. One could imagine this could possibly be used to improve the Quality of Service for some ISPs, as they theoretically could speed up access to important or influential parts of the Internet, but such a system could possibly discriminate against free spread of information as well. And the latter possibility is one of the reasons that speak for advocating NN to make the Internet a free arena for information spread (one can read about it in Tim Wu's paper). One of the founding ideas of the Internet was to provide an open web of information, and the principles of NN aspire to protect this idea.

Some Insight Into Intermediaries

Everybody knows that the Internet is an amazing invention. Through it, we can connect to friends, learn new things, be entertained, and all sorts of other things all without paying a dime. However, we have to remember that everything comes with a price. And I'm not just talking about ads, everybody expects to be assaulted by ads when going online, I'm talking about data mining.

Now, this is not necessarily a bad thing, in fact, in many cases it is a good thing. We get more personalized content and importantly it lets publishers get more information, so they can make more money from their ads, so we can get more free content. However, to do this, our online information is thrown about to a wide variety of companies forming the online advertising pipeline: advertisers, intermediaries, and publishers. And, although publishers and advertisers are what we see in this pipeline, what are really interesting are the intermediaries, the ones getting the right advertisers to the right publishers.

One of the most straightforward ways they do this is with demographic, geographic, and behavioral targeting. However, rather than just targeting publishers as in traditional advertising, they can also target individuals. By tracking cookies and trading information with the publishers, these intermediary companies can track your online presence, so even if you are currently looking up cute cat pictures on the internet, you could still be getting ads for those heavy metal concert tickets you have been eying recently.

But, that is just the beginning of what they can do. Say you are a website and want to target customers who are likely to buy again, how would you do this? One trick they use is that when a customer comes to your site, you drop a cookie on them. Then you send ads their way and see who comes back, creating a profile of who are likely to come back. Then you just rinse and repeat until the profile consists only of people likely to come back. Pretty cool, huh?

However, they don't just match advertisers with publishers, some also do quality control, making sure the advertisers and the publishers make their money's worth. By tracking which users get which ads and how they respond by collating information among the publishers, it becomes possible for even third party entities to track ad value.

The combination of these techniques and many more shows just how complicated, yet amazing this whole industry is. The fact that so many companies with limited information in this industry form a distributed market that keeps the whole Internet going and profitable is, quite frankly, mind-boggling.  


Mozilla Fights Online Tracking with "Collusion"

The Mozilla Foundation announced yesterday the release of a new Firefox add-on intent on helping users understand how tracking cookies can be used by advertisers and others to assemble a nearly comprehensive picture of users' activity online. Collusion is the latest tool released in Mozilla's ongoing effort to empower users to have more control over how they interact with the web. Its current implementation is focused on a graph representation of the flow of cookies, where nodes represent unique domains and edges represent that a request was made from a property of the first domain to one of the second. A demo of this visualization in action can be found on the above-linked Collusion page.

The idea is that within just a handful of visits to supposedly independent websites, it becomes quite clear that several advertisers have a very good idea of exactly how a given user interacts with the web. This is, of course, a manifestation of a couple of the universal properties of graph structure that we have discussed in class, in that the graph quickly converges to a giant strongly connected component with high connectivity. Mozilla goes one step further in the actual add-on by surrounding nodes that a user has explicitly requested with a white halo, and those that are "confirmed trackers by privacychoice.org" with a red one. Privacy advocates at Mozilla hope that the striking visualization will serve as a concerning wakeup call to web users heretofore unaware of the nature of tracking cookies. Empowering users to do something about it is mostly left for a later version. They state that future versions of the Collusion add-on will include the ability to block individual trackers as well as an opt-in to anonymously provide data to researchers and privacy advocates interested in investigating aggregate trends in internet tracking.

If this initiative, along with other Mozilla privacy initiatives like Do Not Track, is successful in encouraging a non-trivial portion of web users to take active control of their privacy on the web, it will be interesting to see what effects it has on the internet advertising industry. As we saw in our in-class experiment, this kind of rich data about user activity is extremely valuable to advertisers, and users opting out may make internet advertising a less lucrative game.

Wednesday, February 29, 2012

End of clickmaniac

Clickmaniac is quickly winding down....  Remember that we'll start class today with a viewing & discussion of the results.  So, please make sure someone from your teams is present to be able to comment on what worked and what didn't...

Tuesday, February 28, 2012

The Golden Ratio φ and the Fibonacci Numbers

 While mathematicians can be proud of π (or rather τ, because π is wrong), and physicists can claim the fine structure constant α-1 (≈137), I would like to claim φ for the computer scientists.
Being the golden ratio and all, it's pretty awesome. As Vi Hart explains in her video, the angles between leaves on a plant, the number of petals on a flower, and the number of grooves on a cone all follow the tenets of "the most irrational number" φ, which is bounded by the ratios of adjacent Fibonacci numbers. φ is also the only number for which φ2= φ+1 or changed slightly, φ2-φ-1=0, which we can solve using the quadratic formula. We get
 Plants weren't the only organisms to use Fibonacci numbers. In colonies of bees, drones will have 1 parent (the queen), whereas queens will have 2 (the queen and a drone). So a drone's family tree has 1 parent, 2 grandparents, 3 great-grandparents, and so on. That's pretty trivial, but how would you explain that a DNA molecule is approximately 34 angstroms long by 21 angstroms wide? Or that the ratio of lengths of the two subdivisions of the bronchi is 1:1.618. If you want something a little more touchable, the proportions of the human body, are all in line with the golden ratio. From the organization of our face (i.e. the distances between our eyes and chin) to the length of our limbs, to the perfect human smile, the golden ratio plays a well-noted role in our body's measurements.

Perhaps it shouldn't be surprising then, that there is a theory that the universe is shaped as a dodecahedron, which is in turn based on φ. Or that even in the nanoscale world, the frequencies of magnetic resonations are in the ratio of φ. The "magic" of this number hasn't escaped human attention. Leonardo da Vinci, Michelango, and Rapheal all used the golden ratio in their work. (I don't know about Donatello) Dimensions in architecture from the Pyramids to the Parthenon to the U.N. building have all been inspired by φ.
Within optimization techniques, the golden section search uses φ to find extremum of unimodal functions; within financial markets, φ is used in trading algorithms. In my humble opinion, I'm still skeptical of φ for all its glory, because there is an underlying force that simply results in φ (as Vi Hart considers in her third Plant video). But the golden ratio will certainly see much more use in human civilization because of its sheer usefulness. For more on the ubiquity and beauty of φ, I strongly suggest the following video:
"This pattern is not just useful, not just beautiful, it's inevitable. This is why science and mathematics are so much fun! You discover things that seem impossible to be true and then get to figure out why it's impossible for them not to be." - Vi Hart
Oracle ThinkQuest's The Beauty of the Golden Ratio