Tuesday, January 31, 2012

The Staggering Numbers of the SOPA and PIPA Blackouts

In the aftermath of the SOPA and PIPA legislation postponement, it is hard to overstate the impact of the January 18th protest blackout. In possibly the biggest protest ever, a staggering 115,000 websites blacked out logos, web pages, or even every page on their site to show what the web would be like after SOPA and PIPA1. Amongst these sites were giants such as Google, Twitter, reddit, etc. with Wikipedia taking it to the extreme by blacking out every page and replacing them with links to contact Senators.

The protest was amazingly effective with the original 80 supporters and 31 opponents before the protest turning into 65 supporters and 101 opponents the following day2. With this massive shift in support, the supporters of the bills had no choice but to postpone the bill a mere three days following the protest3.

However, what is truly mind-boggling is not just the effectiveness of the protest, but the sheer reach of the protest and the heavy tails of the web. 2.4 million tweets were sent in protest in just 16 hours, 4.5 million people signed a Google petition against SOPA and PIPA, 14 million people contacted lawmakers protesting the bill, and a staggering 160 million people, more than half the population of the United States, saw the blackout on Wikipedia with 8 million of them contacting their representatives4 5.

The fact that just a few websites can have such influence by reaching out to the tens of millions of concerned people is a true testament to the connectedness of the web.

Universality and Critical Phenomena

So far in class, we have learned about the universal properties of networks found in nature. In doing so, what we have been doing was basically to build simple, convincing models and to find the expected properties from our models. This attempt is good in the sense that we can check that the properties are emerging quite "naturally" from even the simplest of models. We have seen those examples from many of the homework problems. However, it does not really explain the key factor that attributes  to the interesting features of our networks.

As a physics major senior, I wanted to introduce the physicists' approach to explain the related notions called the universality and critical phenomena. Those topics are being studied in the condensed matter theory. From a condensed matter physics perspective, many interesting features of complex system arise from the largeness of the system. (Remember? We were always taking the limit where n goes to infinity.) One important point of the study  (and problems found in web network) is that we are dealing with systems of particles(nodes, pages, etc) that are too large to have all detailed information about individual interactions(links). So, we physicists begin with simple well-known rules, called Hamiltonian, governing the relationships(links) among a few particles(nodes) and then scale it up to a system with a large number of particles. While scaling the system, physicists assume that some of key features of a large system will not be affected by further scaling up to an even larger system. For example, a hand-sized magnet and Earth are both already large  enough from the perspective of atoms. Physicists extract equations from this "scaling invariance" and explain interesting features of the systems. This technique is called Renormalization Group. (Note that in a technical sense, physicists do the inverse of my description: start with a large system, and then scale it down to a smaller one with the same properties.)

This seemingly abstract idea turns out to be very powerful in explaining many of cool things in physical phenomena such as superconductors or phase transitions, and those concepts appear in nature repeatedly.
Some of them are well described in the article linked at the end of this post in easy terms. Here, I address one intuitive example that is the most related to the graph theory we have learned last week.

Consider a large number of small magnets. A pair of magnets are either correlated or not with some probability, p, which depends only on the temperature of the system. Here, "correlated" means that the two magnets are oriented in the same direction, and "not correlated" means their orientations are random. Suppose that theoretical analysis of a simple system containing only two magnets tells us that the probability is decreasing function in temperature. We can consider each magnet as a node in a graph and correlations as edges with probability p. At very large temperature, p goes to zero. According to the theory of random graph, we know we end up with all isolated nodes. This result implies that all magnets are randomly oriented, consequently producing net zero magnetic field. As the temperature goes down, the system reaches a critical point where the graph begins to form many clusters, which means small groups of magnets begin to align parallel each other. Finally, below a certain temperature T_c, where the probability p goes above log(n) /n, we see one giant connected graph: all magnets are oriented in the same direction.  These "phases transitions" are exactly what we find from the process of cooling down a ferromagnetic material: no net magnetization, fragmented local magnetization, and finally the formation of a big magnet. 

The insight we can learn from this example and many more in the reference source is that some of the universal features of large networks are originated not only from their particular graph structures but  also from the largeness of the systems.

Reference Source:
http://www.math.ubc.ca/~slade/pcm0007_final.pdf

Monday, January 30, 2012

Twitter’s Effects on Sports Journalism

Social Networking has made it easier for everyone to stay connected and follow all the latest celebrity events. One of the other developments is how news, like politics, sports and celebrities, is being broken. If you were to flip on to CNN, ESPN, or another news outlet, there is a high chance you will see them flash a tweet of a new story.  This is becoming a new way for stories to be broken especially in the sports and entertainment industries. This is all coming at the expense of journalist and beat writers across the country. 

Traditionally people got there news by watching the news or reading the newspaper. By the time it hits print you are hours behind everyone else. Athletes themselves are bypassing the middleman and talking straight to their fans1. Who needs a writer with “inside” sources when you can get the actual story from the involved source? 

The biggest downside to twitter journalism is the validity of the reports. Earlier this year actor Rob Lowe2, from “Parks and Recreation”, reported via Twitter that Peyton Manning was retiring. Trying to get a jump on the story, many respected news outlets ran the story. This despite the only source being a tweet from a “friend” of someone involved. After being heavily denied by Manning and the team it has appeared to be a false rumor. 

From new hires, to trades, to athlete’s opinions, all of this breaks first on twitter. This is fine when the player involved is tweeting. When the tweet is second hand is when twitter information gets dicey. It no longer takes a connected journalist to pry from source to source. All it takes is a person following an athlete involved, and the information is sent right to your phone/laptop.

Lecture 7

(I'll be busy after class, so I'm posting this now...)

Today we made our first venture into "exploiting" network structure instead of just trying to understand it.  Our first example of a case where using network structure makes a huge difference is "search" -- the big idea of google back in 1998 was to use network structure to identify important pages, and feed this information into their search engine.  In particular, they defined a notion called "pagerank", which we explored over the two classes. Be sure to take a look at the original papers by Brin & Page on Google...it's quite interesting to read now in the context of what google has become.

Note -- understanding pagerank and the other aspects of search engines that we discuss will be very important for you in "Rankmaniac Reloaded" on HW 4!

Office hours

Since nothing is due this week, we'll only have 1 evening of office hours... It'll be Thursday 7-9 in Ann 107.  You can come and bound your Rankmaniac ideas off the TAs, or just start the theory part early.

Homework 4

Homework 4 was released today and is now available on the course web page.  So begins Rankmaniac 2012...  I look forward to seeing all the crazy things you do.  (Remember not to get me, or Caltech, in too much trouble though!)

I will be willing to post links to peoples pages from the course website if so desired.  There is a limit of 1 page per group that I will link to though. Post the link as a comment here if you'd like me to add it to the main page.

Good luck!

Facebook, the World's Largest Photo Library


The value of the Internet lies in the information that can be found through it, and as everyone who has gone online can attest, the Internet is a seemingly infinite fount of information.  When using our favorite websites it is easy to take this wealth of information for granted, but the sheer amount of data these sites have to process and save are unbelievably massive.
To get a better understanding of the scale of the data backing our favorite sites, we can look at the ubiquitous Facebook photo.  On Facebook, more than 250 million pictures per day are uploaded1 and there are approximately 140 billion photos stored there in total.  To put that number into a quasi-comprehensible form, that is 10,000 times the number of photos in all of the books in the Library of Congress, and 4% of all photographs ever taken since the dawn of photography.2 And while we lazily flip through our friends’ Facebook photos, their datacenters are serving well over 600,000 photos a second.3
These are very big numbers, and the datacenters that store this data are correspondingly gigantic.  Facebook’s Prineville datacenter, for example, is 307,000 square feet and has 60,000 servers.  A quick look at wolfram alpha will tell you that that is almost 5.5 football fields worth of data storage capacity.  That’s pretty ridiculous and is also the reason that America’s largest photo library has evolved from this:
To this:
So the next time you are looking at pictures on Facebook and are complaining about how slowly they load (which they have been recently) , take a moment to think about the sheer amount of data that is coursing through their system and the humongous datacenters that work tirelessly to provide you with something approximating a seamless user experience.
Sources:
1.  http://www.facebook.com/press/info.php?statistics
2. http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/
3. http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/



Entering the Cloud

You and your friends have recently graduated from college and have a great idea for a start-up. Excited about a potential rags-to-riches scenario, you start to think about what you need. You probably cannot run everything from your desktop computer so you are going to need the backbone of all networking startups: a datacenter. This datacenter will serve as your office space, power, storage, servers, etc. Now to keep your datacenter running, you will need a stack of the most up-to-date software plus an IT team to take care of your datacenter. By the time you finished setting everything up, you have already invested large amounts of money without touching your start-up idea. Enter cloud computing.

Cloud computing is another way of running a business. Instead of running your applications yourself, your business will be running on a shared datacenter. Think of cloud computing as a model where anyone anywhere has on-demand network access to a shared pool of computing resources; these resources can be anything ranging from servers to applications to services. Chances are that you have already used cloud computing: Gmail. When you try to access your email, you do not need a server or database to store your emails nor do you need an IT team. Google takes care of all these issues while users only need a username and password to instantly access their emails.

So why cloud computing? One of its biggest traits is scalability; as companies such as Reddit rapidly grow in size, borrowing dynamic resources from the cloud is much easier than physically buying more infrastructures. Unlike the old days, businesses can be upgraded and up and running in a few days. Likewise, costs are less because companies do not need to pay for the facilities or people to maintain their virtual servers. Instead, they only pay for what they use (think water utility).

As cloud computing becomes more reliable and feasible, we are starting to see a trend from the traditional network to a “cloud” network. This new network can grow and shrink based on consumption needs and accommodate rapidly changing businesses. Sharing information technology is already well-established in society, especially among research institutes, and many feel this could take that level of collaboration to the next level.

Already, companies such as Microsoft and Amazon have devoted mass datacenters to provide commercial cloud services to organizations. As more and more businesses depend on the same cloud services, our network will become more and more closely connected.

Sources: 

Human Computation - ReCAPTCHA and Duolingo


CAPTCHA is an internet safeguard against automated form-filling programs. Created by Luis Von Ahn, an associate professor at Carnegie Mellon University, CAPTCHAs consist of randomly generated character images that only a human user can decipher in order to submit online forms. With 200 million CAPTCHAs used every day, Professor Ahn realized that he could use this large amount of “human labor” to solve large-scale problems. As a result, ReCAPTCHA was invented to help digitize books by having one of the random character images be a scanned word from an actual book. This showed Ahn that the internet is a very powerful tool to coordinate people’s minimal contributions for surprisingly useful goals. This led him to the idea of using millions of people to translate the web for free. Instead of paying for language software, people can learn foreign languages and practice translating real web content on his free website, Duolingo.com. Combining multiple unprofessional user translations actually results in an accurate translation of websites, which can speed up the process of translating and spreading information worldwide. This is a win-win strategy and this network of internet users is key towards the project’s success.

This innovative project is a great example of network economics, a topic that is going to be covered in class. Wikipedia defines network economics as, “ business economics that benefit from network effects”. In other words, it’s a business model where the number of users affects how much value a product has for others. Duolingo.com uses this network effect and appeals to two types of users. One type of user is language learners who want to learn a foreign language for free. The other types of user are the worldwide readers and businesses that would like to have website content translated into multiple languages. As more Duolingo users participate, more of the internet can be translated accurately, economically, and efficiently.  This helps build and strengthen the program to help achieve its overall goal of increased international accessibility and connectivity over the web. This is a very interesting and useful business model, and the internet serves as a vital platform for this model. The internet provides a fast way to access a variety of information, and is a powerful tool for combining user-input with computer learning. This project is a great example of how the internet can help us create useful economic cycles and generate benefits for society. 


Sunday, January 29, 2012

The Social Network's Big Impact on Political Changes


Change is happening in the Middle East. In what is known as the “Arab Spring”, people all over the Arab world are holding demonstrations to protest government corruption and police brutality. One of the main driving forces behind this era of revolution is social media and its networks. From the news articles, I’ve seen that the start of the Arab Spring can be traced to one act of protest.

About a year ago, 26-year old Mohamed Bouazizi was selling fruits and vegetables in a small rural city in Tunisia when a policewoman asked him to hand over his cart. Bouazizi didn’t have a permit to sell the goods, but as the sole source of income for his widowed mother and six siblings, he didn’t have a choice. Naturally, Bouazizi refused to give his cart to the policewoman. As a result, she publically humiliated him by slapping him. Angered by the humiliation, Bouazizi went to the front of a government building and set himself on fire. Bouazizi’s act of opposition was immediately broadcasted through this city and sparked more protests. The protests spread through Tunisia and set off more protests in the Middle East, and this became the Arab Spring.

It is amazing that such a giant revolution was sparked by one desperate act in a rural town. This is because the social media network has allowed for information to be spread easily. The accessibility of communication methods has decreased the number of isolated societal groups, creating a large network where all individuals have access to up-to-date news and media. Through such the strongly-connected component that is the social media network, news can pass from a small town all the way to several countries.  As stated in an article, “communication methods such as satellite TV, the Internet, and mobile phones have stimulated change by providing the Middle East and North Africa region with unparalleled channels of discourse”.

In the past, information was mainly shared by newspapers or news stations, and a journalist or news reporter said something that was not allowed by the government, they were simply fired. Now with the use of mobiles phones, blogs, etc, government-unapproved information can be shared quickly and easily, and opposition parties can be efficiently organized without the government noticing.

In the case of the Arab Spring, “technology acted as an accelerant that allowed images to spread and fill the gap that mainstream media had left behind”.  For instance, 90% of all individuals in the countries affected by the Arab Spring have access to satellite TV, which played a huge role in broadcasting police brutality. Such a strongly connected network can spread information to even the most isolated groups. With the new knowledge and awareness, people can form their own opinions and act on them.

In the past few years, social media networks have grown tremendously, reaching even the remote cities. From the Arab Spring, we can see that this far outreach of information and decrease in isolated social groups can spur tremendous political changes. 

Saturday, January 28, 2012

Twitter’s New Policy and Internet Censorship


This Thursday, Twitter announced its new policy via its official blog post titled Tweets still must flow, stating that Twitter will give themselves “the ability to reactively withhold content from users in a specific country — while keeping it available in the rest of the world.” Although Twitter had chosen a relatively uncontroversial example – pro-Nazi content has to be banned in France and Germany for historical reasons – the blog post sparked huge controversy among the Internet community. A large number of users organized a service boycott today via tweets under the hash tag #TwitterBlackout.

A question that naturally arises is whether it is justified to censor certain tweets. On the one hand, deleting tweets that contain links for a pirated movie after receiving complaints from Hollywood is required by the Digital Millennium Copyright Act and is generally accepted by the Internet community; on the other hand, even being able to remove pro-Nazi tweets can have dire consequences, because it is very difficult to define which tweets are pro-Nazi and which tweets deserve censorship, as shown in this article.

Although this new policy may seem sudden, Twitter actually admitted removing “illegal” tweets a year ago. Just like many other companies such as Google and Facebook, Twitter has to face the dilemma of whether to censor certain content or to risk being removed from a country completely. Both Facebook and Twitter are currently blocked in countries such as Iran and China, because of their reluctance to comply with censorship. Google quitted China in 2010 also due to similar reasons. Facebook and Google certainly have made their positions clear, but Twitter’s policy change prompts us to reconsider this dilemma: censoring or risking removal?

Of course, a free social network is just so powerful and useful due to its small diameter and high connectivity – a message advocating social change can propagate through the whole network within just a few retweets. One can argue that revolutions such as the Arab spring may never have happened without an uncensoring Twitter, but some people have suggested otherwise from an interesting perspective: Twitter really only censors tweets in certain countries and these tweets will still be available globally. Thus, if an activist tweets against an oppressive regime, his or her tweet is still globally visible and can at least impact people outside of the country. This is far better than having the whole Twitter website blocked by the regime, as people in the regime still get to use the technology and may utilize it when big social changes do occur, not to mention the fact that Twitter is actually teaching everyone how to get around the restrictions.

There is certainly a tradeoff between trying to keep the website totally free, and trying to reach as many people in the world as possible. It may be really interesting after a few years to see the different impact of different policies of Twitter, Facebook and Google.

The Effect of Network Effects

A key principle behind network economics is network effects: “for some kinds of decisions, you incur an explicit benefit when you align your behavior with the behavior of others.” (E&K, p. 449)

I always thought these network effects were of the utmost importance; for example, we see Facebook, which people use because everyone is using it and thus it becomes an outstanding way to stay in contact with friends. We also see LinkedIn, “The World's Largest Professional Network” where as more people become connected via LinkedIn the more important it is for others to also join as they can be connected to other professionals (currently the number of LinkedIn users is 135 million (Wiki) which I think is rather amazing for professional networking online) (By the way, I am not on LinkedIn or Facebook!) . However, I came across an interesting article by Pascal-Emmanuel Gobry, which got me thinking, “How Strong are Network Effects Online, Really?”

In his article “How Strong are Network Effects Online, Really?” Gobry states that “one way that network effects can be defeated is through … "verticalization."” By the term “verticalization” Gobry means that services will build “niches” in specific verticals. Take for example Facebook, which perhaps reaps most the benefits of network effects. Although no one has supplanted Facebook via a new full-blown social networking service, “plenty of apps are taking specific use-cases of Facebook and turning them into full-blown services.” Gobry gives the examples of Twitter, which made the “status-update” feature of Facebook into its own service and Instagram, which is a way to instantly upload photos (it is based of the photo-uploading feature of Facebook). Although Facebook is still going strong, it is undeniable that more people are using Twitter and Instragram and this could potentially detract from the amount of users in Facebook (according to Gobry). From my personal experience, I have seen many people beginning to use Twitter and Instagram (it is true that the market for these services are growing) but however they are still on Facebook as well … I just cannot believe that these services will supplant Facebook! In his defense, Gobry does not believe “that Facebook is going to crash tomorrow or that Twitter will "kill Facebook" or any of that crap. The point is to say that online network effects are probably overhyped.”

Gobry also argues thatonline network effects are strong barriers to entry to FRONTAL competition but not to LATERAL competition.” By this he means, using Facebook as an example, that if Facebook in its early development had focused its attention toward MySpace users (Frontal attack) it would have never have taken off. “Instead Facebook targeted a population that was less into MySpace, attacking laterally--and won.” The point of this analysis is that “Facebook won through superior execution and lateral attack at least as much as network effects.”

Personally, I do not believe network effects will be defeated by “verticalization.” Gobry does not say this will happen but he does believe this is a possibility. In fact, I believe services reliant on network effects and services reliant on “verticalization” will co-exist, i.e., Facebook will not suffer as a result of Twitter/Instagram and vice versa; people will use Twitter and Instagram for specific functionalities but they will still need Facebook to stay in touch with all their old friends; I believe that Twitter and Instagram are not in competition with Facebook but they are essentially “separate markets.”

However, I do agree that “superior execution” is just as important as network effects. Had Facebook directed their attention toward music group (apparently guitar players and band members were huge fans of MySpace) I do not think they would become as big as they are today whereas by initially marketing to people who were not ardent users of MySpace, they really “took off.”

Anyways, what do you guys think?

“How Strong Are Network Effects Online, REALLY?” by Pascal-Emmanuel Gobry









The Internet, News, and Policy


This is an exciting time to be living in right now, with the rapid growth of the Internet forecasting not only a distinct change in how we communicate with each other, but also an increased ability for all individuals to change the course of the future. Up until now, most people used only their televisions and their radios to receive their news and thus shape their opinions. However, an inherent issue with these sources of news is that they are linear and not interactive. The content of news programs are decided on by only a few individuals for consumption by the public, and an inherent side effect is that many people will form their opinions on issues because of how the information was presented to them. Therefore, those who are in the position to decide what news to show and how to show it have a decided advantage in shaping future elections or policies.

As more and more people begin to shift away from mainstream media to the Internet, more and more power returns to the individuals to make informed decisions. Now, because of the few degrees which separate most information on the Internet, most knowledge about current events is literally a few clicks away, and because of the ability of everyone to have a say on the Internet, all sides of every issue can be examined and discussed. Misinformation and propaganda will be reduced as people will be able to easily fact-check using different sources. News will shift from essentially an oligopoly of several large news channels and radio stations to a free market of individuals. Right now we are at a tipping point, as this new generation of Internet users is balanced by the older generation which still relies on television and radio. The timing of recent bills such as PIPA and SOPA to try to continue the process of increasing the government's control of the Internet is perhaps unsurprising. The Internet in the United States is currently a remarkable entity, as it a truly open forum for the exchange of information and ideas, and is the last medium left that allows each of express of freedom of speech in a effective manner, because every voice can be heard and every person can effect change should their opinions spread virally. Since future policy could then be shaped because of the voices of any single individuals, it makes it impossible, for better or worse, for those in power to chart the path for this country, which is perhaps to them, a scary proposition.

Friday, January 27, 2012

Google's Change of Privacy Policy and Terms of Service - the Aftermath

So if you haven't been living under a rock, you probably already know that Google is going to change its policies effective 1 March 2012. If you take your time and read some of it (which might be an interesting experience, since how often do we read policies) it's definitely much more digestible and easy to read than most policies and ToS agreements that we encounter in everyday life. Yet, while our friendly internet overlord justifies the change as a mere unification of the existing policies of 60 Google services offered to us for free, but as it is always the case with big companies, conspiracy theories arise even with somewhat innocuous and probably life simplifying changes such as this one.

Soon - all used for ad targeting

The single most important change in reducing the number of policies to 1 is the fact that Google will now collect the data from all the services whose policies are unified in order to "improve user experience". One way to put it is creating more interesting content for the user based on his data, but obviously the most obvious usage of the data is tracking the user's history in order to improve ad targeting. And while you might not be using all of Google's 60 services, you are definitely dependent on quite a few of them. Your searches, some of your e-mail contents, etc. - all of them will now contribute to the ad targeting data. All of your e-mail, searches, and, most immediately - Android phone activity doesn't escape the attention of the ads, should you leave data collection services.

Right, should you leave them active. There's a way to opt out, but the shared criticism directed at Google is really the fact that the entire data collection process should be opt in, as argues the FTC while simultaneously considering charges against the Mountain View giant. It is quite inconvenient that the entire process is going to be automatically opt in and it does take quite a bit of extra effort to opt out.

Most users probably do not know or care enough about opting out, so Google will most likely just benefit by default by getting a larger data pool. And what could that be used for? Well, more data is never worse. One use comes to mind: with more data, anything from clicks through content read on Google Reader, through YouTube preferences can be use to determine appropriate suitable content for the user. Furthermore, it is definitely easier to identify similar users, and - therefore - appropriately improve the clustering algorithms in order to improve ad content. And, as some studies show, better clustering is highly effective, with possibly sixfold increases in click-through ratios for ads utilizing clustering in their guesses.
Hey, that article inspired me.

 Whether it's just a side effect of Google's active effort to make our life easier (and their continued effort to convince us that they're doing good) or an actual evil scheme to get more data is up to debate (questioning of the company's principles is particularly fashionable nowadays). But didn't you know this already? The (not only) search giant really has known everything about you all along. Whether you are a Caltech undergrad, or a 24-year old woman who likes wombats - they know it.

http://www.google.com/policies/
http://www.washingtonpost.com/business/economy/google-privacy-policy-who-will-be-affected-and-how-you-can-choose-what-information-gets-shared/2012/01/26/gIQA69fNVQ_story.html
http://arstechnica.com/gadgets/news/2012/01/pascals-wager-googles-new-privacy-policy-could-anger-ftc.ars
http://www2009.eprints.org/27/1/p261.pdf
http://www.cnn.com/2012/01/27/tech/web/google-privacy-clarified/index.html
http://www.google.com/about/corporate/company/tenthings.html

Google+: Providing Companies With a New Way to Leverage the Social Web Graph



Earlier this month, Google launched a service to complement its search engine. A page titled “Search, plus Your World” was added to Google’s website, detailing the company’s desire to add social results into the mix of normal results. These new results include relevant tips, photos, and posts from friends on Google’s social network, Google+. For example, if a person searches the term “sushi”, he gets the normal set of results plus sushi-relevant content or media that has been shared on Google+. Interestingly, this new feature impacts both Google and companies looking to promote their brand or product via the internet in a couple of important ways.

The article mentions that brands should seriously consider launching marketing campaigns via Google+ under that reasoning that it is valuable to get an early foothold in developing social network. Because of features like “Search, plus Your World”, this is certainly a valid point. Google clearly intends to leverage its new social product by integrating it into search, so there is no reason why companies shouldn’t look to ride this wave by growing their brands via Google+; the social web graph is incredibly valuable since it provides a relatively low cost method for spreading content and brands. This is because there are certain properties (some known to be universal properties of networks) of the social web graph that make the graph’s structure particularly exploitable for advertising campaigns. As stated in class the other week, the social graph is a very strongly connected entity; ~86% of it is in a strongly connected component, and ~100% is in the weakly connected component. This should come as no surprise; the reason why people sign up for social networking websites is to connect with (or invite) their friends. In addition to strong connectivity, the social graph has a small diameter; the 90%tile diameter is 4.7. These two properties are extremely important because they enforce the idea that content spreads virally with effective coverage, especially if the social graph is online where the interaction rate is quite high for certain demographics. (This idea isn’t all new; companies like LivingSocial have been exploiting the structure effectively via rewarding referral systems for quite some time now.)


Finally, although Google+ certainly has an impact on companies other than Google, its largest impact is likely on Google itself. In particular, it has the potential to boost the accuracy and relevancy of its search engine. So far Google has attempted to keep its normal search results and social results separate; all social results are labeled with an icon, and users have the option of turning off social results via a button near the top of the search results page. However, it is interesting to wonder if Google will (or did) incorporate information from Google+ into its current result generating algorithm. Right now it would make sense for Google to keep the normal and social search results separate because the Google+ crawling algorithm is likely still very young, and keeping the two separate lets Google experiment on how frequently link results pulled from Google+ are clicked on (and how these click rates compare with the default search results). However, Google’s ultimate objective is to return results that are relevant and satisfy users, and it could be very likely that additional information from its growing social graph could (in the long run) help tune the algorithm to show more optimal results. If a correlation is found between the proximity a particular person in the social web graph and click rates for links posted by that person, this could have a huge impact on Google’s search algorithm.

One might ask: why is this important? Why can’t Facebook do this? Google’s secret weapon is its current dominance in the search engine market, and the synergistic effect that could arise if the algorithm is tuned effectively. Although Facebook has chosen to share much of its social data with Bing, Google’s sheer volume of searches per day simply outnumbers Bing. If done correctly (and soon), Google could potentially leverage Google+, leaving Bing behind and securing its position as the world’s top search engine.

What Does the Internet Look Like?

For me, one of the most interesting problems when dealing with large amounts of information is figuring out how to view all this information in a meaningful way. We can discuss metrics (like the clustering coefficient, diameter and average degree), but usually a picture is worth a thousand words. In this regard, data visualisation can help immensely. Data visualisation in the broadest sense simply means a way of displaying information. It becomes immensely important when we start to consider larger and larger data sets, with sometimes surprisingly beautiful results.

With that in mind, I set out to find some of the images that people have generated of the Internet and its structure.

One of the most impressive projects that I discovered was the Opte Project. While it is a few years old, the project (that graphically represents the internet by mapping relations between class C networks instead of every single IP address) has produced some stunning images.



BitTorrent and Death by Heavy Tails

While studying the network structure of the internet and the web-graph, it is hard to ignore P2P(peer-to-peer) file transfer - one study in 2010 suggested that 50-70% of internet traffic is generated by P2P applications. Consequently, a large number of papers have been published on modeling and optimizing P2P file transfer. Most of these studies focus on the well-known BitTorrent protocol, although some have considered alternatives optimized for high-volume flow on relatively small networks. However, as BitTorrent is the most widespread P2P sharing protocol, it makes the most sense to focus our efforts on just the torrent network.[3]

BitTorrent relies on an initial "seed" content provider and a torrent file of meta-tags, detailing the packet structure of the seeded content, the tracker that is responsible for keeping track of the packet distribution on the network, and encryption hash tags used to verify the authenticity of transferred packets. Peers, other users who want to download the seeded content, download this torrent file and then query for pieces of the files described in the torrent to all other peers on the network and the original seed. On top of this basic framework, the BitTorrent protocol employs several important optimizations to attempt to improve the fairness of packet distribution and incentivize higher upload rates from the members of its network. However, even with the incentive system in place, users often encounter the phenomenon of quickly downloading a file to 99% and then seeing their download speed slow to a grinding halt almost indefinitely. This phenomenon is a classic example of a heavy-tailed behavior - in this case, packet availability in the P2P network is very heavy-tailed, such that the most common packets can be found easily and downloaded from the majority of peers, while the rarest packets are extremely scarce.[1] This result is easily derived by assuming that packets are shared between peers with a uniform probability - the population of the packets that are shared first rises exponentially, drowning out the packets that were distributed by the seed later. A number of studies have proposed solutions to this heavy-tail problem, two of which I would like to discuss here.

The first paper focuses on the behavior of the torrent over time and the incentive system built into the protocol.[1] At the most basic level, a peer can be incetivized to upload more by offering a higher download speed in return. However, this system fails to provide incentive to peers to continue uploading downloaded files after completing the download. Most mathematical models of this process assume that the peer arrival rate is a Poisson process, with the peer arrival rate decaying exponentially with time, and that the time a peer remains in the network is constant. In this model, the lifespan of a particular torrent - the time in which a newly arrived peer can download the entire file - is quite low. In real networks, this time is 8.89 days on average, with a download failure rate of approximately 10%.
A study by L. Guo et. al. proposes an improved incentive system to reduce the download failure rate in the torrent network. The study finds that if torrents are allowed to interact with one another, and determine download rates for a peer based on upload performance across all active torrents, the download failure rate in the network falls off 6-fold since peers are encouraged to come back to completed torrents. Thus, if seeds are convinced to stay 10 times longer, in a traditional torrent network, the download failure rate is expected to fall off by a factor of 10, while in the multi-torrent incentive system, this reduction is by a factor of 10^6.[1]

The second study approaches the problem of heavy-tails in packet availability more directly by proposing a clustering mechanism, in which the rarest packets are distributed to all members of a particular cluster preferentially to increase the overall availability of these packets. For example, if cluster 1 notices that cluster 2 lacks a particular packet that a member of cluster 1 has, this packet is preferentially distributed to all members of cluster 1, such that when the clusters are reorganized, the packet distribution across clusters is fairly even. The authors of this study do not provide the expected packet distribution under this algorithm, but do show that in their simulations, the speed of packet distribution is significantly greater, suggesting that the packet availability becomes to some degree less heavy-tailed.[2]

BitTorrent networks are some of the most prevalent networks on the internet and as such, are extremely interesting from both a networking and a game-theoretic perspective. After all, seeding vs. leeching on a torrent is a classical example of a Prisoner's dilemma. The statistical outcomes of this pattern are quite interesting - one of them is the heavy tailed distribution of file packet availability described above. Surely, there are numerous other characteristics of this network that could be analyzed, perhaps from a game-theoretic standpoint rather than a probabilistic one, but this question is a subject for another time.

[1] Lei Guo, Songqing Chen, Zhen Xiao, Enhua Tan, Xiaoning Ding, and Xiaodong Zhang. 2005. Measurements, analysis, and modeling of BitTorrent-like systems. In Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement (IMC '05). USENIX Association, Berkeley, CA, USA, 4-4. Available at http://dl.acm.org/citation.cfm?id=1251090. Date accessed: 27 Jan, 2012
[2] WEI, G., LING, Y., GU, Y., GE, Y.. A Dependable Cluster Based Topology in P2P Networks. Journal of Communications, North America, 5, jan. 2010. Available at: http://ojs.academypublisher.com/index.php/jcm/article/view/2393. Date accessed: 27 Jan. 2012.
[3] Chaojiong Wang; Ning Wang; Howarth, M.; Pavlou, G.; , "A dynamic Peer-to-Peer traffic limiting policy for ISP networks," Network Operations and Management Symposium (NOMS), 2010 IEEE , vol., no., pp.317-324, 19-23 April 2010. Available at http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5488483&isnumber=5488304. Date accessed: 27 Jan, 2012

Thursday, January 26, 2012

World IPv6 Launch

Ars Technica reports that the Internet Society is leading the charge in organizing "World IPv6 Launch" on June 6, 2012. Much like the successful "World IPv6 Day" that occurred on June 8, 2011, this event has many of the major internet service providers (including At&T, Comcast, Time Warner) and web companies (including Facebook, Google, Bing, Yahoo!) committed to further the public deployment and adoption of IPv6 technology. However, unlike "World IPv6 Day," which just tested IPv6 services for a mere 24 hours, "World IPv6 Launch" will permanently enable IPv6 for many of the popular products we use daily.

For those unfamiliar, IPv4 is the first and current version of the internet protocol to be widely utilized. While the protocol has very reliably served its purposes, its very limited address space has become a major concern in recent years. IPv4 uses a 32-bit address space, which limits the number of unique addresses to 2^32 (4,294,967,296). While the address availability wasn't a big issue in the late 90s and early 2000s, the recent proliferation of internet connected devices have severely shrunk the number of free IPv4 addresses available. Certain companies have even given back addresses for the public good. IPv6 is the intended successor to IPv4. In addition to many improvements, one of its biggest advances is the use of a 128 addresss space, which gives approximately 3.4×10^38 available unique addresses. To give an idea of the sheer size of this number, we could assign an IPv6 address to every atom on the suface of the earth and still have some left over.

For the conservation of IPv4 addresses, many networks today employ network address translation (NAT) to convert a single public IPv4 address into multiple private local addresses. In fact, most if not all off the shelf home routers have this capability baked in. With the deployment of IPv6, this could all change. In the future, IP addresses won't just be assigned to your computers or digital gadgets; everyone could have unique addresses for all their household appliances--refrigerators, stoves, microwaves, washing machines, practically anything with an on-off switch. And there would still be room for more.

IPv6 is well in the pipeline: what's left on the table is just the schedule for its world wide deployment and adoption. The consequences of such a switch-over in technology, however, remains to be seen. With the possibility of myriad new online devices, the geography of the internet has the potential to dramatically shift. What will the internet will look like in 5 years? 10 years? Only time will tell. The Internet Society's "World IPv6 Launch" is certainly a step in the direction of the future.

YouTube's Small World

On Monday January 23rd, YouTube posted a very interesting statistic on their blog. According to the blog post, an hour's worth of video content is put on YouTube every second. They even provided an interactive animation to share the statistic in terms of satellite orbits, bamboo growth, nyans in a nyan cat, etc. YouTube is indeed growing fast.

Although we maybe used to regarding social networks to services such as Facebook, the video sharing site YouTube also has a social network in its videos. Using the videos as the vertices and the related videos links as the directed edges, a group of researchers [1] constructed a network graph for each data set from their crawler and made some interesting findings. They were able to find the small world phenomenon in the YouTube social network.

The small world phenomenon is characterized by graphs with both high clustering and small diameter. Instead of using the average diameter or the 90th percentile, the researchers used the diameter of the largest strongly connected component of each graph and found that it was about 8. This means that instead of the six degrees of separation, there is about eight degrees from one video to another. Given that the clustering coefficient of random graphs is about 0, the clustering coefficient of the YouTube graph measured to be about 0.29 is considerably high. Because of the large clustering coefficient and the small diameter, this gives you a sense of YouTube's small world.

It will be interesting to see how these numbers change in time, especially as the number of vertices in the graph grows. How many hours of video content have been added to YouTube since you started reading?

[1] Xu Cheng; Dale, C.; Jiangchuan Liu; , "Statistics and Social Network of YouTube Videos," Quality of Service, 2008. IWQoS 2008. 16th International Workshop on , vol., no., pp.229-238, 2-4 June 2008
doi: 10.1109/IWQOS.2008.32
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4539688&isnumber=4539656

Lecture 6

Today we focused on the remaining two universal properties of networks -- small diameter and high clustering.  Remember that, in isolation, neither small diameter or high clustering coefficient is particularly "surprising" or difficult to explain.  It is only the combination that is "surprising"... and this is what is termed "small world".

We saw that one possible explanation for small world properties is a combination of local correlation with long-range random links.  However, this combination only explains the existence of short paths, it doesn't explain why we can find them so easily.

To explain this second observation (and to me the more interesting observation) we saw that the long range links have to be proportional to the distance in a very precise manner.  So, it is in some sense much more rare that we can find short paths with myopic algorithms than it is for the short paths to exist in the first place.

Remember though that the models we have discussed for small world graphs DO NOT have heavy-tailed degree distributions!  So, we really haven't given a "natural" mechanism that explains all four of our "universal" properties, we've just given mechanisms for each individually.   Can you think of how to combine our heavy-tailed degree distribution ideas with the small world ideas?

Wednesday, January 25, 2012

Anti-Counterfeiting Trade Agreement on the rise

Last week several major online companies blacked out parts or all of their websites in response to the Stop Online Piracy Act (SOPA) and the PROTECT IP Act (PIPA) being considered by the United States Congress. The response to the blackouts was strong and several Congressmen and Senators changed their stances on the issue due to the enormous public pressure. However, at the same time that The United States is considering legislation to regulate the internet, an international treaty is set to be signed that would create international restrictions similar to the ones proposed by SOPA and PIPA.

The Anti-Counterfeiting Trade Agreement (ACTA) has been a work in progress and since 2008 and the language in the treaty was finalized in April of 2011. ACTA is much larger in scope than the US Bills because it deals with all forms of counterfeiting, not just online piracy. However, the agreement includes many of the same provisions and ideas that made SOPA and PIPA so objectionable. ACTA holds anyone who contributes to piracy responsible for that piracy. Critics have argued that websites with user generated content are now responsible for making sure that nothing illegal is posted. This is a large burden when you consider that the largest of these websites generate millions of user posts per day.

Critics are also concerned about the language that the agreement uses. The full One of their largest concerns is the following from excerpt from Article 10.
“Each Party shall further provide that its judicial authorities have the authority to
order that materials and implements, the predominant use of which has been in the
manufacture or creation of such infringing goods, be, without undue delay and without
compensation of any sort, destroyed or disposed of outside the channels of commerce in
such a manner as to minimize the risks of further infringements.”
They think that this could be used to shutdown any site with infringing material and contend it has no checks and balances on how it would be applied.

As more and more countries prepare to sign the treaty, protests are expected. Just today, Poland saw large protests in response to the announcement that their Prime Minister is expected to sign the treaty tomorrow. Poland will join the growing list of countries that have already signed the treaty, a list which as of this post includes Australia, Canada, Japan, South Korea and the United States. Many countries in the European Union are expected to sign the treaty in the coming months, whether they face the resistance seen in Poland remains to be seen.

ACTA full text: http://www.international.gc.ca/trade-agreements-accords-commerciaux/assets/pdfs/acta-crc_apr15-2011_eng.pdf

BBC Article on ACTA protests: http://www.bbc.co.uk/news/world-europe-16735219

The Power of the Internet

Never has the Internet technology industry flexed their muscles so strongly as they did last Thursday, January 19, 2012. That day, with the website blackouts against SOPA and PIPA, they demonstrated the sheer power of the Internet’s graph to affect people’s actions and decisions. In one day, 8 million people looked up who their congressman or woman was using Wikipedia’s tool (and presumably wrote to them as well). Google convinced 4.5 million people to write to their congressman or woman. Involvement on such scale would have been unheard and was impossible only ten years ago.

I think these successes are indicative of how the web graph is structured. Rather than being totally decentralized, I feel that the web graph is very dense around a few pages, like Facebook, Twitter, Wikipedia and Google. Many sites have links into these websites and they receive a massive amount of traffic. This is why the movement was so successful – the “hubs” of the graph were the ones leading the charge.

It’s quite funny to watch how the entertainment industry has reacted to these developments. These companies are staffed by people who don’t really understand the power of the Internet and social media. For example, they have launched “a television advertising campaign supporting the anti-piracy plans” (BBC). This is almost a laughable feeble move. There is no way they are going to be able to disseminate their message at the same rate or on the same scale as Reddit, Wikipedia or Facebook through a TV advertisement.

All this shows that knowing how to exploit the social and internet graph is of enormous benefit because such a graph has enormous power. It also demonstrates that the companies that control this graph have enormous power as well, perhaps more than entertainment companies that have been around for decades. The world is changing

Sources:

http://mashable.com/2012/01/18/zuckerberg-sopa-is-poorly-thought-out-law/

http://www.bbc.co.uk/news/technology-16628143

Monday, January 23, 2012

The friendship paradox: why a web crawl gives a biased sample, and "your friends have more friends than you do"

The friendship paradox is the phenomenon that on average, your friends have more friends than you do. (Or, as Satoshi Kanazawa puts it, your girlfriend is a whore.) In an undirected graph (such as a friendship network), on average, for any node u, the average degree of u’s neighbors is greater than u’s degree. That is, the average (over all nodes u) of deg(u) < the average (over all nodes u) of the average (over nodes v adjacent to u) of deg(v). The paradox was first characterized by the sociologist Scott L. Feld (Why your friends have more friends than you do, 1991).
This result is superficially paradoxical, but it always holds given the (trivial) condition that not all nodes have the same degree. Intuitively, high-degree nodes have more friends, and each of those friends has 1 link to the high-degree node. Low-degree nodes have few links pointing to them. High-degree nodes are being counted more times, so any given link is more likely to point to a high-degree node. The proof is straightforward; Wikipedia gives a proof.
This idea is fairly robust to other types of graphs. The friendship paradox holds in graphs of sexual relations, even though these are approximately bipartite graphs rather than mostly random. Similarly, in directed graphs, nodes link to nodes with higher in-degree on average, and by symmetry, nodes are linked to by nodes with higher out-degree.
Suppose we try to sample part of the web graph by breadth-first search. Obviously considering the size of the network, we will terminate the crawl early, at some depth. Assuming we’ve only crawled a small part of the network, some hand-waving proves that we will have sampled a disproportionate number of pages with high in-degree. Hence the sample obtained from the crawl is biased w.r.t. in-degree distribution, and presumably other graph statistics. (This still holds if we disregard pages with (in-)degree 0.)
The friendship paradox suggests a cheap way to allocate vaccinations without having to map out the entire social graph. We want to distribute vaccines to the most socially active people, to limit disease transmission on as many links as possible. But mapping is expensive. Instead, pick some number of people uniformly at random from the entire population. Then randomly move a friend of each a few times. The resulting set of people likely have high degree in the graph, by the friendship paradox.
SHA-256 token: 54641f52bbfd085c229515c5c4c2e82d01756820dc3da125a375cc476ac7677b

Lecture 5

Today we talked a LOT about heavy-tails...and we managed to finish it off, despite the fire alarm!

The details of what we covered could fill a whole probability course, so don't worry if you don't feel that you could prove all the results -- focus on understanding the intuition behind what I told you.  (And if you want to learn more, ask me and I'll point you in the right direction.)

In my mind the key things to take away about heavy-tails are the following:
1) Know some examples of heavy-tailed distributions (Pareto, Weibull, and LogNormal)
2) Know some generic processes that create heavy-tails -- additive process (generalized CLT), extremal processes, and multiplicative processes.
3) Know how to apply these ideas to networks in order to get a heavy-tailed degree distribution. (And be sure to look at the details of the proof for Preferential attachment, since that'll be useful on the HW.)

Like I said in class, I could go on-and-on about heavy-tails...  So, I probably tried to pack too much into one lecture.  You'll likely need to go back and look at the notes in order to really process everything we talked about. 

Sunday, January 22, 2012

The Building Blocks of Economic Complexity in a Global Product Space

TEDxBoston - César A. Hidalgo - Global Product Space - YouTube


Dr. Hidalgo offers another metric, specifically one of diversity, in addition to GDP to analyze the wealth of a country.

In his analysis, a capability is a resource, whether physical good or intangible skills and services. Countries vary in capabilities which lead to different products. To create certain products, the country must have the appropriate capability. Some valuable products require specific and rare capabilities. A large hold of capabilities can be then used as a indicator of wealth. Capabilities can be estimated by product output. This is done by the method of reflections (paper). When GDP is plotted against the estimated number of capabilities (8:10, video), we get a better analysis of wealth than with just GDP because capabilities adds another dimension (Fig. 1, paper). We see that despite having similar GDPs, a country with many capabilities would tend toward other obviously wealthy countries, but a country with few capabilities would tend toward other obviously poor countries.

Looking at the product space (12:12, video), we can see that different sectors are clustered showing different geopolitical tendencies (14:00, video). Also can see that progress of a country to become more wealthy; the products of a country, e.g. Malaysia (14:35, video), migrates from nodes with a small connectivity (thus small capabilities) to those with a higher connectivity (high capabilities), which is an implication of the wealth that Malaysia has acquired over the years, which is in agreement with historical records.

In conclusion, a strategy toward wealth would be to migrate from products along edges that would lead to products that had higher connectivity.

Since a small post or video cannot do justice to the research, see his paper:

"The New, New Economy"

A few years back, the internet became the new platform for economy. Allowing for faster communication and farther reach, businesses quickly grew to a global level. Nowadays, businesses compete for customers to give the most convenient online experience instead of cultivating a actual relationships with them. However, advances in technologies have forged the way for what Hixon calls “new, new economy”, the platform of which is the smartphone.

With the number of smartphone users growing in the US and across the globe, companies who reach out through smartphone apps now have a new way to access to a large network of consumers. No longer do businesses need to appeal to the comfort of doing business in without leaving home, but the convenience of doing business anywhere and anytime with their smartphones.

With the exception of gaming apps, a business cannot be run with only an app at its front. The internet economy will not die with the emergence of the smartphone economy, however, companies that wish to survive this transition need to adjust to the new platform. For example, the internet changed the banking industry by offering customers access to their accounts and information securely online to make transfers, pay bills, and even deposit checks from their personal customers. Now, customers look for banks that offer these services from any time and any place via mobile devices.

Finally, just as was necessary with the arrival of the internet, survival in the “new, new economy” will depend of the virality of the business. Basically, survival depends on how fast you can get to the top and stay there. Consider the well-known story of the movie rental service Blockbuster and how it fell to its end because its failure to transition its business to the new terrain of the internet. A business that does not want this fate in the transition to the "new, new economy" will take heed and adjust quickly.

http://www.forbes.com/sites/toddhixon/2012/01/19/going-viral-on-the-mobile-web/

Talk on "Startups" Monday at 4pm

 Later in the afternoon, after our class on Monday, there will be a very interesting talk by a Caltech CS alum about what it's like to join a startup.  I highly recommend attending!  Details are below...

Title: Skills You Should Have at a Startup
Speaker: Anthony Chong (Caltech Alum)

Date: Monday, January 23, 2012
Time: 4:00p.m.
Place: 213 Annenberg

Abstract: After leaving Caltech, I moved in with a friend from high school as the first employee at Adaptly (a startup in NYC).  Rather than going for a safer future in industry or grad school, I figured a few years in a startup would provide a faster learning environment and better vehicle for doing social good.  In the first year we went from three of us working out of the living room of our apartment to 30+ people working out of our offices in New York and sales office in London. I plan to address three topics:
1.  What is it like to work at a startup?
2.  What are useful tools Caltech grads should know/get familiar with to contribute quickly to the software development life cycle?
3.  How do you get the most out of professional development at a startup?

Friday, January 20, 2012

Video Games and Game Theory

Can a greater knowledge of game theory provide a significant advantage in video games? One genre of video games that stands out as having a possibility of higher success with this knowledge is MMORPG’s, or Massively Multiplayer Online Role-Playing Games. The general idea behind an MMORPG is you, the player, take on a new identity in a fantasy world online. In this game players live the life of their new persona and interact with other players through this identity. The big question is whether a knowledge of game theory will affect decisions made in this virtual world?

One aspect that could be affected by knowledge of game theory is cooperation between players in small groups. In many MMORPG’s there are missions or quests which may sometimes have the option to complete them either alone or with a group. Considering that players act as they would in real life and are “intelligent rational decision-makers,”[1] if a mission can be completed alone with a high chance of success then they will choose not to cooperate with others. However, if the probability of success for working alone is low then that player will choose to work with others. When a group of players decide to work together in order to increase the success of all members, this is an example of a Non-zero sum game. If all the players can all succeed or fail together it’s called a non-zero-sum. While all players now all have a higher chance of success, the possibility of failure is still present. Many players may use this aspect of game theory without realizing this terminology. The difference between someone who knows what a non-zero sum is and someone who does not is a player with this knowledge can use it to create good odds of success but realizes that if they have too many players in the group then, although success is much higher, the payoff will be lower because it needs to be split between the group.



[1] Roger B. Myerson (1991). Game Theory: Analysis of Conflict, Harvard University Press, p.1. Chapter-preview links, pp. vii-xi