Monday, January 23, 2012

The friendship paradox: why a web crawl gives a biased sample, and "your friends have more friends than you do"

The friendship paradox is the phenomenon that on average, your friends have more friends than you do. (Or, as Satoshi Kanazawa puts it, your girlfriend is a whore.) In an undirected graph (such as a friendship network), on average, for any node u, the average degree of u’s neighbors is greater than u’s degree. That is, the average (over all nodes u) of deg(u) < the average (over all nodes u) of the average (over nodes v adjacent to u) of deg(v). The paradox was first characterized by the sociologist Scott L. Feld (Why your friends have more friends than you do, 1991).
This result is superficially paradoxical, but it always holds given the (trivial) condition that not all nodes have the same degree. Intuitively, high-degree nodes have more friends, and each of those friends has 1 link to the high-degree node. Low-degree nodes have few links pointing to them. High-degree nodes are being counted more times, so any given link is more likely to point to a high-degree node. The proof is straightforward; Wikipedia gives a proof.
This idea is fairly robust to other types of graphs. The friendship paradox holds in graphs of sexual relations, even though these are approximately bipartite graphs rather than mostly random. Similarly, in directed graphs, nodes link to nodes with higher in-degree on average, and by symmetry, nodes are linked to by nodes with higher out-degree.
Suppose we try to sample part of the web graph by breadth-first search. Obviously considering the size of the network, we will terminate the crawl early, at some depth. Assuming we’ve only crawled a small part of the network, some hand-waving proves that we will have sampled a disproportionate number of pages with high in-degree. Hence the sample obtained from the crawl is biased w.r.t. in-degree distribution, and presumably other graph statistics. (This still holds if we disregard pages with (in-)degree 0.)
The friendship paradox suggests a cheap way to allocate vaccinations without having to map out the entire social graph. We want to distribute vaccines to the most socially active people, to limit disease transmission on as many links as possible. But mapping is expensive. Instead, pick some number of people uniformly at random from the entire population. Then randomly move a friend of each a few times. The resulting set of people likely have high degree in the graph, by the friendship paradox.
SHA-256 token: 54641f52bbfd085c229515c5c4c2e82d01756820dc3da125a375cc476ac7677b


  1. I cringed a little at the title...but the idea in the post are really important. I'll probably bring this up in lecture when we talk about the spread of epidemics. In some settings, you can actually prove optimality results about the following immunization strategy: pick a random person, and then immunize a random friend of theirs. This is a way of getting at the high degree nodes even if you don't know the network structure.

  2. I don't seem to have a record of the author of this post. Can the author please send me an email?

  3. That is ridiculously cool. I wonder how much better you could do in identifying the structure of a network via random sampling. In a network of 10,000, for example, how many people would you need to ask to find sufficiently high degree nodes. Or asking another way, who do we need to infect to get everyone sick?

    1. Actually, that's subtly different. The bioterrorist would want to infect high-Pagerank nodes. For preventing epidemics, you'd probably want to look at betweenness centrality, because you want to make it harder for the disease to spread between disparate parts of the network, rather than just trying to prevent as many direct infections as possible.