Homework 2 is now posted to the course website. It is due next Friday at 1:30pm. Good luck!
If you have any questions about the HW, or just want to point out typos, please post them as comments on this post and then the TAs/prof will respond ASAP.
Don't forget to start early!
Friday, January 13, 2012
Subscribe to:
Post Comments (Atom)
Is this a class where we can work in teams on homework (as in turn in only one copy for multiple people)? It obviously wasn't possible for the last one, but what about this week?
ReplyDeleteThe collaboration policy is clearly specified on the course website (for all homeworks): "You are strongly encouraged to collaborate with your classmates on these problems, but each person must write up the final solutions individually. You should note on your homework specifically which problems were a collaborative effort and with whom."
DeleteSometimes, individual problems might have a "no collaboration" rule, for example, problem 4 in HW1. In such cases, it will be clearly indicated at the beginning of the problem.
For the histogram for the web crawler, is text output sufficient (i.e. "parking.caltech.edu, 4")? Or do we need to generate images in our code?
ReplyDeleteAnd same with the ccdfs.
For both histogram and the ccdfs, you need to generate images.
ReplyDeleteThe python fetch code doesn't work with https:// pages from the command line, but fine if you access the functions directly through python.
ReplyDeleteChanging line 100 from
if len(sys.argv)!=2 or not sys.argv[1].startswith("http://"):
to
if len(sys.argv)!=2 or not (sys.argv[1].startswith("http://") or sys.argv[1].startswith("https://")):
solves the problem :)
I am seriously confused as what a ccdf is. especially how we're supposed to code it. i don't think most languages have libraries that can do that? at least google didn't bring up anything. and google doesn't really have good explanations as to what a ccdf actually is either.
ReplyDeleteTLDR: what is a ccdf?
Q: what is a ccdf?
ReplyDeleteA: A ccdf is a function, say G, where G(x) is the fraction of samples taking a value greater than x. Clearly, G(x) is a non-increasing function taking values in [0,1] and approaches zero as x becomes large.
Following up on JKs answer:
ReplyDeleteQ: What is a ccdf?
A: ccdf --> complementary cumulative distribution function. The ccdf of a random variable X is \bar{F}(t) = Pr(X>t). Hopefully you remember these from Math 2. If not, we'll talk about them more in class on Wed.
A review of ccdfs in class would be very helpful.
DeleteDo we need to find a regression function, or is printing out the image of the ccdf sufficient?
ReplyDeletePringting the image is OK. You do not need to work out the function.
DeleteAn important note about the crawler: Apparently someone brought down a few library sites this morning with their crawler... Please be sure to keep your crawler "polite" by not having an extremely fast request rate.
ReplyDeletedoes the provided code take care of opening only HTML files, or do we need to make HEAD requests ourselves? the writeup confuses me. we can include links to documents in our histogram data right?
ReplyDeletethat is, non-HTML documents such as pdf files.
ReplyDeleteThe python fetcher code provided returns all the links on a page, including links to non-html documents. And you should include these in your histogram.
ReplyDeleteWhen crawling the pages I realise we're only to crawl the caltech pages but when we keep track of the number of links are those meant to include pages outside of the caltech domain as well?
ReplyDeleteIf so are these supposed to only be kept for histograms or are they meant to be used in the clustering coefficient calculations too?
Thanks
"The python fetcher code provided returns all the links on a page, including links to non-html documents. And you should include these in your histogram."
ReplyDeleteThe fetcher.py code has this (line 76):
if "text/html" in usock.info()['content-type']:
This seems to be filtering on the reported content type of the page (no fetching for non-html pages according to the meta type). Do you mean that we must implement something in addition to this for better filtering, or do you simply mean that we should make sure we include pages in our histogram even when the code reports failure because the requested URL is not html?
"This seems to be filtering on the reported content type of the page (no fetching for non-html pages according to the meta type). Do you mean that we must implement something in addition to this for better filtering, or do you simply mean that we should make sure we include pages in our histogram even when the code reports failure because the requested URL is not html?"
DeleteThat line in the code simply ensures that the script does not parse for links if the url in the argument points to a non-html file. But isn't that exactly what you need? Non-html documents should be interpreted as nodes with no out-links.
What is meant in problem 3 when they say node degrees. Do they mean separation from other nodes?
ReplyDeleteAre we looking for just 1 value for every number from 1-379?
From basic graph theory, the degree of a node is the number of neighbors of that node - how many other nodes is it connected to? Each node will have a degree of at least one (since the graph is connected), and in this case, at most 378. Your task is to generate the histogram and ccdf for the data set containing 379 data points.
Delete