Friday, January 13, 2012

Homework 2

Homework 2 is now posted to the course website. It is due next Friday at 1:30pm.  Good luck!

If you have any questions about the HW, or just want to point out typos, please post them as comments on this post and then the TAs/prof will respond ASAP.

Don't forget to start early!

20 comments:

  1. Is this a class where we can work in teams on homework (as in turn in only one copy for multiple people)? It obviously wasn't possible for the last one, but what about this week?

    ReplyDelete
    Replies
    1. The collaboration policy is clearly specified on the course website (for all homeworks): "You are strongly encouraged to collaborate with your classmates on these problems, but each person must write up the final solutions individually. You should note on your homework specifically which problems were a collaborative effort and with whom."

      Sometimes, individual problems might have a "no collaboration" rule, for example, problem 4 in HW1. In such cases, it will be clearly indicated at the beginning of the problem.

      Delete
  2. For the histogram for the web crawler, is text output sufficient (i.e. "parking.caltech.edu, 4")? Or do we need to generate images in our code?

    And same with the ccdfs.

    ReplyDelete
  3. For both histogram and the ccdfs, you need to generate images.

    ReplyDelete
  4. The python fetch code doesn't work with https:// pages from the command line, but fine if you access the functions directly through python.
    Changing line 100 from
    if len(sys.argv)!=2 or not sys.argv[1].startswith("http://"):
    to
    if len(sys.argv)!=2 or not (sys.argv[1].startswith("http://") or sys.argv[1].startswith("https://")):
    solves the problem :)

    ReplyDelete
  5. I am seriously confused as what a ccdf is. especially how we're supposed to code it. i don't think most languages have libraries that can do that? at least google didn't bring up anything. and google doesn't really have good explanations as to what a ccdf actually is either.

    TLDR: what is a ccdf?

    ReplyDelete
  6. Q: what is a ccdf?
    A: A ccdf is a function, say G, where G(x) is the fraction of samples taking a value greater than x. Clearly, G(x) is a non-increasing function taking values in [0,1] and approaches zero as x becomes large.

    ReplyDelete
  7. Following up on JKs answer:

    Q: What is a ccdf?
    A: ccdf --> complementary cumulative distribution function. The ccdf of a random variable X is \bar{F}(t) = Pr(X>t). Hopefully you remember these from Math 2. If not, we'll talk about them more in class on Wed.

    ReplyDelete
    Replies
    1. A review of ccdfs in class would be very helpful.

      Delete
  8. Do we need to find a regression function, or is printing out the image of the ccdf sufficient?

    ReplyDelete
    Replies
    1. Pringting the image is OK. You do not need to work out the function.

      Delete
  9. An important note about the crawler: Apparently someone brought down a few library sites this morning with their crawler... Please be sure to keep your crawler "polite" by not having an extremely fast request rate.

    ReplyDelete
  10. does the provided code take care of opening only HTML files, or do we need to make HEAD requests ourselves? the writeup confuses me. we can include links to documents in our histogram data right?

    ReplyDelete
  11. that is, non-HTML documents such as pdf files.

    ReplyDelete
  12. The python fetcher code provided returns all the links on a page, including links to non-html documents. And you should include these in your histogram.

    ReplyDelete
  13. When crawling the pages I realise we're only to crawl the caltech pages but when we keep track of the number of links are those meant to include pages outside of the caltech domain as well?

    If so are these supposed to only be kept for histograms or are they meant to be used in the clustering coefficient calculations too?

    Thanks

    ReplyDelete
  14. "The python fetcher code provided returns all the links on a page, including links to non-html documents. And you should include these in your histogram."

    The fetcher.py code has this (line 76):
    if "text/html" in usock.info()['content-type']:

    This seems to be filtering on the reported content type of the page (no fetching for non-html pages according to the meta type). Do you mean that we must implement something in addition to this for better filtering, or do you simply mean that we should make sure we include pages in our histogram even when the code reports failure because the requested URL is not html?

    ReplyDelete
    Replies
    1. "This seems to be filtering on the reported content type of the page (no fetching for non-html pages according to the meta type). Do you mean that we must implement something in addition to this for better filtering, or do you simply mean that we should make sure we include pages in our histogram even when the code reports failure because the requested URL is not html?"

      That line in the code simply ensures that the script does not parse for links if the url in the argument points to a non-html file. But isn't that exactly what you need? Non-html documents should be interpreted as nodes with no out-links.

      Delete
  15. What is meant in problem 3 when they say node degrees. Do they mean separation from other nodes?

    Are we looking for just 1 value for every number from 1-379?

    ReplyDelete
    Replies
    1. From basic graph theory, the degree of a node is the number of neighbors of that node - how many other nodes is it connected to? Each node will have a degree of at least one (since the graph is connected), and in this case, at most 378. Your task is to generate the histogram and ccdf for the data set containing 379 data points.

      Delete