(CS/EE 144) The ideas behind the web: Homework 2

Friday, January 13, 2012

Homework 2

Homework 2 is now posted to the course website. It is due next Friday at 1:30pm. Good luck!

If you have any questions about the HW, or just want to point out typos, please post them as comments on this post and then the TAs/prof will respond ASAP.

Don't forget to start early!

20 comments:

AnonymousJanuary 14, 2012 at 1:27 AM
Is this a class where we can work in teams on homework (as in turn in only one copy for multiple people)? It obviously wasn't possible for the last one, but what about this week?
ReplyDelete
Replies
AnonymousJanuary 16, 2012 at 9:58 PM
For the histogram for the web crawler, is text output sufficient (i.e. "parking.caltech.edu, 4")? Or do we need to generate images in our code?

And same with the ccdfs.
ReplyDelete
Replies
ChanghongJanuary 16, 2012 at 10:07 PM
For both histogram and the ccdfs, you need to generate images.
ReplyDelete
Replies
AnonymousJanuary 16, 2012 at 10:44 PM
The python fetch code doesn't work with https:// pages from the command line, but fine if you access the functions directly through python.
Changing line 100 from
if len(sys.argv)!=2 or not sys.argv[1].startswith("http://"):
to
if len(sys.argv)!=2 or not (sys.argv[1].startswith("http://") or sys.argv[1].startswith("https://")):
solves the problem :)
ReplyDelete
Replies
AnonymousJanuary 17, 2012 at 10:41 AM
I am seriously confused as what a ccdf is. especially how we're supposed to code it. i don't think most languages have libraries that can do that? at least google didn't bring up anything. and google doesn't really have good explanations as to what a ccdf actually is either.

TLDR: what is a ccdf?
ReplyDelete
Replies
JKJanuary 17, 2012 at 10:46 AM
Q: what is a ccdf?
A: A ccdf is a function, say G, where G(x) is the fraction of samples taking a value greater than x. Clearly, G(x) is a non-increasing function taking values in [0,1] and approaches zero as x becomes large.
ReplyDelete
Replies
Adam WiermanJanuary 17, 2012 at 10:54 AM
Following up on JKs answer:

Q: What is a ccdf?
A: ccdf --> complementary cumulative distribution function. The ccdf of a random variable X is \bar{F}(t) = Pr(X>t). Hopefully you remember these from Math 2. If not, we'll talk about them more in class on Wed.
ReplyDelete
Replies
AnonymousJanuary 17, 2012 at 12:58 PM
Do we need to find a regression function, or is printing out the image of the ccdf sufficient?
ReplyDelete
Replies
Adam WiermanJanuary 17, 2012 at 1:17 PM
An important note about the crawler: Apparently someone brought down a few library sites this morning with their crawler... Please be sure to keep your crawler "polite" by not having an extremely fast request rate.
ReplyDelete
Replies
AnonymousJanuary 17, 2012 at 11:19 PM
does the provided code take care of opening only HTML files, or do we need to make HEAD requests ourselves? the writeup confuses me. we can include links to documents in our histogram data right?
ReplyDelete
Replies
AnonymousJanuary 17, 2012 at 11:20 PM
that is, non-HTML documents such as pdf files.
ReplyDelete
Replies
JKJanuary 18, 2012 at 9:52 AM
The python fetcher code provided returns all the links on a page, including links to non-html documents. And you should include these in your histogram.
ReplyDelete
Replies
AnonymousJanuary 18, 2012 at 12:41 PM
When crawling the pages I realise we're only to crawl the caltech pages but when we keep track of the number of links are those meant to include pages outside of the caltech domain as well?

If so are these supposed to only be kept for histograms or are they meant to be used in the clustering coefficient calculations too?

Thanks
ReplyDelete
Replies
AnonymousJanuary 19, 2012 at 1:18 PM
"The python fetcher code provided returns all the links on a page, including links to non-html documents. And you should include these in your histogram."

The fetcher.py code has this (line 76):
if "text/html" in usock.info()['content-type']:

This seems to be filtering on the reported content type of the page (no fetching for non-html pages according to the meta type). Do you mean that we must implement something in addition to this for better filtering, or do you simply mean that we should make sure we include pages in our histogram even when the code reports failure because the requested URL is not html?
ReplyDelete
Replies
AnonymousJanuary 19, 2012 at 1:19 PM
What is meant in problem 3 when they say node degrees. Do they mean separation from other nodes?

Are we looking for just 1 value for every number from 1-379?
ReplyDelete
Replies

Add comment

(CS/EE 144) The ideas behind the web

Friday, January 13, 2012

Homework 2

20 comments:

Course details

Labels

Blog Archive