(CS/EE 144) The ideas behind the web: HW 2 is out

Monday, January 11, 2010

HW 2 is out

Grab it from the web page if you didn't get a copy in class.

Post your questions as comments here...

8 comments:

Adam WiermanJanuary 14, 2010 at 11:39 AM
We will be extending the deadline of HW2 until noon Thursday (due to the holiday).
ReplyDelete
Replies
JiyoungJanuary 19, 2010 at 3:31 PM
I have a question in problem 3.
We're supposed to turn in the HISTOGRAM of the number of hyperlinks per page in the Caltech domain. When we count the number of hyperlinks per page, do we count the hyperlinks ONLY to the CALTECH domain? or do we count the hyperlinks to ALL other websites??
ReplyDelete
Replies
JKJanuary 19, 2010 at 3:35 PM
To Jiyoung: You can count the hyperlinks either way, just make sure your mention which definition your code uses in your submission.
ReplyDelete
Replies
JiyoungJanuary 19, 2010 at 4:01 PM
Alright. And I also:
Do you expect us to crawl until we find the last page on the Caltech domain? I think it takes a long time to crawl through the whole domain.
Approximately how long should it take?? Can we just make it go through about 2000 pages and make the crawler stop there??
ReplyDelete
Replies
MinghongJanuary 19, 2010 at 4:19 PM
To Jiyoung: Yes, there are hundreds of thousands of web pages in Caltech domain. You should probabily crawl N (N>=2000)html pages and stop.
ReplyDelete
Replies
e*January 20, 2010 at 3:32 AM
re: crawler

can you please give us a hint on a general method for preventing the crawler from getting stuck in calendars?
ReplyDelete
Replies
JKJanuary 20, 2010 at 7:57 AM
To e^*: One crude way (this is what the script we have provided does) is to ignore any parameters in the urls of dynamic pages. For example, if you extract the url "www.cs.caltech.edu/~ujk/calendar/month.php?cal=jk&getdate=20100120", then ignore the part of the url from the "?" onward.
ReplyDelete
Replies
Adam WiermanJanuary 20, 2010 at 9:28 PM
I was just convinced by one of your compadres that instead of a noon time for the due date, it makes sense to push it back a little so that you don't have to worry about running over to Annenberg as your morning classes finish.

So, feel free to turn it in anytime before 1pm tomorrow. (We'll adjust the HW3 due time accordingly too.)
ReplyDelete
Replies

Add comment

(CS/EE 144) The ideas behind the web

Monday, January 11, 2010

HW 2 is out

8 comments:

Course details

Labels

Blog Archive