I have a question in problem 3. We're supposed to turn in the HISTOGRAM of the number of hyperlinks per page in the Caltech domain. When we count the number of hyperlinks per page, do we count the hyperlinks ONLY to the CALTECH domain? or do we count the hyperlinks to ALL other websites??
Alright. And I also: Do you expect us to crawl until we find the last page on the Caltech domain? I think it takes a long time to crawl through the whole domain. Approximately how long should it take?? Can we just make it go through about 2000 pages and make the crawler stop there??
To e^*: One crude way (this is what the script we have provided does) is to ignore any parameters in the urls of dynamic pages. For example, if you extract the url "www.cs.caltech.edu/~ujk/calendar/month.php?cal=jk&getdate=20100120", then ignore the part of the url from the "?" onward.
I was just convinced by one of your compadres that instead of a noon time for the due date, it makes sense to push it back a little so that you don't have to worry about running over to Annenberg as your morning classes finish.
So, feel free to turn it in anytime before 1pm tomorrow. (We'll adjust the HW3 due time accordingly too.)
We will be extending the deadline of HW2 until noon Thursday (due to the holiday).
ReplyDeleteI have a question in problem 3.
ReplyDeleteWe're supposed to turn in the HISTOGRAM of the number of hyperlinks per page in the Caltech domain. When we count the number of hyperlinks per page, do we count the hyperlinks ONLY to the CALTECH domain? or do we count the hyperlinks to ALL other websites??
To Jiyoung: You can count the hyperlinks either way, just make sure your mention which definition your code uses in your submission.
ReplyDeleteAlright. And I also:
ReplyDeleteDo you expect us to crawl until we find the last page on the Caltech domain? I think it takes a long time to crawl through the whole domain.
Approximately how long should it take?? Can we just make it go through about 2000 pages and make the crawler stop there??
To Jiyoung: Yes, there are hundreds of thousands of web pages in Caltech domain. You should probabily crawl N (N>=2000)html pages and stop.
ReplyDeletere: crawler
ReplyDeletecan you please give us a hint on a general method for preventing the crawler from getting stuck in calendars?
To e^*: One crude way (this is what the script we have provided does) is to ignore any parameters in the urls of dynamic pages. For example, if you extract the url "www.cs.caltech.edu/~ujk/calendar/month.php?cal=jk&getdate=20100120", then ignore the part of the url from the "?" onward.
ReplyDeleteI was just convinced by one of your compadres that instead of a noon time for the due date, it makes sense to push it back a little so that you don't have to worry about running over to Annenberg as your morning classes finish.
ReplyDeleteSo, feel free to turn it in anytime before 1pm tomorrow. (We'll adjust the HW3 due time accordingly too.)