Basing on my previous post about specialized crawlers, I will show how I to crawl a French sports newspaper named L’Equipe using scripts written in Perl, which I did lately. For educational purpose, it works by now but it is bound to stop being efficient as soon as the design of the website changes.
First of all, you have to make a list of links so that you have something to start from. Here is the beginning of the script:
#!/usr/bin/perl #assuming you're using a UNIX-based system... use strict; #because it gets messy without, and because Perl is faster that way use Encode; #you have to get the correct encoding settings of the pages use LWP::Simple; #to get the webpages use Digest::MD5 qw(md5_hex);
Just an explanation on the last line : we are going to use a hash function to shorten the links and make sure we fetch a single page just once.
my $url = "http://www.lequipe.fr/"; #the starting point
$page = get $url; #the variables ought to be defined somewhere before $page = encode(“iso-8859-1”, $page); #because the pages are not in Unicode format push (@done_md5, substr(md5_hex($url), 0, 8 ...more ...