Crawling a newspaper website to build a corpus

Basing on my previous post about specialized crawlers, I will show how I to crawl a French sports newspaper named L’Equipe using scripts written in Perl, which I did lately. For educational purpose, it works by now but it is bound to stop being efficient as soon as the design of the website changes.

Gathering links

First of all, you have to make a list of links so that you have something to start from. Here is the beginning of the script:

#!/usr/bin/perl #assuming you're using a UNIX-based system...
use strict; #because it gets messy without, and because Perl is faster that way
use Encode; #you have to get the correct encoding settings of the pages
use LWP::Simple; #to get the webpages
use Digest::MD5 qw(md5_hex);

Just an explanation on the last line : we are going to use a hash function to shorten the links and make sure we fetch a single page just once.

my $url = "http://www.lequipe.fr/"; #the starting point

$page = get $url; #the variables ought to be defined somewhere before $page = encode(“iso-8859-1”, $page); #because the pages are not in Unicode format push (@done_md5, substr(md5_hex($url), 0, 8)); #taking the first eight characters of the md5 hash function of the url push (@done, $url); #just to make sure, a readable format

Now we have to find the links and analyze them to see if they are useful. Here, those with the word breves in it are of interest. A brève is a short description of something that happened, a few paragraphs long.

@links = ();
@temp = split ("
foreach $n (@temp) {   #taking the links one after another (not necessarily the fastest way)
    if ($n =~ m/\/breves20/) { #if the link contains the expression
        if ($n =~ m/(http:\/\/www\.lequipe\.fr\/.+?)(")/) { #absolute links
            $n = $1; #the first match

} else { #relative links $n =~ m/(“\/?)(.+?)(“)/; $n = “http://www.lequipe.fr/” . $2; #the second group } if (($n =~ m/breves20/) && ($n =~ m/.html$/)) { #just to check if the url looks good push (@links, $n); #our links list } } }

If it is not the first time that the page gets fetched, you may want to check if you already went through that way using the first eight characters of the md5 hash to spare memory. I will not go further into detail. I posted something about binary searches in Perl a few months ago.

Finally, you have to make sure there are no duplicates in the list and it is ready to be written to a file.

%seen = ();
@links = grep { ! $seen{ $_ }++ } @links; # a fast and efficient way

Getting and cleaning the text

Then you have to go through the list you just made, a simple way is to get the pages one by one (since the bandwidth is the limit here it will not be necessarily slow).

First, you may want to charge the list of what you already did. You might also define a iteration limit for the loop you are going to start, so that you don’t realize after a few hours that something in the script was not working properly.

Then you start the loop and get the first page on the to-do list. You can collect the remaining links on the fly as shown in the first part.

Now we can find the author of the article, its title, its date and so on. I have an ugly but efficient way to do this: I cut off the parts that don’t interest me so the informations are faster available using regular expressions. You can use splitting or regular expressions all the way, both work (at a certain cost).

### Cutting off
@temp = split ("<div id=\"corps\">", $page);
$page = $temp[1];
@temp = split ("<div id=\"bloc_bas_breve\">", $page);
$page = $temp[0];
### Getting the topic
$page =~ m/(<h2>)(.+?)(<\/h2>)/;
$info = $2; $info = "Info: " . $info;
push (@text, $info);
### Finding the title
$page =~ m/(<h1>)(.+?)(<\/h1>)/;
$title = $2;
$title = "Title: " . $title;
push (@text, $title);
### Finding the excerpt if there is one
if ($page =~m/<strong>/) {
    $page =~ m/(<strong>)(.+?)(<\/strong>)/;
    $excerpt = $2;
    $excerpt = "Excerpt: " . $excerpt;
    push (@text, $excerpt);
    $page =~ s/<strong>.+?<\/strong>//;
}
else {
    push (@text, "Excerpt: ");
}

Remark : you could use XML fields as well. This example here is just a demonstration, it lacks functionality.

To get the text itself we could split the code into paragraphs, but it is not necessary here, as the HTML layout is basic. We just have to perform some cleaning of what we cut off, starting from the tags.

# replacing paragraphs by newlines $page =~ s/

/\n/g; # removing html tags $page =~ s/<.+?>//g; # removing left ads # … and so on $page =~ s/SmartAd.+$//g; # example of spaces removal: one or more spaces at the beginning of a line get deleted in all the string $page =~ s/^\s+//g; # … and so on`

Finally you write the text you gathered (here, @text) to a file and/or what you did and close the loop.

That’s all for today, I hope it helps. Just contact me if you need the whole scripts, I thought they were to long to be displayed here.

Update:

As corpora built using similar crawling and scraping techniques are used internally at the ENS Lyon, I released a technical report on this topic, it is available online: Two comparable corpora of German newspaper text gathered on the web: Bild & Die Zeit.

The corpora are available upon request.

Gathering links

Getting and cleaning the text

Related Posts: