Review of the Czech internet corpus

Web for “old school” balanced corpus

The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.

The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:

We have chosen to begin with manually selecting, crawling and cleaning particular web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, young mothers discussion fora etc.).” (p. 311)

Boilerplate removal

The boilerplate removal part is specially crafted for each target, the authors speak of “manually written scripts”. Texts are picked within each website according to their knowledge. Still, as the number of documents remains too high to allow for a completely manual selection, the authors use natural language processing methods to avoid duplicates.


Their workflow includes:

  1. download of the pages,
  2. HTML and boilerplate removal,
  3. near-duplicate removal,
  4. and finally a language detection, which does not deal with English text but ...
more ...

Two open-source corpus-builders for German and French


I already described how to build a basic specialized crawler on this blog. I also wrote about crawling a newspaper website to build a corpus. As I went on work on this issue, I decided to release a few useful scripts under an open-source license.

The crawlers are not just mere link-harvesters, they are designed to be used as corpus-builders. As one cannot republish anything but quotations of the texts, the purpose is to enable others to make their own version of the corpora. Since the newspapers are updated quite often, it is not imaginable to create exact duplicates, that said the majority of the articles will be the same.

Interesting features

The interesting facts are that the crawlers are relatively fast (even if they were not set up for speed) and do not need a lot of computational resources. They may be run on a personal computer.

Due to their specialization, they are able to build a reliable corpus consisting of texts and relevant metadata (e.g. title, author, date and url). Thus, one may gather millions of tokens from home and start exploring the corpus.

The HTML code as well as the superfluous text are stripped in ...

more ...

Crawling a newspaper website to build a corpus

Basing on my previous post about specialized crawlers, I will show how I to crawl a French sports newspaper named L’Equipe using scripts written in Perl, which I did lately. For educational purpose, it works by now but it is bound to stop being efficient as soon as the design of the website changes.

Gathering links

First of all, you have to make a list of links so that you have something to start from. Here is the beginning of the script:

#!/usr/bin/perl #assuming you're using a UNIX-based system...
use strict; #because it gets messy without, and because Perl is faster that way
use Encode; #you have to get the correct encoding settings of the pages
use LWP::Simple; #to get the webpages
use Digest::MD5 qw(md5_hex);

Just an explanation on the last line : we are going to use a hash function to shorten the links and make sure we fetch a single page just once.

my $url = ""; #the starting point

$page = get $url; #the variables ought to be defined somewhere before $page = encode(“iso-8859-1”, $page); #because the pages are not in Unicode format push (@done_md5, substr(md5_hex($url), 0, 8 ...

more ...