Introduction
I already described how to build a basic specialized crawler on this blog. I also wrote about crawling a newspaper website to build a corpus. As I went on work on this issue, I decided to release a few useful scripts under an open-source license.
The crawlers are not just mere link-harvesters, they are designed to be used as corpus-builders. As one cannot republish anything but quotations of the texts, the purpose is to enable others to make their own version of the corpora. Since the newspapers are updated quite often, it is not imaginable to create exact duplicates, that said the majority of the articles will be the same.
Interesting features
The interesting facts are that the crawlers are relatively fast (even if they were not set up for speed) and do not need a lot of computational resources. They may be run on a personal computer.
Due to their specialization, they are able to build a reliable corpus consisting of texts and relevant metadata (e.g. title, author, date and url). Thus, one may gather millions of tokens from home and start exploring the corpus.
The HTML code as well as the superfluous text are stripped in order to spare disk space. Scripts to convert raw data into the XML format are inclusive for further use with natural language processing tools.
Sources
The two crawlers and corpus-builders use a similar engine, they address the following newspapers:
- The German national weekly newspaper Die Zeit, regarded for its quality (see on Wikipedia)
- The French nationwide daily newspaper devoted to sports L’Équipe (see on Wikipedia)
The open-source projects are hosted on GitHub, the names should be no mystery :
- Gather more than 130.000 articles and 100 millions of tokens thanks to the zeitcrawler
- Build a French corpus of more than 40.000 articles related to sports with the equipe-crawler
Both are available under the GNU GPL v3 license.
The texts gathered using this software are for personal or academic use only, as no republication of any kind (but quotes) is authorized. So far, crawling is not explicitly forbidden by the right-holders.
As the corpora are used internally at the ENS Lyon, I released a technical report concerning the work on the Zeitcrawler, it is available online: Two comparable corpora of German newspaper text gathered on the web: Bild & Die Zeit.
The corpora are available upon request.