I already described how to build a basic specialized crawler on this blog. I also wrote about crawling a newspaper website to build a corpus. As I went on work on this issue, I decided to release a few useful scripts under an open-source license.
The crawlers are not just mere link-harvesters, they are designed to be used as corpus-builders. As one cannot republish anything but quotations of the texts, the purpose is to enable others to make their own version of the corpora. Since the newspapers are updated quite often, it is not imaginable to create exact duplicates, that said the majority of the articles will be the same.
The interesting facts are that the crawlers are relatively fast (even if they were not set up for speed) and do not need a lot of computational resources. They may be run on a personal computer.
Due to their specialization, they are able to build a reliable corpus consisting of texts and relevant metadata (e.g. title, author, date and url). Thus, one may gather millions of tokens from home and start exploring the corpus.
The HTML code as well as the superfluous text are stripped in …more ...