Web for “old school” balanced corpus

The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.

The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:

We have chosen to begin with manually selecting, crawling and cleaning particular web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, young mothers discussion fora etc.).” (p. 311)

Boilerplate removal

The boilerplate removal part is specially crafted for each target, the authors speak of “manually written scripts”. Texts are picked within each website according to their knowledge. Still, as the number of documents remains too high to allow for a completely manual selection, the authors use natural language processing methods to avoid duplicates.

Workflow

Their workflow includes:

  1. download of the pages,
  2. HTML and boilerplate removal,
  3. near-duplicate removal,
  4. and finally a language detection, which does not deal with English text but rather with the distinction of Czech and Slovak variants.

Finally, they divide the corpus into three parts: articles, discussions and blogs. What they do with mixed-content is not clear:

Encouraged by the size, and also by the quality of the texts acquired from the web, we decided to compile the whole corpus only from particular, carefully selected sites, to proceed the cleaning part in the same, sophisticated manner, and to divide the corpus into three parts – articles (from news, magazines etc.), discussions (mainly standalone discussion fora, but also some comments to the articles in acceptable quality) and blogs (also diaries, stories, poetry, user film reviews).” (p. 312)

Review

There are indeed articles and blog posts which due to long comment threads are likelier to fall into the discussion category. On so-called “pure players” or “netzines” the distinction between an article and a blog post is not clear either, because of the content but also for technical reasons related to the publishing software, such as a the content management system like WordPress, which is very popular among bloggers but also sometimes used to propel static websites.

It is interesting to see that “classical” approaches to web texts seem to be valid among the corpus linguistics community, in a shift that could be associated with the “web for corpus” or “corpora from the web” approach.

The workflow replicates steps that are useful for scanned texts, with boilerplate removal somehow replacing OCR corrections. One clear advantage is the availability and quantity of the texts, another is the speed of processing, both are mentioned by the authors who are convinced that their approach can lead to further text collections. A downside is the lack of information about the decisions made during the process, which ought to be encoded as metadata and exported with the corpus, so that the boilerplate removal or the text classification process for example can be evaluated or redesigned using other tools.

Reference

Johanka Spoustová and Miroslav Spousta, “A High-Quality Web Corpus of Czech”, in Proceedings of LREC, pp. 311-315, 2012.