Web for “old school” balanced corpus
The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.
The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:
“We have chosen to begin with manually selecting, crawling and cleaning particular web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, young mothers discussion fora etc.).” (p. 311)
The boilerplate removal part is specially crafted for each target, the authors speak of “manually written scripts”. Texts are picked within each website according to their knowledge. Still, as the number of documents remains too high to allow for a completely manual selection, the authors use natural language processing methods to avoid duplicates.
Their workflow includes:
- download of the pages,
- HTML and boilerplate removal,
- near-duplicate removal,
- and finally a language detection, which does not deal with English text but ...