Review of the Czech internet corpus
Web for “old school” balanced corpus
The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.
The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:
more ...“We have chosen to begin with manually selecting, crawling and cleaning particular web sites with …