I am now part of the COW project (COrpora on the Web). The project has been carried by (amongst others) Roland Schäfer and Felix Bildhauer at the FU Berlin for about two years. Work has already been done, especially concerning long-haul crawls in several languages.
A few resources have already been made available, software, n-gram models as well as web-crawled corpora, which for copyright reasons are not downloadable as a whole. They may be accessed through a special interface (COLiBrI – COW’s Light Browsing Interface) or downloaded upon request in a scrambled form (all sentences randomly reordered).
This is a heavy limitation, but it is still better than no corpus at all if one’s research interest does not rely too closely on features above sentence level. This example shows that legal matters ought to be addressed when it comes to collect texts, and that web corpora are as such not easy research objects to deal with. Making reliable tools public is more important at the end that giving access to a particular corpus.
The goal is to perform language-focused (and thus maybe language-aware) crawls and to gather relevant resources for (corpus) linguists, with a particular interest …more ...