I am now part of the COW project (COrpora on the Web). The project has been carried by (amongst others) Roland Schäfer and Felix Bildhauer at the FU Berlin for about two years. Work has already been done, especially concerning long-haul crawls in several languages.
A few resources have already been made available, software, n-gram models as well as web-crawled corpora, which for copyright reasons are not downloadable as a whole. They may be accessed through a special interface (COLiBrI – COW’s Light Browsing Interface) or downloaded upon request in a scrambled form (all sentences randomly reordered).
This is …more ...