I am now part of the COW project (COrpora on the Web). The project has been carried by (amongst others) Roland Schäfer and Felix Bildhauer at the FU Berlin for about two years. Work has already been done, especially concerning long-haul crawls in several languages.
A few resources have already been made available, software, n-gram models as well as web-crawled corpora, which for copyright reasons are not downloadable as a whole. They may be accessed through a special interface (COLiBrI – COW’s Light Browsing Interface) or downloaded upon request in a scrambled form (all sentences randomly reordered).
This is a heavy limitation, but it is still better than no corpus at all if one’s research interest does not rely too closely on features above sentence level. This example shows that legal matters ought to be addressed when it comes to collect texts, and that web corpora are as such not easy research objects to deal with. Making reliable tools public is more important at the end that giving access to a particular corpus.
The goal is to perform language-focused (and thus maybe language-aware) crawls and to gather relevant resources for (corpus) linguists, with a particular interest for lesser-known languages. The material resources at the HPSG lab (where the COW is hosted) are on line with the expectations concerning web mining. There definitely are interesting projects to start on these servers, and a few ideas are already being tested. If they prove to be fruitful, I will report on the results when/if they get published.
At the moment, I work on two different sides of the tool chain : on one hand at the very beginning, i.e. the quest for ‘good’ (relevant for linguistic purposes and spam-free) URL or word seeds, and on the other hand at the end, i.e. the effort to qualify and classify properly the texts that were crawled and filtered.
The first task is not easy, but the second one is really a challenge, as texts coming from the web cover a wide spectrum. There are a lot of metadata that can be added and a lot of labels to choose from. Concerning the work on text complexity/readability/comprehensibility there are also many options to take, but a possible cross-linguistic approach reduces the size of the research publications and software to know about. In fact, tools that are mature for English may simply not exist for lesser-known languages.
I recently participated in the creation of a Twitter account used by machines only : @cowmunist (COW Machine-Updated Notification and Information System for Twitter). A few scripts tweet now and then about what they do.