What is good enough to become part of a web corpus?

I recently worked at the FU Berlin with Roland Schäfer and Felix Bildhauer on issues related to web corpora. One of them deals with corpus construction: as a matter of fact, web documents can be very different, and even after a proper cleaning it is not rare to see things that could hardly be qualified as texts. While there are indubitably clear cases such as lists of addresses or tag clouds, it is not always obvious to define how extensive the notions of text and corpus are. What’s more, a certain amount of documents just end up too close to call. Nonetheless, this issue has to be addressed, since even a no-decision policy would have consequences, as certain linguistic phenomena become more or less accidentally over- or underrepresented in the final corpus. That is why we believe that linguists and “end users” in general should be aware of this kind of technicalities.

The Good, the Bad, and the Hazy

In an article to be published in the proceedings of the 8th Web as Corpus Workshop, The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction, we show that text quality is not always easy to assess …

more ...

Feeding the COW at the FU Berlin

I am now part of the COW project (COrpora on the Web). The project has been carried by (amongst others) Roland Schäfer and Felix Bildhauer at the FU Berlin for about two years. Work has already been done, especially concerning long-haul crawls in several languages.

Resources

A few resources have already been made available, software, n-gram models as well as web-crawled corpora, which for copyright reasons are not downloadable as a whole. They may be accessed through a special interface (COLiBrI – COW’s Light Browsing Interface) or downloaded upon request in a scrambled form (all sentences randomly reordered).

This is a heavy limitation, but it is still better than no corpus at all if one’s research interest does not rely too closely on features above sentence level. This example shows that legal matters ought to be addressed when it comes to collect texts, and that web corpora are as such not easy research objects to deal with. Making reliable tools public is more important at the end that giving access to a particular corpus.

Research aim

The goal is to perform language-focused (and thus maybe language-aware) crawls and to gather relevant resources for (corpus) linguists, with a particular interest …

more ...