I recently worked at the FU Berlin with Roland Schäfer and Felix Bildhauer on issues related to web corpora. One of them deals with corpus construction: as a matter of fact, web documents can be very different, and even after a proper cleaning it is not rare to see things that could hardly be qualified as texts. While there are indubitably clear cases such as lists of addresses or tag clouds, it is not always obvious to define how extensive the notions of text and corpus are. What’s more, a certain amount of documents just end up too close to call. Nonetheless, this issue has to be addressed, since even a no-decision policy would have consequences, as certain linguistic phenomena become more or less accidentally over- or underrepresented in the final corpus. That is why we believe that linguists and “end users” in general should be aware of this kind of technicalities.

The Good, the Bad, and the Hazy

In an article to be published in the proceedings of the 8th Web as Corpus Workshop, The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction, we show that text quality is not always easy to assess. Our primary goal was to find out whether corpus designers have clear intuitions about the text quality of web documents, and whether they could operationalize them.

As we eat our own dog food, we decided to look at a thousand English texts that had been gathered by the crawler and cleaned by an appropriate tool chain. More precisely, one half of the sample came from the beginning of a crawl, where text quality is allegedly at its highest (that is a different matter… we give a few insights about it in the article), whereas the other half came from the end of a crawl. Mostly concerning the latter, we saw strange, uncanny “documents” whose form was quite unexpected and which confirm that web corpus linguistics still has a potential for discoveries, may it be text linguistics or document classification.

The oceanographers probably tackle similar issues when they encounter a new species which is hardly an invertebrate or a fish, etc. The interesting point in this paper is that we do not try to find a typology, we merely try to classify the texts as “good” or “bad” with respect to web corpus construction.

The haziness and how to cope with it

To see how much we agree on these terms, we performed a manual classification with three coders, according to criteria we discussed together and agreed upon. The results were sobering: the inter-coder agreement was low, indicating that even for humans this task is far from being trivial. There is still work to do on this topic, among other things with machine learning techniques that may be able to discriminate between relevant factors more than we do. But we still need a baseline to do so.

Next, we introduce and evaluate an unsupervised method to classify documents. Indeed, we show that type profiles could be a way to solve this problem, the lack of highly frequent words could be a measure which proves to be robust enough. We also found that for reasons that are yet to determine the undecidable zone seems to be larger in the English corpora than in the German ones for instance.

Last, the relativity of design decisions illustrates the importance of non-destructive annotation processes during the corpus construction in order to let users decide how they want to filter noise.

Reference

R. Schäfer, A. Barbaresi, and F. Bildhauer, “The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction”, in Proceedings of the 8th Web as Corpus Workshop (WAC8), 2013, pp. 7-15.