Review of the Czech internet corpus

Web for “old school” balanced corpus

The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.

The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:

We have chosen to begin with manually selecting, crawling and cleaning particular web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, young mothers discussion fora etc.).” (p. 311)

Boilerplate removal

The boilerplate removal part is specially crafted for each target, the authors speak of “manually written scripts”. Texts are picked within each website according to their knowledge. Still, as the number of documents remains too high to allow for a completely manual selection, the authors use natural language processing methods to avoid duplicates.


Their workflow includes:

  1. download of the pages,
  2. HTML and boilerplate removal,
  3. near-duplicate removal,
  4. and finally a language detection, which does not deal with English text but …
more ...

Overview of URL analysis and classification methods

The analysis of URLs using natural language processing methods has recently become a research topic by itself, all the more since large URL lists are considered as being part of the big data paradigm. Due to the quantity of available web pages and the costs of processing large amounts of data, it is now an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to.

Why is that so and what can be taken away from these methods?

Interest and objectives

Obviously, the URLs contain clues regarding the ressource they point to. The URL analysis is about getting as much information as possible to try to predict several characteristics of a web page. The results may influence the way the URL is processed: prioritization, delay, building of focused URL groups, etc.

The main goal seems to be to save crawling time, bandwidth and disk space, which are issues everyone confronted to web-scale crawling has to deal with.

However, one could also argue that it is sometimes hard to figure out what hides behind a URL. Kan & Thi (2005) tackle this issue under the assumption that there …

more ...

Ludovic Tanguy on Visual Analysis of Linguistic Data

In his professorial thesis (or habilitation thesis), which is about to be made public (the defence takes place next week), Ludovic Tanguy explains why and on what conditions data visualization could help linguists. In a previous post, I showed a few examples of visualization applied to the field of readability assessment. Tanguy’s questioning is more general, it has to do with what is to include in the disciplinary field of linguistics.

He gives a few reasons to use the methods from the emerging field of visual analytics and mentions some of its upholders (like Daniel Keim or Jean-Daniel Fekete). But he also states that they are not well adapted to the prevailing models of scientific evaluation.

Why use visual analytics in linguistics ?

His main point is the (fast) growing size and complexity of linguistic data. Visualization comes at hand when selecting, listing or counting phenomena does not prove useful anymore. There is evidence from the field of cognitive psychology that an approach based on form recognition may lead to an interpretation. Briefly, new needs come forth when calculations come short.

Tanguy gives to main examples of cases where it is obvious : firstly the analysis of networks, which can be …

more ...

XML standards for language corpora (review)

Document-driven and data-driven, standoff and inline

First of all, the intention of the encoding can be different. Richard Eckart summarizes two main trends: document-driven XML and data-driven XML. While the first uses an « inline approach » and is « usually easily human-readable and meaningful even without the annotations », the latter is « geared towards machine processing and functions like a database record. […] The order of elements often is meaningless. » (Eckart 2008 p. 3)

In fact, several choices of architecture depend on the goal of an annotation using XML. The main division regards standoff and inline XML (also : stand-off and in-line).

The Paula format (“Potsdamer Austauschformat für linguistische Annotation”, ‘Potsdam Interchange Format for Linguistic Annotation’) chose both approaches. So did Nancy Ide for the ANC Project, a series of tools enable the users to convert the data between well-known formats (GrAF standoff, GrAF inline, GATE or UIMA). This versatility seems to be a good point, since you cannot expect corpus users to change their habits just because of one single corpus. Regarding the way standoff and inline annotation compare, (Dipper et al. 2007) found that the inline format (with pointers) performs better.

A few trends in linguistic research

Speaking about trends in the German …

more ...

Quick review of the Falko Project

The Falko Project is an error-annotated corpus of German as a foreign language, maintained by the Humboldt Universität Berlin who made it publicly accessible.

Recently a new search engine was made available, practically replacing the old CQP interface. This tool is named ANNIS2 and can handle complex queries on the corpus.


There are several subcorpora, and apparently more to come. The texts were written by advanced learners of German. There are most notably summaries (with the original texts and a comparable corpus of summaries written by native-speakers), essays who come from different locations (with the same type of comparable corpus) and a ‘longitudinal’ corpus coming from students of the Georgetown-University of Washington.

The corpora are annotated by a part-of-speech tagger (the TreeTagger) so that word types and lemmas are known but most of all the mistakes can be found, with several hypotheses at different levels (mainly what the correct sentence would be and what might be the reason of the mistake).


The engine (ANNIS2) has a good tutorial (in English by the way) so that it is not that difficult to search for complex patterns across the subcorpora. It seems also efficient in terms of speed. You may …

more ...

On Text Linguistics

Talking about text complexity in my last post, I did not realize how important it is to take the framework of text linguistics into account. This branch of linguistics is well-known in Germany but is not really meant as a topic by itself elsewhere. Most of the time, no one makes a distinction between text linguistics and discourse analysis, although the background is not necessarily the same.

I saw a presentation by Jean-Michel Adam last week, who describes himself as the “last of the Mohicans” to use this framework in French research. He drew a comprehensive picture of its origin and its developments which I am going to try and sum up.

This field started to become popular in the ‘70s with books by Eugenio Coseriu, Harald Weinrich (in Germany), Frantisek Danek (and the Functional Sentence Perspective Framework) or MAK Halliday who was a lot more read in English-speaking countries. Text linguistics is not a grammatical description of language, nor is it bound to a particular language. It is a science of the texts, a theory which comes on top of several levels such as semantics or structure analysis. It enables to distinguish several classes of texts at a global …

more ...