Review of the Czech internet corpus

Web for “old school” balanced corpus

The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.

The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:

We have chosen to begin with manually selecting, crawling and cleaning particular web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, young mothers discussion fora etc.).” (p. 311)

Boilerplate removal

The boilerplate removal part is specially crafted for each target, the authors speak of “manually written scripts”. Texts are picked within each website according to their knowledge. Still, as the number of documents remains too high to allow for a completely manual selection, the authors use natural language processing methods to avoid duplicates.


Their workflow includes:

  1. download of the pages,
  2. HTML and boilerplate removal,
  3. near-duplicate removal,
  4. and finally a language detection, which does not deal with English text but ...
Overview of URL analysis and classification methods

The analysis of URLs using natural language processing methods has recently become a research topic by itself, all the more since large URL lists are considered as being part of the big data paradigm. Due to the quantity of available web pages and the costs of processing large amounts of data, it is now an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to.

Why is that so and what can be taken away from these methods ?

Interest and objectives

Obviously, the URLs contain clues regarding the ressource they point to. The URL analysis is about getting as much information as possible to try to predict several characteristics of a web page. The results may influence the way the URL is processed: prioritization, delay, building of focused URL groups, etc.

The main goal seems to be to save crawling time, bandwidth and disk space, which are issues everyone confronted to web-scale crawling has to deal with.

However, one could also argue that it is sometimes hard to figure out what hides behind a URL. Kan & Thi (2005) tackle this issue under the assumption that there ...

A note on Computational Models of Psycholinguistics

I would like to sum up a clear synthesis and state of the art of scientific traditions and ways to deal with language features as a whole. In a chapter entitled ‘Computational Models of Psycholinguistics’ and published in the Cambridge Handbook of Psycholinguistics, Nick Chater and Morten H. Christiansen distinguish three main traditions in psycholinguistic language modeling :

  • a symbolic (Chomskyan) tradition
  • connectionnist psycholinguistics
  • probabilistic models

They state that the Chomskyan approach (as well as nativist theories of language in general) outweighed until recently by far any other one, setting the ground for cognitive science :

Chomsky’s arguments concerning the formal and computational properties of human language were one of the strongest and most influential lines of argument behind the development of the field of cognitive science, in opposition to behaviorism.” (p. 477)

The Symbolic Tradition

They describe the derivational theory of complexity (the hypothesis that number and complexity of transformations correlate with processing time and difficulty) as proving ‘a poor computational model when compared with empirical data’ (p. 479). Further work on generative grammar considered the relationship between linguistic theory and processing as indirect, this is how they explain that this Chomskyan tradition progressively disengaged from work on computational modeling ...

Feeding the COW at the FU Berlin

I am now part of the COW project (COrpora on the Web). The project has been carried by (amongst others) Roland Schäfer and Felix Bildhauer at the FU Berlin for about two years. Work has already been done, especially concerning long-haul crawls in several languages.


A few resources have already been made available, software, n-gram models as well as web-crawled corpora, which for copyright reasons are not downloadable as a whole. They may be accessed through a special interface (COLiBrI – COW’s Light Browsing Interface) or downloaded upon request in a scrambled form (all sentences randomly reordered).

This is a heavy limitation, but it is still better than no corpus at all if one’s research interest does not rely too closely on features above sentence level. This example shows that legal matters ought to be addressed when it comes to collect texts, and that web corpora are as such not easy research objects to deal with. Making reliable tools public is more important at the end that giving access to a particular corpus.

Research aim

The goal is to perform language-focused (and thus maybe language-aware) crawls and to gather relevant resources for (corpus) linguists, with a particular interest ...

Ludovic Tanguy on Visual Analysis of Linguistic Data

In his professorial thesis (or habilitation thesis), which is about to be made public (the defence takes place next week), Ludovic Tanguy explains why and on what conditions data visualization could help linguists. In a previous post, I showed a few examples of visualization applied to the field of readability assessment. Tanguy’s questioning is more general, it has to do with what is to include in the disciplinary field of linguistics.

He gives a few reasons to use the methods from the emerging field of visual analytics and mentions some of its upholders (like Daniel Keim or Jean-Daniel Fekete). But he also states that they are not well adapted to the prevailing models of scientific evaluation.

Why use visual analytics in linguistics ?

His main point is the (fast) growing size and complexity of linguistic data. Visualization comes at hand when selecting, listing or counting phenomena does not prove useful anymore. There is evidence from the field of cognitive psychology that an approach based on form recognition may lead to an interpretation. Briefly, new needs come forth when calculations come short.

Tanguy gives to main examples of cases where it is obvious : firstly the analysis of networks, which can be ...

Review of the readability checker DeLite

Continuing a series of reviews on readability assessment, I would like to describe a tool which is close to what I intend to do. It is named DeLite and is named a ‘readability checker’. It has been developed at the IICS research center of the FernUniversität Hagen.

From my point of view, its main feature is that it has not been made publicly available, it is based on software one has to buy and I did not manage to find even a demo version, although they claim to have been publicly (i.e. EU-)funded. Thus, my description is based on what its designers mention in the articles quoted below.


The article by Glöckner et al. (2006) offers a description of the fundamentals of the software, as well as an interesting summary of research on readability. They depict the ‘classical’ pattern used to come to a readability formula :

  • select elements in a text that are related to readability’,
  • then ‘correlate element occurrences with text readability (measured by established comprehension tests)’,
  • and finally ‘combine the variables into a regression equation’ (p. 32).

This is the approach that led to a preponderance of criteria like word and sentence length, because they ...

On global vs. local visualization of readability

It is not only a matter of scale : the perspective one chooses is crucial when it comes to visualize how difficult a text is. Two main options can be taken into consideration :

  • An overview in form of a summary which enables to compare a series of phenomena for the whole text.
  • A visualization which takes the course of the text into account, as well as the possible evolution of parameters.

I already dealt with the first type of visualization on this blog when I evoked Amazon’s text stats. To sum up, their simplicity is also their main problem, they are easy to read and provide users with a first glimpse of a book, but the kind of information they deliver is not always reliable.

Sooner or later, one has to deal with multidimensional representations as the number of monitored phenomena keeps increasing. That is where a real reflexion on finding a visualization that is faithful and clear at the same time. I would like to introduce two examples of recent research that I find to be relevant to this issue.

An approach inspired by computer science

The first one is taken from an article by Oelke et al. (2010 ...

Gerolinguistics” and text comprehension

The field of “gerolinguistics” is becoming more and more important. The word was first coined by G. Cohen in 1979 and it has been regularly used ever since.

How do older people read ? How do they perform when trying to understand difficult sentences ? It was the idea I was following when I recently decided to read a few papers about linguistic abilities and aging. As I work on different reader profiles I thought it would be an interesting starting point.

The fact is that I did not find what I was looking for, but was not disappointed since the assumption I had made on this matter were proved wrong by recent research. Here is what I learned.

Interindividual variability increases with age

First of all, it is difficult to build a specific profile that would address ‘older people’, as this is a vast category which is merely a subclass of the ‘readers’, and which (as them) contains lots of variable individual evolutions. Very old people (and not necesarily old people) do have more difficulties to read, but this can be caused by very different factors. Most of all, age is not a useful predictor :

Many aspects of language comprehension remain ...

Microsoft to analyze social networks to determine comprehension level

I recently read that Microsoft was planning to analyze several social networks in order to know more about users, so that the search engine could deliver more appropriate results. See this article on : Microsoft idea: Analyze social networks posts to deduce mood, interests, education.

Among the variables that are considered, the ‘sophistication and education level’ of the posts is mentionned. This is highly interesting, because it assumes a double readability assessment, on the reader’s side and on the side of the search engine. More precisely, this could refer to a classification task.

Here is an extract of a patent describing how this is supposed to work.

[0117] In addition to skewing the search results to the user’s inferred interests, the user-following engine 112 may further tailor the search results to a user’s comprehension level. For example, an intelligent processing module 156 may be directed to discerning the sophistication and education level of the posts of a user 102. Based on that inference, the customization engine may vary the sophistication level of the customized search result 510. The user-following engine 112 is able to make determinations about comprehension level several ways, including from a user’s ...

XML standards for language corpora (review)

Document-driven and data-driven, standoff and inline

First of all, the intention of the encoding can be different. Richard Eckart summarizes two main trends: document-driven XML and data-driven XML. While the first uses an « inline approach » and is « usually easily human-readable and meaningful even without the annotations », the latter is « geared towards machine processing and functions like a database record. […] The order of elements often is meaningless. » (Eckart 2008 p. 3)

In fact, several choices of architecture depend on the goal of an annotation using XML. The main division regards standoff and inline XML (also : stand-off and in-line).

The Paula format (“Potsdamer Austauschformat für linguistische Annotation”, ‘Potsdam Interchange Format for Linguistic Annotation’) chose both approaches. So did Nancy Ide for the ANC Project, a series of tools enable the users to convert the data between well-known formats (GrAF standoff, GrAF inline, GATE or UIMA). This versatility seems to be a good point, since you cannot expect corpus users to change their habits just because of one single corpus. Regarding the way standoff and inline annotation compare, (Dipper et al. 2007) found that the inline format (with pointers) performs better.

A few trends in linguistic research

Speaking about trends in the German ...

