XML standards for language corpora (review)

Document-driven and data-driven, standoff and inline

First of all, the intention of the encoding can be different. Richard Eckart summarizes two main trends: document-driven XML and data-driven XML. While the first uses an « inline approach » and is « usually easily human-readable and meaningful even without the annotations », the latter is « geared towards machine processing and functions like a database record. […] The order of elements often is meaningless. » (Eckart 2008 p. 3)

In fact, several choices of architecture depend on the goal of an annotation using XML. The main division regards standoff and inline XML (also : stand-off and in-line).

The Paula format (“Potsdamer Austauschformat für linguistische Annotation”, ‘Potsdam Interchange Format for Linguistic Annotation’) chose both approaches. So did Nancy Ide for the ANC Project, a series of tools enable the users to convert the data between well-known formats (GrAF standoff, GrAF inline, GATE or UIMA). This versatility seems to be a good point, since you cannot expect corpus users to change their habits just because of one single corpus. Regarding the way standoff and inline annotation compare, (Dipper et al. 2007) found that the inline format (with pointers) performs better.

A few trends in linguistic research

Speaking about trends in the German ...

more ...

Completing web pages on the fly with JavaScript

As I am working on a new release of the German Political Speeches Corpus, I looked for a way to make web pages lighter. I have lots of repetitive information, so that a little JavaScript is helpful when it comes to save on file size. Provided that the DOM structure is available, there are elements that may be completed on load.

For example, there are span elements which include specific text. By catching them and testing them against a regular expression the script is able to add attributes (like a class) to the right ones. Without activating JavaScript one still sees the contents of the page, and with it the page appears as I intended. In fact, the attributes match properties defined in a separate CSS file.

I had to look for several JavaScript commands across many websites, that’s why I decided to summarize what I found in a post.

First example : append text and a target to a link

These lines of code match all the links that don’t already have a href attribute, and append to them a modified destination as well as a target attribute.

function modLink(txt){ 
    // Get all the links

var list = document ...

more ...

Canadian research on readability in the ‘90s

I would like to write a word about the beginnings of computer-aided readability assessment research in Canada during the ‘90s, as they show interesting ways of thinking and measuring the complexity of texts.


Daoust, Laroche and Ouellet (1997) start from research on readability as it prevailed in the United States : they aim at finding a way to be able to assign a level to texts by linking them to a school level. They assume that the discourses of the school institutions are coherent and that they can be examined as a whole. Among other things, their criteria concern lexical, dictionary-based data and grammatical observations, such as the amount of proper nouns, of relative pronouns and of finite verb forms.

Several variables measure comparable aspects of text complexity and the authors wish to avoid being redundant, so they use factorial analysis and multiple regression to group the variables and try to explain why a text targeted a given school grade. They managed to narrow down the observations to thirty variables, whose impact on readability assessment is known. This is an interesting approach. The fact that they chose to keep about thirty variables in their study shows that readability formulas lack ...

more ...

Word lists, word frequency and contextual diversity

How to build an efficient word list ? What are the limits of word frequency measures ? These issues are relevant to readability.

First, a word about the context : word lists are used to find difficulties and to try to improve the teaching material, whereas word frequency is used in psychological linguistics to measure cognitive processing. Thus, this topic deals with education science, psychological linguistics and corpus linguistics.

Coxhead’s Academic Word List

The academic word list by Averil Coxhead is a good example of this approach. He finds that students are not generally familiar with academic vocabulary, giving following examples : substitute, underlie, establish and inherent (p. 214). According to him, this kind of words are are “supportive” but not “central” (these adjectives could be good examples as well).

He starts from principles from corpus linguistics and states that “a register such as academic texts encompasses a variety of subregisters”, one has to balance the corpus.

Coxhead’s methodology is interesting. As one can see he probably read the works of Douglas Biber or John Sinclair, just to name a few. (AWL stands for Academic Word List.)

« To establish whether the AWL maintains high coverage over academic texts other than those in ...

more ...

Interview with children’s books author Sabine Ludwig

Last week I had the chance to talk about complexity and readability with an experienced children’s books author, Sabine Ludwig (see also the page on the German Wikipedia). She has published around 30 books so far, as well as a dozen books which were translated from English to German. Some of them were awarded. The most successful one, Die schrecklichsten Mütter der Welt, had sold about 65.000 copies by the end of 2011 (although a few booksellers first thought it was unadapted to children). I was able to record the interview so that I could take extensive notes afterward, which I am going to summarize.

Sabine Ludwig writes in an intuitive way, which means that she does not pay attention to the complexity of the sentences she creates. She tries to see the world through a child’s eyes, and she pretends that the (inevitable) adaptation of both content and style takes place this way. She does not shorten sentences, neither does she avoid particular words. In fact, she does not want to be perfectly clear and readable for children. She does not find it to be a reasonable goal, because children can progressively learn words from a ...

more ...

My contribution to the Anglicism of the Year award

I contributed to the Anglicism of the Year award nominations. It is the second edition, the first was rather confidential but still got mentionned by the English-speaking press (e.g. by The Guardian). The jury is once again chaired by Anatol Stefanowitsch, a professor in linguistics at Hamburg University. The selection of the final nominees will be relayed by a few German bloggers specialized in linguistics.
I made it to the first nominees, but there was no selection so far, this phase goes till January 7th. News can be found on the official blog.

My suggestions are :

  • das Handyticketsystem
  • whistleblowen
  • der Occupist, die Occupisten
  • die Post-Privacy

To my opinion, the latter two have the good chances to advance to the final stage. Among the other nominees I like die Fazialpalmierung (facepalm) and die Liquid Democracy. But there are not that many interesting ones, that may be a reason why the deadline was postponed by a week.

I will keep this post up to date.

Updates :

more ...

Tendencies in research on readability

In a recent article about a readability checker prototype for italian, Felice Dell’Orletta, Simonetta Montemagni, and Giulia Venturi provide a good overview of current research on readability. Starting from the end of the article, I must say the bibliography is quite up-to-date and the authors offer an extensive review of criteria used by other researchers.

Tendencies in research

First of all, there is a growing tendency towards statistical language models. In fact, language models are used by Thomas François (2009) for example, who considers they are a more efficient replacement for the vocabulary lists used in readability formulas.

Secondly, readability assessment at a lexical or syntactic level has been explored, but factors at a higher level still need to be taken into account. It is acknowledged since the 80s that the structure of texts and the development of discourse play a major role in making a text more complex. Still, it is harder to focus on discourse features than on syntactic ones.

« Over the last ten years, work on readability deployed sophisticated NLP techniques, such as syntactic parsing and statistical language modeling, to capture more complex linguistic features and used statistical machine learning to build readability assessment tools. […] Yet ...

more ...

Bibliography and links updates

As I try to put my notes in order by the end of this year, I changed a series of references, most notably in the bibliography and in the links sections.


I just updated the bibliography, using new categories. I divided the references in two main sections:

Corpus Linguistics, Complexity and Readability Assessment



First of all, I updated the links section using the W3C Link Validator. It is very useful, as it points out dead links and moved pages.

Resources for German

This is a new subsection:

Other links

I added a subsection to the links about LaTeX: LaTeX for Humanities (and Linguists).

I also added new tools and new Perl links.

more ...

A note on Amazon’s text readability stats

Recently, Jean-Philippe Magué advised me of the newly introduced text stats on Amazon. A good summary by Gabe Habash on the news blog of Publishers Weekly describes the perspectives and the potential interest of this new software : Book Lies: Readability is Impossible to Measure. The stats seem to have been available since last summer. I decided to contribute to the discussion on Amazon’s text readability statistics : to what extent are they reliable and useful ?


Gabe Habash compares several well-known books and concludes that the sentence length is determining in the readability measures used by Amazon. In fact, the readability formulas (Fog Index, Flesch Index and Flesch-Kincaid Index, for an explanation see Amazon’s text readability help) are centered on word length and sentence length, which is convenient but by far not always adapted.

There is another metric named ‘word complexity’, which Amazon defines as follows : ‘A word is considered “complex” if it has three or more syllables’ (source : complexity help). I wonder what happens in the case of proper nouns like (again…) Schwarzenegger. There are cases where the syllable recognition is not that easy for an algorithm that was programmed and tested to perform well on English words ...

more ...

Using a rule-based tokenizer for German

In order to solve a few tokenization problems and to delimit the sentences properly I decided not to fight with the tokenization anymore and to use an efficient script that would do it for me. There are taggers which integrate a tokenization process of their own, but that’s precisely why I need an independent one, so that I can let the several taggers downstream work on an equal basis.
I found an interesting script written by Stefanie Dipper of the University of Bochum, Germany. It is freely available here : Rule-based Tokenizer for German.


  • It’s written in Perl.
  • It performs a tokenization and a sentence boundary detection.
  • It can output the result in text mode as well as in XML format, including a detailed version where all the space types are qualified.
  • It was created to perform well on German.
    • It comes with an abbreviation list which fits German standards (e.g. the street names like Hauptstr.)
    • It tries to address the problem of the dates in German, which are often written using dots (e.g. 01.01.12), using a “hard-wired list of German date expressions” according to its author.
  • The code is clear and well documented ...
more ...