A one-pass valency-oriented chunker for German

I recently introduced at the LTC‘13 conference a tool I developed to help performing fast text analysis on web corpora: a one-pass valency-oriented chunker for German.


It turns out that topological fields together with chunked phrases provide a solid basis for a robust analysis of German sentence structure.”
E. W. Hinrichs, “Finite-State Parsing of German”, in Inquiries into Words, Constraints and Contexts, A. Arppe and et al. (eds.), Stanford: CSLI Publications, pp. 35–44, 2005.


Non-finite state parsers provide fine-grained information but they are computationally demanding, so that it can be interesting to see how far a shallow parsing approach is able to go.

The transducer described here consists in a pattern-based matching operation of POS-tags using regular expressions that takes advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb.

The chunker reads its input exactly once instead of using cascades, which greatly benefits computational efficiency.

This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher: possible applications include ...

more ...

Review of the readability checker DeLite

Continuing a series of reviews on readability assessment, I would like to describe a tool which is close to what I intend to do. It is named DeLite and is named a ‘readability checker’. It has been developed at the IICS research center of the FernUniversität Hagen.

From my point of view, its main feature is that it has not been made publicly available, it is based on software one has to buy and I did not manage to find even a demo version, although they claim to have been publicly (i.e. EU-)funded. Thus, my description is based on what its designers mention in the articles quoted below.


The article by Glöckner et al. (2006) offers a description of the fundamentals of the software, as well as an interesting summary of research on readability. They depict the ‘classical’ pattern used to come to a readability formula :

  • select elements in a text that are related to readability’,
  • then ‘correlate element occurrences with text readability (measured by established comprehension tests)’,
  • and finally ‘combine the variables into a regression equation’ (p. 32).

This is the approach that led to a preponderance of criteria like word and sentence length, because they ...

more ...

Two open-source corpus-builders for German and French


I already described how to build a basic specialized crawler on this blog. I also wrote about crawling a newspaper website to build a corpus. As I went on work on this issue, I decided to release a few useful scripts under an open-source license.

The crawlers are not just mere link-harvesters, they are designed to be used as corpus-builders. As one cannot republish anything but quotations of the texts, the purpose is to enable others to make their own version of the corpora. Since the newspapers are updated quite often, it is not imaginable to create exact duplicates, that said the majority of the articles will be the same.

Interesting features

The interesting facts are that the crawlers are relatively fast (even if they were not set up for speed) and do not need a lot of computational resources. They may be run on a personal computer.

Due to their specialization, they are able to build a reliable corpus consisting of texts and relevant metadata (e.g. title, author, date and url). Thus, one may gather millions of tokens from home and start exploring the corpus.

The HTML code as well as the superfluous text are stripped in ...

more ...

2nd release of the German Political Speeches Corpus

Last Monday, I released an updated version of both corpus and visualization tool on the occasion of the DGfS-CL Poster-Session in Frankfurt, where I presented a poster (in German).

The first version had been made available last summer and mentioned on this blog, cf this post : Introducing the German Political Speeches Corpus and Visualization Tool.

The resource still uses this permanent redirection : http://purl.org/corpus/german-speeches


If you don’t remember it or never heard of it, here is a brief description :

The resource presented here consists of speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources. It provides raw data, metadata and tokenized text with part-of-speech tagging and lemmas in XML TEI format for researchers that are able to use it and a simple visualization interface for those who want to get a glimpse of what is in the corpus before downloading it or thinking about using more complete tools.

The visualization output is in valid CSS/XHTML format, it takes advantage of recent standards. The purpose is to give a sort of Zeitgeist, an insight on the topics developed by a government official and on ...

more ...

Bibliography and links updates

As I try to put my notes in order by the end of this year, I changed a series of references, most notably in the bibliography and in the links sections.


I just updated the bibliography, using new categories. I divided the references in two main sections:

Corpus Linguistics, Complexity and Readability Assessment



First of all, I updated the links section using the W3C Link Validator. It is very useful, as it points out dead links and moved pages.

Resources for German

This is a new subsection:

Other links

I added a subsection to the links about LaTeX: LaTeX for Humanities (and Linguists).

I also added new tools and new Perl links.

more ...

Using a rule-based tokenizer for German

In order to solve a few tokenization problems and to delimit the sentences properly I decided not to fight with the tokenization anymore and to use an efficient script that would do it for me. There are taggers which integrate a tokenization process of their own, but that’s precisely why I need an independent one, so that I can let the several taggers downstream work on an equal basis.
I found an interesting script written by Stefanie Dipper of the University of Bochum, Germany. It is freely available here : Rule-based Tokenizer for German.


  • It’s written in Perl.
  • It performs a tokenization and a sentence boundary detection.
  • It can output the result in text mode as well as in XML format, including a detailed version where all the space types are qualified.
  • It was created to perform well on German.
    • It comes with an abbreviation list which fits German standards (e.g. the street names like Hauptstr.)
    • It tries to address the problem of the dates in German, which are often written using dots (e.g. 01.01.12), using a “hard-wired list of German date expressions” according to its author.
  • The code is clear and well documented ...
more ...

Quick review of the Falko Project

The Falko Project is an error-annotated corpus of German as a foreign language, maintained by the Humboldt Universität Berlin who made it publicly accessible.

Recently a new search engine was made available, practically replacing the old CQP interface. This tool is named ANNIS2 and can handle complex queries on the corpus.


There are several subcorpora, and apparently more to come. The texts were written by advanced learners of German. There are most notably summaries (with the original texts and a comparable corpus of summaries written by native-speakers), essays who come from different locations (with the same type of comparable corpus) and a ‘longitudinal’ corpus coming from students of the Georgetown-University of Washington.

The corpora are annotated by a part-of-speech tagger (the TreeTagger) so that word types and lemmas are known but most of all the mistakes can be found, with several hypotheses at different levels (mainly what the correct sentence would be and what might be the reason of the mistake).


The engine (ANNIS2) has a good tutorial (in English by the way) so that it is not that difficult to search for complex patterns across the subcorpora. It seems also efficient in terms of speed. You may ...

more ...