My contribution to the Anglicism of the Year award

I contributed to the Anglicism of the Year award nominations. It is the second edition, the first was rather confidential but still got mentionned by the English-speaking press (e.g. by The Guardian).

The jury is once again chaired by Anatol Stefanowitsch, a professor in linguistics at Hamburg University. The selection of the final nominees will be relayed by a few German bloggers specialized in linguistics. I made it to the first nominees, but there was no selection so far, this phase goes till January 7th. News can be found on the official blog.

My suggestions are:

  • das Handyticketsystem
  • whistleblowen …
more ...

Tendencies in research on readability

In a recent article about a readability checker prototype for italian, Felice Dell’Orletta, Simonetta Montemagni, and Giulia Venturi provide a good overview of current research on readability. Starting from the end of the article, I must say the bibliography is quite up-to-date and the authors offer an extensive review of criteria used by other researchers.

Tendencies in research

First of all, there is a growing tendency towards statistical language models. In fact, language models are used by Thomas François (2009) for example, who considers they are a more efficient replacement for the vocabulary lists used in readability formulas.

Secondly …

more ...

Bibliography and links updates

As I try to put my notes in order by the end of this year, I changed a series of references, most notably in the bibliography and in the links sections.

Bibliography

I just updated the bibliography, using new categories. I divided the references in two main sections:

Corpus Linguistics, Complexity and Readability Assessment

Background

Links

First of all, I updated the links section …

more ...

A note on Amazon’s text readability stats

Recently, Jean-Philippe Magué advised me of the newly introduced text stats on Amazon. A good summary by Gabe Habash on the news blog of Publishers Weekly describes the perspectives and the potential interest of this new software : Book Lies: Readability is Impossible to Measure. The stats seem to have been available since last summer. I decided to contribute to the discussion on Amazon’s text readability statistics : to what extent are they reliable and useful ?

Discussion

Gabe Habash compares several well-known books and concludes that the sentence length is determining in the readability measures used by Amazon. In fact, the …

more ...

Parallel work with two taggers

I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :

  • on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
  • on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.

This is easier said than done.

I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic …

more ...

Find and delete LaTeX temporary files

This morning I was looking for a way to delete the dispensable aux, bbl, blg, log, out and toc files that a pdflatex compilation generates. I wanted it to go through directories so that it would eventually find old files and delete them too. I also wanted to do it from the command-line interface and to integrate it within a bash script.

As I didn’t find this bash snippet as such, i.e. adapted to the LaTeX-generated files, I post it here:

find . -regex ".*\(aux\|bbl\|blg\|log\|nav\|out\|snm\|toc\)$" -exec rm -i {} \;

This works on Unix …

more ...

Selected recent discoveries

Here are a few links about interesting things that I recently read.

more ...

Display long texts with CSS, tutorial and example

Last week, I improved the CSS file that displays the (mostly long) texts of the German Political Speeches Corpus, which I introduced in my last post (“Introducing the German Political Speeches Corpus and Visualization Tool”). The texts should be easier to read now (though I do not study this kind of readability), you can see an example here (BP text 536).

I looked for ideas to design a clean and simple layout, but I did not find what I needed. So I will outline in this post the main features of my new CSS file:

  • First of all, margins, font-size …

more ...

Introducing the German Political Speeches Corpus and Visualization Tool

I am currently working on a resource I would like to introduce : the German Political Speeches Corpus (no acronym apart from GPS). It consists in speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources.

As far I as know no such corpus was publicly available for German. Most speeches could not be found on Google until today (which is bound to change). It can be freely republished.

The two main corpora (Presidency and Chancellery) are released in XML format basing on raw text and metadata.

There is a series of …

more ...

About Google Reading Level

Jean-Philippe Magué told me there was a Google advanced search filter that checked the result pages to give a readability estimate. In fact, it was introduced about seven months ago and works to my knowledge only for the English language (that’s also why I didn’t notice it).

Description

For more information, you can read the official help page. I also found two convincing blog posts showing how it works, one by the Unofficial Google System Blog and the other by Daniel M. Russell.

The most interesting bits of information I was able to find consist in a brief …

more ...