A note on Amazon’s text readability stats

Recently, Jean-Philippe Magué advised me of the newly introduced text stats on Amazon. A good summary by Gabe Habash on the news blog of Publishers Weekly describes the perspectives and the potential interest of this new software : Book Lies: Readability is Impossible to Measure. The stats seem to have been available since last summer. I decided to contribute to the discussion on Amazon’s text readability statistics : to what extent are they reliable and useful ?

Discussion

Gabe Habash compares several well-known books and concludes that the sentence length is determining in the readability measures used by Amazon. In fact, the readability formulas (Fog Index, Flesch Index and Flesch-Kincaid Index, for an explanation see Amazon’s text readability help) are centered on word length and sentence length, which is convenient but by far not always adapted.

There is another metric named ‘word complexity’, which Amazon defines as follows : ‘A word is considered “complex” if it has three or more syllables’ (source : complexity help). I wonder what happens in the case of proper nouns like (again…) Schwarzenegger. There are cases where the syllable recognition is not that easy for an algorithm that was programmed and tested to perform well on English words ...

more ...

Using a rule-based tokenizer for German

In order to solve a few tokenization problems and to delimit the sentences properly I decided not to fight with the tokenization anymore and to use an efficient script that would do it for me. There are taggers which integrate a tokenization process of their own, but that’s precisely why I need an independent one, so that I can let the several taggers downstream work on an equal basis.
I found an interesting script written by Stefanie Dipper of the University of Bochum, Germany. It is freely available here : Rule-based Tokenizer for German.

Features

  • It’s written in Perl.
  • It performs a tokenization and a sentence boundary detection.
  • It can output the result in text mode as well as in XML format, including a detailed version where all the space types are qualified.
  • It was created to perform well on German.
    • It comes with an abbreviation list which fits German standards (e.g. the street names like Hauptstr.)
    • It tries to address the problem of the dates in German, which are often written using dots (e.g. 01.01.12), using a “hard-wired list of German date expressions” according to its author.
  • The code is clear and well documented ...
more ...

Parallel work with two taggers

I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :

  • on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
  • on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.

This is easier said than done.

I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic version of the TreeTagger with the standard model for German.

A few issues

  • The Stanford-Tagger does not recognize the € symbol, and as in similar cases it adds a word and a tag explaining that the symbol is unknown.
  • There are non-breaking hyphens in my corpus, which (in my opinion) result from a too hasty cleaning of the texts before there where published, or a strange publication software. All the hyphens appear as white spaces, including in the HTML source, but in fact they are a Unicode sign. The TreeTagger treats them as spaces, the Stanford Tagger spits an error, marks ...
more ...

Find and delete LaTeX temporary files

This morning I was looking for a way to delete the dispensable aux, bbl, blg, log, out and toc files that a pdflatex compilation generates. I wanted it to go through directories so that it would eventually find old files and delete them too. I also wanted to do it from the command-line interface and to integrate it within a bash script.

As I didn’t find this bash snippet as such, i.e. adapted to the LaTeX-generated files, I post it here :

find . -regex ".*\(aux\|bbl\|blg\|log\|nav\|out\|snm\|toc\)$" -exec rm -i {} \;

This works on Unix, probably on Mac OS and perhaps on Windows if you have Cygwin installed.

Remarks

  • Find goes here through all the directories starting from where you are (.), it could also go through absolutely all directories (/) or search your Desktop for instance (something like \$Home/Desktop/).
  • The regular expression captures files ending with the (expandable) given series of letters, but also files with no extension which end with it (like test-aux).
    If you want it to stick to file extensions you may prefer this variant :
    find . \( -name "*.aux" -or -name "*.bbl" -or -name "*.blg" ... \)
  • The second part really removes the files that ...
more ...

Selected recent discoveries

Here are a few links about interesting things that I recently read.

Links section updates

Out:
The Noisy Channel
LingPipe

In:
doink.ch
internetactu.net

more ...

Display long texts with CSS, tutorial and example

Last week, I improved the CSS file that displays the (mostly long) texts of the German Political Speeches Corpus, which I introduced in my last post (“Introducing the German Political Speeches Corpus and Visualization Tool”). The texts should be easier to read now (though I do not study this kind of readability), you can see an example here (BP text 536).

I looked for ideas to design a clean and simple layout, but I did not find what I needed. So I will outline in this post the main features of my new CSS file:

  • First of all, margins, font-size and eventually font-family are set for the whole page :

    html {
        margin-left: 10%;
        margin-right: 10%;
        font-family: sans-serif;
        font-size: 10pt;
    }
    
  • Two frames, one for the main content and one for the footer, denoted as div in the XHTML file.

    div.framed {
        padding-top: 1em;
        padding-bottom: 1em;
        padding-left: 7%;
        padding-right: 7%; 
        border: 1px solid #736F6E;
        margin-bottom: 10px;
    }
    

    NB: I know there is a faster way to set the padding but I like to keep things clear. - I chose to use the good old separation rule, hr in XHTML with custom (adaptable) spacing in the CSS : :::css hr { margin-top: 2.5em; margin-bottom: 2.5em; } This ...

more ...

Introducing the German Political Speeches Corpus and Visualization Tool

I am currently working on a resource I would like to introduce : the German Political Speeches Corpus (no acronym apart from GPS). It consists in speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources.

As far I as know no such corpus was publicly available for German. Most speeches could not be found on Google until today (which is bound to change). It can be freely republished.

The two main corpora (Presidency and Chancellery) are released in XML format basing on raw text and metadata.

There is a series of improvements I plan, among which a better tokenization and POS-tags.

I am also working on a basic visualization tool enabling users to get a first glimpse of the resource, using simple text statistics in form of XHTML pages (a sort of Zeitgeist). By now it is static and I still need to brush up the CSS, but it is functional.

I think that I could take benefit from the corpus and the statistics display for my research on complexity levels.

Here is the permanent URL of the resource : http://purl.org/corpus/german-speeches
Additional information and download there.

This ...

more ...

About Google Reading Level

Jean-Philippe Magué told me there was a Google advanced search filter that checked the result pages to give a readability estimate. In fact, it was introduced about seven months ago and works to my knowledge only for the English language (that’s also why I didn’t notice it).

Description

For more information, you can read the official help page. I also found two convincing blog posts showing how it works, one by the Unofficial Google System Blog and the other by Daniel M. Russell.

The most interesting bits of information I was able to find consist in a brief explanation by a product manager at Google who created the following topic on the help forum : New Feature: Filter your results by reading level.
Note that this does not seem to have ever been a hot topic !

Apparently, it was designed as an “annotation” based on a statistical model developed using real word data (i.e. pages that were “manually” classified by teachers). The engine works by performing a word comparison, using the model as well as articles found by Google Scholar.

In the original text :

The feature is based primarily on statistical models we built with the help of ...

more ...

A few links on producing posters using LaTeX

As I had to make a poster for the TALN 2011 conference to illustrate my short paper (PDF, in French), I decided to use LaTeX, even if it was not the easiest way. I am quite happy with the result (PDF).

I gathered a few links that helped me out. My impression is that there are two common models, and as I matter of fact I saw both of them at the conference. The one that I used, Beamerposter, was “made in Germany” by Philippe Dreuw, from the Informatics Department of the University of Aachen. I only had to adapt the model to fit my needs, which is done by editing the .sty file (it is self-explanatory).

The other one, BA Poster, was “made in Switzerland” by Brian Amberg, from the Computer Science Department of the University of Basel.

And here are the links :

more ...

Lord Kelvin, Bachelard and Dilbert on Measurement

Lord Kelvin

Here is what William Thompson, better known as Lord Kelvin, once said about measure :

« I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be. »
William Thompson, Lecture on “Electrical Units of Measurement” (3 May 1883)

Bachelard

I found this quote in an early essay of the French philosopher Gaston Bachelard on what he calls “approached knowledge” (Essai sur la connaissance approchée, 1927). For him, measures cannot be considered for themselves, and he does not agree with Thompson on this point. According to him, the fact that a measure is precise enough gives us the illusion that something exists or just became real.

I quote in French, as I could find a English edition nearby, the page numbers refer to the book published by Vrin.

« Et pourtant, que ce soit dans la mesure ou dans une comparaison qualitative, il ...

more ...