Bibliography and links updates

As I try to put my notes in order by the end of this year, I changed a series of references, most notably in the bibliography and in the links sections.

Bibliography

I just updated the bibliography, using new categories. I divided the references in two main sections:

Corpus Linguistics, Complexity and Readability Assessment

Background

Links

First of all, I updated the links section using the W3C Link Validator. It is very useful, as it points out dead links and moved pages.

Resources for German

This is a new subsection:

Other links

I added a subsection to the links about LaTeX: LaTeX for Humanities (and Linguists).

I also added new tools and new Perl links.

more ...

A note on Amazon’s text readability stats

Recently, Jean-Philippe Magué advised me of the newly introduced text stats on Amazon. A good summary by Gabe Habash on the news blog of Publishers Weekly describes the perspectives and the potential interest of this new software : Book Lies: Readability is Impossible to Measure. The stats seem to have been available since last summer. I decided to contribute to the discussion on Amazon’s text readability statistics : to what extent are they reliable and useful ?

Discussion

Gabe Habash compares several well-known books and concludes that the sentence length is determining in the readability measures used by Amazon. In fact, the readability formulas (Fog Index, Flesch Index and Flesch-Kincaid Index, for an explanation see Amazon’s text readability help) are centered on word length and sentence length, which is convenient but by far not always adapted.

There is another metric named ‘word complexity’, which Amazon defines as follows : ‘A word is considered “complex” if it has three or more syllables’ (source : complexity help). I wonder what happens in the case of proper nouns like (again…) Schwarzenegger. There are cases where the syllable recognition is not that easy for an algorithm that was programmed and tested to perform well on English words ...

more ...

Using a rule-based tokenizer for German

In order to solve a few tokenization problems and to delimit the sentences properly I decided not to fight with the tokenization anymore and to use an efficient script that would do it for me. There are taggers which integrate a tokenization process of their own, but that’s precisely why I need an independent one, so that I can let the several taggers downstream work on an equal basis.
I found an interesting script written by Stefanie Dipper of the University of Bochum, Germany. It is freely available here : Rule-based Tokenizer for German.

Features

  • It’s written in Perl.
  • It performs a tokenization and a sentence boundary detection.
  • It can output the result in text mode as well as in XML format, including a detailed version where all the space types are qualified.
  • It was created to perform well on German.
    • It comes with an abbreviation list which fits German standards (e.g. the street names like Hauptstr.)
    • It tries to address the problem of the dates in German, which are often written using dots (e.g. 01.01.12), using a “hard-wired list of German date expressions” according to its author.
  • The code is clear and well documented ...
more ...

Parallel work with two taggers

I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :

  • on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
  • on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.

This is easier said than done.

I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic version of the TreeTagger with the standard model for German.

A few issues

  • The Stanford-Tagger does not recognize the € symbol, and as in similar cases it adds a word and a tag explaining that the symbol is unknown.
  • There are non-breaking hyphens in my corpus, which (in my opinion) result from a too hasty cleaning of the texts before there where published, or a strange publication software. All the hyphens appear as white spaces, including in the HTML source, but in fact they are a Unicode sign. The TreeTagger treats them as spaces, the Stanford Tagger spits an error, marks ...
more ...

Find and delete LaTeX temporary files

This morning I was looking for a way to delete the dispensable aux, bbl, blg, log, out and toc files that a pdflatex compilation generates. I wanted it to go through directories so that it would eventually find old files and delete them too. I also wanted to do it from the command-line interface and to integrate it within a bash script.

As I didn’t find this bash snippet as such, i.e. adapted to the LaTeX-generated files, I post it here :

find . -regex ".*\(aux\|bbl\|blg\|log\|nav\|out\|snm\|toc\)$" -exec rm -i {} \;

This works on Unix, probably on Mac OS and perhaps on Windows if you have Cygwin installed.

Remarks

  • Find goes here through all the directories starting from where you are (.), it could also go through absolutely all directories (/) or search your Desktop for instance (something like \$Home/Desktop/).
  • The regular expression captures files ending with the (expandable) given series of letters, but also files with no extension which end with it (like test-aux).
    If you want it to stick to file extensions you may prefer this variant :
    find . \( -name "*.aux" -or -name "*.bbl" -or -name "*.blg" ... \)
  • The second part really removes the files that ...
more ...

Selected recent discoveries

Here are a few links about interesting things that I recently read.

Links section updates

Out:
The Noisy Channel
LingPipe

In:
doink.ch
internetactu.net

more ...

Display long texts with CSS, tutorial and example

Last week, I improved the CSS file that displays the (mostly long) texts of the German Political Speeches Corpus, which I introduced in my last post (“Introducing the German Political Speeches Corpus and Visualization Tool”). The texts should be easier to read now (though I do not study this kind of readability), you can see an example here (BP text 536).

I looked for ideas to design a clean and simple layout, but I did not find what I needed. So I will outline in this post the main features of my new CSS file:

  • First of all, margins, font-size and eventually font-family are set for the whole page :

    html {
        margin-left: 10%;
        margin-right: 10%;
        font-family: sans-serif;
        font-size: 10pt;
    }
    
  • Two frames, one for the main content and one for the footer, denoted as div in the XHTML file.

    div.framed {
        padding-top: 1em;
        padding-bottom: 1em;
        padding-left: 7%;
        padding-right: 7%; 
        border: 1px solid #736F6E;
        margin-bottom: 10px;
    }
    

    NB: I know there is a faster way to set the padding but I like to keep things clear. - I chose to use the good old separation rule, hr in XHTML with custom (adaptable) spacing in the CSS : :::css hr { margin-top: 2.5em; margin-bottom: 2.5em; } This ...

more ...

Introducing the German Political Speeches Corpus and Visualization Tool

I am currently working on a resource I would like to introduce : the German Political Speeches Corpus (no acronym apart from GPS). It consists in speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources.

As far I as know no such corpus was publicly available for German. Most speeches could not be found on Google until today (which is bound to change). It can be freely republished.

The two main corpora (Presidency and Chancellery) are released in XML format basing on raw text and metadata.

There is a series of improvements I plan, among which a better tokenization and POS-tags.

I am also working on a basic visualization tool enabling users to get a first glimpse of the resource, using simple text statistics in form of XHTML pages (a sort of Zeitgeist). By now it is static and I still need to brush up the CSS, but it is functional.

I think that I could take benefit from the corpus and the statistics display for my research on complexity levels.

Here is the permanent URL of the resource : http://purl.org/corpus/german-speeches
Additional information and download there.

This ...

more ...

About Google Reading Level

Jean-Philippe Magué told me there was a Google advanced search filter that checked the result pages to give a readability estimate. In fact, it was introduced about seven months ago and works to my knowledge only for the English language (that’s also why I didn’t notice it).

Description

For more information, you can read the official help page. I also found two convincing blog posts showing how it works, one by the Unofficial Google System Blog and the other by Daniel M. Russell.

The most interesting bits of information I was able to find consist in a brief explanation by a product manager at Google who created the following topic on the help forum : New Feature: Filter your results by reading level.
Note that this does not seem to have ever been a hot topic !

Apparently, it was designed as an “annotation” based on a statistical model developed using real word data (i.e. pages that were “manually” classified by teachers). The engine works by performing a word comparison, using the model as well as articles found by Google Scholar.

In the original text :

The feature is based primarily on statistical models we built with the help of ...

more ...

A few links on producing posters using LaTeX

As I had to make a poster for the TALN 2011 conference to illustrate my short paper (PDF, in French), I decided to use LaTeX, even if it was not the easiest way. I am quite happy with the result (PDF).

I gathered a few links that helped me out. My impression is that there are two common models, and as I matter of fact I saw both of them at the conference. The one that I used, Beamerposter, was “made in Germany” by Philippe Dreuw, from the Informatics Department of the University of Aachen. I only had to adapt the model to fit my needs, which is done by editing the .sty file (it is self-explanatory).

The other one, BA Poster, was “made in Switzerland” by Brian Amberg, from the Computer Science Department of the University of Basel.

And here are the links :

more ...