Parallel work with two taggers

I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :

  • on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
  • on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.

This is easier said than done.

I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic version of the TreeTagger with the standard model for German.

A few issues

  • The Stanford-Tagger does not recognize the € symbol, and as in similar cases it adds a word and a tag explaining that the symbol is unknown.
  • There are non-breaking hyphens in my corpus, which (in my opinion) result from a too hasty cleaning of the texts before there where published, or a strange publication software. All the hyphens appear as white spaces, including in the HTML source, but in fact they are a Unicode sign. The TreeTagger treats them as spaces, the Stanford Tagger spits an error, marks ...
more ...

Find and delete LaTeX temporary files

This morning I was looking for a way to delete the dispensable aux, bbl, blg, log, out and toc files that a pdflatex compilation generates. I wanted it to go through directories so that it would eventually find old files and delete them too. I also wanted to do it from the command-line interface and to integrate it within a bash script.

As I didn’t find this bash snippet as such, i.e. adapted to the LaTeX-generated files, I post it here :

find . -regex ".*\(aux\|bbl\|blg\|log\|nav\|out\|snm\|toc\)$" -exec rm -i {} \;

This works on Unix, probably on Mac OS and perhaps on Windows if you have Cygwin installed.

Remarks

  • Find goes here through all the directories starting from where you are (.), it could also go through absolutely all directories (/) or search your Desktop for instance (something like \$Home/Desktop/).
  • The regular expression captures files ending with the (expandable) given series of letters, but also files with no extension which end with it (like test-aux).
    If you want it to stick to file extensions you may prefer this variant :
    find . \( -name "*.aux" -or -name "*.bbl" -or -name "*.blg" ... \)
  • The second part really removes the files that ...
more ...

Selected recent discoveries

Here are a few links about interesting things that I recently read.

Links section updates

Out:
The Noisy Channel
LingPipe

In:
doink.ch
internetactu.net

more ...

Display long texts with CSS, tutorial and example

Last week, I improved the CSS file that displays the (mostly long) texts of the German Political Speeches Corpus, which I introduced in my last post (“Introducing the German Political Speeches Corpus and Visualization Tool”). The texts should be easier to read now (though I do not study this kind of readability), you can see an example here (BP text 536).

I looked for ideas to design a clean and simple layout, but I did not find what I needed. So I will outline in this post the main features of my new CSS file :

  • First of all, margins, font-size and eventually font-family are set for the whole page :

    html {
        margin-left: 10%;
        margin-right: 10%;
        font-family: sans-serif;
        font-size: 10pt;
    }
    
  • Two frames, one for the main content and one for the footer, denoted as div in the XHTML file.

    div.framed {
        padding-top: 1em;
        padding-bottom: 1em;
        padding-left: 7%;
        padding-right: 7%; 
        border: 1px solid #736F6E;
        margin-bottom: 10px;
    }
    

    NB: I know there is a faster way to set the padding but I like to keep things clear. - I chose to use the good old separation rule, hr in XHTML with custom (adaptable) spacing in the CSS : :::css hr { margin-top: 2.5em; margin-bottom: 2.5em; } This ...

more ...

Introducing the German Political Speeches Corpus and Visualization Tool

I am currently working on a resource I would like to introduce : the German Political Speeches Corpus (no acronym apart from GPS). It consists in speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources.

As far I as know no such corpus was publicly available for German. Most speeches could not be found on Google until today (which is bound to change). It can be freely republished.

The two main corpora (Presidency and Chancellery) are released in XML format basing on raw text and metadata.

There is a series of improvements I plan, among which a better tokenization and POS-tags.

I am also working on a basic visualization tool enabling users to get a first glimpse of the resource, using simple text statistics in form of XHTML pages (a sort of Zeitgeist). By now it is static and I still need to brush up the CSS, but it is functional.

I think that I could take benefit from the corpus and the statistics display for my research on complexity levels.

Here is the permanent URL of the resource : http://purl.org/corpus/german-speeches
Additional information and download there.

This ...

more ...

About Google Reading Level

Jean-Philippe Magué told me there was a Google advanced search filter that checked the result pages to give a readability estimate. In fact, it was introduced about seven months ago and works to my knowledge only for the English language (that’s also why I didn’t notice it).

Description

For more information, you can read the official help page. I also found two convincing blog posts showing how it works, one by the Unofficial Google System Blog and the other by Daniel M. Russell.

The most interesting bits of information I was able to find consist in a brief explanation by a product manager at Google who created the following topic on the help forum : New Feature: Filter your results by reading level.
Note that this does not seem to have ever been a hot topic !

Apparently, it was designed as an “annotation” based on a statistical model developed using real word data (i.e. pages that were “manually” classified by teachers). The engine works by performing a word comparison, using the model as well as articles found by Google Scholar.

In the original text :

The feature is based primarily on statistical models we built with the help of ...

more ...

A few links on producing posters using LaTeX

As I had to make a poster for the TALN 2011 conference to illustrate my short paper (PDF, in French), I decided to use LaTeX, even if it was not the easiest way. I am quite happy with the result (PDF).

I gathered a few links that helped me out. My impression is that there are two common models, and as I matter of fact I saw both of them at the conference. The one that I used, Beamerposter, was “made in Germany” by Philippe Dreuw, from the Informatics Department of the University of Aachen. I only had to adapt the model to fit my needs, which is done by editing the .sty file (it is self-explanatory).

The other one, BA Poster, was “made in Switzerland” by Brian Amberg, from the Computer Science Department of the University of Basel.

And here are the links :

more ...

Lord Kelvin, Bachelard and Dilbert on Measurement

Lord Kelvin

Here is what William Thompson, better known as Lord Kelvin, once said about measure :

« I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be. »
William Thompson, Lecture on “Electrical Units of Measurement” (3 May 1883)

Bachelard

I found this quote in an early essay of the French philosopher Gaston Bachelard on what he calls “approached knowledge” (Essai sur la connaissance approchée, 1927). For him, measures cannot be considered for themselves, and he does not agree with Thompson on this point. According to him, the fact that a measure is precise enough gives us the illusion that something exists or just became real.

I quote in French, as I could find a English edition nearby, the page numbers refer to the book published by Vrin.

« Et pourtant, que ce soit dans la mesure ou dans une comparaison qualitative, il ...

more ...

Crawling a newspaper website to build a corpus

Basing on my previous post about specialized crawlers, I will show how I to crawl a French sports newspaper named L’Equipe using scripts written in Perl, which I did lately. For educational purpose, it works by now but it is bound to stop being efficient as soon as the design of the website changes.

Gathering links

First of all, you have to make a list of links so that you have something to start from. Here is the beginning of the script:

#!/usr/bin/perl #assuming you're using a UNIX-based system...
use strict; #because it gets messy without, and because Perl is faster that way
use Encode; #you have to get the correct encoding settings of the pages
use LWP::Simple; #to get the webpages
use Digest::MD5 qw(md5_hex);

Just an explanation on the last line : we are going to use a hash function to shorten the links and make sure we fetch a single page just once.

my $url = "http://www.lequipe.fr/"; #the starting point

$page = get $url; #the variables ought to be defined somewhere before $page = encode(“iso-8859-1”, $page); #because the pages are not in Unicode format push (@done_md5, substr(md5_hex($url), 0, 8 ...

more ...

Building a basic specialized crawler

As I went on crawling again in the last few days I thought it could be helpful to describe the way I do.

Note that it is for educational purpose only (I am not assuming that I built the fastest and most reliable crawling engine ever) and that the aim is to crawl specific pages of interest. That implies I know which links I want to follow just by regular expressions, because I observe how a given website is organized.

I see two (or eventually three) steps in the process, which I will go through giving a few hints in pseudocode.

A shell script

You might want to write a shell script to fire the two main phases automatically and/or to save your results on a regular basis (if something goes wrong after a reasonable amount of explored pages you don’t want to lose all the work, even if it’s mainly CPU time and electricity).

A list of links

If the website has an archive, a sitemap or a general list of its contents you can spare time by picking the interesting links once and for all.

going through a shortlist of archives DO {      fetch page      find ...

more ...