Parallel work with two taggers

I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :

on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.

This is easier said than done.

I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic version of the TreeTagger with the standard model for German.

A few issues

The Stanford-Tagger does not recognize the € symbol, and as in similar cases it adds a word and a tag explaining that the symbol is unknown.
There are non-breaking hyphens in my corpus, which (in my opinion) result from a too hasty cleaning of the texts before there where published, or a strange publication software. All the hyphens appear as white spaces, including in the HTML source, but in fact they are a Unicode sign. The TreeTagger treats them as spaces, the Stanford Tagger spits an error, marks it as unkwown and continues.
Although the texts are in German (and the parameter files are set accordingly), the tagger surprisingly applies English rules to tag the quotation below. I wondered why a word was added somewhere around this sentence, the TreeTagger treats “cannot” as a single word. A workaround I found was to alter the word “cannot” into “can”, which might be a problem if the corpus is to be republished (which it is bound to be).

« The world we have created is a product of our thinking; it cannot be changed without changing thinking. »

Here is the result :

The_NE world_NE we_FM have_FM created_FM is_FM a_XY product_FM of_FM our_FM thinking_FM ;_\$. it_FM can_FM not_ADJD be_ADJA changed_NN without_VVFIN changing_ADJD thinking_VVFIN ._\$.

My opinion at this stage

A comparative study of the speed and accuracy of several taggers (reference below) confirms my impression, saying that the Stanford Tagger is (only) a little bit better at a substantial computational cost.

Moreover, the TreeTagger does not raise as many issues.

E. Giesbrecht and S. Evert, “Part-of-Speech (POS) Tagging – a solved task ? An Evaluation of POS Taggers for the German Web as Corpus”, in Proceedings of the Fifth Web as Corpus Workshop (WAC5), San Sebastian, 2009, pp. 27-35.

A few issues

My opinion at this stage

Related Posts: