I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :
- on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
- on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.
This is easier said than done.
I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic version of the TreeTagger with the standard model for German.
A few issues
- The Stanford-Tagger does not recognize the € symbol, and as in similar cases it adds a word and a tag explaining that the symbol is unknown.
- There are non-breaking hyphens in my corpus, which (in my opinion) result from a too hasty cleaning of the texts before there where published, or a strange publication software. All the hyphens appear as white spaces, including in the HTML source, but in fact they are a Unicode sign. The TreeTagger treats them as spaces, the Stanford Tagger spits an error, marks it as unkwown and continues.
-
Although the texts are in German (and the parameter files are set accordingly), the tagger surprisingly applies English rules to tag the quotation below. I wondered why a word was added somewhere around this sentence, the TreeTagger treats “cannot” as a single word. A workaround I found was to alter the word “cannot” into “can”, which might be a problem if the corpus is to be republished (which it is bound to be).
« The world we have created is a product of our thinking; it cannot be changed without changing thinking. »
Here is the result :
The_NE world_NE we_FM have_FM created_FM is_FM a_XY product_FM of_FM our_FM thinking_FM ;_\$. it_FM can_FM not_ADJD be_ADJA changed_NN without_VVFIN changing_ADJD thinking_VVFIN ._\$.
My opinion at this stage
A comparative study of the speed and accuracy of several taggers (reference below) confirms my impression, saying that the Stanford Tagger is (only) a little bit better at a substantial computational cost.
Moreover, the TreeTagger does not raise as many issues.
E. Giesbrecht and S. Evert, “Part-of-Speech (POS) Tagging – a solved task ? An Evaluation of POS Taggers for the German Web as Corpus”, in Proceedings of the Fifth Web as Corpus Workshop (WAC5), San Sebastian, 2009, pp. 27-35.