I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :
- on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
- on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.
This is easier said than done.
I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic version of the TreeTagger with the standard model for German.
A few issues
- The Stanford-Tagger does not recognize the € symbol, and as in similar cases it adds a word and a tag explaining that the symbol is unknown.
- There are non-breaking hyphens in my corpus, which (in my opinion) result from a too hasty cleaning of the texts before there where published, or a strange publication software. All the hyphens appear as white spaces, including in the HTML source, but in fact they are a Unicode sign. The TreeTagger treats them as spaces, the Stanford Tagger spits an error, marks …