I have been working with the part-of-speech tagger developed at the IMS Stuttgart TreeTagger since my master thesis. It performs well on german texts as one could easily suppose, since it was one of its primary purposes. One major problem is that it’s poorly documented, so I would like to share the way that I found to pass things to TreeTagger through a pipe.
The first thing is that TreeTagger doesn’t take Unicode strings, as it dates back to the nineties. So you have to convert whatever you pass to ISO-8859-1, which the iconv software with the translit option set does very well. It means here “find an equivalent if the character cannot be exactly translated”.
Then you have to define the options that you want to use. I put the most frequent ones in the example.
The advantage of a pipe is that you can clean the text while passing it to the tagger. Here is one way of doing it, by using the text editor sed to : 1. remove the trailing white lines 2. replace everything that’s more than one space by one space and 3. replacing spaces by new lines.
This way the TreeTagger gets one word every new line, as required, which
is very convenient I think.
Starting from a text file, you get a word, its tag and a new line.
Here is the code that I use :
#!/bin/bash INPUT=~/file.txt TAGGER=~/something/TreeTagger/bin/tree-tagger OPTIONS="-token -lemma" PARMFILE=~/something/TreeTagger/lib/german.par` < $INPUT sed -e '/^$/d' -e 's/\s+/\s/g' -e 's/ /\n/g' | iconv --from-code=UTF-8 --to-code=ISO-8859-1//TRANSLIT | $TAGGER $OPTIONS $PARMFILE | iconv --from-code=ISO-8859-1 --to-code=UTF-8//TRANSLIT | ... >
That’s all ! Please let me know if these lines proved useful.
Update: You can also choose to use directly the UTF8-encoded model for German, which was apparently trained on the same texts and which should behave the same way (although I have no evidence for that).