Update 2018-01-19: I would now use the SoMaJo tokenizer for Python, which is both convenient and efficient on newer texts. For the series of rules used, see the source code.
In order to solve a few tokenization problems and to delimit the sentences properly I decided not to fight with the tokenization anymore and to use an efficient script that would do it for me. There are taggers which integrate a tokenization process of their own, but that’s precisely why I need an independent one, so that I can let the several taggers downstream work on an equal basis. I found an interesting script written by Stefanie Dipper of the University of Bochum, Germany. It is freely available here : Rule-based Tokenizer for German.
- It’s written in Perl.
- It performs a tokenization and a sentence boundary detection.
- It can output the result in text mode as well as in XML format, including a detailed version where all the space types are qualified.
- It was created to perform well on German.
- It comes with an abbreviation list which fits German standards (e.g. the street names like Hauptstr.)
- It tries to address the problem of the dates in German …