Using a rule-based tokenizer for German

Tokenization is a text segmentation process whose objective resides in dividing written text into meaningful units. This post introduces two rule-based methods to perform tokenization on German, English and beyond.

more ...

A simple multilingual lemmatizer for Python

Task at hand: lemmatization ≠ stemming

In computer science, canonicalization (also known as standardization or normalization) is a process for converting data that has more than one possible representation into a standard, normal, or canonical form. In morphology and lexicography, a lemma is the canonical form of a set of words. In English, for example, run, runs, ran and running are forms of the same lexeme (run) which can be selected to represent all its possible forms.

Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by …

more ...