Task at hand: lemmatization ≠ stemming

In computer science, canonicalization (also known as standardization or normalization) is a process for converting data that has more than one possible representation into a standard, normal, or canonical form. In morphology and lexicography, a lemma is the canonical form of a set of words. In English, for example, run, runs, ran and running are forms of the same lexeme (run) which can be selected to represent all its possible forms.

Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word’s lemma or dictionary form. Unlike stemming, which reduces word forms to stems that are not necessarily valid roots, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing, this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.

Context

Web text collections are ideal to look for new language trends. They are of particular interest concerning rare, non-standard or new forms, that is mostly adjectives, nouns and verbs. Since we are dealing with new or rare phenomena and texts which are sometimes not completely clean, errors will happen. Still, grouping forms under lemmata is better for frequency calculations and diverse word searches (including on a database basis).

So I came to look for a way to quickly reduce known and unknown word forms to a lemma/dictionary form. Most morphological analysis systems for German are not completely open-source so that critical components have to be installed separately, which can be cumbersome. I also wondered how efficient a generic approach would be.

Introducing the Simplemma library

The Python library Simplemma provides a simple and multilingual approach to look for base forms or lemmata, it currently supports 35 languages. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. By design it should be reasonably fast and work in a large majority of cases.

To this day, the library partly or fully supports Bulgarian, Catalan, Czech, Danish, Dutch, English, Estonian, Finnish, French, Gaelic, Galician, Georgian, German, Hungarian, Indonesian, Irish, Italian, Latin, Latvian, Lithuanian, Luxembourgish, Manx, Persian, Portuguese, Romanian, Russian, Slovak, Slovene, Spanish, Swedish, Turkish, Ukranian, Urdu, and Welsh. For a detailed list please refer to the homepage as well as its credits.

This rule-based approach based on flexion/lemmatization dictionaries and rules is still used in popular, state-of-the-art libraries such as spacy. With its comparatively small footprint it is especially useful when speed and simplicity matter, for educational purposes or as a baseline system for lemmatization and morphological analysis.

Installation

The package is written in pure Python with no dependencies, it can be installed and used quite easily:

pip install simplemma (or pip3 where applicable)

For a tutorial on the installation of Python libraries see Installing Packages with pip.

Usage with Python

Three steps are required to use the library:

  1. import the package
  2. load language data/model
  3. apply word-by-word or on a text

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# get a word
myword = 'masks'
# decide which language data to load
# and apply it on a word form
>>> simplemma.lemmatize(myword, lang='en')
'mask'

Lists of tokens

It can be more convenient to use it on a list of tokens:

>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, lang='de')
'hier'
'sein'
'Vaccines'

It is even more convenient and generally faster to use list comprehensions:

>>> [simplemma.lemmatize(t, lang='de') for t in mytokens]
['hier', 'sein', 'Vaccines']

Tokenization

A simple tokenization is included for convenience only, it isn’t especially good but covers most cases:

>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']

The function text_lemmatizer() chains tokenization and lemmatization. It can take greedy and silent as arguments:

>>> from simplemma import text_lemmatizer
>>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', lang='pt')
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']

Chaining languages

With its multilingual capacity, Simplemma can be configured to tackle several languages of interest. Chaining several languages can indeed improve coverage:

>>> from simplemma import lemmatize
>>> lemmatize('Vaccines', lang=('de', 'en'))
'vaccine'
>>> lemmatize('spaghettis', lang='it')
'spaghettis'
>>> lemmatize('spaghettis', lang=('it', 'fr'))
'spaghetti'
>>> lemmatize('spaghettis', lang=('it', 'fr'))
'spaghetto'

There are cases in which a greedier decomposition and lemmatization algorithm is better. It is deactivated by default:

# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True)
'spaghetto'

Caveats

# don't expect too much though
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', lang='it')
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> simplemma.lemmatize('son', lang='es')
'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage for more rarely seen forms, some short frequent words (typically: pronouns) can need post-processing, this may generally concern up to a few dozens of tokens per language.

The greedy algorithm can lead to forms that are not valid. It is mainly useful on long words and neologisms, or on morphologically-rich languages.

As Simplemma mostly acts as a wrapper for lemmatization lists and rules, in some cases the original lists are wrong and need to be rectified.

For all the caveats above as well as other problems, bug reports over the issues page are welcome!