How to make language detection with langid.py faster

Language detection or language identification consists in recognizing the natural language an input text is written in. Computational approaches usually treat this problem as a text categorization issue. The language detector langid.py by Marco Lui and Tim Baldwin has become a standard in its field. It uses statistical methods based on a series of features covering 97 languages.

Using the modernized fork I have been working on (py3langid) as an example, I show how to maintain and optimize a Python package in three pratical steps.

Pickling the model

The model which langid.py uses to classify texts is the first code part I looked into. This is the engine supporting the package and a critical component. It was provided as a compressed string which was loaded on import and each time it was initialized again.

It turned out this string variable could be further compressed to make it available more quickly. Here is how to make pickling and compression work:

# standard compression module
import lzma
# potentially faster than pickle
import _pickle as cpickle

# The extension doesn't matter, let's say "pickled LZMA"
filename = "model.plzma"
mystring = "…"  # to be replaced by the actual model

# pickle, compress and write to the file in one go
with lzma.open(filepath, "w") as filehandle:
    cpickle.dump(mystring, filehandle)

The dump() function also takes an optional argument protocol: 4 is supported by Python 3.4+, 5 by Python 3.8+. One can simply default to the highest available protocol: pickle.HIGHTEST_PROTOCOL, so that the last line could be cpickle.dump(mystring, filehandle, protocol=cpickle.HIGHTEST_PROTOCOL).

All the changes discussed here are summarized in this commit diff. In addition to the method described above, the new model packaging had to be made available to relevant package functions.

These steps led to a major improvement: loading the model now runs 10x faster.

Feature extraction loop

A significant amount of time was spent extracting the features in each text in order to apply the classification model to them. This code has a central function as it runs each time the language detector see a text, so even small improvements are meaningful in the end.

Old code

In a nutshell, here is how the old code worked:

from collections import defaultdict
import numpy as np

arr = np.zeros((self.nb_numfeats,), dtype='uint32')

# use a default dict to count occurrences
statecount = defaultdict(int)
for letter in text:
    # get a state info from the model
    state = self.tk_nextmove[(state) + letter]
    statecount[state] += 1

# update all the productions corresponding to the state
for state in statecount:
  for index in self.tk_output.get(state, []):
    arr[index] += statecount[state]

An array was thus updated like arr[index] += value with values from operations derived from the states in the statecount dictionary.

New code

The new code optimizes these two steps by merging them to get a series of array indexes to update and then applying them to the array in one go:

from collections import Counter
import numpy as np

arr = np.zeros(self.nb_numfeats, dtype='uint32')

# store indexes as a list
indexes = []
for letter in text:
    state = self.tk_nextmove[(state) + letter]
    # directly perform the next operation
    indexes.extend(self.tk_output.get(state, []))

# use a counter an update the values in the array once at a time
for index, value in Counter(indexes).items():
  arr[index] = value

It turned out that skipping a step and refraining from updating the array multiple times led to improved performance, the code running 2-3x faster.

This commit diff contains all changes described above.

Vectors and data types

Finally, there was something to be optimized in the array in itself. NumPy is a library for Python which provides support for large, multi-dimensional arrays and matrices, along with efficient high-level mathematical functions to operate on these arrays.

NumPy arrays are initialized using specified data types. Switching from the uint32 (unsigned integer type, compatible with C unsigned int) to the uint16 (unsigned integer type, compatible with C unsigned short) data type proved to be beneficial here.

The change happened on array initialization, as the features are padded with zeros first:

# old
arr = np.zeros((self.nb_numfeats,), dtype='uint32')
# new, also: simpler syntax
arr = np.zeros(self.nb_numfeats, dtype='uint16')

It may seem simple, but the package performs a series of operations on matrices (e.g. matrix multiplication) and it turns out that these are much faster with smaller arrays.

The change could have an influence if the number of occurrences gets really high. On overflow, nothing would happen and the feature counts would be bashed down to values fitting the required data type. This ceiling being high enough, it should not affect the results as the difference between features would remain (e.g. a count of 65536 vs a count of 5).

Once again, the code was made to run 2-3x faster by implementing these changes. In total, it runs about five times faster by combining both changes to very frequently used code sections.

Summing up

Making the right changes at the right spots can make Python code run much faster. Pickling critical components, optimizing loops and choosing the right data types are ideas to start from.

Py3langid has been introduced here as an example as it is based on a popular library, robust and reasonably fast. However it is not the only option for the task at hand. There are other good language identification libraries, their usefulness can vary depending on the case at hand and especially the languages of interest.

There is also a much faster C library by the same author (langid.c), however pure Python code using NumPy is more portable and easier to maintain. Another option would be to write Cython code for critical Numpy parts and to embed it into the package, which is doable but needs a lot of scaffholding to make sure the code can run on all platforms.