Language detection or language identification consists in recognizing the natural language an input text is written in. Computational approaches usually treat this problem as a text categorization issue. The language detector
langid.py by Marco Lui and Tim Baldwin has become a standard in its field. It uses statistical methods based on a series of features covering 97 languages.
Using the modernized fork I have been working on (py3langid) as an example, I show how to maintain and optimize a Python package in three pratical steps.
Pickling the model
The model which langid.py uses to classify texts is the first code part I looked into. This is the engine supporting the package and a critical component. It was provided as a compressed string which was loaded on
import and each time it was initialized again.
It turned out this string variable could be further compressed to make it available more quickly. Here is how to make pickling and compression work:
# standard compression module import lzma # potentially faster than pickle import _pickle as cpickle # The extension doesn't matter, let's say "pickled LZMA" filename = "model.plzma" mystring = "…" # to be replaced by the actual model # pickle, compress and write to the file in one go with lzma.open(filepath, "w") as filehandle: cpickle.dump(mystring, filehandle)
dump() function also takes an optional argument protocol: 4 is supported by Python 3.4+, 5 by Python 3.8+. One can simply default to the highest available protocol: pickle.HIGHTEST_PROTOCOL, so that the last line could be
cpickle.dump(mystring, filehandle, protocol=cpickle.HIGHTEST_PROTOCOL).
All the changes discussed here are summarized in this commit diff. In addition to the method described above, the new model packaging had to be made available to relevant package functions.
These steps led to a major improvement: loading the model now runs 10x faster.
Feature extraction loop
A significant amount of time was spent extracting the features in each text in order to apply the classification model to them. This code has a central function as it runs each time the language detector see a text, so even small improvements are meaningful in the end.
In a nutshell, here is how the old code worked:
from collections import defaultdict import numpy as np arr = np.zeros((self.nb_numfeats,), dtype='uint32') # use a default dict to count occurrences statecount = defaultdict(int) for letter in text: # get a state info from the model state = self.tk_nextmove[(state) + letter] statecount[state] += 1 # update all the productions corresponding to the state for state in statecount: for index in self.tk_output.get(state, ): arr[index] += statecount[state]
An array was thus updated like
arr[index] += value with values from operations derived from the states in the
The new code optimizes these two steps by merging them to get a series of array indexes to update and then applying them to the array in one go:
from collections import Counter import numpy as np arr = np.zeros(self.nb_numfeats, dtype='uint32') # store indexes as a list indexes =  for letter in text: state = self.tk_nextmove[(state) + letter] # directly perform the next operation indexes.extend(self.tk_output.get(state, )) # use a counter an update the values in the array once at a time for index, value in Counter(indexes).items(): arr[index] = value
It turned out that skipping a step and refraining from updating the array multiple times led to improved performance, the code running 2-3x faster.
This commit diff contains all changes described above.
Vectors and data types
Finally, there was something to be optimized in the array in itself. NumPy is a library for Python which provides support for large, multi-dimensional arrays and matrices, along with efficient high-level mathematical functions to operate on these arrays.
NumPy arrays are initialized using specified data types. Switching from the
uint32 (unsigned integer type, compatible with C unsigned int) to the
uint16 (unsigned integer type, compatible with C unsigned short) data type proved to be beneficial here.
The change happened on array initialization, as the features are padded with zeros first:
# old arr = np.zeros((self.nb_numfeats,), dtype='uint32') # new, also: simpler syntax arr = np.zeros(self.nb_numfeats, dtype='uint16')
It may seem simple, but the package performs a series of operations on matrices (e.g. matrix multiplication) and it turns out that these are much faster with smaller arrays.
The change could have an influence if the number of occurrences gets really high. On overflow, nothing would happen and the feature counts would be bashed down to values fitting the required data type. This ceiling being high enough, it should not affect the results as the difference between features would remain (e.g. a count of 65536 vs a count of 5).
Once again, the code was made to run 2-3x faster by implementing these changes. In total, it runs about five times faster by combining both changes to very frequently used code sections.
Making the right changes at the right spots can make Python code run much faster. Pickling critical components, optimizing loops and choosing the right data types are ideas to start from.
Py3langid has been introduced here as an example as it is based on a popular library, robust and reasonably fast. However it is not the only option for the task at hand. There are other good language identification libraries, their usefulness can vary depending on the case at hand and especially the languages of interest.
There is also a much faster C library by the same author (langid.c), however pure Python code using NumPy is more portable and easier to maintain. Another option would be to write Cython code for critical Numpy parts and to embed it into the package, which is doable but needs a lot of scaffholding to make sure the code can run on all platforms.