Tokenization is a text segmentation process whose objective resides in dividing written text into meaningful units. A token is a sequence of characters that are grouped together as a useful semantic unit for processing. Tokens can be words, numbers or punctuation marks, while certain characters such as punctuation can be thrown away depending on particular goals.
Typical natural language processing tasks feature tokenization as a first step: given a character sequence and a defined document unit, smaller units are created by locating word boundaries. Methods used to identify tokens include regular expressions, specific sequences of characters termed a flag, specific separating characters called delimiters, and explicit definition by a dictionary.
Word delimiters and tokenization rules are language-specific: punctuation and space use differs from one language to another. A frequent approach also resides in language-aware lists of abbreviations. Tokenization is loosely related to sentence segmentation in the sense that wrongly tokenizing text can also negatively impact sentence boundary detection, for instance by wrongly segmenting “1.5” into “1⎵.⎵5”.
Rule-based tokenization
There are NLP toolchains which integrate a tokenization process of their own, but despite the growing relevant of deep learning frameworks it can still be useful to perform a tokenization step on its own, notably to compare the output of different tools. The original post on this topic was ten years old: I first found an interesting Perl script written by Stefanie Dipper of the University of Bochum, Germany. It is no longer online and there are now other alternatives. In this revised post, I show two examples of rule-based tokenization:
- The first is fast and useful when speed and simplicity matters
- The second has been shown to be very precise on German and English benchmarks
Both are based on regular expression patterns, that is a corresponding to a particular kind of mathematical model of computation called finite state automatons (FSA). In fact, there is a FSA for every regular expression, see this proof: From Regular Expressions to FSAs.
Baseline tokenization
Simplemma is a lemmatizer with a comparatively small footprint including a tokenizer. The library is also described in this post: A simple multilingual lemmatizer for Python . The following examples focus on its simple regex-based tokenizer.
Installation is straightforward through Python package managers:
$ pip install simplemma # or pip3 where applicable
This simple tokenization isn’t especially good but should cover most cases quite efficiently:
>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']
>>> simple_tokenizer('I envy paranoids; they actually feel people are paying attention to them. "Susan Sontag Finds Romance," interview by Leslie Garis, The New York Times (2 August 1992)')
['I', 'envy', 'paranoids', ';', 'they', 'actually', 'feel', 'people', 'are', 'paying', 'attention', 'to', 'them', '.', '"', 'Susan', 'Sontag', 'Finds', 'Romance', ',', '"', 'interview', 'by', 'Leslie', 'Garis', ',', 'The', 'New', 'York', 'Times', '(', '2', 'August', ' 1992', ')']
Precise tokenization
I would now use the SoMaJo tokenizer for Python, which is both convenient and efficient on newer texts. For the series of rules used, see the source code.
SoMaJo is a state-of-the-art tokenizer and sentence splitter for German and English web and social media texts. In addition to tokenizing the input text, SoMaJo can also output token class information for each token, i.e. if it is a number, an emoticon, an abbreviation, etc.
Installation through a package manager:
pip install somajo # or pip3 where applicable
Here is a simple example of usage with Python:
>>> from somajo import SoMaJo
>>> paragraphs = ["That aint bad!:D"]
>>> tokenizer = SoMaJo(language="en_PTB")
>>> sentences = tokenizer.tokenize_text(paragraphs)
>>> for sentence in sentences:
>>> for token in sentence:
>>> print(token.text)
That
ai
nt
bad
!
:D
Additional methods
For a comparison of different tokenizers from different backgrounds see Evaluating Off-the-Shelf NLP Tools for German. For other open approaches to German see also this German-NLP list.
Beyond rule-based methods it would be interesting to know more about the efficiency of segmenters use in deep-learning approaches, especially unsupervised sub-word approaches like SentencePiece, byte-pair-encoding (BPE) and unigram language models. To understand how segmentation can be done and benchmark approaches on different datasets, here are 6 methods to perform tokenization in Python, including NLTK, Spacy, Keras and Gensim.