Bits of Language: corpus linguistics, NLP and text analytics

Web scraping with Trafilatura just got faster

HTML to text extraction just got faster with the dedicated Trafilatura software as measured on the benchmark available on the repository. These follows from from two major changes in the package dependencies charset_normalizer and jusText.

more ...

Using a rule-based tokenizer for German

Tokenization is a text segmentation process whose objective resides in dividing written text into meaningful units. This post introduces two rule-based methods to perform tokenization on German, English and beyond.

more ...

A simple multilingual lemmatizer for Python

Grouping together the inflected forms of a word allows for analyzing them as a single item, identified by the dictionary form. The Python library Simplemma provides a simple and multilingual approach to look for base forms or lemmata.

more ...