Bits of Language: corpus linguistics, NLP and text analytics

How to make language detection with langid.py faster

The language detector langid.py has become quite popular. Using the modernized fork py3langid as an example I show how to maintain and optimize a Python package.

more ...

How to download web pages in parallel and follow politeness rules in Python

Optimizing downloads is crucial to gather data from a series of websites. However, one should respect “politeness” rules. Here is a simple way keep an eye on all these constraints as once.

more ...

Web scraping with Trafilatura just got faster

HTML to text extraction just got faster with the dedicated Trafilatura software as measured on the benchmark available on the repository. These follows from from two major changes in the package dependencies charset_normalizer and jusText.

more ...

Using RSS and Atom feeds to collect web pages with Python

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to optimize link discovery and document filtering.

more ...

A simple multilingual lemmatizer for Python

Grouping together the inflected forms of a word allows for analyzing them as a single item, identified by the dictionary form. The Python library Simplemma provides a simple and multilingual approach to look for base forms or lemmata.

more ...

Using sitemaps to crawl websites on the command-line

Sitemaps are particularly useful for web crawling, so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering and to work with sitemaps on the command-line.

more ...

Filtering links to gather texts on the web

Courlan is a command-line tool and Python library designed to clean, filter, normalize, and sample URLs. Its primary purpose is to optimize web crawling by focusing on web pages containing primarily spam-free text in a target language.

more ...

Evaluating scraping and text extraction tools for Python

Python packages are compared with respect to robustness and speed. Raw text extraction of boilerplate and content segments reveals which web scraping tool is more adapted to the html2text challenge.

more ...

Validating TEI-XML documents with Python

Here are two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative. The tutorial shows how to parse and to validate XML documents, using a shortcut or detailing each step.

more ...

Extracting the main text content from web pages using Python

Trafilatura is a Python library designed to download, parse, and scrape web page data. It also offers tools that can easily help with website navigation and extraction of links from sitemaps and feeds.

more ...