Bits of Language: corpus linguistics, NLP and text analytics

Validating TEI-XML documents with Python

Here are two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative. The tutorial shows how to parse and to validate XML documents, using a shortcut or detailing each step.

more ...

Extracting the main text content from web pages using Python

Trafilatura is a Python library designed to download, parse, and scrape web page data. It also offers tools that can easily help with website navigation and extraction of links from sitemaps and feeds.

more ...

A module to extract date information from web pages

Date is a critical component for web archives since it is one of the few metadata that are relevant for philologists and information scientists alike. The Python library htmldate provides a way to extract the creation or modification date of web pages.

more ...

Ad hoc and general-purpose corpus construction from web sources

The diversity and quantity of texts present on the Internet have to be better assessed to allow for the description of language with its diversity and change. Focusing on actual construction processes leads to better corpus design, beyond simple collections or heterogeneous resources.

more ...