Validating TEI-XML documents with Python

Here are two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative. The tutorial shows how to parse and to validate XML documents, using a shortcut or detailing each step.

more ...


A module to extract date information from web pages

Date is a critical component for web archives since it is one of the few metadata that are relevant for philologists and information scientists alike. The Python library htmldate provides a way to extract the creation or modification date of web pages.

more ...

Ad hoc and general-purpose corpus construction from web sources

The diversity and quantity of texts present on the Internet have to be better assessed to allow for the description of language with its diversity and change. Focusing on actual construction processes leads to better corpus design, beyond simple collections or heterogeneous resources.

more ...