Using a rule-based tokenizer for German

Tokenization is a text segmentation process whose objective resides in dividing written text into meaningful units. This post introduces two rule-based methods to perform tokenization on German, English and beyond.

more ...


Using sitemaps to crawl websites on the command-line

Sitemaps are particularly useful for web crawling, so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering and to work with sitemaps on the command-line.

more ...

Validating TEI-XML documents with Python

Here are two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative. The tutorial shows how to parse and to validate XML documents, using a shortcut or detailing each step.

more ...


A module to extract date information from web pages

Date is a critical component for web archives since it is one of the few metadata that are relevant for philologists and information scientists alike. The Python library htmldate provides a way to extract the creation or modification date of web pages.

more ...