Filtering links to gather texts on the web

Courlan is a command-line tool and Python library designed to clean, filter, normalize, and sample URLs. Its primary purpose is to optimize web crawling by focusing on web pages containing primarily spam-free text in a target language.

more ...

Evaluation of date extraction tools for Python

htmldate performs better than the other Python solutions, it is also noticeably faster. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English, it greatly extends date extraction coverage without sacrificing precision.

more ...