Filtering links to gather texts on the web

The issue with URLs and URIs

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI).

Both navigation on the Web and web crawling rely on the assumption that “the Web is a space in which resources are identified by Uniform Resource Identifiers (URIs).” (Berners-Lee et al., 2006) That being said, URLs cannot be expected to be entirely reliable. Especially as part of the Web 2.0 content …

more ...

Evaluation of date extraction tools for Python

htmldate performs better than the other Python solutions, it is also noticeably faster. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English, it greatly extends date extraction coverage without sacrificing precision.

more ...