Evaluation of date extraction tools for Python

htmldate performs better than the other Python solutions, it is also noticeably faster. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English, it greatly extends date extraction coverage without sacrificing precision.

more ...

A module to extract date information from web pages

Description

Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust …

more ...