Description

Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example webpages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was written.

Context

I regularly work on improving the extraction methods for the web collections at my home institutions. They are unique as they combine both the quantity resulting from broad web crawling and the quality obtained by carefully extracting text and metadata as well as rejecting documents that do not match certain criteria. In that sense, I already published work on methods to derive metadata from web documents in order to build text corpora for (computational) linguistic analysis, see for example Efficient construction of metadata-enhanced web corpora (2016).

The date however is a critical component, since it is one of the few metadata that are relevant both from a philological standpoint and in the context of information extraction for use in digital databases. It is crucial for my fellow lexicographers at the language center of the Berlin-Brandenburg Academy of Sciences to be able to determine with precision when a given word has been used for the first time and how its use evolves through time.

Module

There is an existing codebase to extract or derive metadata from web pages, and these methods generally work well for articles or blog posts. I took inspiration goose, newspaper and articleDateExtractor (all for Python) as well as metascraper, a complete suite for Javascript focusing on metadata. With the exception of the latter, I could not find any functional and actively maintained module, let alone for Python.

That is why I am releasing the code I am working on under an open-source license on GitHub and as a Python package.

htmldate provides a simple and convenient way to extract the creation or modification date of web pages, based on HTML parsing and scraping functions:

  1. Starting from the header of the page, it uses common patterns to identify date fields.
  2. If this is not successful, it scans the whole document looking for structural markers.
  3. If no date cue could be found, it finally runs a series of heuristics on the content (text and markup).

Usage

The module is compatible with most of Python 2 and 3 versions. It takes the HTML document as input (string format) and returns a date if a valid cue could be found in the document. The output string currently defaults to ISO 8601 YMD format.

Installation

Installation from package repository: pip install htmldate. For complete instructions see the documentation.

Direct installation of the latest version over pip is possible:

pip3 install git+https://github.com/adbar/htmldate.git

Unit tests with corresponding web page samples are available in the tests/ directory.

Command-line

A basic command-line interface is included:

$ htmldate -u "http://blog.python.org/2016/12/python-360-is-now-available.html"
2016-12-23

For more information, type htmldate --help.

Within Python

In case the web page features clear metadata, the extraction is straightforward:

from htmldate import find_date
find_date('https://www.theguardian.com/politics/2016/feb/17/merkel-eu-uk-germany-national-interest-cameron-justified')
'2016-02-17'

A more advanced analysis is sometimes needed:

find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'# DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>'
'# DEBUG result: 2016-12-23'
'2016-12-23'

In the worst case, the module resorts to a series of heuristics which can be deactivated (in “safe mode”) and tries to output a date. There is a trade-off in terms of granularity, current tests indicate a margin of error of 2-3 months.

find_date('https://github.com/adbar/htmldate')
'2017-09-01' # may have changed

For more information see the readme page on GitHub.

Feedback and pull requests are welcome!

Update: the library has gotten better over time and now runs on all common platforms. The documentation is available on htmldate.readthedocs.io.