Evaluation of date extraction tools for Python

Introduction

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult, and an important problem remains as to the most efficient way to gather language data. Metadata extraction is part of data mining and knowledge extraction techniques. Dates are critical components since they are relevant both from a philological standpoint and in the context of information technology.

In most cases, immediately accessible data on retrieved webpages do not carry substantial or accurate information: neither the URL nor the server response provide a reliable way to date a web document, i.e. to find …

more ...

Evaluating scraping and text extraction tools for Python

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. They come in different shapes and sizes mostly because of the wide variety of platforms and content management systems, and not least because of varying reasons and diverging goals followed during web publication.

This wide variety of contexts and text genres leads to important design decisions during the collection of texts: should the tooling be adapted to particular news outlets or blogs that are targeted (which often amounts to the development of web scraping tools) or should the extraction be as generic as …

more ...