Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. They come in different shapes and sizes mostly because of the wide variety of platforms and content management systems, and not least because of varying reasons to publish and diverging goals followed during web publication.
This wide variety of contexts and text genres leads to important design decisions during the collection of texts: should the tooling be adapted to particular news outlets or blogs that are targeted (which often amounts to the development of web scraping tools) or should the extraction be as generic as possible to provide opportunistic ways of gathering information? Due to a certain lack of time resources in academia and elsewhere, the second option is often best.
Consequently, an important problem remains as to the most efficient way to gather language data. Between CMS idiosyncracies, bulky pages and malformed HTML, the chosen solution has to be precise, robust and fast at the same time. The purpose of this evaluation is to test currently available alternatives with respect to particular needs for coverage and speed.
The current benchmark focuses on the Python programming language which is reportedly the most popular …more ...