Web corpus construction involves a significant number of design decisions and turning points in data processing. Depending of the purpose of data collection, it may also require a substantial filtering and quality assessment. While some large-scale algorithms can be expected to smooth out irregularities, uses requiring a low margin of error and close reading approaches (such as the search for examples in lexicographic research) imply constant refinements and improvements with respect to the building and processing of the dataset.
Because of the vastly increasing variety of corpora, text types and use cases, it becomes more and more difficult to assess the adequacy and quality of certain web data for given research objectives. A central operation in corpus construction consists in retaining the desired content while discarding the rest, a task which has many names referring to peculiar subtasks or to the whole: web scraping, boilerplate removal or boilerplate detection, web page template detection, web page cleaning, or web content extraction – for a recent overview see Lejeune & Zhu (2018).
Recently, approaches using the CommonCrawl have flourished, as they allow for faster download and processing by skipping (or more precisely outsourcing) the crawling phase. While I think that finding one’s “own” way through the Web is quite relevant for certain usage scenarios, it is clear that the CommonCrawl data should not be used without some filtering, it could also benefit from more refined metadata.
I already wrote about ongoing work on date extraction in HTML pages with the Python module htmldate, I will now introduce a second component of my processing chain: trafilatura, a Python library for text extraction. It focuses on the main content, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and comments.
Distinguishing between the whole page and the main text content can help alleviating many quality problems related to web texts: if the main text is too short or redundant, it may not be necessary to use it. While it is useful for de-duplicating web documents, other tasks related to content extraction also profit from a cleaner text base, as it makes work on the “real” content possible. In the concrete case of linguistic and lexicographic research, it allows for running content checks (such as language detection) on the only portion of the document that really counts.
A few libraries already exist to perform similar tasks, although most corresponding Python modules are not actively maintained, the following alternatives exist:
- dragnet features combined and machine-learning approaches, but requires many dependencies as well as extensive tuning
- python-readability cleans the page and preserves some markup but is mostly geared towards news texts
- html2text converts HTML pages to Markup language and thus keeps the structure, but doesn’t focus on main text extraction
Another problem I encountered was the lack of output formats corresponding to my needs for document storage and processing: XML and possibly XML TEI.
From a linguistic standpoint and especially in comparison with “pre-web” and general-purpose corpora, challenges of web corpus construction reside in the ability to extract and pre-process resulting web texts and ultimately to make them available in clearly describable and coherent collections.
trafilatura library scrapes the main text of web pages while preserving some structure, which is also known as boilerplate removal, DOM-based content extraction, main content identification, or HTML text cleaning. The purpose is to find relevant and original text sections of a web page and also to remove the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.). It has to be precise enough not to miss texts or discard valid documents, it also has to be reasonably fast, as it is expected to run in production on millions of pages. The result of processing can be in plain text or XML format. In the latter case, basic formatting elements are preserved such as text formatting (bold, italic, etc.) and page structure (paragraphs, titles, lists), which can then be used for further processing.
This is a work in progress, experimental features currently include the extraction of comments (separated from the rest), duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache, XML output compatible with the recommendations of the Text Encoding Initiative (XML TEI), and language detection on the extracted content.
The library works with Python 3 and can be installed as follows using pip or pip3 (depending on the system):
pip install trafilatura
Direct installation of the latest development version is also possible:
pip install git+https://github.com/adbar/trafilatura.git
In a nutshell, from Python:
import trafilatura downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/') trafilatura.extract(downloaded) # outputs main content and comments as plain text ...
From the command-line:
$ trafilatura -u https://www.scientificamerican.com/article/statistically-speaking-elephants-by-the-numbers/ 'Post updated 8/13/2013, 11:18 a.m. It’s World Elephant Day. (Who knew?!) Here’s a sober update on the ongoing saga of the proboscidian we call elephants. ...'
For more details please refer to the documentation and
- Barbaresi, A. (2019). The Vast and the Focused: On the need for domain-focused web corpora, in Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC-7), Corpus Linguistics 2019, Cardiff, pp. 29-32.
- Lejeune, G., & Zhu, L. (2018). A New Proposal for Evaluating Web Page Cleaning Tools. Computación y Sistemas, 22(4).
- Barbaresi, A. (2016). Efficient construction of metadata-enhanced web corpora, Proceedings of the 10th Web as Corpus Workshop, ACL, 2016, pp. 7-16.
- Barbaresi, A. (2015). Ad hoc and general-purpose corpus construction from web sources, PhD thesis, École Normale Supérieure de Lyon.