Web data mining involves a significant number of design decisions and turning points in data processing. Depending of the purpose of data collection, it may also require a substantial filtering and quality assessment. While some large-scale algorithms can be expected to smooth out irregularities, uses requiring a low margin of error and close reading approaches (such as the search for examples in lexicographic research) imply constant refinements and improvements with respect to the building and processing of the dataset.

Distinguishing between the whole page and the main text content can help alleviating many quality problems related to web texts: if the main text is too short or redundant, it may not be necessary to use it. While it is useful for de-duplicating web documents, other tasks related to content extraction also profit from a cleaner text base, as it makes work on the “real” content possible. In the concrete case of linguistic and lexicographic research, it allows for running content checks (such as language detection) on the only portion of the document that really counts.

Challenges in web content extraction

Because of the vastly increasing variety of text corpora, text types and use cases, it becomes more and more difficult to assess the adequacy and quality of certain web data for given research objectives. A central operation in corpus construction consists in retaining the desired content while discarding the rest, a task which has many names referring to peculiar subtasks or to the whole: web scraping, boilerplate removal or boilerplate detection, web page template detection, web page cleaning, or web content extraction – for a recent overview see Lejeune & Zhu (2018).

Recently, approaches using the CommonCrawl have flourished, as they allow for faster download and processing by skipping (or more precisely outsourcing) the crawling phase. While I think that finding one’s “own” way through the Web is quite relevant for certain usage scenarios, it is clear that the CommonCrawl data should not be used without some filtering, it could also benefit from more refined metadata.

I already wrote about ongoing work on date extraction in HTML pages with the Python module htmldate, I will now introduce a second component of my processing chain: trafilatura, a Python library for text extraction. It focuses on the main content, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and comments.

Introducing text scraping with Trafilatura

Trafilatura is a Python library designed to download, parse, and scrape web page data. It also offers tools that can easily help with website navigation and extraction of links from sitemaps and feeds.

Its main purpose is to find relevant and original text sections of a web page and also to remove the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.). It has to be precise enough not to miss texts or discard valid documents, it also has to be reasonably fast, as it is expected to run in production on millions of pages.

Trafilatura scrapes the main text of web pages while preserving some structure, a task which is also known as boilerplate removal, DOM-based content extraction, main content identification, or HTML text cleaning. The result of processing can be in TXT, CSV, JSON & XML formats. In the latter case, basic formatting elements are preserved such as text formatting (bold, italic, etc.) and page structure (paragraphs, titles, lists, links, images, etc.), which can then be used for further processing.

The library is primarily geared towards linguistic analysis but can serve a lot of different purposes. From a linguistic standpoint and especially in comparison with “pre-web” and general-purpose corpora, challenges of web corpus construction reside in the ability to extract and pre-process resulting web texts and ultimately to make them available in clearly describable and coherent collections.

As such trafilatura features comments extraction (separated from the rest), duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache, XML output compatible with the recommendations of the Text Encoding Initiative (XML TEI), and language detection on the extracted content.

The library works with all common versions of Python and can be installed as follows:

$ pip install trafilatura # pip3 where applicable

Usage with Python

The library entails a series of Python functions which can be easily re-used and adapted to various development settings:

>>> import trafilatura
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> trafilatura.extract(downloaded)
# outputs main content and comments as plain text ...
>>> trafilatura.extract(downloaded, xml_output=True, include_comments=False)
# outputs main content without comments as XML ...

These values combined probably provide the fastest execution times but don’t necessarily include all the available text segments:

>>> result = extract(downloaded, include_comments=False, include_tables=False, no_fallback=True)

The input can consist of a previously parsed tree (i.e. a lxml.html object), which is then handled seamlessly:

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><article><p>Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p></article></body></html>')
>>> extract(mytree)
'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n'

The function bare_extraction can be used to bypass conversion und directly use and transform the raw output: it returns Python variables for metadata (as a dictionary) as well as main text and comments (both as LXML objects).

>>> from trafilatura import bare_extraction
>>> bare_extraction(downloaded)

Usage on the command-line

Trafilatura includes a command-line interface and can be conveniently used without writing code.

$ trafilatura -u "https://www.scientificamerican.com/article/statistically-speaking-elephants-by-the-numbers/"
'Post updated 8/13/2013, 11:18 a.m.
It’s World Elephant Day. (Who knew?!) Here’s a sober update on the ongoing saga of the proboscidian we call elephants. ...'
$ trafilatura -h
# displays all the available options

The following argument combination allows for bulk downloads (URLs contained by links.txt), backup of HTML sources in a separate directory, conversion and storage of extracted texts as XML. This can be especially useful for archival and further processing:

$ trafilatura --inputfile links.txt --outputdir converted/ --backup-dir html-sources/ --xml

Potential alternatives

Although a few corresponding Python packages are not actively maintained the following alternatives for web text extraction comprise:

  1. Libraries that keep the textual structure intact but don’t focus on main texts
  2. Libraries that focus on main text extraction
  3. Libraries that extract main texts while also extracting document metadata

Trafilatura features many useful functions like metadata, text and comment extraction as well as link discovery in feeds and sitemaps. On text extraction alone it already fares significantly better than available alternatives, see the following scraping quality comparisons:

Another problem can reside in the lack of output formats corresponding to common needs for document storage and processing: this library can convert the result to CSV, JSON, XML & XML TEI.

Further information

Post last updated on 2021-02-23.