Web scraping with R: Text and metadata extraction

Why choose between R and Python?

R is a free software environment for statistical computing and graphics. Together with Python, it is part of the most popular languages among (data) scientists. Although both environments are similar, most people feel they face a choice between the two. The question “R vs Python, What should I learn?” resonates across the Internet. But why choose between them when you can choose both?

The reticulate package provides a comprehensive set of tools for seamless interoperability between Python and R. It basically allows for execution of Python code inside an R session, so that Python packages can be used with minimal adaptations, which is ideal for those who would rather operate from R than having to go back and forth between languages and environments.

The package provides several ways to integrate Python code into R projects: Python in R Markdown, importing Python modules, sourcing Python scripts, and an interactive Python console within R. Here is the complete vignette with thorough documentation on Calling Python from R.

In the tutorial below, we are going to import a Python scraper straight from R and use the results directly with the usual R syntax, thus harnessing its functions for data mining: content discovery and main text extraction.

Scraping web pages with Trafilatura

Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure.

Its features include seamless parallelized online and offline processing, extraction of main text, comments and metadata with several output formats, and link discovery starting from the homepage of a website. The extractor aims to be precise enough in order not to miss texts or to discard valid documents. In addition, it must be robust, but also reasonably fast.

The library outperforms similar software in a text extraction benchmark and in an external evaluation, ScrapingHub’s article extraction benchmark.

Using it from R allows for tapping into its potential while seamlessly operating from one’s environment of choice.

Installation

Reticulate

The reticulate package can be easily installed from CRAN as follows:

> install.packages("reticulate")

Python

Then you need a version of Python to interact with as well as the Python packages needed for the task. A recent version of Python 3 is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window:

$ python3 --version
Python 3.8.6 # version 3.6 or higher is fine

In case Python is not installed, please refer to the excellent Djangogirls tutorial: Python installation.

Trafilatura

The Trafilatura package in itself can be installed with the Python package manager pip which comes along with Python. You can find how on the Trafilatura installation page.

The other package managers conda and py_install also work. Here is a simple example using the py_install() function included in reticulate:

> library(reticulate)
> py_install("trafilatura")

Alternately, reticulate includes a set of functions for managing and installing packages within virtualenvs and Conda environments, see the article on Installing Python Packages for additional details. (You can skip the installation of Miniconda if it doesn’t seem necessary, you should only be prompted once.)

Web page download and text extraction

Getting started

Loading/importing the necessary packages is easy, type in the following commands and you are good to go:

> library(reticulate)
> trafilatura <- import("trafilatura")

Downloads

Trafilatura can take already downloaded web pages as input, you can also make use of its download capacity to fetch a web page in a straightforward way:

# fetch a HTML document and store as a variable
> url <- "https://example.org/"
> downloaded <- trafilatura$fetch_url(url)

Text extraction

The web page (i.e. the body of the HTTP request) is now stored under downloaded. Content extraction can then be performed using the extract() function:

> trafilatura$extract(downloaded)
[1] "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\nMore information..."

The extraction function takes arguments affecting the output, for instance the format. Unlike the example above (using plain text), you can opt for CSV, JSON, XML and XML-TEI. For a full list of arguments see the extraction documentation. The arguments can be combined as follows:

# extraction with arguments
> trafilatura$extract(downloaded, output_format="xml", url=url)
[1] "<doc sitename=\"example.org\" title=\"Example Domain\" source=\"https://example.org/\" hostname=\"example.org\" categories=\"\" tags=\"\" fingerprint=\"lxZaiIwoxp80+AXA2PtCBnJJDok=\">\n  <main>\n    <div>\n      <head>Example Domain</head>\n      <p>This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.</p>\n      <p>More information...</p>\n    </div>\n  </main>\n  <comments/>\n</doc>"

Storing the information in files and opening them with R

Already stored documents can also be read directly from R, for example Trafilatura’s TSV output and read_tsv() (or as read_delim() as a backup):

# repeat the steps above (or download a series of URLs)
output <- trafilatura$extract(downloaded, output_format="csv", url=url)
# store it somewhere, e.g. with sink or writeLines
# open it
mycorpus <- read_tsv("myfile.csv")

For more see this page on data import in R.

Importing other functions: sitemaps and metadata

Specific parts of the package can also be imported on demand, which provides access to functions not directly exported by the package. For a list of relevant functions and arguments see core functions.

Listing links from sitemaps

The code snippet below shows how to list all pages from a website by fetching its sitemap. In practice, it demonstrates how to select the sitemap_search() function using reticulate’s py_run_string():

# using the code for link discovery in sitemaps
> sitemapsfunc <- py_run_string("from trafilatura.sitemaps import sitemap_search")
> sitemapsfunc$sitemap_search("https://www.sitemaps.org/")
[1] "https://www.sitemaps.org"
[2] "https://www.sitemaps.org/protocol.html"
[3] "https://www.sitemaps.org/faq.html"
[4] "https://www.sitemaps.org/terms.html"
...

Scraping page metadata

In the following, metadata extraction is performed and the information is stored in a series of variables which can be directly used with R for further analysis:

# import the metadata part of the package as a function
> metadatafunc <- py_run_string("from trafilatura.metadata import extract_metadata")
> downloaded <- trafilatura$fetch_url("https://github.com/rstudio/reticulate")
> metadatafunc$extract_metadata(downloaded)
$title
[1] "rstudio/reticulate"

$author
[1] "Rstudio"

$url
[1] "https://github.com/rstudio/reticulate"

$hostname
[1] "github.com"
...

Going further

Using the scraping package allows for further text processing within R:

Basic Text Processing in R lesson on The Programming Historian
Advancing Text Mining with R and quanteda, an R package for text analysis