Using sitemaps to crawl websites (updated)

In order to gather web documents it can be useful to download the portions of a website programmatically, mostly to save time and resources. The retrieval and download of documents within a website is often called web crawling or web spidering. This post describes practical ways to find URLs within a website and to work with URL lists on the command-line. It contains all necessary code snippets to optimize link discovery and filtering.

Getting started

Interest of sitemaps

A sitemap is a file that lists the visible or whitelisted URLs for a given site, the main goal being to reveal where machines can look for content. Web crawlers usually discover pages from links within the site and from other sites, following a series of rules and protocols. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata.

The sitemaps protocol primarily allows webmasters to inform search engines about pages on their sites that are available for crawling. Crawlers can use it to pick up all URLs in the sitemap and learn about those URLs using the associated metadata. Sitemaps follow the XML format …

more ...

Filtering links to gather texts on the web

The issue with URLs and URIs

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI).

Both navigation on the Web and web crawling rely on the assumption that “the Web is a space in which resources are identified by Uniform Resource Identifiers (URIs).” (Berners-Lee et al., 2006) That being said, URLs cannot be expected to be entirely reliable. Especially as part of the Web 2.0 content on the Web is changing faster than ever before, it can be tailored to a particular geographic or linguistic profile and isn’t stable in time.

Although URLs cannot be expected to be perfect predictors for the content which gets downloaded, they are often the only indication according to which crawling strategies are developed. It can be really useful to identify and discard redundant URIs, that is different URIs leading to similar text, which can also be called DUST (Schonfeld et al. 2006). Refining and filtering steps relie on URL components such as host/domain name, path and parameters/query …

more ...

Evaluation of date extraction tools for Python


Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult, and an important problem remains as to the most efficient way to gather language data. Metadata extraction is part of data mining and knowledge extraction techniques. Dates are critical components since they are relevant both from a philological standpoint and in the context of information technology.

In most cases, immediately accessible data on retrieved webpages do not carry substantial or accurate information: neither the URL nor the server response provide a reliable way to date a web document, i.e. to find when it was written or modified. In that case it is necessary to fully parse the document or apply robust scraping patterns on it.

State of the art

Diverse extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Content extraction mostly draws on Document Object Model (DOM) examination, that is on considering a given HTML document as a tree structure whose nodes represent parts of the document to be operated on. Less thorough and not necessarily faster alternatives use superficial search patterns such as regular expressions in order to capture desirable excerpts …

more ...

Evaluating scraping and text extraction tools for Python

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. They come in different shapes and sizes mostly because of the wide variety of platforms and content management systems, and not least because of varying reasons and diverging goals followed during web publication.

This wide variety of contexts and text genres leads to important design decisions during the collection of texts: should the tooling be adapted to particular news outlets or blogs that are targeted (which often amounts to the development of web scraping tools) or should the extraction be as generic as possible to provide opportunistic ways of gathering information? Due to a certain lack of time resources in academia and elsewhere, the second option is often best.

Consequently, an important problem remains as to the most efficient way to gather language data. Between CMS idiosyncrasies, bulky pages and malformed HTML, the chosen solution has to be precise, robust and fast at the same time. The purpose of this evaluation is to test currently available alternatives with respect to particular needs for coverage and speed.

The current benchmark focuses on Python, reportedly the most popular programming language in academia and one of …

more ...

Validating TEI-XML documents with Python

This post introduces two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative, using a format commonly known as TEI-XML. The first one takes a shortcut using a library I am working on, while the second one shows an exhaustive way to perform the operation.

Both ground on LXML, an efficient library for processing XML and HTML. The following lines of code will try to parse and validate a document in the same directory as the terminal window or Python console.

Shortcut with the trafilatura library

I am currently using this web scraping library to download web pages, find the main text and thecomments while preserving some structure, and convert the output to TXT, XML & TEI-XML. As such, I recently added a way to systematically check if the TEI-XML documents produced by the library are valid.

The library can be installed with pip or pip3 (depending on the system): pip install lxml trafilatura As this functionality is new, please update trafilatura if you have already installed it: pip install -U trafilatura.

Trafilatura will seamlessly download the schema on the first call and then return True if a document is valid or …

more ...

Extracting the main text content from web pages using Python

Web corpus construction involves a significant number of design decisions and turning points in data processing. Depending of the purpose of data collection, it may also require a substantial filtering and quality assessment. While some large-scale algorithms can be expected to smooth out irregularities, uses requiring a low margin of error and close reading approaches (such as the search for examples in lexicographic research) imply constant refinements and improvements with respect to the building and processing of the dataset.


Because of the vastly increasing variety of corpora, text types and use cases, it becomes more and more difficult to assess the adequacy and quality of certain web data for given research objectives. A central operation in corpus construction consists in retaining the desired content while discarding the rest, a task which has many names referring to peculiar subtasks or to the whole: web scraping, boilerplate removal or boilerplate detection, web page template detection, web page cleaning, or web content extraction – for a recent overview see Lejeune & Zhu (2018).

Recently, approaches using the CommonCrawl have flourished, as they allow for faster download and processing by skipping (or more precisely outsourcing) the crawling phase. While I think that finding one’s …

more ...

A module to extract date information from web pages


Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example webpages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was written.


I regularly work on improving the extraction methods for the web collections at my home institutions. They are unique as they combine both the quantity resulting from broad web crawling and the quality obtained by carefully extracting text and metadata as well as rejecting documents that do not match certain criteria. In that sense, I already published work on methods to derive metadata from web documents in order …

more ...

Indexing text with ElasticSearch

The Lucene-based search engine Elasticsearch is fast and adaptable, so that it suits most demanding configurations, including large text corpora. I use it daily with tweets and began to release the scripts I use to do so. In this post, I give concrete tips for indexation of text and linguistic analysis.


You do not need to define a type for the indexed fields, the database can guess it for you, however it speeds up the process and gives more control to use a mapping. The official documentation is extensive and it is sometimes difficult to get a general idea of how to parametrize indexation:

Interesting options which are better specified before indexation include similarity scoring as well as term frequencies and positions.

Linguistic analysis

The string data type allows for the definition of the linguistic analysis to be used (or not) during indexation.

Elasticsearch ships with a series of language analysers which can be used for language-aware tokenization and indexation. Given a “text” field in German, here is where it happens in the mapping:

  "text": {
    "type" : "string",
    "index" : "analyzed",
    "analyzer" : "german",

Beyond that, it is possible to write …

more ...

Parsing and converting HTML documents to XML format using Python’s lxml

The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.

In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.

Update: I released a Python module that includes all operations described here and more: trafilatura


A confortable installation is apt-get install python-lxml on Debian/Ubuntu, but the underlying packages may be old. The more pythonic way would be to make sure all the necessary libraries are installed (something like apt-get install libxml2-dev libxslt1-dev python-dev), and then using a package manager such as pip: pip install lxml.

Parsing HTML

Here are the modules required for basic manipulation:

from __future__ import print_function
from lxml import etree, html
from StringIO import StringIO

And here is how to read a file, supposing it is valid Unicode (it is not necessarily …

more ...

Analysis of the German Reddit corpus

I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.

Basic idea

The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform at the beginning of July 2015 and claimed to have any publicly available comment.

Corpus construction

In order to focus on German comments, I use a two-tiered filter in order to deliver a hopefully well-balanced performance between speed and accuracy. The first filter uses a spell-checking algorithm (delivered by the enchant library), and the second resides in my language identification tool of choice,

The corpus is comparatively small (566,362 tokens), due to the fact that Reddit is almost exclusively an English-speaking platform. The number of tokens tagged as proper nouns (NE) is particularly high (14.4\%), which exemplifies the perplexity of the tool itself, for example because the redditors refer to trending and possibly short-lived notions and celebrities …

more ...