Ad hoc and general-purpose corpus construction from web sources

While the pervasiveness of digital communication is undeniable, the numerous traces left by users-customers are collected and used for commercial purposes. The creation of digital research objects should provide the scientific community with ways to access and analyze them. Particularly in linguistics, the diversity and quantity of texts present on the internet have to be better assessed in order to make current text corpora available, allowing for the description of the variety of languages uses and ongoing changes. In addition, transferring the field of analysis from traditional written text corpora to texts taken from the web results in the creation of new tools and new observables. We must therefore provide the necessary theoretical and practical background to establish scientific criteria for research on these texts.

This is the subject of my PhD work which has been performed under the supervision of Benoît Habert and which led to a thesis entitled Ad hoc and general-purpose corpus construction from web sources, defended on June 19th 2015 at the École Normale Supérieure de Lyon to obtain the degree of doctor of philosophy in linguistics.

Methodological considerations

At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics ...

more ...

Tips and tricks for indexing text with ElasticSearch

The Lucene-based search engine Elasticsearch is fast and adaptable, so that it suits most demanding configurations, including large text corpora. I use it daily with tweets and began to release the scripts I use to do so. In this post, I give concrete tips for indexation of text and linguistic analysis.


You do not need to define a type for the indexed fields, the database can guess it for you, however it speeds up the process and gives more control to use a mapping. The official documentation is extensive and it is sometimes difficult to get a general idea of how to parametrize indexation:

Interesting options which are better specified before indexation include similarity scoring as well as term frequencies and positions.

Linguistic analysis

The string data type allows for the definition of the linguistic analysis to be used (or not) during indexation.

Elasticsearch ships with a series of language analysers which can be used for language-aware tokenization and indexation. Given a “text” field in German, here is where it happens in the mapping:

  "text": {
    "type" : "string",
    "index" : "analyzed",
    "analyzer" : "german",

Beyond that, it is possible to write ...

more ...

Collecting and indexing tweets with a geographical focus

This paper introduces a Twitter corpus focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets.

Why Twitter?

To do linguistics on texts is to do botanics on a herbarium, and zoology on remains of more or less well-preserved animals.”

Faire de la linguistique sur des textes, c’est faire de la botanique sur un herbier, de la zoologie sur des dépouilles d’animaux plus ou moins conservées.”
Inaugural speech from Charles de Tourtoulon at the Académie des sciences, agriculture, arts et belles lettres, Aix-en-Provence, 1897. (For a detailed interpretation see the introduction of my PhD thesis)

Practical reasons

  • (Lui & Baldwin 2014)

    • A frontier area due to their dissimilarity with existing corpora
  • (Krishnamurthy et al. 2008)

    • Availability and ease of use
    • Immediacy of the information presented
    • Volume and variability of the data contained
    • Presence of geolocated messages

My study of 2013 concerning other social networks (Crawling microblogging services to gather language-classified URLs ...

more ...

Distant reading and text visualization

A new paradigm in “digital humanities” – you know, that Silicon Valley of textual studies geared towards neoliberal narrowing of research (highly provocative but interesting read nonetheless)… A new paradigm resides in the belief that understanding language (e.g. literature) is not accomplished by studying individual texts, but by aggregating and analyzing massive amounts of data (Jockers 2013). Because it is impossible for individuals to “read” everything in a large corpus, advocates of distant reading employ computational techniques to “mine” the texts for significant patterns and then use statistical analysis to make statements about those patterns (Wulfman 2014).

One of the first attempts to apply visualization techniques to texts has been the “shape of Shakespeare” by Rohrer (1998). Clustering methods were used to let set emerge among textual data as well as metadata, not only in humanities but also in the case of Web genres (Bretan, Dewe, Hallberg, Wolkert, & Karlgren, 1998). It may seem rudimentary by today’s standards or far from being a sophisticated “view” on literature but the “distant reading” approach is precisely about seeing the texts in another perspective and exploring the corpus interactively. Other examples of text mining approaches enriching visualization techniques include the document atlas of ...

more ...

Foucault and the spatial turn

I would like to share a crucial text by Michel Foucault which I discovered through a recent article by Marko Juvan on geographical information systems (GIS) and literary analysis:

  • Juvan, Marko (2015). From Spatial Turn to GIS-Mapping of Literary Cultures. European Review, 23(1), pp. 81-96.
  • Foucault, Michel (1984). Des espaces autres. Hétérotopies. Architecture, Mouvement, Continuité, 5, pp. 46-49. Originally: Conférence au Cercle d’études architecturales, 14 mars 1967.

The full text including the translation I am quoting from is available on It is available somewhere in Dits et écrits in paper form. If am understand correctly, the translation is from Jay Miskowiec (see this website). It is an absolute bootleg, since it is originally from a lecture and has not been officially planned for publication. Still, Foucault’s prose is as usual really dense and there is much to learn from it. In the course of time, it has become a central text of the so-called “spatial turn”, which has admittedly been introduced by Foucault and Lefebvre in the 1960s and 70s.

In the opening of the text, comparing the 20th with the 19th century, Foucault comes to the idea that our time is one of ...

more ...

Parsing and converting HTML documents to XML format using Python’s lxml

The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.

In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.


A confortable installation is apt-get install python-lxml on Debian/Ubuntu, but the underlying packages may be old. The more pythonic way would be to make sure all the necessary libraries are installed (something like apt-get install libxml2-dev libxslt1-dev python-dev), and then using a package manager such as pip: pip install lxml.

Parsing HTML

Here are the modules required for basic manipulation:

from __future__ import print_function
from lxml import etree, html
from StringIO import StringIO

And here is how to read a file, supposing it is valid Unicode (it is not necessarily the case). The StringIO buffering is probably not the most direct way, but I found ...

more ...

Analysis of the German Reddit corpus

I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.

Basic idea

The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform at the beginning of July 2015 and claimed to have any publicly available comment.

Corpus construction

In order to focus on German comments, I use a two-tiered filter in order to deliver a hopefully well-balanced performance between speed and accuracy. The first filter uses a spell-checking algorithm (delivered by the enchant library), and the second resides in my language identification tool of choice,

The corpus is comparatively small (566,362 tokens), due to the fact that Reddit is almost exclusively an English-speaking platform. The number of tokens tagged as proper nouns (NE) is particularly high (14.4\%), which exemplifies the perplexity of the tool itself, for example because the redditors refer to trending and possibly short-lived notions and celebrities ...

more ...

Rule-based URL cleaning for text collections

I would like to introduce the way I clean lists of unknown URLs before going further (e.g. by retrieving the documents). I often use a Python script named which I made available under a open-source license as a part of the FLUX-toolchain.

The following Python-based regular expressions show how malformed URLs, URLs leading to irrelevant content as well as URLs which obviously lead to adult content and spam can be filtered using a rule-based approach.

Avoid recurrent sites and patterns to save bandwidth

First, it can be useful to make sure that the URL was properly parsed before making it into the list, the very first step would be to check whether it starts with the right protocol (ftp is deemed irrelevant in my case).

protocol = re.compile(r'^http', re.IGNORECASE)

Then, it is necessary to find and extract URLs nested inside of a URL: referrer URLs, links which were not properly parsed, etc.

match ='^http.+?(https?://.+?$)', line)

After that, I look at the end of the URLset rid of URLs pointing to files which are frequent but obviously not text-based, both at the end and inside the URL:

# obvious extensions
extensions ...
more ...

Finding viable seed URLs for web corpora

I recently attended the Web as Corpus Workshop in Gothenburg, where I had a talk for a paper of mine, Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources, and another with Felix Bildhauer and Roland Schäfer, Focused Web Corpus Crawling.


The comparison I did started from web crawling experiments I performed at the FU Berlin. The fact is that the conventional tools of the “Web as Corpus” framework rely heavily on URLs obtained from search engines. URLs were easily gathered that way until search engine companies restricted this allowance, meaning that one now has to pay and/or to wait longer to send queries.

I tried to evaluate the leading approach and to find decent subtitutes using social networks as well as the Open Directory Project and Wikipedia. I take four different languages (Dutch, French, Indonesian and Swedish) as examples in order to compare several web spaces with different if not opposed characteristics.

My results distinguish no clear winner, complementary approaches are called for, and it seems possible to replace or at least to complement the existing BootCaT approach. I think that crawling problems such as link/host diversity have not ...

more ...

Challenges in web corpus construction for low-resource languages

I recently presented a paper at the third LRL Workshop (a joint LTC-ELRA-FLaReNet-META_NET workshop on “Less Resourced Languages, new technologies, new challenges and opportunities”).


The state of the art tools of the “web as corpus” framework rely heavily on URLs obtained from search engines. Recently, this querying process became very slow or impossible to perform on a low budget.

Moreover, there are diverse and partly unknown search biases related to search engine optimization tricks and undocumented PageRank adjustments, so that diverse sources of URL seeds could at least ensure that there is not a single bias, but several ones. Last, the evolving web document structure and a shift from “web AS corpus” to “web FOR corpus” (increasing number of web pages and the necessity to use sampling methods) complete what I call the post-BootCaT world in web corpus construction.

Study: What are viable alternative data sources for lesser-known languages?

Trying to find reliable data sources for Indonesian, a country with a population of 237,424,363 of which 25.90 % are internet users (2011, official Indonesian statistics institute), I performed a case study of different kinds of URL sources and crawling strategies.

First, I classified URLs extracted ...

more ...