Ad hoc and general-purpose corpus construction from web sources

While the pervasiveness of digital communication is undeniable, the numerous traces left by users-customers are collected and used for commercial purposes. The creation of digital research objects should provide the scientific community with ways to access and analyze them. Particularly in linguistics, the diversity and quantity of texts present on the internet have to be better assessed in order to make current text corpora available, allowing for the description of the variety of languages uses and ongoing changes. In addition, transferring the field of analysis from traditional written text corpora to texts taken from the web results in the creation of new tools and new observables. We must therefore provide the necessary theoretical and practical background to establish scientific criteria for research on these texts.

This is the subject of my PhD work which has been performed under the supervision of Benoît Habert and which led to a thesis entitled Ad hoc and general-purpose corpus construction from web sources, defended on June 19th 2015 at the École Normale Supérieure de Lyon to obtain the degree of doctor of philosophy in linguistics.

Methodological considerations

At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics ...

Collecting and indexing tweets with a geographical focus

This paper introduces a Twitter corpus focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets.

Why Twitter?

To do linguistics on texts is to do botanics on a herbarium, and zoology on remains of more or less well-preserved animals.”

Faire de la linguistique sur des textes, c’est faire de la botanique sur un herbier, de la zoologie sur des dépouilles d’animaux plus ou moins conservées.”
Inaugural speech from Charles de Tourtoulon at the Académie des sciences, agriculture, arts et belles lettres, Aix-en-Provence, 1897. (For a detailed interpretation see the introduction of my PhD thesis)

Practical reasons

  • (Lui & Baldwin 2014)

    • A frontier area due to their dissimilarity with existing corpora
  • (Krishnamurthy et al. 2008)

    • Availability and ease of use
    • Immediacy of the information presented
    • Volume and variability of the data contained
    • Presence of geolocated messages

My study of 2013 concerning other social networks (Crawling microblogging services to gather language-classified URLs ...

2nd release of the German Political Speeches Corpus

Last Monday, I released an updated version of both corpus and visualization tool on the occasion of the DGfS-CL Poster-Session in Frankfurt, where I presented a poster (in German).

The first version had been made available last summer and mentioned on this blog, cf this post : Introducing the German Political Speeches Corpus and Visualization Tool.

The resource still uses this permanent redirection :


If you don’t remember it or never heard of it, here is a brief description :

The resource presented here consists of speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources. It provides raw data, metadata and tokenized text with part-of-speech tagging and lemmas in XML TEI format for researchers that are able to use it and a simple visualization interface for those who want to get a glimpse of what is in the corpus before downloading it or thinking about using more complete tools.

The visualization output is in valid CSS/XHTML format, it takes advantage of recent standards. The purpose is to give a sort of Zeitgeist, an insight on the topics developed by a government official and on ...

