On the interest of social media corpora


The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.

In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult.


I recently attended a workshop organized by the H2020-project CLARIN-PLUS on this topic. I wrote a blog post on the CLARIN blog: Reflections on the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”

Ethical remark

In any case, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or which enfold ...

more ...

Finding viable seed URLs for web corpora

I recently attended the Web as Corpus Workshop in Gothenburg, where I had a talk for a paper of mine, Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources, and another with Felix Bildhauer and Roland Schäfer, Focused Web Corpus Crawling.


The comparison I did started from web crawling experiments I performed at the FU Berlin. The fact is that the conventional tools of the “Web as Corpus” framework rely heavily on URLs obtained from search engines. URLs were easily gathered that way until search engine companies restricted this allowance, meaning that one now has to pay and/or to wait longer to send queries.

I tried to evaluate the leading approach and to find decent subtitutes using social networks as well as the Open Directory Project and Wikipedia. I take four different languages (Dutch, French, Indonesian and Swedish) as examples in order to compare several web spaces with different if not opposed characteristics.

My results distinguish no clear winner, complementary approaches are called for, and it seems possible to replace or at least to complement the existing BootCaT approach. I think that crawling problems such as link/host diversity have not ...

more ...

A few links on producing posters using LaTeX

As I had to make a poster for the TALN 2011 conference to illustrate my short paper (PDF, in French), I decided to use LaTeX, even if it was not the easiest way. I am quite happy with the result (PDF).

I gathered a few links that helped me out. My impression is that there are two common models, and as I matter of fact I saw both of them at the conference. The one that I used, Beamerposter, was “made in Germany” by Philippe Dreuw, from the Informatics Department of the University of Aachen. I only had to adapt the model to fit my needs, which is done by editing the .sty file (it is self-explanatory).

The other one, BA Poster, was “made in Switzerland” by Brian Amberg, from the Computer Science Department of the University of Basel.

And here are the links :

more ...

Workshop on Complexity in Language – Day 2 (report)

I could not follow the whole second day of the Workshop on Complexity in Language (see previous post), but here is what I heard in the morning.

Salikoko Mufwene talked about the emergence of complexity, which he sees as a self-organization process : we don’t plan the way we are going to speak.

He adopts a relativistic perspective speaking of a multi-agent system and asking if the agents are really agentive or if there are triggers of particular behaviors. He likes to consider language as a technology that evolved. At the end of the talk he also tackled the notion of communal complexity and communal patterns used by speakers (also known as norms).

Luc Steels explained his understanding of language complexity and how he simulates communication with robots. He thinks there is an alternative to the evolutionary framework: according to him grammar is functional and not superficial and complexity has grown step by step in a cultural evolution rather than a biological.

His perception of self-organization bases most notably on alignment, structural coupling and linguistic selection. That’s what he builds models for by letting robots find common words to describe a situation (for example the fact that a given ...

more ...

Workshop on Complexity in Language - Day 1 (report)

I attended yesterday the first day of a workshop organized by Salikoko Mufwene and held at the ENS Lyon. This “Workshop on Complexity in Language: Developmental and Evolutionary Perspectives” lasts two days: HTML version of the program.

Here is my personal report on what I heard during the first day and on what I found interesting.

Complexity and complexity science

First of all, William S.-Y. Wang referred to Herbert Simon and Melanie Mitchell in particular to define complexity, two approaches that I described on this blog.

Tom Schoenemann talked about the increasing richness, subtlety and complexity of hominin conceptual understanding which created a need for syntax and grammar as characteristics resulting from it. In the course of history brain areas appear less directly connected, they process information more independently. What he calls “conceptual complexity” bases on the idea of “grounded cognition” developed by Lawrence W. Barsalou.

Barbara L. Davis said of the complexity science that it was another paradigm. Indeed, most of the debate took place on an abstract level, with many different (and not really compatible) notions of language and complexity. William Croft for instance said the whole context of language needed to be taken into account, and ...

more ...

On Text Linguistics

Talking about text complexity in my last post, I did not realize how important it is to take the framework of text linguistics into account. This branch of linguistics is well-known in Germany but is not really meant as a topic by itself elsewhere. Most of the time, no one makes a distinction between text linguistics and discourse analysis, although the background is not necessarily the same.

I saw a presentation by Jean-Michel Adam last week, who describes himself as the “last of the Mohicans” to use this framework in French research. He drew a comprehensive picture of its origin and its developments which I am going to try and sum up.

This field started to become popular in the ‘70s with books by Eugenio Coseriu, Harald Weinrich (in Germany), Frantisek Danek (and the Functional Sentence Perspective Framework) or MAK Halliday who was a lot more read in English-speaking countries. Text linguistics is not a grammatical description of language, nor is it bound to a particular language. It is a science of the texts, a theory which comes on top of several levels such as semantics or structure analysis. It enables to distinguish several classes of texts at a global ...

more ...