Distant reading and text visualization

A new paradigm in “digital humanities” – you know, that Silicon Valley of textual studies geared towards neoliberal narrowing of research (highly provocative but interesting read nonetheless)… A new paradigm resides in the belief that understanding language (e.g. literature) is not accomplished by studying individual texts, but by aggregating and analyzing massive amounts of data (Jockers 2013). Because it is impossible for individuals to “read” everything in a large corpus, advocates of distant reading employ computational techniques to “mine” the texts for significant patterns and then use statistical analysis to make statements about those patterns (Wulfman 2014).

One of the first attempts to apply visualization techniques to texts has been the “shape of Shakespeare” by Rohrer (1998). Clustering methods were used to let set emerge among textual data as well as metadata, not only in humanities but also in the case of Web genres (Bretan, Dewe, Hallberg, Wolkert, & Karlgren, 1998). It may seem rudimentary by today’s standards or far from being a sophisticated “view” on literature but the “distant reading” approach is precisely about seeing the texts in another perspective and exploring the corpus interactively. Other examples of text mining approaches enriching visualization techniques include the document atlas of ...

more ...

Analysis of the German Reddit corpus

I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.

Basic idea

The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform archive.org at the beginning of July 2015 and claimed to have any publicly available comment.

Corpus construction

In order to focus on German comments, I use a two-tiered filter in order to deliver a hopefully well-balanced performance between speed and accuracy. The first filter uses a spell-checking algorithm (delivered by the enchant library), and the second resides in my language identification tool of choice, langid.py.

The corpus is comparatively small (566,362 tokens), due to the fact that Reddit is almost exclusively an English-speaking platform. The number of tokens tagged as proper nouns (NE) is particularly high (14.4\%), which exemplifies the perplexity of the tool itself, for example because the redditors refer to trending and possibly short-lived notions and celebrities ...

more ...

Data analysis and modeling in R: a crash course

Let’s pretend you recently installed R (a software to do statistical computing), you have a text collection you would like to analyze or classify and some time to lose. Here are a few quick commands that could get you a little further. I also write this kind of cheat sheet in order to remember a set of useful tricks and packages I recently gathered and from which I thought they could help others too.

Letter frequencies

In this example I will use a series of characteristics (or features) extracted from a text collection, more precisely the frequency of each letter from a to z (all lowercase). By the way, it goes as simple as that using Perl and regular expressions (provided you have a $text variable):

my @letters = ("a" .. "z");
foreach my $letter (@letters) {
    my $letter_count = () = $text =~ /$letter/gi;
    printf "%.3f", (($letter_count/length($text))*100);

First tests in R

After having started R (‘R’ command), one usually wants to import data. In this case, my file type is TSV (Tab-Separated Values) and the first row contains only describers (from ‘a’ to ‘z’), which comes at hand later. This is done using the read.table command.

alpha <- read.table("letters_frequency ...
more ...

Ludovic Tanguy on Visual Analysis of Linguistic Data

In his professorial thesis (or habilitation thesis), which is about to be made public (the defence takes place next week), Ludovic Tanguy explains why and on what conditions data visualization could help linguists. In a previous post, I showed a few examples of visualization applied to the field of readability assessment. Tanguy’s questioning is more general, it has to do with what is to include in the disciplinary field of linguistics.

He gives a few reasons to use the methods from the emerging field of visual analytics and mentions some of its upholders (like Daniel Keim or Jean-Daniel Fekete). But he also states that they are not well adapted to the prevailing models of scientific evaluation.

Why use visual analytics in linguistics ?

His main point is the (fast) growing size and complexity of linguistic data. Visualization comes at hand when selecting, listing or counting phenomena does not prove useful anymore. There is evidence from the field of cognitive psychology that an approach based on form recognition may lead to an interpretation. Briefly, new needs come forth when calculations come short.

Tanguy gives to main examples of cases where it is obvious : firstly the analysis of networks, which can be ...

more ...

On global vs. local visualization of readability

It is not only a matter of scale : the perspective one chooses is crucial when it comes to visualize how difficult a text is. Two main options can be taken into consideration :

  • An overview in form of a summary which enables to compare a series of phenomena for the whole text.
  • A visualization which takes the course of the text into account, as well as the possible evolution of parameters.

I already dealt with the first type of visualization on this blog when I evoked Amazon’s text stats. To sum up, their simplicity is also their main problem, they are easy to read and provide users with a first glimpse of a book, but the kind of information they deliver is not always reliable.

Sooner or later, one has to deal with multidimensional representations as the number of monitored phenomena keeps increasing. That is where a real reflexion on finding a visualization that is faithful and clear at the same time. I would like to introduce two examples of recent research that I find to be relevant to this issue.

An approach inspired by computer science

The first one is taken from an article by Oelke et al. (2010 ...

more ...

Amazon’s readability statistics by example

I already mentioned Amazon’s text stats in a post where I tried to explain why they were far from being useful in every situation : A note on Amazon’s text readability stats, published last December.

I found an example which shows particularly well why you cannot rely on these statistics when it comes to get a precise picture of a text’s readability. Here are the screenshots of text statistics describing two different books (click on them to display a larger view) :

Comparison of two books on Amazon

The two books look quite similar, except for the length of the second one, which seems to contain significantly more words and sentences.

The first book (on the left) is Pippi Longstocking, by Astrid Lindgren, whereas the second is The Sound and The Fury, by William Faulkner… The writing style could not be more different, however, the text statistics make them appear quite close to each other.

The criteria used by Amazon are too simplistic, even if they usually perform acceptably on all kind of texts. The readability formulas that output the first series of results only take the length of words and sentences into account and their scale is designed for the US school system. In ...

more ...

2nd release of the German Political Speeches Corpus

Last Monday, I released an updated version of both corpus and visualization tool on the occasion of the DGfS-CL Poster-Session in Frankfurt, where I presented a poster (in German).

The first version had been made available last summer and mentioned on this blog, cf this post : Introducing the German Political Speeches Corpus and Visualization Tool.

The resource still uses this permanent redirection : http://purl.org/corpus/german-speeches


If you don’t remember it or never heard of it, here is a brief description :

The resource presented here consists of speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources. It provides raw data, metadata and tokenized text with part-of-speech tagging and lemmas in XML TEI format for researchers that are able to use it and a simple visualization interface for those who want to get a glimpse of what is in the corpus before downloading it or thinking about using more complete tools.

The visualization output is in valid CSS/XHTML format, it takes advantage of recent standards. The purpose is to give a sort of Zeitgeist, an insight on the topics developed by a government official and on ...

more ...

Display long texts with CSS, tutorial and example

Last week, I improved the CSS file that displays the (mostly long) texts of the German Political Speeches Corpus, which I introduced in my last post (“Introducing the German Political Speeches Corpus and Visualization Tool”). The texts should be easier to read now (though I do not study this kind of readability), you can see an example here (BP text 536).

I looked for ideas to design a clean and simple layout, but I did not find what I needed. So I will outline in this post the main features of my new CSS file :

  • First of all, margins, font-size and eventually font-family are set for the whole page :

    html {
        margin-left: 10%;
        margin-right: 10%;
        font-family: sans-serif;
        font-size: 10pt;
  • Two frames, one for the main content and one for the footer, denoted as div in the XHTML file.

    div.framed {
        padding-top: 1em;
        padding-bottom: 1em;
        padding-left: 7%;
        padding-right: 7%; 
        border: 1px solid #736F6E;
        margin-bottom: 10px;

    NB: I know there is a faster way to set the padding but I like to keep things clear. - I chose to use the good old separation rule, hr in XHTML with custom (adaptable) spacing in the CSS : :::css hr { margin-top: 2.5em; margin-bottom: 2.5em; } This ...

more ...