Renate Bartsch on linguistic complexity

I just found a seminal article on complexity written by Renate Bartsch in 1973 (in German). It is a very good summary of the perspective on this topic at the beginning of the ‘70s. The generative grammar background research on language starts to be criticized, but it is still a landmark and a framework (most notably the reflexion on surface and deep structure).

R. Bartsch, “Gibt es einen sinnvollen Begriff von linguistischer Komplexität ?” Zeitschrift für Germanistische Linguistik, vol. 1, iss. 1, pp. 6-31, 1973.

Bartsch focuses on three main aspects of the problem to answer this question: does the idea of linguistic complexity make sense ?


The framework of the transformational grammar alone cannot be trusted when it comes to measuring complexity, because the surface complexity does not account for a potential underlying complexity.
Bartsch quotes the interviews made by Labov and his conclusions stating that the dialect difference is to be found on the surface without having anything to do with the logic of a sentence.


This is by far the most interesting part of the article, lots of criteria for linguistic complexity are analyzed with examples (some in German).
Bartsch also writes about complexity metrics and claims ...

more ...

Philosophy of technology, how things started: a typology

In my previous post, I presented a few references. I went on reading books and articles on this topic, and I am now able to sort them in several kinds of approaches.

This is mostly thanks to these books in French on philosophy of technology:

  • G. Simondon, L’invention dans les techniques : cours et conférences, Paris: Seuil, 2005.
  • G. Hottois, Philosophies des sciences, philosophies des techniques, Paris: Odile Jacob, 2004.
  • J. Goffi, La philosophie de la technique, Presses Universitaires de France, 1988.
  • G. Hottois, Le signe et la technique : la philosophie à l’épreuve de la technique, Paris: Aubier, 1984.

In his second lesson at the Collège de France (Philosophies des sciences, philosophies des techniques, p. 94-118), Gilbert Hottois tries to provide a state-of-the-art in philosophy of technology: he describes several traditions and backgrounds. Here is how things started:

  1. A German origin of the reflexion on technology (Ernst Kapp, Friedrich Dessauer) which is mostly analyzed by engineers who shed a new light on this topic and try to think it as a system. The VDI (Verein Deutscher Ingenieure) continues this tradition. From 1956 onwards, this association organizes a series of meetings entitled Man and Technology which notably sees the question ...
more ...

Philosophy of technology: a few resources

As I once studied philosophy (back in the classes préparatoires), I like to keep in touch with this kind of reflexion. Moreover, in this research field where everything is moving very fast, it is a way to find a few continuities and to ground the peculiar questions regarding the analysis of language in a more conceptual framework.

Here is a list of texts available on the Internet (some of them partly) that seem important to me. Some are written in English, some in French or in German, as I chose the original ones.

It does not have the pretension to be complete ! Other references may follow.

  • Denis Diderot wrote the article Art in the Encyclopédie. It is a state of the art introducing the word and its different meanings (which by that time included arts, techniques and technology). Diderot is speaking in favor of the techniques developed by the craftsmen and give an account of the ideas of the time about liberal arts, theory and usage.
    The whole text was made available by the ARTFL Encyclopédie Project.

    Les Artisans se sont crus méprisables, parce qu’on les a méprisés; apprenons - leur à mieux penser d’eux - mêmes: c’est le ...

more ...

Binary search to find words in a list: Perl tutorial

Given a dictionary, say one of the frequent words lists of the University of Leipzig, given a series of words: How can you check which ones belong to the list ?

Another option would be to use the operator available since Perl 5.10: :::perl if ($word ~~ @list) {…} But this gets very slow if the size of the list increases. I wrote a naive implementation of the binary search algorithm in Perl that I would like to share. It is not that fast though. Basic but it works.

First of all the wordlist gets read:

my $dict = 'leipzig10000';
open (DICTIONARY, $dict) or die "Error...: $!\n";
my @list = ;
close (DICTIONARY) or die "Error...: $!\n";
my $number = scalar @list;

Then you have to initialize the values (given a list @words) and the upper bound:

my $start= scalar @words;
my $log2 = 0;
my $n = $number;
my ($begin, $end, $middle, $i, $word);
my $a = 0;
while ($n > 1){
    $n = $n / 2;
    $log2 = $log2 + 1;

Then the binary search can start:

foreach $mot (@mots) {
    $begin = 0;
    $end = $number - 1;
    $middle = int($number/2);
    $word =~ tr/A-Z/a-z/;
    $i = 0;
    while($i < $log2 + 1){
        if ($word eq lc($list[$middle])){
        elsif ($word ...
more ...

Resource links update

I recently updated the blogroll section and I also would like to share a few links:

As I will be teaching LaTeX soon the LaTeX links section of the blog has expanded.

Last but not least, here is an E-Book, Mining of Massive Datasets, by A. Rajaraman and J. D. Ullmann. It was made of classes taught at Stanford and is now free to use (available chapter by chapter or as a whole), very up-to-date and informative on this hot topic. It seems to be a good introduction as well. That said I cannot really review it since I am not an expert of this research field.

Here is the reference:

  • A. Rajaraman and J. D. Ullmann, Mining of Massive Datasets, Stanford, Palo Alto, CA: e-book, 2010.
more ...

Quick review of the Falko Project

The Falko Project is an error-annotated corpus of German as a foreign language, maintained by the Humboldt Universität Berlin who made it publicly accessible.

Recently a new search engine was made available, practically replacing the old CQP interface. This tool is named ANNIS2 and can handle complex queries on the corpus.


There are several subcorpora, and apparently more to come. The texts were written by advanced learners of German. There are most notably summaries (with the original texts and a comparable corpus of summaries written by native-speakers), essays who come from different locations (with the same type of comparable corpus) and a ‘longitudinal’ corpus coming from students of the Georgetown-University of Washington.

The corpora are annotated by a part-of-speech tagger (the TreeTagger) so that word types and lemmas are known but most of all the mistakes can be found, with several hypotheses at different levels (mainly what the correct sentence would be and what might be the reason of the mistake).


The engine (ANNIS2) has a good tutorial (in English by the way) so that it is not that difficult to search for complex patterns across the subcorpora. It seems also efficient in terms of speed. You may ...

more ...

Having fun and making money doing research

What do people look for ? A few years ago it would have been difficult to gather information at a large scale and grab it with a powerful, yet more or less objective tool. Nowadays a single company is able to know what you want, what you buy or what you just did. And sometimes it shares a little bit of the data.

So, the end of the year gives me an occasion to try and discover changes in the mentalities using the ready-to-use Google Trends. Just for fun…

How does research compare with other interests ?

First of all, research is no fun, it was more requested than money and was at the level of work, but things have changed. It still outnumbers fun in the news though.

A few trends regarding research

A few trends regarding research, “Research is no fun”… Source: Google), worldwide trends.

People seem to look for money more often than a few years ago, it’s the only thing which becomes more popular, even work just remains stable.

A remark: I think the search volume is much more bigger now than it was back in 2004, there are also more languages available, and probably more search terms (since the users may ...

more ...

Three series of recorded lectures

Here is my selection of introductory courses given by well-known specialists in Computer Science or Natural Language Processing and recorded so that they can be followed at home.

1. Artificial Intelligence | Natural Language Processing, Christopher D. Manning, Stanford University.
More than 20 hours, 18 lectures.
Introduction to the key topics of NLP, summary of existing models.
Lecture 12 : Dan Jurafsky as a guest lecturer.
Requires the Silverlight plugin (no comment). Transcripts available.

2. Bits, Harry R. Lewis, Harvard University.
A general overview of information as quantity and quantitative methods.
Very comprehensive lecture (data theories, internet protocols, encryption, copyright issues, laws…), cut in small pieces for you to pick a focused topic.
Several formats available, links to blog posts.

3. Search Engines: Technology, Society, and Business, various lecturers, UC Berkeley.
Fall 2007, 13 lectures.
Overview of the topic.
Requires iTunes (no comment).

more ...

On Text Linguistics

Talking about text complexity in my last post, I did not realize how important it is to take the framework of text linguistics into account. This branch of linguistics is well-known in Germany but is not really meant as a topic by itself elsewhere. Most of the time, no one makes a distinction between text linguistics and discourse analysis, although the background is not necessarily the same.

I saw a presentation by Jean-Michel Adam last week, who describes himself as the “last of the Mohicans” to use this framework in French research. He drew a comprehensive picture of its origin and its developments which I am going to try and sum up.

This field started to become popular in the ‘70s with books by Eugenio Coseriu, Harald Weinrich (in Germany), Frantisek Danek (and the Functional Sentence Perspective Framework) or MAK Halliday who was a lot more read in English-speaking countries. Text linguistics is not a grammatical description of language, nor is it bound to a particular language. It is a science of the texts, a theory which comes on top of several levels such as semantics or structure analysis. It enables to distinguish several classes of texts at a global ...

more ...

E. Castello, Text Complexity and Reading Comprehension Tests - Reading Notes

Here is what I retain from my reading of this book: E. Castello, Text Complexity and Reading Comprehension Tests*, Bern: Peter Lang, 2008.

Notional framework

To begin with, Castello identifies two types of complexity, and states that research in this field attempts to quantify inherent complexity and receiver-oriented complexity, i.e. complexity or difficulty per se on one side and in terms of reader and text on the other.

He cites C.J. Alderson and L. Merlini Barbaresi (strangely enough, we are not related, as far as I know) for their definition of linguistic complexity, M. Halliday and T. Gibson regarding lexical information, S. Urquhart and C. Weir for their work on different types of reading.


Erik Castello uses a series of measures, most notably:

  1. word-related:
  2. - type/token ratio - word frequency lists - lexical density - lexical variation - lexical density: difference between lexical and grammatical words, multi-word units, research on word families
  3. clause-related:
  4. - clause type ratio, meaning for instance the ratio between hypotactic clauses and clause complexes - grammatical intricacy
  5. sentence-related:
  6. - readability formulas

He mentions an interesting idea: to try and capture the intention of the writer according to a given situation, which can be compared with measures on discourse level (see ...

more ...