Using and parsing the hCard microformat, an introduction

Recently, as I decided to get involved in the design of my personal page, I learned how to represent semantic markup on a web page. I would like to share a few things about writing and parsing semantic information in this format. I have the intuition that it is only the beginning and that there will be more and more formats to describe who you are, what do you do, who your are related to, where you link to, and engines that gather these informations.

First of all, the hCard microformat points to this standard, hCard 1.0.1.  For …

more ...

Commented bibliography on readability assessment

I have selected a few papers on readability published in the last years, all available online (for instance using a specialized search engine, see previous post):

  1. First of all, I reviewed this one last week, it is a very up-to-date article. L. Feng, M. Jansche, M. Huenerfauth, and N. Elhadad, “A Comparison of Features for Automatic Readability Assessment”, 2010, pp. 276-284.
  2. The seminal paper to which Feng et al. often refers, as they combine several approaches, especially statistical language models, support vector machines and more traditional criteria. A comprehensive bibliography. S. E. Schwarm and M. Ostendorf, “Reading level assessment using …
more ...

Comparison of Features for Automatic Readability Assessment: review

I read an interesting article, “featuring” an up-to-date comparison of what is being done in the field of readability assessment:

A Comparison of Features for Automatic Readability Assessment”, Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad, 23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284.

I am interested in the features they use. Let’s summarize, I am going to do a quick recension:

Corpus and tools

  • Corpus: a sample from the Weekly Reader
  • OpenNLP to extract named entities and resolve co-references
  • the Weka learning toolkit for machine learning


  • Four subsets of discourse features:
  • 1. entity-density …
more ...

A short bibliography on Latent Semantic Analysis and Indexing

To go a bit further than my previous post, here are a few references that I recently found to be interesting.

For a definition and/or other short bibliographies, see Wikipedia or something else this time : Scholarpedia, with an article “curated” by T.K. Landauer and S.T. Dumais.

U. Mortensen, Einführung in die Korrespondenzanalyse, Universität Münster,2009.

G. Gorrell and B. Webb, “Generalized Hebbian Algorithm for Incremental Latent Semantic Analysis,” in Ninth European Conference on Speech Communication and Technology, 2005.

P. Cibois, Les méthodes d’analyse d’enquêtes, Que sais-je ?, 2004.

B. Pincombe, Comparison of Human and Latent Semantic …

more ...

Building a topic-specific corpus out of two different corpora

I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.

Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.

One of the appropriate techniques (if not the best)

I could do …

more ...

Collecting academic papers

I would like to build a corpus from a variety of scientific papers of a given field in a given language (german).

The problems of crawling put aside, I wonder if there is a way to do this automatically. All the papers I read deal with hand-collected corpora.

The Open Archive format might be a good workaround (see The Open Archives Initiative Protocol for Metadata Harvesting). As far as I know it is well-spread. And there are search engines that look for academic papers and use these metadata.

Among the most popular ones (Google Scholar, Scirus, OAIster), a few seem …

more ...


Here is the beginning of a bibliography generated from my Master’s thesis, converted between different formats, and parked here for further reference.

Complexity and Readability Assessment


Complexity and Linguistic Complexity Theory

  • S. T. Piantadosi, H. Tily, and E. Gibson, “Word lengths are optimized for efficient communication”, Proceedings of the National Academy of Sciences, vol. 108, iss. 9, pp. 3526-3529, 2011.
  • L. Maurits …
more ...

Why I don’t blog on and why I might do so (someday…)

People around me at the lab keep talking about a French institutional blog platform named In fact it is well-known but no one is using it. The website is still a bit new, according to them they currently host a hundred blogs.

The main benefits are visibility and durability as it is institutional, well-referenced and competently maintained.

It is what it claims to be, which is also why I hesitated and finally chose to set up a basic personal website.

  • First you need to fill out a form to get a registration, which is good in terms of …
more ...

A fast bash pipe for TreeTagger

I have been working with the part-of-speech tagger developed at the IMS Stuttgart TreeTagger since my master thesis. It performs well on german texts as one could easily suppose, since it was one of its primary purposes. One major problem is that it’s poorly documented, so I would like to share the way that I found to pass things to TreeTagger through a pipe.

The first thing is that TreeTagger doesn’t take Unicode strings, as it dates back to the nineties. So you have to convert whatever you pass to ISO-8859-1, which the iconv software with the …

more ...

Resources and links of interest