Using and parsing the hCard microformat, an introduction

Recently, as I decided to get involved in the design of my personal page, I learned how to represent semantic markup on a web page. I would like to share a few things about writing and parsing semantic information in this format. I have the intuition that it is only the beginning and that there will be more and more formats to describe who you are, what do you do, who your are related to, where you link to, and engines that gather these informations.

First of all, the hCard microformat points to this standard, hCard 1.0.1.  For an explanation of what it is, see here on, for a global article on microformats see also Wikipedia.

The information displayed is useful as it is a way to markup semantic relations, so that named entities are correctly identified. By search engines for instance : Google supports several formats, including hCard, and there are more specific search engines which aim at gathering informations such as a contact or a product list starting from this kind of markup. For a comprehensive list see here.

Now, if you are interested in parsing microformats, there are several tools. Among them, my pick ...

more ...

Commented bibliography on readability assessment

I have selected a few papers on readability published in the last years, all available online (for instance using a specialized search engine, see previous post):

  1. First of all, I reviewed this one last week, it is a very up-to-date article. L. Feng, M. Jansche, M. Huenerfauth, and N. Elhadad, “A Comparison of Features for Automatic Readability Assessment”, 2010, pp. 276-284.
  2. The seminal paper to which Feng et al. often refers, as they combine several approaches, especially statistical language models, support vector machines and more traditional criteria. A comprehensive bibliography. S. E. Schwarm and M. Ostendorf, “Reading level assessment using support vector machines and statistical language models”, in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 523-530.
  3. A complementary approach, also a combination of features, this time mainly of lexical and grammatical ones, with a focus on the latter, as the authors use parse trees and subtrees (i.e. «relative frequencies of partial syntactic derivations») at three different levels. I found this convincing. A comparison of three statistical models: Linear Regression, Proportional Odds Model and Multi-class Logistic Regression. M. Heilman, K. Collins-Thompson, and M. Eskenazi, “An analysis of statistical models and features for reading difficulty ...
more ...

Comparison of Features for Automatic Readability Assessment: review

I read an interesting article, “featuring” an up-to-date comparison of what is being done in the field of readability assessment:

A Comparison of Features for Automatic Readability Assessment”, Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad, 23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284.

I am interested in the features they use. Let’s summarize, I am going to do a quick recension:

Corpus and tools

  • Corpus: a sample from the Weekly Reader
  • OpenNLP to extract named entities and resolve co-references
  • the Weka learning toolkit for machine learning


  • Four subsets of discourse features:
  • 1. entity-density features 2. lexical-chain features (chains rely on semantic relations as they are automatically detected) 3. co-reference inference features (a research novelty) 4. entity grid features (transition patterns according to the grammatical roles of the words)
  • Language Modeling Features, i.e. train language models
  • Parsed Syntactic Features, such as parse tree height
  • POS-based Features
  • Shallow Features, i.e. traditional readability metrics
  • Other features, mainly “perplexity features” according to Schwarm and Ostendorf (2005), see below


  • Combining discourse features doesn’t significantly improve accuracy, discourse features do not seem to be useful.
  • Language models trained with information gain outperform those trained ...
more ...

A short bibliography on Latent Semantic Analysis and Indexing

To go a bit further than my previous post, here are a few references that I recently found to be interesting.

For a definition and/or other short bibliographies, see Wikipedia or something else this time : Scholarpedia, with an article “curated” by T.K. Landauer and S.T. Dumais.

U. Mortensen, Einführung in die Korrespondenzanalyse, Universität Münster,2009.

G. Gorrell and B. Webb, “Generalized Hebbian Algorithm for Incremental Latent Semantic Analysis,” in Ninth European Conference on Speech Communication and Technology, 2005.

P. Cibois, Les méthodes d’analyse d’enquêtes, Que sais-je ?, 2004.

B. Pincombe, Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus, Australian Department of Defence,2004.

M. W. Berry, S. T. Dumais, and G. W. O’Brien, “Using Linear Algebra for Intelligent Information Retrieval,” SIAM Review, vol. 37, iss. 4, p. pp. 573-595, 1995.

S. Dumais, Enhancing performance in latent semantic indexing (LSI) retrieval, Bellcore,1992.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis”, Journal of the American society for information science, vol. 41, iss. 6, pp. 391-407, 1990.

G. Salton, A. Wong, and C. S. Yang, “A vector ...

more ...

Building a topic-specific corpus out of two different corpora

I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.

Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.

One of the appropriate techniques (if not the best)

I could do it using LSA (in this particular case Latent semantic analysis, and not Lysergic acid amide!) or to be more precise Latent semantic indexing.

As this technical report shows, it can perform well in that kind of case: Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus,  B. Pincombe, Australian Department of Defence, 2004. (full text available here or through any good search engine, see previous post)

This could be an issue for later research.

The approach that I am working on (not quick and dirty but simpler and hopefully robust)

For exercise ...

more ...

Collecting academic papers

I would like to build a corpus from a variety of scientific papers of a given field in a given language (german).

The problems of crawling put aside, I wonder if there is a way to do this automatically. All the papers I read deal with hand-collected corpora.

The Open Archive format might be a good workaround (see The Open Archives Initiative Protocol for Metadata Harvesting). As far as I know it is well-spread. And there are search engines that look for academic papers and use these metadata.

Among the most popular ones (Google Scholar, Scirus, OAIster), a few seem to deal with a lot of german texts : Scientific Commons (St. Gallen, CH) and Base (Bielefeld).

I read an interesting article today about the search engines regarding this particular field: by Dirk Pieper and Sebastian Wolf from the University Library of Bielefeld, “Wissenschaftliche Dokumente in Suchmaschinen”, in Handbuch Internet-Suchmaschinen, D. Lewandowski (ed.), Heidelberg, 2009. PDF version here.

I could crawl the result pages of a given website and see what I get. I’ll see what I can do.

more ...


Here is the beginning of a bibliography generated from my Master’s thesis, converted between different formats, and parked here for further reference.

Complexity and Readability Assessment


Complexity and Linguistic Complexity Theory

  • S. T. Piantadosi, H. Tily, and E. Gibson, “Word lengths are optimized for efficient communication”, Proceedings of the National Academy of Sciences, vol. 108, iss. 9, pp. 3526-3529, 2011.
  • L. Maurits, A. Perfors, and D. Navarro, “Why are some word orders more common than others? A uniform information density account”, in Proceedings of NIPS, 2010.
  • P. Blache, “Un modèle de caractérisation de la complexité syntaxique”, in TALN 2010, Montréal, 2010.
  • T. Givon, The Genesis of Syntactic Complexity : diachrony, ontogeny, neuro-cognition, evolution, Amsterdam, New York: John Benjamins Publishing Co., 2009.
  • M. Mitchell, Complexity: A Guided Tour, Oxford, New York: Oxford University Press, 2009.
  • C. Beckner, N. C. Ellis, R. Blythe, J. Holland, J. Bybee, J. Ke, M. H. Christiansen, D. Larsen-Freeman, W. Croft, and T. Schoenemann, “Language Is a Complex Adaptive System ...
more ...

Why I don’t blog on and why I might do so (someday…)

People around me at the lab keep talking about a French institutional blog platform named In fact it is well-known but no one is using it. The website is still a bit new, according to them they currently host a hundred blogs.

The main benefits are visibility and durability as it is institutional, well-referenced and competently maintained.

It is what it claims to be, which is also why I hesitated and finally chose to set up a basic personal website.

  • First you need to fill out a form to get a registration, which is good in terms of label, but I don’t know how long or how often I am going to blog. I don’t want to request a service I might finally not use.
  • The second reason is that it is very useful for people who do not want to deal with layout issues, all the pages look quite the same apart from backgrounds colors and a few images. I think it may be to maintain a global coherence on the website.
  • It’s not that international, it’s not what it’s meant to be. Most of the articles are in French, and I ...
more ...

A fast bash pipe for TreeTagger

I have been working with the part-of-speech tagger developed at the IMS Stuttgart TreeTagger since my master thesis. It performs well on german texts as one could easily suppose, since it was one of its primary purposes. One major problem is that it’s poorly documented, so I would like to share the way that I found to pass things to TreeTagger through a pipe.

The first thing is that TreeTagger doesn’t take Unicode strings, as it dates back to the nineties. So you have to convert whatever you pass to ISO-8859-1, which the iconv software with the translit option set does very well. It means here “find an equivalent if the character cannot be exactly translated”.

Then you have to define the options that you want to use. I put the most frequent ones in the example.


The advantage of a pipe is that you can clean the text while passing it to the tagger. Here is one way of doing it, by using the text editor sed to : 1. remove the trailing white lines 2. replace everything that’s more than one space by one space and 3. replacing spaces by new lines.

This way ...

more ...

Resources and links of interest

Archive of links gathered during my PhD thesis: 1. Linguistics and NLP 2. Corpus Linguistics 3. Perl 4. LaTeX 5. R 6. PhD related 7. Misc.

1 – Linguistics and NLP

General Linguistics

Computational Linguistics

Online Articles and Conferences

Lists of CL Blogs

Resources for German

Computer Science

more ...