In a recent article about a readability checker prototype for italian, Felice Dell’Orletta, Simonetta Montemagni, and Giulia Venturi provide a good overview of current research on readability. Starting from the end of the article, I must say the bibliography is quite up-to-date and the authors offer an extensive review of criteria used by other researchers.

Tendencies in research

First of all, there is a growing tendency towards statistical language models. In fact, language models are used by Thomas François (2009) for example, who considers they are a more efficient replacement for the vocabulary lists used in readability formulas.

Secondly, readability assessment at a lexical or syntactic level has been explored, but factors at a higher level still need to be taken into account. It is acknowledged since the 80s that the structure of texts and the development of discourse play a major role in making a text more complex. Still, it is harder to focus on discourse features than on syntactic ones.

« Over the last ten years, work on readability deployed sophisticated NLP techniques, such as syntactic parsing and statistical language modeling, to capture more complex linguistic features and used statistical machine learning to build readability assessment tools. […] Yet, besides lexical and syntactic complexity features there are other important factors, such as the structure of the text, the definition of discourse topic, discourse cohesion and coherence and so on, playing a central role in determining the reading difficulty of a text. » (Dell’orletta et al., p. 74.)

As a matter of fact, the prototype introduced by Dell’Orletta et al. (named READ-IT) does not deal with discourse features.

Combination of factors and adaptation

The authors underline the importance of combination :

« The last few years have been characterised by approaches based on the combination of features ranging over different linguistic levels, namely lexical, syntactic and discourse. » (Dell’orletta et al., p. 75.)

To be able to combine also means adaptability, which is a key concept, as one has to bear in mind that « reading ease does not follow from intrinsic text properties alone, but it is also affected by the expected audience » (ibid., p. 75).
The authors quote Pitler and Nenkova (2008) as an example of this approach. They also refer to their conclusions on the adaptability of criteria. « When readability is targeted towards adult competent language users a more prominent role is played by discourse features. » (ibid., p. 75.)

Remarks on the article

The corpus consists in newspaper articles and comparable ones from a easy-to-read newspaper, I am also working on this kind of parallel approach. Newspaper corpora are hard to republish due to copyright issues.
I discovered an interesting criterion that was first summarized by Miller and Weinert (1998) « sentences containing subordinate clauses in post-verbal rather than in pre–verbal position are easier to read » (Dell’Orletta et al., p. 77).

References

  • F. Dell’Orletta, S. Montemagni, and G. Venturi, “READIT: Assessing Readability of Italian Texts with a View to Text Simplification”, in Proceedings of the 2nd Workshop on Speech and Language Processing for Assistive Technologies, Edinburgh, Scotland, UK, 2011, pp. 73-83.
  • T. François, “Modèles statistiques pour l’estimation automatique de la difficulté de textes de FLE”, in Actes TALN/RECITAL, Senlis, 2009.
  • J. Miller and R. Weinert, Spontaneous spoken language. Syntax and discourse, Oxford, Clarendon Press, 1998.
  • E. Pitler and A. Nenkova, “Revisiting readability: A unified framework for predicting text quality”, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008, pp. 186-195.