I have selected a few papers on readability published in the last years, all available online (for instance using a specialized search engine, see previous post):
- First of all, I reviewed this one last week, it is a very up-to-date article. L. Feng, M. Jansche, M. Huenerfauth, and N. Elhadad, “A Comparison of Features for Automatic Readability Assessment”, 2010, pp. 276-284.
- The seminal paper to which Feng et al. often refers, as they combine several approaches, especially statistical language models, support vector machines and more traditional criteria. A comprehensive bibliography. S. E. Schwarm and M. Ostendorf, “Reading level assessment using support vector machines and statistical language models”, in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 523-530.
- A complementary approach, also a combination of features, this time mainly of lexical and grammatical ones, with a focus on the latter, as the authors use parse trees and subtrees (i.e. «relative frequencies of partial syntactic derivations») at three different levels. I found this convincing. A comparison of three statistical models: Linear Regression, Proportional Odds Model and Multi-class Logistic Regression. M. Heilman, K. Collins-Thompson, and M. Eskenazi, “An analysis of statistical models and features for reading difficulty prediction”, in Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications, 2008, pp. 71-79.
- To me, a paper worth to be mentioned, as it deals with German. At first, the authors give a good recapitulation of what has been done about readability since the very beginning. Then they introduce the so-called software DeLite, which rates the readability of German texts and returns a report in XML format. I. Glöckner, S. Hartrumpf, H. Helbig, J. Leveling, and R. Osswald, “An architecture for rating and controlling text readability”, Proceedings of KONVENS 2006, pp. 32-35, 2006.
- An another approach of the linguistic phenomenon of readability: the coherence of a text. Starting from the framework of Centering Theory, where the text is computed as a grid of syntactic informations, R. Barzilay and M. Lapata treat the problem of coherence assessment as a ranking task to order texts. R. Barzilay and M. Lapata, “Modeling local coherence: An entity-based approach”, Computational Linguistics, vol. 34, iss. 1, pp. 1-34, 2008.
- To me the most comprehensive combination of features, because on the emphasis on discourse features: lexical cohesion, entity coherence (quoting previous paper), and discourse relations (text as a «bag of words»). A good problematization of the concept of readability. Statistical measures on two corpora: the entire Wall Street Journal corpus and a collection of general AP news. Combination of factors using the leaps package of R. E. Pitler and A. Nenkova, “Revisiting readability: A unified framework for predicting text quality”, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008, pp. 186-195.
- Last, an interesting paper about simplification with a focus on Brazilian and literacy levels. S. Aluisio, L. Specia, C. Gasperin, and C. Scarton, “Readability assessment for text simplification”, in Fifth Workshop on Innovative Use of NLP for Building Educational Applications, 2010, p. 1.