Comparison of Features for Automatic Readability Assessment: review

I read an interesting article, “featuring” an up-to-date comparison of what is being done in the field of readability assessment:

“A Comparison of Features for Automatic Readability Assessment”, Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad, 23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284.

I am interested in the features they use. Let’s summarize, I am going to do a quick recension:

Corpus and tools

Corpus: a sample from the Weekly Reader
OpenNLP to extract named entities and resolve co-references
the Weka learning toolkit for machine learning

Features

Four subsets of discourse features:
Language Modeling Features, i.e. train language models
Parsed Syntactic Features, such as parse tree height
POS-based Features
Shallow Features, i.e. traditional readability metrics
Other features, mainly “perplexity features” according to Schwarm and Ostendorf (2005), see below

Results

Combining discourse features doesn’t significantly improve accuracy, discourse features do not seem to be useful.
Language models trained with information gain outperform those trained with POS labels (apart from words and/or tags alone).
Verb phrases appear to be more closely correlated with text complexity than other types of phrases.
Noun-based features generate the highest classification accuracy.
Average sentence length has dominating predictive power over all other shallow features.
The criteria regarding clauses did not perform well, the authors are going to work on it.

My remarks

No wonder that the criteria that are simple to implement do perform well. On the other hand, I cannot believe that the discourse features are of no use. More fine-grained features such as these ones need models that are more accurate, which means after all complex models…

“In general, our selected POS features appear to be more correlated to text complexity than syntactic features, shallow features and most discourse features.”

Alas, the POS-based features do not go into details (I would rather speak of POS-basic features). The authors did not focus on this kind of features, although the simple approach apparently finds relevant information.

“A judicious combination of features examined here results in a significant improvement over the state of the art.”

That leads to another problem: how is the combination to be balanced ? In this study it seems all the features were equal, but in fact there are always privileged metrics as more discourse or more word criteria for instance are taken into account.

Reference

Sarah E. Schwarm and Mari Ostendorf. 2005. Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics.

Corpus and tools

Features

Results

My remarks

Reference

Related Posts: