I read an interesting article, “featuring” an up-to-date comparison of what is being done in the field of readability assessment:
“A Comparison of Features for Automatic Readability Assessment”, Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad, 23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284.
I am interested in the features they use. Let’s summarize, I am going to do a quick recension:
Corpus and tools
- Corpus: a sample from the Weekly Reader
- OpenNLP to extract named entities and resolve co-references
- the Weka learning toolkit for machine learning
Features
- Four subsets of discourse features: 1. entity-density features 2. lexical-chain features (chains rely on semantic relations as they are automatically detected) 3. co-reference inference features (a research novelty) 4. entity grid features (transition patterns according to the grammatical roles of the words)
- Language Modeling Features, i.e. train language models
- Parsed Syntactic Features, such as parse tree height
- POS-based Features
- Shallow Features, i.e. traditional readability metrics
- Other features, mainly “perplexity features” according to Schwarm and Ostendorf (2005), see below
Results
- Combining discourse features doesn’t significantly improve accuracy, discourse features do not seem to be useful.
- Language models trained with information gain outperform those trained with POS labels (apart from words and/or tags alone).
- Verb phrases appear to be more closely correlated with text complexity than other types of phrases.
- Noun-based features generate the highest classification accuracy.
- Average sentence length has dominating predictive power over all other shallow features.
- The criteria regarding clauses did not perform well, the authors are going to work on it.
My remarks
No wonder that the criteria that are simple to implement do perform well. On the other hand, I cannot believe that the discourse features are of no use. More fine-grained features such as these ones need models that are more accurate, which means after all complex models…
“In general, our selected POS features appear to be more correlated to text complexity than syntactic features, shallow features and most discourse features.”
Alas, the POS-based features do not go into details (I would rather speak of POS-basic features). The authors did not focus on this kind of features, although the simple approach apparently finds relevant information.
“A judicious combination of features examined here results in a significant improvement over the state of the art.”
That leads to another problem: how is the combination to be balanced ? In this study it seems all the features were equal, but in fact there are always privileged metrics as more discourse or more word criteria for instance are taken into account.
Reference
Sarah E. Schwarm and Mari Ostendorf. 2005. Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics.