I read an interesting article, “featuring” an up-to-date comparison of what is being done in the field of readability assessment:
“A Comparison of Features for Automatic Readability Assessment”, Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad, 23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284.
I am interested in the features they use. Let’s summarize, I am going to do a quick recension:
Corpus and tools
- Corpus: a sample from the Weekly Reader
- OpenNLP to extract named entities and resolve co-references
- the Weka learning toolkit for machine learning
- Four subsets of discourse features: 1. entity-density features 2. lexical-chain features (chains rely on semantic relations as they are automatically detected) 3. co-reference inference features (a research novelty) 4. entity grid features (transition patterns according to the grammatical roles of the words)
- Language Modeling Features, i.e. train language models
- Parsed Syntactic Features, such as parse tree height
- POS-based Features
- Shallow Features, i.e. traditional readability metrics
- Other features, mainly “perplexity features” according to Schwarm and Ostendorf (2005), see below
- Combining discourse features doesn’t significantly improve accuracy, discourse features do not seem to be useful.
- Language models trained with information gain outperform those trained ...