Jean-Philippe Magué told me there was a Google advanced search filter that checked the result pages to give a readability estimate. In fact, it was introduced about seven months ago and works to my knowledge only for the English language (that’s also why I didn’t notice it).

Description

For more information, you can read the official help page. I also found two convincing blog posts showing how it works, one by the Unofficial Google System Blog and the other by Daniel M. Russell.

The most interesting bits of information I was able to find consist in a brief explanation by a product manager at Google who created the following topic on the help forum : New Feature: Filter your results by reading level.
Note that this does not seem to have ever been a hot topic !

Apparently, it was designed as an “annotation” based on a statistical model developed using real word data (i.e. pages that were “manually” classified by teachers). The engine works by performing a word comparison, using the model as well as articles found by Google Scholar.

In the original text :

The feature is based primarily on statistical models we built with the help of teachers. We paid teachers to classify pages for different reading levels, and then took their classifications to build a statistical model. With this model, we can compare the words on any webpage with the words in the model to classify reading levels. We also use data from Google Scholar, since most of the articles in Scholar are advanced.”

Remarks

  • It seems to be a model of reading complexity merely based on words. It does not include readability formulas. By comparing the texts to assess with a (gold) standard it aims at being robust.
  • This model assumes that one doesn’t tackle a simple issue using uncommon or difficult words, and that the words are a sufficient criterion. This can lead to curious deformations, see the bloggers that started assessing the complexity of “Jesus” against profane words.
  • The Googlers think that scientific articles are far out of the linguistic norm. However, the purpose of the authors is most of the time to sound as clear as possible, apart from technical words. The identification of such words can be difficult to balance.
  • What about slang ? This language dimension seems to be taken into account, a search for “yo momma” returns mostly results qualified as basic, same for “in my hood”. It’s interesting since it would probably be different if these words were unknown to the system.

To conclude, the Simple English Wikipedia seems to be annotated as intermediate or advanced, though this version is really meant to be simple and succeeds in my opinion, on a lexical as well as on a syntactical and on a semantic level. So that I wonder if the model really proves efficient by now in terms of precision.