I recently read that Microsoft was planning to analyze several social networks in order to know more about users, so that the search engine could deliver more appropriate results. See this article on geekwire.com : Microsoft idea: Analyze social networks posts to deduce mood, interests, education.
Among the variables that are considered, the ‘sophistication and education level’ of the posts is mentionned. This is highly interesting, because it assumes a double readability assessment, on the reader’s side and on the side of the search engine. More precisely, this could refer to a classification task.
Here is an extract of a patent describing how this is supposed to work.
[0117] In addition to skewing the search results to the user’s inferred interests, the user-following engine 112 may further tailor the search results to a user’s comprehension level. For example, an intelligent processing module 156 may be directed to discerning the sophistication and education level of the posts of a user 102. Based on that inference, the customization engine may vary the sophistication level of the customized search result 510. The user-following engine 112 is able to make determinations about comprehension level several ways, including from a user’s posts and from a user’s stored profile. In one example, the user-following engine 112 may discern whether a user is a younger student or an adult professional. In such an example, the user-following engine may tailor the results so that the professional receives results reflecting a higher comprehension level than the results for the student. Any of a wide variety of differentiations may be made. In a further example, the user-following engine may discern a particular specialty of the user, e.g., the user is a marine biologist or an avid cyclist. In such embodiments, a query from a user related to his or her particular area of specialty may return a more sophisticated set of results than the same query from a user not in that area of specialty.
The main drawback I see in this approach is the determination of a profile based on communication. First of all, people do not necessarily want to read texts that are as easy (or difficult) as those they write. Secondly, people progress in speaking a language by reading words or expressions they do not already know, by doing so Microsoft could prevent young students from developing language skills. Last, communication is an adaptative process : a whole series of adaptations depends on the persons or the group one speaks to, and the ‘sophistication level’ varies accordingly, which is not necessarily correlated with an education level.
A general example would be that people usually try to be (or to seem) cool on Facebook, which involves using shorter sentences and colloquial terms. Another example would be the lack of time, and as a result shorter sentences and messages.
It seems that this strategy is based on the false assumption that you can judge user’s linguistic abilities by starting from a result that is in fact a construct. In other words, it seems like an excessive valuation of performance over competence. There are many reasons why people may speak or write differently in different situations, that is what many sub-disciplines of linguistics are about, and that is what Microsoft is blatantly ignoring in this project. A reasonable explanation would be that the so-called levels are rough estimations and that the profiles are not fine-grained, i.e. that there is only a few of them.