Recipes for several model fitting techniques in R

As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer : the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the examples are from the Titanic track on Kaggle.

Generalized linear models

titanic.glm <- glm (survived ~ pclass + sex + age + sibsp, data = titanic, family = binomial(link=logit))
glm.pred <- predict.glm(titanic.glm, newdata = titanicnew, na.action = na.pass, type = "response")
  • cat’ actually prints the output
  • One might want to use the na.action switch to be able to deal with incomplete data (as in the Titanic dataset) : na.action=na.pass

Link to glm in R manual.

Mixed GAM (Generalized Additive Models) Computation Vehicle

The commands are a little less obvious:

titanic.gam <- gam (survived ~ pclass ...
more ...

Amazon’s readability statistics by example

I already mentioned Amazon’s text stats in a post where I tried to explain why they were far from being useful in every situation : A note on Amazon’s text readability stats, published last December.

I found an example which shows particularly well why you cannot rely on these statistics when it comes to get a precise picture of a text’s readability. Here are the screenshots of text statistics describing two different books (click on them to display a larger view) :

Comparison of two books on Amazon

The two books look quite similar, except for the length of the second one, which seems to contain significantly more words and sentences.

The first book (on the left) is Pippi Longstocking, by Astrid Lindgren, whereas the second is The Sound and The Fury, by William Faulkner… The writing style could not be more different, however, the text statistics make them appear quite close to each other.

The criteria used by Amazon are too simplistic, even if they usually perform acceptably on all kind of texts. The readability formulas that output the first series of results only take the length of words and sentences into account and their scale is designed for the US school system. In ...

more ...

Word lists, word frequency and contextual diversity

How to build an efficient word list ? What are the limits of word frequency measures ? These issues are relevant to readability.

First, a word about the context : word lists are used to find difficulties and to try to improve the teaching material, whereas word frequency is used in psychological linguistics to measure cognitive processing. Thus, this topic deals with education science, psychological linguistics and corpus linguistics.

Coxhead’s Academic Word List

The academic word list by Averil Coxhead is a good example of this approach. He finds that students are not generally familiar with academic vocabulary, giving following examples : substitute, underlie, establish and inherent (p. 214). According to him, this kind of words are are “supportive” but not “central” (these adjectives could be good examples as well).

He starts from principles from corpus linguistics and states that “a register such as academic texts encompasses a variety of subregisters”, one has to balance the corpus.

Coxhead’s methodology is interesting. As one can see he probably read the works of Douglas Biber or John Sinclair, just to name a few. (AWL stands for Academic Word List.)

« To establish whether the AWL maintains high coverage over academic texts other than those in ...

more ...

A note on Amazon’s text readability stats

Recently, Jean-Philippe Magué advised me of the newly introduced text stats on Amazon. A good summary by Gabe Habash on the news blog of Publishers Weekly describes the perspectives and the potential interest of this new software : Book Lies: Readability is Impossible to Measure. The stats seem to have been available since last summer. I decided to contribute to the discussion on Amazon’s text readability statistics : to what extent are they reliable and useful ?


Gabe Habash compares several well-known books and concludes that the sentence length is determining in the readability measures used by Amazon. In fact, the readability formulas (Fog Index, Flesch Index and Flesch-Kincaid Index, for an explanation see Amazon’s text readability help) are centered on word length and sentence length, which is convenient but by far not always adapted.

There is another metric named ‘word complexity’, which Amazon defines as follows : ‘A word is considered “complex” if it has three or more syllables’ (source : complexity help). I wonder what happens in the case of proper nouns like (again…) Schwarzenegger. There are cases where the syllable recognition is not that easy for an algorithm that was programmed and tested to perform well on English words ...

more ...

Having fun and making money doing research

What do people look for ? A few years ago it would have been difficult to gather information at a large scale and grab it with a powerful, yet more or less objective tool. Nowadays a single company is able to know what you want, what you buy or what you just did. And sometimes it shares a little bit of the data.

So, the end of the year gives me an occasion to try and discover changes in the mentalities using the ready-to-use Google Trends. Just for fun…

How does research compare with other interests ?

First of all, research is no fun, it was more requested than money and was at the level of work, but things have changed. It still outnumbers fun in the news though.

A few trends regarding research

A few trends regarding research, “Research is no fun”… Source: Google), worldwide trends.

People seem to look for money more often than a few years ago, it’s the only thing which becomes more popular, even work just remains stable.

A remark: I think the search volume is much more bigger now than it was back in 2004, there are also more languages available, and probably more search terms (since the users may ...

more ...