Recipes for several model fitting techniques in R

As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer : the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the examples are from the Titanic track on Kaggle.

Generalized linear models

titanic.glm <- glm (survived ~ pclass + sex + age + sibsp, data = titanic, family = binomial(link=logit))
glm.pred <- predict.glm(titanic.glm, newdata = titanicnew, na.action = na.pass, type = "response")
  • cat’ actually prints the output
  • One might want to use the na.action switch to be able to deal with incomplete data (as in the Titanic dataset) : na.action=na.pass

Link to glm in R manual.

Mixed GAM (Generalized Additive Models) Computation Vehicle

The commands are a little less obvious:

titanic.gam <- gam (survived ~ pclass ...
more ...

Data analysis and modeling in R: a crash course

Let’s pretend you recently installed R (a software to do statistical computing), you have a text collection you would like to analyze or classify and some time to lose. Here are a few quick commands that could get you a little further. I also write this kind of cheat sheet in order to remember a set of useful tricks and packages I recently gathered and from which I thought they could help others too.

Letter frequencies

In this example I will use a series of characteristics (or features) extracted from a text collection, more precisely the frequency of each letter from a to z (all lowercase). By the way, it goes as simple as that using Perl and regular expressions (provided you have a $text variable):

my @letters = ("a" .. "z");
foreach my $letter (@letters) {
    my $letter_count = () = $text =~ /$letter/gi;
    printf "%.3f", (($letter_count/length($text))*100);

First tests in R

After having started R (‘R’ command), one usually wants to import data. In this case, my file type is TSV (Tab-Separated Values) and the first row contains only describers (from ‘a’ to ‘z’), which comes at hand later. This is done using the read.table command.

alpha <- read.table("letters_frequency ...
more ...