As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer: the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the examples are from the Titanic track on Kaggle.

Generalized linear models

titanic.glm <- glm (survived ~ pclass + sex + age + sibsp, data = titanic, family = binomial(link=logit))
glm.pred <- predict.glm(titanic.glm, newdata = titanicnew, na.action = na.pass, type = "response")
cat(glm.pred)`
  • cat’ actually prints the output
  • One might want to use the na.action switch to be able to deal with incomplete data (as in the Titanic dataset) : na.action=na.pass

Link to glm in R manual.

Mixed GAM (Generalized Additive Models) Computation Vehicle

The commands are a little less obvious:

library(mgcv)
titanic.gam <- gam (survived ~ pclass + sex + age + sibsp, data = titanic, family=quasibinomial(link = "logit"), method="GCV.Cp")
gam.pred <- predict.gam(titanic.gam, newdata = titanicnew, na.action = na.pass, type = "response")
gam.pred <- ifelse(gam.pred <= 0, 0, 1)`
  • Quasibinomial, Poisson, Gamma and Gaussian are usually usable alternatives with both libraries.
  • The method used (GCV.Cp) is select by cross-validation. Others are available.
  • The ifelse threshold is not necessarily 0. Moreover, this kind of conversion is not always required.

Link to mgcv package manual.

Robust statistics

library(robustbase)
titanic.glm <- glmrob(survived ~ pclass + sex + age + sibsp + parch , data = titanic, family = binomial)`
  • use predict.glmrob for prediction.

Link to robustbase package manual.

Partition trees

I mentioned this part in my last post about R.

library(rpart)
titanic.tree <- rpart(survived ~ pclass + sex + age + sibsp, data = titanic, method="anova")
tree.pred <- predict(titanic.tree, newdata = titanicnew)`

Link to rpart package manual.

Support vector machines

There are many other implementations available.

library(e1071)
titanic.svm <- svm(formula = survived ~ pclass + sex + age + sibsp, data = titanic, gamma = 10^-1, cost = 10^-1)
pred.svm <- predict (titanic.svm, data = titanicnew)`
  • optional : decision.values = TRUE
  • A formula in order to automatically tune the result (certainly not the most accurate):

    :::r tuned <- tune.svm(survived ~ pclass + sex + age + sibsp, data = titanic, gamma = 10^(-5:5), cost = 10^(-2:2), kernel = "polynomial")

Link to e1071 package manual.

Bagging (in this case bagging of a tree)

library(ipred)
titanic.bt <- bagging(survived ~ pclass + sex + age + sibsp, data = titanic, nbagg=20,coob=T)
titanic.bt <- ipredbag(survived ~ pclass + sex + age + sibsp, data = titanic, nbagg=20,coob=T)
exp <- predict(titanic.bt, type="class")`
  • Sampling : nbagg bootstrap samples are drawn and a tree is constructed for each of them.
  • Coob : out-of-bag estimate of the error rate is computed.

Link to ipred package manual.

Random forests

library(randomForest)
titanic.rf <- randomForest(survived ~ pclass + sex + age + sibsp, data = titanic, importance=T, na.action=NULL)
rf.pred <- predict(titanic.rf, data = titanicnew, type="response")`
  • The mtry and ntree options can be useful here.

Link to randomForest package manual.