As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.
Disclaimer: the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.
I see it as a way to quickly test several frameworks without going into details.
The column names used in the examples are from the Titanic track on Kaggle.
Generalized linear models
titanic.glm <- glm (survived ~ pclass + sex + age + sibsp, data = titanic, family = binomial(link=logit)) glm.pred <- predict.glm(titanic.glm, newdata = titanicnew, na.action = na.pass, type = "response") cat(glm.pred)`
- ‘cat’ actually prints the output
- One might want to use the na.action switch to be able to deal with incomplete data (as in the Titanic dataset) : na.action=na.pass
Mixed GAM (Generalized Additive Models) Computation Vehicle
The commands are a little less obvious:
library(mgcv) titanic.gam <- gam (survived ~ pclass + sex + age + sibsp, data = titanic, family=quasibinomial(link = "logit"), method="GCV.Cp") gam.pred <- predict.gam(titanic.gam, newdata = titanicnew, na.action = na.pass, type = "response") gam.pred <- ifelse(gam.pred <= 0, 0, 1)`
- Quasibinomial, Poisson, Gamma and Gaussian are usually usable alternatives with both libraries.
- The method used (GCV.Cp) is select by cross-validation. Others are available.
- The ifelse threshold is not necessarily 0. Moreover, this kind of conversion is not always required.
library(robustbase) titanic.glm <- glmrob(survived ~ pclass + sex + age + sibsp + parch , data = titanic, family = binomial)`
- use predict.glmrob for prediction.
I mentioned this part in my last post about R.
library(rpart) titanic.tree <- rpart(survived ~ pclass + sex + age + sibsp, data = titanic, method="anova") tree.pred <- predict(titanic.tree, newdata = titanicnew)`
Support vector machines
There are many other implementations available.
library(e1071) titanic.svm <- svm(formula = survived ~ pclass + sex + age + sibsp, data = titanic, gamma = 10^-1, cost = 10^-1) pred.svm <- predict (titanic.svm, data = titanicnew)`
- optional :
decision.values = TRUE
A formula in order to automatically tune the result (certainly not the most accurate):
tuned <- tune.svm(survived ~ pclass + sex + age + sibsp, data = titanic, gamma = 10^(-5:5), cost = 10^(-2:2), kernel = "polynomial")
Bagging (in this case bagging of a tree)
library(ipred) titanic.bt <- bagging(survived ~ pclass + sex + age + sibsp, data = titanic, nbagg=20,coob=T) titanic.bt <- ipredbag(survived ~ pclass + sex + age + sibsp, data = titanic, nbagg=20,coob=T) exp <- predict(titanic.bt, type="class")`
- Sampling : nbagg bootstrap samples are drawn and a tree is constructed for each of them.
- Coob : out-of-bag estimate of the error rate is computed.
library(randomForest) titanic.rf <- randomForest(survived ~ pclass + sex + age + sibsp, data = titanic, importance=T, na.action=NULL) rf.pred <- predict(titanic.rf, data = titanicnew, type="response")`
- The mtry and ntree options can be useful here.