--- title: "Lecture 10 - Tree Classification" author: "Joseph Haymaker" date: "11/8/2017" output: html_document --- # Tree based methods ### Part II: RandomForest for classifications ##### 1) Single Trees ##### 2) Random Forests (Notice, RF for classification is similar to that of regression RF's. May skip this and please go through the code for 1) and 2)) ##### 3) Appendix 1: Deviance and Gini index in details ##### 4) Appendix 2. Random classifiers Load all packages ```{r} if (!require("pacman")) install.packages("pacman") pacman::p_load(randomForest, tree, rpart, pROC, partykit) #install.packages('rattle') #install.packages('rpart.plot') library(randomForest) library(tree) library(rpart) library(rpart.plot) library(pROC) library(partykit) # library(rattle) #library(rpart.plot) ``` ### Part II: RandomForeset for classifications ### Random Forest for classification is similar to that of regression trees except for the criterion used for growing a tree is different. Algorithm: ``` -For b=1 to B I) Take a bootstrap sample of size n II) Build a tree T_b, using the bootstrap sample recursively until the n_min is met i) Randomly select m variables ii) For each variable, find the best split point such that the misclassification errors is minimized by majority vote iii) Find the best variable and split point, split the node into two iv) The end node will output the majority vote either 0 or 1 -Output the trees {T_1, T_2, ... T_B} -The final aggregated tree will report the prop. of 1's among B trees ``` ##### Remarks on the splitting criterion. + Suppose $(n_1, p_1)$ and $(n_2, p_2)$ are the number of obs'n and sample prop of "1"'s on the left and right split point Three criteria are commonly used i) Mis classification errors for majority vote (RF uses) $ (n_1* min(p_1, 1-p_1) + n_2* min(p_2, 1-p_2))/n $ (MisClassification errors are $min(p_1, 1-p_1)$ in the left region and $min(p_2, 1-p_2)$ on the right side) ii) Deviance: $ -2loglik =-2*(n1*p1*log(p1)+ n1*(1-p1)*log(1-p1) + n2*p2*log(p2)+ n2*(1-p2)*log(1-p2)) $ iii) Gini index $2(p_1 (1-p_1) + p_2( 1-p_2))$ (we skip this) _Gini index_: misclassification errors for a random procedure that assign hat y="1" with prob. p_1 #### Framingham data for classification trees for simplicity ```{r} rm(list=ls()) # Remove all the existing variables data <- read.table("Framingham.dat", sep=",", header=T, as.is=T) data1 <- na.omit(data) # no missing value names(data1)[1] <- "HD" data1$HD <- as.factor(data1$HD) data1$SEX <- as.factor(data1$SEX) table(as.factor(data1$CIG)) # The levels are very unbalanced. We combine those CIG >35 together and # make CIG a categorical variable data1$CIG[data1$CIG > 35] <- 40 data1$CIG <- as.factor(data1$CIG) str(data1) ``` ### 1): A single tree using package tree *Note this package is different from RandomForest ##### a) Use SBP alone to see the effect of two different criteria: deviance and gini ```{r} fit.dev <- tree(HD~SBP, data1, control=tree.control(nrow(data1), mindev = 0.01), split="deviance") #split = c("deviance", "gini"), "deviance" as default. plot(fit.dev) text(fit.dev, pretty=TRUE) # plot the labels fit.dev$frame ``` ``` var n dev yval splits.cutleft splits.cutright yprob.0 yprob.1 1 SBP 1393 1469.3313 0 <154.5 >154.5 0.7796123 0.2203877 2 <leaf> 923 829.0482 0 0.8342362 0.1657638 3 <leaf> 470 594.5583 0 0.6723404 0.3276596 ``` ##### b) Use all predictors ```{r} fit.tree <- tree(HD~., data1) plot(fit.tree) text(fit.tree, pretty=TRUE) # plot the labels fit.tree$frame # yprob.1=sample prop of 1's in the end node # yval is majority vote class # we can set our own rules by thresholding yprob ``` __Use ROC/AUC ect. to measure the performance__ ```{r} predict(fit.tree, data1)[1:20, ] prob.1 <- predict(fit.tree, data1)[, 2] # Prob(y=1|x) roc(data1$HD, prob.1, plot=T) # since there are only three values for hat P(y=1) ``` ##### c) A single tree with categorical predictors To convince you that trees take categorical var. Grow a tree with only `CIG`+`SBP` (Obviously from b), `CIG` didn't get in) ```{r} fit.tree <- tree(HD~CIG+SBP, data1, control=tree.control(nrow(data1), mindev = 0.005)) plot(fit.tree) text(fit.tree, pretty=TRUE) fit.tree$frame (unique(as.factor(round(predict(fit.tree)[, 2], 5)))) # To check it agrees with fit.tree$frame ``` ##### d) Package rpart Use package rpart, together with rattle or partykid we can get a better tree displays. ```{r} fit.tree <- rpart(HD~., data1, minsplit=1, cp=9e-3) fit.tree <- rpart(HD~CIG+SBP, data1, minsplit=1, cp=9e-3) # cp is similar to mindev in tree() # control=rpart.control(cp=-1, minbucket=1), method="class") plot(as.party(fit.tree), main="Final Tree with Rpart") # method 1 #fancyRpartPlot(fit.tree) # The plot shows the split together with more information # method 2 fit.tree$frame ``` ### 2): Random Forests ##### 1) First run only a few trees ```{r} set.seed(1) fit.rf <- randomForest(HD~., data1, mtry=4, ntree=4) ``` Mis-classification rates of OOB, misclassification errors for "0" and "1" ```{r} plot(fit.rf) ``` The output ```{r} names(fit.rf) fit.rf$mtry fit.rf$votes[1:20, ] # prob of 0 and 1 using oob's fit.rf$predicted[1:20] # lables using oob's and majority vote. Notice those with NA because they are not in any OOB's fit.rf$err.rate # # mis-classification errors of oob's/0/1 predict(fit.rf, data1)[1:20] # prediction by using the RF based on all the training data. data.frame(fit.rf$votes[1:20, ], fit.rf$predicted[1:20], predict(fit.rf, data1)[1:20] ) ``` ### 2) RandomForest, more general ```{r} fit.rf <- randomForest(HD~., data1, mtry=4, ntree=500 ) #fit.rf=randomForest(HD~., data.fram, cutoff=c(.5, .5), ntree=500) ``` __default settings__: + mtry=p/3, (sqrt(p) in classification tree) + Bootstrap size B=ntree=500 ```{r} plot(fit.rf) # Three curves of MCE of 1's, 0's and overall. fit.rf.pred <- predict(fit.rf, type="prob") # output the prob of "0" and "1") fit.rf.pred.y <- predict(fit.rf, type="response") mean(data1$HD != fit.rf.pred.y) # Training misclassification error is about .23 ``` #### 3) using training and testing data ```{r} n <- nrow(data1) n1 <- (2/3)*n train.index <- sample(n, n1,replace=FALSE) length(train.index) data.train <- data1[train.index, ] data.test <- data1[-train.index, ] fit.rf.train <- randomForest(HD~., data.train) plot(fit.rf.train) predict.rf.y <- predict(fit.rf.train, newdata=data.test) # labels predict.rf <- predict(fit.rf.train, newdata=data.test, type="prob") #probabilities ``` ##### Testing errors ```{r} mean(data.test$HD != predict.rf.y) # about .24 ``` ##### Testing ROC curve ```{r} roc(data.test$HD, predict.rf[,2], plot=TRUE) # Ok ``` __Comments about RF classifications__ __Pro:__ + pretty accurate; fast; robust to number of variables; + not very sensitive to outliers. + It handles automatically multi-categories in response. __Con:__ + loose interpretation, may discriminate against categorical predictors. Extremely cautious about: + Importance plot + The validity of using a cut off point other than 1/2, though we can still set our own thresholds. __We will see how other competitors work__ ## Appendix 1: Deviance and Gini index in details Use `SBP` alone to see the effect of two different criteria: _deviance_ and _gini_ ```{r} fit.dev <- tree(HD~SBP, data1, split="deviance") #split <- c("deviance", "gini"), "deviance" as default. plot(fit.dev) text(fit.dev, pretty=TRUE) # plot the labels fit.dev$frame ``` ``` var n dev yval splits.cutleft splits.cutright yprob.0 yprob.1 1 SBP 1393 1469.3313 0 <154.5 >154.5 0.7796123 0.2203877 2 <leaf> 923 829.0482 0 0.8342362 0.1657638 3 <leaf> 470 594.5583 0 0.6723404 0.3276596 ``` ```{r} summary(fit.dev) ``` Residual mean deviance: 1.023 = 1424 / 1391 Use above output, we recover the mean dev above: -2loglik/(n-number of nodes) ```{r} p1 <- 0.1657638 # sample prop of Y=1 when SBP <154.5 n1 <- 923 n1* p1 # number of Y=1 when SBP < 154.5 n1*(1-p1) # number of Y=0 when SBP < 154.5 p2 <- 0.3276596 # sample prop of Y=1 when SBP <154.5 n2 <- 470 n2*p2 # number of Y=1 when SBP > 154.5 n2*(1-p2) # number of Y=0 when SBP > 154.5 dev <- -2*(n1*p1*log(p1)+ n1*(1-p1)*log(1-p1) + n2*p2*log(p2)+ n2*(1-p2)*log(1-p2)) dev/(1393-2) # [1] 1.023441 agrees with the summary output ``` ```{r} fit.gini <- tree(HD~SBP, data1, split="gini") #split = c("deviance", "gini") as an option plot(fit.gini) text(fit.gini, pretty=TRUE) # plot the labels head(fit.gini$frame, 30) summary(fit.gini) ``` We notice that two trees are different. ## Appendix 2. Random classiferes + Assign Y = 1 with prob of P(Y=1), + Similarly assigning Y= 0 with prob of P(Y=0) The misclassificaiton error will be same as gini index. We use the following tree to demonstrate how to produce a random procedure and also to compare the __mis-classification errors (MCE)__ between the majority vote and the random procedure. ##### We use fit.gini to construct two sets of classifiers. ```{r} fit.gini <- tree(HD~SBP, data1, split="gini") #split = c("deviance", "gini") as an option phat <- predict(fit.gini, data1 ) # estimated prob ``` __1) Majority vote rules:__ ```{r} y.5 <- ifelse(phat>0.5, "1", "0") mean(data1$HD != y.5) ## It's about .5 ``` __2) Use a random procedure yhat="1" with prob phat__ ```{r} n <- nrow(data1) y.rand <- rep(0,n ) for (i in 1:n) { y.rand[i] <- rbinom(1, 1, phat[i, 2]) } mean(data1$HD != y.rand) # mis-classification errors #[1] 0.3201723 # it's about .3 ``` ## Appendix 3. Tree split in action (a toy example) ```{r} library(MASS) data(cats) help(cats) plot(cats[2:3], pch = 21, bg = c("red", "green3")[unclass(cats$Sex)]) cats_model <- tree(Sex~., data = cats, control=tree.control(nrow(cats), mindev = 0.05)) plot(cats_model) text(cats_model) x1 <- seq(min(cats$Bwt), max(cats$Bwt), length = 50) x2 <- seq(min(cats$Hwt), max(cats$Hwt), length = 50) Feature_x1_to_x2 <- expand.grid(Bwt = x1, Hwt = x2) Feature_x1_to_x2_Class <- apply(predict(cats_model,Feature_x1_to_x2),1, function(one_row) return(which(one_row == max(one_row)))) plot(cats[2:3], pch = 21, bg = c("red", "green3")[unclass(cats$Sex)]) contour(x1,x2,matrix(Feature_x1_to_x2_Class,length(x1)),add = T, levels = 1.5, labex = 0) ``` Change mindev ```{r} cats_model <- tree(Sex~., data = cats, control=tree.control(nrow(cats), mindev = 0.001)) plot(cats_model) text(cats_model) Feature_x1_to_x2 <- expand.grid(Bwt = x1, Hwt = x2) Feature_x1_to_x2_Class <- apply(predict(cats_model,Feature_x1_to_x2),1, function(one_row) return(which(one_row == max(one_row)))) plot(cats[2:3], pch = 21, bg = c("red", "green3")[unclass(cats$Sex)]) contour(x1,x2,matrix(Feature_x1_to_x2_Class,length(x1)),add = T, levels = 1.5, labex = 0) ```