F Solutions ch. 9 - Decision trees and random forests
Solutions to exercises of chapter 7.
F.1 Exercise 1
Load the necessary packages
readr to read in the data
dplyr to process data
party and rpart for the classification tree algorithms
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
Select features that may explain survival
Each row in the data is a passenger. Columns are features:
survived: 0 if died, 1 if survived
embarked: Port of Embarkation (Cherbourg, Queenstown,Southampton)
sex: Gender
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
fare: Fare Payed
Make categorical features should be made into factors
titanic3 <- "https://goo.gl/At238b" %>%
read_csv %>% # read in the data
select(survived, embarked, sex,
sibsp, parch, fare) %>%
mutate(embarked = factor(embarked),
sex = factor(sex))
## Parsed with column specification:
## cols(
## pclass = col_character(),
## survived = col_double(),
## name = col_character(),
## sex = col_character(),
## age = col_double(),
## sibsp = col_double(),
## parch = col_double(),
## ticket = col_character(),
## fare = col_double(),
## cabin = col_character(),
## embarked = col_character(),
## boat = col_character(),
## body = col_double(),
## home.dest = col_character()
## )
Split data into training and test sets
Recursive partitioning is implemented in “rpart” package
Conditional partitioning is implemented in the “ctree” method
Use ROCR package to visualize ROC Curve and compare methods
tree_roc <- tree_fit %>%
predict(newdata = .data$test) %>%
prediction(.data$test$survived) %>%
performance("tpr", "fpr")
plot(tree_roc)
Acknowledgement: the code for this excersise is from http://bit.ly/2fqWKvK