Skip to content

aaronmkwong/Python-Random-Forest-Titanic-Survivorship

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 

Repository files navigation

Python-Random-Forest-Titanic-Survivorship

This project uses Random Forests to classify a passenger's survivorship as died or survived. It uses the same engineered data and features as the Logistic Regression Titanic Survivorship project. Random Forests samples the training dataset with replacement (bagging), but trees are constructed in a way that reduces the correlation between individual classifiers (Brownlee, 2021, p.93) providing an improvement over Bagged Trees. Bagging (or Bootstrap Aggregation) takes multiple samples from the training dataset with replacement and trains a model for each sample, then averages the predictions of all sub-models to obtain a final averaged prediction (Brownlee, 2021, p. 92). The custom function rf_mod_assess() returns a dataframe of all of the classifcation report results for each of the 25 trials using Monte Carlo cross validation.

See also the KNN Titanic Survivorship project.

The code can be viewed here: Titanic_Survivorship_Random_Forest_02.ipynb.

Random Forest outperformed Logistic Regression and KNN by 2% and 1% respectively.

To determine feature importance the middle result with random state 19 is used.

Permutation Importance rather than Random Forest Feature Importance is used since the latter suffers from being computed on statistics derived from the training dataset (Scikit-Learn, 2021a). The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled, which breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature (Scikit-Learn, 2021b). In the boxplot below accuracy is used for scoring the model.

Similar to the Logistic Regression Titanic Survivorship project, Male, PClass and Age are the most important features and share the same rank position.

REFERENCES

Brownlee, Jason. (2021). Machine Learning Mastery with Python: Understand Your Data, Create Accurate Models and Work Projects End-To-End (v1.2 ed.). https://machinelearningmastery.com/machine-learning-with-python/

Scikit-Learn. (2021a, October 17). https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py

Scikit-Learn. (2021b, October 17). https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance

About

Random Forests classification and feature importance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published