Skip to content

Second project of the Data Scientist Nanodegree of Udacity

Notifications You must be signed in to change notification settings

carogomezt/datascience_project2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disaster Response Pipeline Project

Table of Contents

  1. Installation
  2. Project Motivation
  3. File Descriptions
  4. Instructions
  5. Results
  6. Licensing, Authors, and Acknowledgements

Installation

  1. Clone the repository.
  2. Create a virtual environment.
$ virtualenv --python=python3  ds-project2 --no-site-packages 
$ source ds-project2/bin/activate  
  1. Go to the project folder (datascience_project2) and run the following command to install all the dependencies:
$ pip install -r requirements.txt  

Project Motivation

For this project was used data from Figure Eight to build a model for an API that classifies disaster messages. With this information I was able to put into practice ETL skills, and the creation of ML Pipelines. This application will help people and organizations during and event of disaster because they could categorize the messages sent by the people and could make a mitigation plan faster.

File Descriptions

  1. app: Folder with html files and code to run the API.
  2. data: Folder with the file to make the preprocessing of the data.
  3. data_analysis: Folder with the Jupyter Notebooks used to make the initial exploration of the data and models.
  4. img: Folder with images of the results.
  5. models: Folder with the file to make, train and evaluate the model.
  6. README.md: File with repository information.
  7. requirements.txt: File with requirements of the project.

File structure

- app
| - template
| |- master.html # main page of web app
| |- go.html # classification result page of web app
|- run.py # Flask file that runs app
- data
|- disaster_categories.csv # data to process
|- disaster_messages.csv # data to process
|- process_data.py
|- InsertDatabaseName.db # database to save clean data to
- data_analysis
|- ETL Pipeline Preparation.ipynb # first processing of the data
|- ML Pipeline Preparation.ipynb # # exploration of models
- img
|- models_accuracy.png
- models
|- train_classifier.py
|- classifier.pkl # saved model
README.md

Instructions

  1. Run the following commands in the project's root directory to set up your database and model.

    • To run ETL pipeline that cleans data and stores in database python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
    • To run ML pipeline that trains classifier and saves python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
  2. Run the following command in the app's directory to run your web app. python run.py

  3. Go to http://0.0.0.0:3001/

Results

In the step to build the model I tried different models, and they give me the following results:

  • RandomForest score: 81%
  • DecisionTrees score: 79%
  • KNeighborsClassifier score: 26%

models accuracy

You could see more detailed information on the jupyter notebook.

I choose the RandomForest model because it had the highest score, and I tried to optimize this model with the GridSearch. It took more than 10 hours to train and the score descended to 21%. For that reason I decided to choose the model before applying the GridSearch. I couldn't upload the model because the size of the file was around 1Gb.

Some target classes didn't have different values (all values were 0) and that could make that the model couldn't generalize in a better way.

Licensing, Authors, Acknowledgements

Must give credit to Stack Overflow for the data. You can find the Licensing for the data and other descriptive information here. Otherwise, feel free to use the code here as you would like!

About

Second project of the Data Scientist Nanodegree of Udacity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published