Project: Data Cleaning and Manipulation with Pandas

The goal of this project is to combine data wrangling, cleaning, and manipulation with Pandas in order to see how it all works together.

The dataset we are going to use is: Shark Attacks

Project: Data Cleaning and Manipulation with Pandas

Shark Attacks Files

Goal:

In this project our primary goal was clean the dataset provided by using pandas. We also define some functions along the project, among others a predictor for two columns based on modal frequencies.

Procedure:

In short, we proceed as follows:

Data exploration: Here we have an insight of the data set by looking the data types in question, which is the information valuable in this context, etc. There are many NaN values in critical fields, such as the Shark species. Our focus is try to maintain this valuable information.
Delete duplicated columns and filling NaN values plan:

The unnamed columns 22 and 23 should be dropped, because are almost entirely null.
Due to the relevance of the "Species" column in the data set, we cannot drop this column, even when the percentage of NaN's there is near to 50%.
The column "Time" and "Year" seems to be irrelevant. However we will explore it further.
The columns "Country", "Area" and "Location" are related, so it might be possible to infer the missing values one from the others.
The column "Age" maybe can be infered from anothers after further inspection and it seems to be relevant also in our context so we decided not to drop it.

Filling values in "Injury" and "Fatal(Y/N)" heuristically
Filling values in "Country", "Area" and "Location" heuristically
Cleaning and parsing the column "Date".
Droping the column "Case Number", "Year" and "original order".
Filling values for "Name" and "Investigator or Source".
Tiding column "Activity" by extracting verbs in gerund.
Create a Predictor Function and with it, filling NaN values in Columns "Age" and "Activity".
Use this Predictor for filling NaN's in Area
Filling values for column "Species": Unfortunately we could not apply the Predictor in this case because there are no enough information in the data set
Droping irrelevant rows based on NaN's counting by columns
Infer values for "Area" and "Location"
Assing boolean values fo column "Fatal (Y/N)".
Changing column names and Re-indexing
Exporting data frame as .csv file

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
GSAF5.csv		GSAF5.csv
README.md		README.md
Shark_attacks_clean.csv		Shark_attacks_clean.csv
data-wrangling.ipynb		data-wrangling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Data Cleaning and Manipulation with Pandas

Shark Attacks Files

Goal:

Procedure:

About

Releases

Packages

Languages

fabi-cast/Data-wrangling-shark-attacks

Folders and files

Latest commit

History

Repository files navigation

Project: Data Cleaning and Manipulation with Pandas

Shark Attacks Files

Goal:

Procedure:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages