The goal of this project is to combine data wrangling, cleaning, and manipulation with Pandas in order to see how it all works together.
The dataset we are going to use is: Shark Attacks
In this project our primary goal was clean the dataset provided by using pandas. We also define some functions along the project, among others a predictor for two columns based on modal frequencies.
In short, we proceed as follows:
-
Data exploration: Here we have an insight of the data set by looking the data types in question, which is the information valuable in this context, etc. There are many
NaN
values in critical fields, such as the Shark species. Our focus is try to maintain this valuable information. -
Delete duplicated columns and filling
NaN
values plan:
-
The unnamed columns 22 and 23 should be dropped, because are almost entirely null.
-
Due to the relevance of the
"Species"
column in the data set, we cannot drop this column, even when the percentage ofNaN
's there is near to 50%. -
The column
"Time"
and"Year"
seems to be irrelevant. However we will explore it further. -
The columns
"Country"
,"Area"
and"Location"
are related, so it might be possible to infer the missing values one from the others. -
The column
"Age"
maybe can be infered from anothers after further inspection and it seems to be relevant also in our context so we decided not to drop it.
-
Filling values in
"Injury"
and"Fatal(Y/N)"
heuristically -
Filling values in
"Country"
,"Area"
and"Location"
heuristically -
Cleaning and parsing the column
"Date"
. -
Droping the column
"Case Number"
,"Year"
and"original order"
. -
Filling values for
"Name"
and"Investigator or Source"
. -
Tiding column
"Activity"
by extracting verbs in gerund. -
Create a Predictor Function and with it, filling NaN values in Columns
"Age"
and"Activity"
. -
Use this Predictor for filling
NaN
's in Area -
Filling values for column
"Species"
: Unfortunately we could not apply the Predictor in this case because there are no enough information in the data set -
Droping irrelevant rows based on
NaN
's counting by columns -
Infer values for
"Area"
and"Location"
-
Assing boolean values fo column
"Fatal (Y/N)"
. -
Changing column names and Re-indexing
-
Exporting data frame as .csv file