src folder has all the code.
Preprocessing_all.ipynb
Includes all the preprocessing analysis. Excluding missing values, irrelevant/incomplete information, adding columns. Also includes the API for stringency and the one hot encoding of the data.
EDA_london.ipynb
EDA_london_boroughs.ipynb
EDA_other_cities.ipynb
EDA_maps.ipynb
The files includes all the visualizations created for all the cities for both datasets. Top crimes, top locations, crime
outcomes etc. Also plots comparing before and after covid crime count and more.
Machine_learning.ipynb
contains all the machine learning methods used. First part of the notebook is the feature importance
using RF and xgBoost to determine the most important features. Next is the regression part for
predicting crime. In part 3 is the classification (using the covid column as target). Part 4 has the
clustering to determine whether there are unique clusters before and after covid. Lastly, is the regression on just the
unique values (45 rows). File also includes correlation matrices.
Data folder has all the raw data: the police data (added per month since the complete csv was too long to upload)
The stringency values that were extracted from the API (per month) and the unemployment rates csv for unemployment
information for London.
Documents folder includes the project report and the PowerPoint slides.
How to connect to the database:
The SQL DB is hosted in RDS in AWS.
To connect to the DB create a new connection with the bellow details:
hostname='database-1.caiikwj2d9fo.us-east-2.rds.amazonaws.com',
database='crimedb',
username='cfg_user',
password='cfg_project2021'
In order to extract data from the DB import the connect_to_db.ipynb file in your script:
import import_ipynb
from connect_to_db import db_cnx
NOTE:
Some Jupyter notebooks might not load on GitHub due to being too large. If that happens you can view them by
inputting the notebook URL in nbviewer: https://nbviewer.org/