Multinational-Retail-Data-Centralisation

Project Description

This project mimics an ideal ETL project for a multinational company that sells various goods across the globe.

In this scenerio, their sales data is spread across many different data sources making it not easily accessible or analysable by current members of the team. In an effort to become more data-driven, the organisation would like to make its sales data accessible from one centralised location.

The goal of the project will be to produce a system that stores the current company data in a database so that it's accessed from one centralised location and acts as a single source of truth for sales data.

The user will then query the database to get up-to-date metrics for the business.

File Structure

Milestones 2 : Code Description

data_extraction.py : used to extract data from different data sources
database_utils.py : used to connect with and upload data to the database
data_cleaning.py : used to clean data from each of the data sources

Milestones 3 : Code Description

task_1.sql : casts the column in the orders_table to appopriate datatypes as shown
task_2.sql : casts the column in the users_table to appopriate datatypes as shown
task_3.sql : casts the column in the dim_stores_table to appopriate datatypes as shown
task_4.sql : add the column weight_category in the dim_products table
task_5.sql : casts the column in the dim_products to appopriate datatypes as shown
task_6.sql : casts the column in the dim_date_times to appopriate datatypes as shown
task_7.sql : casts the column in the dim_card_details to appopriate datatypes as shown
task_8.sql : set primary keys for all of the dimensional tables with accordance to the ERD diagram below
task_9.sql : add foreign keys to the orders_table

Milestones 4 : Code Description

task_1.sql : finds the amount of stores in each country
task_2.sql : finds locations which have the most stores
task_3.sql : finds which months produce the highest cost of sales typically
task_4.sql : compares sales and products offline vs online
task_5.sql : computes percentage of sales per store type
task_6.sql : finds which months in each year produce the highest cost of sales
task_7.sql : finds the overall staff numbers in each location around the world
task_8.sql : finds stores with highest sales in Germany
task_9.sql : computes the average time taken between each sale grouped by year

License information

GNU General Public License

Installation Requirements

Pandas : !pip install pandas
SQLAlchemy : !pip install SQLAlchemy
Psycopg2 : !pip install psycopg2
Tabula : !pip install tabula-py
Requests : !pip install requests
Boto3 : !pip install boto3

Usage instructions

- The individual databases can be obtained and cleaned by instantating the DataCleaning class as shown

from data_cleaning import DataCleaning
datacleaner = DataCleaning()

users_df = datacleaner.clean_user_data()
orders_df = datacleaner.clean_orders_data()
cards_table = datacleaner.clean_card_data()
stores_df = datacleaner.called_clean_store_data()
products_df = datacleaner.clean_products_data()
events_df = datacleaner.clean_event_date_data

These can be uploaded unto your local Postgres space as follows

from database_utils import DatabaseConnector
connector = DatabaseConnector()
connector.upload_to_db(<table_name>)

Note you'd require your credentials for both an AWS account, a postgres local space and API key for some of the private and restricted databases

Some keynotes learned from this project

When listing table names from SQL alchemy, it is more computationally to use the Inspector object as opposed to the MetaData class. The Metadata object holds collection of table info, their data types, schema names etc
however MetaData() includes thread safety -> meaning it can handle concurrent tasks from multiple thread (computationally efficient when multiple threads need access to same resource) Reference: obtained from here : https://docs.sqlalchemy.org/en/20/core/metadata.html
A FacadeDict : basically it is like Python's default Dict() datastrucutre, SQLAlchemy claims it is more efficient for lookup and management. Since there is no resource that compares the two, the claim that Facade is better is clearly, in itself, a facade :)
However, SQLAlchemy claims it to be publically immutable, hence it is probably more secure to use, Reference: https://pydoc.dev/sqlalchemy/latest/sqlalchemy.util._collections.FacadeDict.html
Using a column that has unique characters is ideal for sorting out alphanumeric presence, so it was a mistake remove special symbols first, it is best to leave this processing step until later!!
Always wise to check consistency before changing any format of data, if the format is persistent, no need to change it
Changing column types using ALTER TABLE directly changed values of columns with TEXT type, so a better method was to use CAST and then DROP the former and upload the newer
In PostgresSQL, you can select columns that have special characters in them by using double quotes, for example SELECT "price ($)" FROM tableX, this is different to TransactSQL where you'd ideally use square brackets for example SELECT [price ($)] FROM tableX
Also in PostgresSQL, when using CASE statement, double quotes are reserved for columnar names, single quotes for string/text/varchar output
N.B: On PC be sure to have latest Pandas version and sqlalchemy==1.4.46 for successfull connection, On Mac all versions are acceptable

#TODO : for improving code in future (quality and performance)

Re-run task 3 miletone 2 with Inspector object, use %%time magic func. to compare performance with MetaData object in SQLAlchemy
Change the ERD to show relationships
Remove the credential instantiations within the API setup in dataextractor.py, this should be naturally be passed as an argument for both code reproducibility and security.

Resources

SQLAlchemy common commands : https://towardsdatascience.com/sqlalchemy-python-tutorial-79a577141a91

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.vscode		.vscode
img		img
milestone_2		milestone_2
milestone_3		milestone_3
milestone_4		milestone_4
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multinational-Retail-Data-Centralisation

Project Description

File Structure

Milestones 2 : Code Description

Milestones 3 : Code Description

Milestones 4 : Code Description

License information

Installation Requirements

Usage instructions

Some keynotes learned from this project

#TODO : for improving code in future (quality and performance)

Resources

About

Releases

Packages

Languages

License

mahedjaved/Multinational-Retail-Data-Centralisation

Folders and files

Latest commit

History

Repository files navigation

Multinational-Retail-Data-Centralisation

Project Description

File Structure

Milestones 2 : Code Description

Milestones 3 : Code Description

Milestones 4 : Code Description

License information

Installation Requirements

Usage instructions

Some keynotes learned from this project

#TODO : for improving code in future (quality and performance)

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages