The MovieLens 100K Dataset is used for this exercise. It can be downloaded from https://grouplens.org/datasets/movielens/100k/ for a sample glance.
Dataset is available under https://files.grouplens.org/datasets/movielens/ml-100k.zip
All the 3 questions have been implemented as 3 separate functions in a single main.py file . Execution of the main file leads to generated outputs in the target folder. Unit tests check for data integrity issues.A file based logger logs messages to a log file.
- The contents of the zip file must be extracted into any location.
- Docker must be installed and is a pre-requisite.
- Execute the below commands:
cd Take2Project
docker build --rm --pull -t "take2projectfeaturedev:latest" .
- At this point the Docker image is built and is assigned a random name. To verify, run
docker images
- We should be seeing an image with the name take2projectfeaturedev .Note the IMAGE ID of this image.
- To run a container off the image,
docker run -it IMAGE ID
- This spins up a container that runs the ETL job and print the output files onto the terminal.
- First, we must create a virtual environment using below command, and also activate it:
python3 -m venv venv
venv\Scripts\activate
- Next, we must install the required dependencies using:
pip install -r requirements.txt
- Create a directory for storing the output files.
mkdir target
-
Download the ml-100k folder from [https://files.grouplens.org/datasets/movielens/ml-100k.zip].Unzip it and place the ml -100k folder in the current directory. If all good then the current directory structure so far should look like this:
Take2Project ├───ml-100k ├───target └───venv
-
Run tests
python -m unittest test_etl.py -v
- Run main script
python main_etl.py
- The target folder should have 3 files generated corresponding to each question.