diff --git a/Makefile b/Makefile deleted file mode 100644 index 7e7f1bd..0000000 --- a/Makefile +++ /dev/null @@ -1,26 +0,0 @@ -MYPY_OPTIONS = --ignore-missing-imports --disallow-untyped-calls --disallow-untyped-defs --disallow-incomplete-defs - -.PHONY: integration-test -integration-test: - poetry run pytest tests/integration - -.PHONY: unit-test -unit-test: - poetry run pytest tests/unit - -.PHONY: lint-check -lint-check: - poetry run pylint data_transformations tests - -.PHONY: type-check -type-check: - poetry run mypy ${MYPY_OPTIONS} data_transformations tests - -.PHONY: style-checks -style-checks: lint-check type-check - -.PHONY: tests -tests: unit-test integration-test - -requirements.txt: - poetry export -f requirements.txt --output requirements.txt --dev diff --git a/README-DOCKER.md b/README-DOCKER.md new file mode 100644 index 0000000..9daae3d --- /dev/null +++ b/README-DOCKER.md @@ -0,0 +1,229 @@ +# Data transformations with Python +This is a collection of _Python_ jobs that are supposed to transform data. +These jobs are using _PySpark_ to process larger volumes of data and are supposed to run on a _Spark_ cluster (via `spark-submit`). + +## Pre-requisites + +We use [`batect`](https://batect.dev/) to dockerise the tasks in this exercise. +`batect` is a lightweight wrapper around Docker that helps to ensure tasks run consistently (across linux, mac windows). +Similarly, `go.sh` / `go.ps` enables commands to be consistent across linux, mac & windows. +With `batect`, the only dependencies that need to be installed are Docker and Java >=8. Every other dependency is managed inside Docker containers. +If docker desktop can't be installed then Colima could be used on Mac and Linux. + +> **For Windows, docker desktop is the only option for using container to run application +otherwise local laptop should be set up.** + +Please make sure you have the following installed and can run them + +* Docker Desktop or Colima +* Java (11) + + +You could use following instructions as guidelines to install Docker or Colima and Java. + +```bash +# Install pre-requisites needed by batect +# For mac users: +./go.sh install-with-docker-desktop +OR +./go.sh install-with-colima + +# For windows/linux users: +# Please ensure Docker and java >=8 is installed +scripts\install_choco.ps1 +scripts\install.bat + +# For local laptop setup ensure that Java 11 with Spark 3.2.1 is available. More details in README-LOCAL.md +``` + +> **If you are using Colima, please ensure that you start Colima. For staring Colima, you could use following command:** + +`./go.sh start-colima` + + +> **Please install poetry if you would like to use lint command. Instructions to install poetry in [README-LOCAL](README.md) ** + + +## List of commands + +General pattern apart from installation and starting of Colima is: + +`./go.sh run--` + +type could be local, colima or docker-desktop + +action could be unit-test, integration-test or job. + +Full list of commands for Mac and Linux users is as follows: + +| S.No. | Command | Action | +| :---: | :---- | :--- | +| 1 | ./go.sh lint | Static analysis, code style, etc. (please install poetry if you would like to use this command) | +| 2 | ./go.sh linting | Static analysis, code style, etc. (please install poetry if you would like to use this command) | +| 3 | ./go.sh install-with-docker-desktop | Install the application requirements along with docker desktop | +| 4 | ./go.sh install-with-colima | Install the application requirements along with colima | +| 5 | ./go.sh start-colima | Start Colima | +| 6 | ./go.sh run-local-unit-test | Run unit tests on local machine | +| 7 | ./go.sh run-colima-unit-test | Run unit tests on containers using Colima | +| 8 | ./go.sh run-docker-desktop-unit-test | Run unit tests on containers using Docker Desktop | +| 9 | ./go.sh run-local-integration-test | Run integration tests on local machine | +| 10 | ./go.sh run-colima-integration-test | Run integration tests on containers using Colima | +| 11 | ./go.sh run-docker-desktop-integration-test | Run integration tests on containers using Docker Desktop | +| 12 | ./go.sh run-local-job | Run job on local machine | +| 13 | ./go.sh run-colima-job | Run job on containers using Colima | +| 14 | ./go.sh run-docker-desktop-job | Run job on containers using Docker Desktop | +| 15 | ./go.sh Usage | Display usage | + + +Full list of commands for Windows users is as follows: + +| S.No. | Command | Action | +| :---: | :---- | :--- | +| 1 | go.ps1 linting | Static analysis, code style, etc. (please install poetry if you would like to use this command) | +| 2 | go.ps1 install-with-docker-desktop | Install the application requirements along with docker desktop | +| 3 | go.ps1 run-local-unit-test | Run unit tests on local machine | +| 4 | go.ps1 run-docker-desktop-unit-test | Run unit tests on containers using Docker Desktop | +| 5 | go.ps1 run-local-integration-test | Run integration tests on local machine | +| 6 | go.ps1 run-docker-desktop-integration-test | Run integration tests on containers using Docker Desktop | +| 7 | go.ps1 run-local-job | Run job on local machine | +| 8 | go.ps1 run-docker-desktop-job | Run job on containers using Docker Desktop | +| 9 | go.ps1 Usage | Display usage | + + +## Jobs + +There are two applications in this repo: Word Count, and Citibike. + +Currently, these exist as skeletons, and have some initial test cases which are defined but ignored. +For each application, please un-ignore the tests and implement the missing logic. + +### Word Count +A NLP model is dependent on a specific input file. This job is supposed to preprocess a given text file to produce this +input file for the NLP model (feature engineering). This job will count the occurrences of a word within the given text +file (corpus). + +There is a dump of the datalake for this under `resources/word_count/words.txt` with a text file. + +#### Input +Simple `*.txt` file containing text. + +#### Output +A single `*.csv` file containing data similar to: +```csv +"word","count" +"a","3" +"an","5" +... +``` + +#### Run the job using Docker Desktop on Mac or Linux + +```bash +JOB=wordcount ./go.sh run-docker-desktop-job +``` + +#### Run the job using Docker Desktop on Windows + +```bash +$env:JOB = wordcount +.\go.ps1 run-docker-desktop-job +``` + +#### Run the job using Colima + +```bash +JOB=wordcount ./go.sh run-colima-job +``` + +### Citibike +***This problem uses data made publicly available by [Citibike](https://citibikenyc.com/), a New York based bike share company.*** + +For analytics purposes, the BI department of a hypothetical bike share company would like to present dashboards, displaying the +distance each bike was driven. There is a `*.csv` file that contains historical data of previous bike rides. This input +file needs to be processed in multiple steps. There is a pipeline running these jobs. + +![citibike pipeline](docs/citibike.png) + +There is a dump of the datalake for this under `resources/citibike/citibike.csv` with historical data. + +#### Ingest +Reads a `*.csv` file and transforms it to parquet format. The column names will be sanitized (whitespaces replaced). + +##### Input +Historical bike ride `*.csv` file: +```csv +"tripduration","starttime","stoptime","start station id","start station name","start station latitude",... +364,"2017-07-01 00:00:00","2017-07-01 00:06:05",539,"Metropolitan Ave & Bedford Ave",40.71534825,... +... +``` + +##### Output +`*.parquet` files containing the same content +```csv +"tripduration","starttime","stoptime","start_station_id","start_station_name","start_station_latitude",... +364,"2017-07-01 00:00:00","2017-07-01 00:06:05",539,"Metropolitan Ave & Bedford Ave",40.71534825,... +... +``` + +##### Run the job using Docker Desktop on Mac or Linux + +```bash +JOB=citibike_ingest ./go.sh run-docker-desktop-job +``` + +##### Run the job using Docker Desktop on Windows + +```bash +$env:JOB = citibike_ingest +.\go.ps1 run-docker-desktop-job +``` + +##### Run the job using Colima + +```bash +JOB=citibike_ingest ./go.sh run-colima-job +``` + +#### Distance calculation +This job takes bike trip information and calculates the "as the crow flies" distance traveled for each trip. +It reads the previously ingested data parquet files. + +Hint: + - For distance calculation, consider using [**Haversine formula**](https://en.wikipedia.org/wiki/Haversine_formula) as an option. + +##### Input +Historical bike ride `*.parquet` files +```csv +"tripduration",... +364,... +... +``` + +##### Outputs +`*.parquet` files containing historical data with distance column containing the calculated distance. +```csv +"tripduration",...,"distance" +364,...,1.34 +... +``` + +##### Run the job + +##### Run the job using Docker Desktop on Mac or Linux + +```bash +JOB=citibike_distance_calculation ./go.sh run-docker-desktop-job +``` + +##### Run the job using Docker Desktop on Windows + +```bash +$env:JOB = citibike_distance_calculation +.\go.ps1 run-docker-desktop-job +``` + +##### Run the job using Colima + +```bash +JOB=citibike_distance_calculation ./go.sh run-colima-job +``` \ No newline at end of file diff --git a/README-LOCAL.md b/README-LOCAL.md deleted file mode 100644 index 98505c8..0000000 --- a/README-LOCAL.md +++ /dev/null @@ -1,188 +0,0 @@ -# Data transformations with Python - -This is a collection of _Python_ jobs that are supposed to transform data. -These jobs are using _PySpark_ to process larger volumes of data and are supposed to run on a _Spark_ cluster (via `spark-submit`). - -## Pre-requisites - -Please make sure you have the following installed and can run them - -- Python (3.11.x), you can use for example [pyenv](https://github.com/pyenv/pyenv#installation) to manage your python versions locally -- [Poetry](https://python-poetry.org/docs/#installation) -- Java (11) - -## Install all dependencies - -```bash -poetry install -``` - -## Run tests - -To run all tests: - -```bash -make tests -``` - -### Run unit tests - -```bash -make unit-test -``` - -### Run integration tests - -```bash -make integration-test -``` - -## Create package - -This will create a `tar.gz` and a `.wheel` in `dist/` folder: - -```bash -poetry build -``` - -More: https://python-poetry.org/docs/cli/#build - -## Run style checks - -```bash -make style-checks -``` - -This is running the linter and a type checker. - -## Jobs - -There are two applications in this repo: Word Count, and Citibike. - -Currently, these exist as skeletons, and have some initial test cases which are defined but ignored. -For each application, please un-ignore the tests and implement the missing logic. - -### Word Count - -A NLP model is dependent on a specific input file. This job is supposed to preprocess a given text file to produce this -input file for the NLP model (feature engineering). This job will count the occurrences of a word within the given text -file (corpus). - -There is a dump of the datalake for this under `resources/word_count/words.txt` with a text file. - -#### Input - -Simple `*.txt` file containing text. - -#### Output - -A single `*.csv` file containing data similar to: - -```csv -"word","count" -"a","3" -"an","5" -... -``` - -#### Run the job - -Please make sure to package the code before submitting the spark job (`poetry build`) - -```bash -poetry run spark-submit \ - --master local \ - --py-files dist/data_transformations-*.whl \ - jobs/word_count.py \ - \ - -``` - -### Citibike - -For analytics purposes the BI department of a bike share company would like to present dashboards, displaying the -distance each bike was driven. There is a `*.csv` file that contains historical data of previous bike rides. This input -file needs to be processed in multiple steps. There is a pipeline running these jobs. - -![citibike pipeline](docs/citibike.png) - -There is a dump of the datalake for this under `resources/citibike/citibike.csv` with historical data. - -#### Ingest - -Reads a `*.csv` file and transforms it to parquet format. The column names will be sanitized (whitespaces replaced). - -##### Input - -Historical bike ride `*.csv` file: - -```csv -"tripduration","starttime","stoptime","start station id","start station name","start station latitude",... -364,"2017-07-01 00:00:00","2017-07-01 00:06:05",539,"Metropolitan Ave & Bedford Ave",40.71534825,... -... -``` - -##### Output - -`*.parquet` files containing the same content - -```csv -"tripduration","starttime","stoptime","start_station_id","start_station_name","start_station_latitude",... -364,"2017-07-01 00:00:00","2017-07-01 00:06:05",539,"Metropolitan Ave & Bedford Ave",40.71534825,... -... -``` - -##### Run the job - -Please make sure to package the code before submitting the spark job (`poetry build`) - -```bash -poetry run spark-submit \ - --master local \ - --py-files dist/data_transformations-*.whl \ - jobs/citibike_ingest.py \ - \ - -``` - -#### Distance calculation - -This job takes bike trip information and calculates the "as the crow flies" distance traveled for each trip. -It reads the previously ingested data parquet files. - -Hint: - -- For distance calculation, consider using [**Harvesine formula**](https://en.wikipedia.org/wiki/Haversine_formula) as an option. - -##### Input - -Historical bike ride `*.parquet` files - -```csv -"tripduration",... -364,... -... -``` - -##### Outputs - -`*.parquet` files containing historical data with distance column containing the calculated distance. - -```csv -"tripduration",...,"distance" -364,...,1.34 -... -``` - -##### Run the job - -Please make sure to package the code before submitting the spark job (`poetry build`) - -```bash -poetry run spark-submit \ - --master local \ - --py-files dist/data_transformations-*.whl \ - jobs/citibike_distance_calculation.py \ - \ - -``` diff --git a/README.md b/README.md index 62da81c..1c9f7c3 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,42 @@ # Data transformations with Python - This is a collection of _Python_ jobs that are supposed to transform data. These jobs are using _PySpark_ to process larger volumes of data and are supposed to run on a _Spark_ cluster (via `spark-submit`). ## Pre-requisites +Please make sure you have the following installed and can run them +* Python (3.11 or later), you can use for example [pyenv](https://github.com/pyenv/pyenv#installation) to manage your python versions locally +* [Poetry](https://python-poetry.org/docs/#installation) +* Java (11) -We use [`batect`](https://batect.dev/) to dockerise the tasks in this exercise. -`batect` is a lightweight wrapper around Docker that helps to ensure tasks run consistently (across linux, mac windows). -With `batect`, the only dependencies that need to be installed are Docker and Java >=8. Every other dependency is managed inside Docker containers. -If docker desktop can't be installed then Colima could be used on Mac and Linux. +## Install all dependencies +```bash +poetry install +``` -> **For Windows, docker desktop is the only option for using container to run application -> otherwise local laptop should be set up.** +## Setup +### Run tests -Please make sure you have the following installed and can run them +#### Run unit tests +```bash +poetry run pytest tests/unit +``` -- Docker Desktop or Colima -- Java (11) +#### Run integration tests +```bash +poetry run pytest tests/integration +``` + +#### Run style checks +```bash +poetry run mypy --ignore-missing-imports --disallow-untyped-calls --disallow-untyped-defs --disallow-incomplete-defs \ + data_transformations tests -You could use following instructions as guidelines to install Docker or Colima and Java. +poetry run pylint data_transformations tests +``` +This is running the linter and a type checker. +## Create package +This will create a `tar.gz` and a `.wheel` in `dist/` folder: ```bash # Install pre-requisites needed by batect # For mac users: @@ -34,56 +51,7 @@ scripts\install.bat # For local laptop setup ensure that Java 11 with Spark 3.5.1 is available. More details in README-LOCAL.md ``` - -> **If you are using Colima, please ensure that you start Colima. For staring Colima, you could use following command:** - -`./go.sh start-colima` - -> **Please install poetry if you would like to use lint command. Instructions to install poetry in [README-LOCAL](README-LOCAL.md) ** - -## List of commands - -General pattern apart from installation and starting of Colima is: - -`./go.sh run--` - -type could be local, colima or docker-desktop - -action could be unit-test, integration-test or job. - -Full list of commands for Mac and Linux users is as follows: - -| S.No. | Command | Action | -| :---: | :------------------------------------------ | :---------------------------------------------------------------------------------------------- | -| 1 | ./go.sh lint | Static analysis, code style, etc. (please install poetry if you would like to use this command) | -| 2 | ./go.sh linting | Static analysis, code style, etc. (please install poetry if you would like to use this command) | -| 3 | ./go.sh install-with-docker-desktop | Install the application requirements along with docker desktop | -| 4 | ./go.sh install-with-colima | Install the application requirements along with colima | -| 5 | ./go.sh start-colima | Start Colima | -| 6 | ./go.sh run-local-unit-test | Run unit tests on local machine | -| 7 | ./go.sh run-colima-unit-test | Run unit tests on containers using Colima | -| 8 | ./go.sh run-docker-desktop-unit-test | Run unit tests on containers using Docker Desktop | -| 9 | ./go.sh run-local-integration-test | Run integration tests on local machine | -| 10 | ./go.sh run-colima-integration-test | Run integration tests on containers using Colima | -| 11 | ./go.sh run-docker-desktop-integration-test | Run integration tests on containers using Docker Desktop | -| 12 | ./go.sh run-local-job | Run job on local machine | -| 13 | ./go.sh run-colima-job | Run job on containers using Colima | -| 14 | ./go.sh run-docker-desktop-job | Run job on containers using Docker Desktop | -| 15 | ./go.sh Usage | Display usage | - -Full list of commands for Windows users is as follows: - -| S.No. | Command | Action | -| :---: | :------------------------------------------- | :---------------------------------------------------------------------------------------------- | -| 1 | .\go.ps1 linting | Static analysis, code style, etc. (please install poetry if you would like to use this command) | -| 2 | .\go.ps1 install-with-docker-desktop | Install the application requirements along with docker desktop | -| 3 | .\go.ps1 run-local-unit-test | Run unit tests on local machine | -| 4 | .\go.ps1 run-docker-desktop-unit-test | Run unit tests on containers using Docker Desktop | -| 5 | .\go.ps1 run-local-integration-test | Run integration tests on local machine | -| 6 | .\go.ps1 run-docker-desktop-integration-test | Run integration tests on containers using Docker Desktop | -| 7 | .\go.ps1 run-local-job | Run job on local machine | -| 8 | .\go.ps1 run-docker-desktop-job | Run job on containers using Docker Desktop | -| 9 | .\go.ps1 Usage | Display usage | +More: https://python-poetry.org/docs/cli/#build ## Jobs @@ -93,21 +61,17 @@ Currently, these exist as skeletons, and have some initial test cases which are For each application, please un-ignore the tests and implement the missing logic. ### Word Count - A NLP model is dependent on a specific input file. This job is supposed to preprocess a given text file to produce this input file for the NLP model (feature engineering). This job will count the occurrences of a word within the given text -file (corpus). +file (corpus). There is a dump of the datalake for this under `resources/word_count/words.txt` with a text file. #### Input - Simple `*.txt` file containing text. #### Output - A single `*.csv` file containing data similar to: - ```csv "word","count" "a","3" @@ -115,23 +79,15 @@ A single `*.csv` file containing data similar to: ... ``` -#### Run the job using Docker Desktop on Mac or Linux - +#### Run the job +Please make sure to package the code before submitting the spark job (`poetry build`) ```bash -JOB=wordcount ./go.sh run-docker-desktop-job -``` - -#### Run the job using Docker Desktop on Windows - -```bash -$env:JOB = wordcount -.\go.ps1 run-docker-desktop-job -``` - -#### Run the job using Colima - -```bash -JOB=wordcount ./go.sh run-colima-job +poetry run spark-submit \ + --master local \ + --py-files dist/data_transformations-*.whl \ + jobs/word_count.py \ + \ + ``` ### Citibike @@ -170,23 +126,15 @@ Historical bike ride `*.csv` file: ... ``` -##### Run the job using Docker Desktop on Mac or Linux - -```bash -JOB=citibike_ingest ./go.sh run-docker-desktop-job -``` - -##### Run the job using Docker Desktop on Windows - -```bash -$env:JOB = citibike_ingest -.\go.ps1 run-docker-desktop-job -``` - -##### Run the job using Colima - +##### Run the job +Please make sure to package the code before submitting the spark job (`poetry build`) ```bash -JOB=citibike_ingest ./go.sh run-colima-job +poetry run spark-submit \ + --master local \ + --py-files dist/data_transformations-*.whl \ + jobs/citibike_ingest.py \ + \ + ``` #### Distance calculation @@ -219,29 +167,19 @@ Historical bike ride `*.parquet` files ``` ##### Run the job - -##### Run the job using Docker Desktop on Mac or Linux - -```bash -JOB=citibike_distance_calculation ./go.sh run-docker-desktop-job -``` - -##### Run the job using Docker Desktop on Windows - -```bash -$env:JOB = citibike_distance_calculation -.\go.ps1 run-docker-desktop-job -``` - -##### Run the job using Colima - +Please make sure to package the code before submitting the spark job (`poetry build`) ```bash -JOB=citibike_distance_calculation ./go.sh run-colima-job +poetry run spark-submit \ + --master local \ + --py-files dist/data_transformations-*.whl \ + jobs/citibike_distance_calculation.py \ + \ + ``` -## Running the code outside container +## Running the code inside container -If you would like to run the code in your laptop locally without containers then please follow instructions [here](README-LOCAL.md). +If you would like to run the code in Docker, please follow instructions [here](README-DOCKER.md). ## Running the code on Gitpod