Skip to content

Latest commit

 

History

History
68 lines (46 loc) · 1.61 KB

README.md

File metadata and controls

68 lines (46 loc) · 1.61 KB

pv-apache-beam

This project is to demostrate how to setup the data pipeline to proces multiple files through the Apache Beam on basic wordcount and Solardatatools. It demostrates how to setup and run the pipeline on your local machine and google cloud platform.

Installation and setup environment

Set up the python virtual environment and install python dependences. In the project folder, the the command below

python3 -m venv venv
source ./venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Create a .env file, and set up the google cloud platform cradential json file locations

export GOOGLE_APPLICATION_CREDENTIALS='your-cradential-location'

Examples

wordcount-example

Run example in your local machine

Under the wordcount-example folder, run the command below:

python -m wordcount --runner DirectRunner \
--input ./kinglear-1.txt \
--output ./results

Run example on GCP

solardatatools-onefile-example

Under the solardatatools-onefile-example folder, run the command below:

python -m main --runner DirectRunner \
--input ./data \
--output ./parallelism/results \
--direct_num_workers 0

solardatatools-custom-package-example

In this example, the solardatatools package is packaged as transformers. You have to install this custom package by the command. Under the solardatatools-custom-package-example folder, run the command below:

pip install ./transformers

Then you can execute the command

python -m main --runner DirectRunner \
--input ./kinglear-1.txt \
--output ./results

Run in local

reference