Skip to content

konrads/spark-etl-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-etl-demo

Demo application for spark-etl library.

Build status (master): Build Status

Suggested ba-dev-ops interactions

BA:

BA works off the sql directory in BA branch, populating only app.yaml config and resource SQLs. When done, BA notifies the developer.

Dev:

Developer merges BA branch into master, potentially using utilities that massage SQL for production (eg. via sbt stripPrefixes). Developer initiates the build, ensuring sbt test target passes, as that validates config/SQL, as well as any code tests. This sbt target should also be run within CI, see .travis.yml. The code can then build and published on Spark cluster environment.

This project utilizes library spark-etl, please look there for more information.

Note: to reduce the size of fatjar, unused dependencies (joda-time & co) have been identified and pruned thanks to net.virtual-void:sbt-dependency-graph plugin:

sbt dependencyTree

Ops:

Once the releasable artefacts are deployed to Spark cluster environment, run with run.sh:

> ./run.sh
  Usage:
    help
    validate-local
    validate-remote
    transform-load
    extract-check
    transform-check

Always start by running validations:

> # check config and SQL
> ./run.sh -Denv.path=<root> validate-local

> # check hdfs paths, other remote connectivity
> ./run.sh -Denv.path=<root> validate-remote

To run transform and persist results:

> ./run.sh -Denv.path=<root> transform-load

To fetch yarn logs after the job is run, set PACKAGE_LOGS=true. Note: not for production!

> export PACKAGE_LOGS=true
> ./run.sh -Denv.path=<root> transform-load

> # list zipped up logs (from driver and from the cluster)
> ls logs/current
logs_application_XXXXXXXXXXXXX_YYYYYY.zip

> cd logs/current
> unzip logs_application_XXXXXXXXXXXXX_YYYYYY.zip
Archive:  logs_application_XXXXXXXXXXXXX_YYYYYY.zip
  inflating: application_XXXXXXXXXXXXX_YYYYYY.local.log
  inflating: application_XXXXXXXXXXXXX_YYYYYY.remote.log

Lineage

Following lineage/dependency graph was generated via:

sbt genDot
# after graphviz install, eg. brew install graphviz
dot -Tgif src/main/lineage/lineage.dot -o src/main/lineage/lineage.gif

dot-lineage

Legend
  • ellipse = extract
  • component = transform
  • cylinder = load
  • full line = relationship configured explicitly
  • dotted line = relationship derived from SQL

About

Demo for spark-etl

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published