Demo application for spark-etl library.
BA works off the sql directory in BA branch, populating only app.yaml config and resource SQLs. When done, BA notifies the developer.
Developer merges BA branch into master
, potentially using utilities that massage SQL for production (eg. via sbt stripPrefixes
). Developer initiates the build
, ensuring sbt test
target passes, as that validates config/SQL, as well as any code tests. This sbt target should also be run within CI, see .travis.yml.
The code can then build and published on Spark cluster environment.
This project utilizes library spark-etl, please look there for more information.
Note: to reduce the size of fatjar, unused dependencies (joda-time & co) have been identified and pruned thanks to net.virtual-void:sbt-dependency-graph
plugin:
sbt dependencyTree
Once the releasable artefacts are deployed to Spark cluster environment, run with run.sh:
> ./run.sh
Usage:
help
validate-local
validate-remote
transform-load
extract-check
transform-check
Always start by running validations:
> # check config and SQL
> ./run.sh -Denv.path=<root> validate-local
> # check hdfs paths, other remote connectivity
> ./run.sh -Denv.path=<root> validate-remote
To run transform and persist results:
> ./run.sh -Denv.path=<root> transform-load
To fetch yarn logs after the job is run, set PACKAGE_LOGS=true
. Note: not for production!
> export PACKAGE_LOGS=true
> ./run.sh -Denv.path=<root> transform-load
> # list zipped up logs (from driver and from the cluster)
> ls logs/current
logs_application_XXXXXXXXXXXXX_YYYYYY.zip
> cd logs/current
> unzip logs_application_XXXXXXXXXXXXX_YYYYYY.zip
Archive: logs_application_XXXXXXXXXXXXX_YYYYYY.zip
inflating: application_XXXXXXXXXXXXX_YYYYYY.local.log
inflating: application_XXXXXXXXXXXXX_YYYYYY.remote.log
Following lineage/dependency graph was generated via:
sbt genDot
# after graphviz install, eg. brew install graphviz
dot -Tgif src/main/lineage/lineage.dot -o src/main/lineage/lineage.gif
- ellipse = extract
- component = transform
- cylinder = load
- full line = relationship configured explicitly
- dotted line = relationship derived from SQL