Preprint available at https://georgheiler.com/publication/ieee-bigdata-19/
Scalable mobility analytics toolkit in spark.
Capabilities:
- open street map data is loaded
- various POI enrichment methods are compared with regards to efficiency
Comparison of various distributed implementations of a spatial join for point of interest enrichment.
Results:
As seen in Figure for large enough quantities of data, the locality preserving geospark join (3) is faster than the non-preserving approach (2). This is particularly surprising as the amount of data being shuffled is larger in the first case as an explode, left join and aggregation happens for the locality preserving distributed geospark join. In all cases the custom implementation (1) using a map-side broadcast join is optimal.
- jdk8
- apache-spark 2.2.3 on the path
- in case of windows winutils (see description below)
- make (if not available you need to copy / paste and adapt some commands)
- git
To get up and running with this project you need to follow the steps both for local development and on your cluster. Next, load the data as described in the section data generation
Download any spark 2.2.x build from spark's website and make sure to extract it and have it available
on the path, i.e. that spark-shell
and spark-submit
are working fine.
Assuming you are using one of the standard clusters from distributors like Cloudera you should already have everything set up.
In my case for HDP 2.6.x I need to explicitly switch to spark 2.x using:
export SPARK_MAJOR_VERSION=2
- Initialization script set up correctly to use enterprise artifact store, in particular use the file at Code/build/init.gradle
- optionally it would be great if you have a T-Mobile Austria DevBox environment available HDP2.6.4 or later including HDF3.1 or later, secured via kerberos to have a similar environment
git clone [email protected]:complexity-science-hub/distributed-POI-enrichment.git
cd spark-mobility
make build # will also execute test cases
# have a look at one of the outputted jars
ls benchmarking-spark/build/libs
make v # outputs the current version
# make changes to the code & validate test cases & commit results
make test
make version # if version previously was a release now points to a snapshot
make release # create a release in git repository (tag)
make publish # publish artifacts to artifact store
additional tasks in makefile are:
- reformat-code
Other useful things:
- dependency to debug dependency hell use gradles
dependencyInsight
task.
In a kerberos enabled environment you must already provide a keytab:
kinit -kt /etc/security/keytabs/my.keytab myprincipal
Interactive development experience can be achieved by executing:
make replSparkShell # full power of spark configuration properties in an interactive REPL
this will open an interactive shell which starts a (local) spark session and has the project on the classpath.
A make build
will not succeed in running the test cases on windows unless hadoop windows native binaries, winutils, are installed. Also correct permissions like outlined here are required to be configured on your local windows developer notebook.
Also configure your computer with the proxy settings available in the Code/build folder.
execute the following steps to release a new version. This happens automatically when a merge request is merged to master branch.
A new tag is created automatically and releases are published.
make clean
make build
make test
make release
make publish
to release a specific version / introduce a new major version use:
make releaseVersion=1.0.0 release-major