spark mobility

Preprint available at https://georgheiler.com/publication/ieee-bigdata-19/

Scalable mobility analytics toolkit in spark.

Capabilities:

open street map data is loaded
various POI enrichment methods are compared with regards to efficiency

POI enrichment

Comparison of various distributed implementations of a spatial join for point of interest enrichment.

Results:

As seen in Figure for large enough quantities of data, the locality preserving geospark join (3) is faster than the non-preserving approach (2). This is particularly surprising as the amount of data being shuffled is larger in the first case as an explode, left join and aggregation happens for the locality preserving distributed geospark join. In all cases the custom implementation (1) using a map-side broadcast join is optimal.

development

requirements

jdk8
apache-spark 2.2.3 on the path
in case of windows winutils (see description below)
make (if not available you need to copy / paste and adapt some commands)
git

basics

To get up and running with this project you need to follow the steps both for local development and on your cluster. Next, load the data as described in the section data generation

locally

Download any spark 2.2.x build from spark's website and make sure to extract it and have it available on the path, i.e. that spark-shell and spark-submit are working fine.

in the cluster

Assuming you are using one of the standard clusters from distributors like Cloudera you should already have everything set up.

In my case for HDP 2.6.x I need to explicitly switch to spark 2.x using:

export SPARK_MAJOR_VERSION=2

Initialization script set up correctly to use enterprise artifact store, in particular use the file at Code/build/init.gradle
optionally it would be great if you have a T-Mobile Austria DevBox environment available HDP2.6.4 or later including HDF3.1 or later, secured via kerberos to have a similar environment

git clone [email protected]:complexity-science-hub/distributed-POI-enrichment.git
cd spark-mobility
make build # will also execute test cases

# have a look at one of the outputted jars
ls benchmarking-spark/build/libs

make v # outputs the current version

# make changes to the code & validate test cases & commit results
make test

make version # if version previously was a release now points to a snapshot
make release # create a release in git repository (tag)
make publish # publish artifacts to artifact store

additional tasks in makefile are:

reformat-code

Other useful things:

dependency to debug dependency hell use gradles dependencyInsight task.

In a kerberos enabled environment you must already provide a keytab:

kinit -kt /etc/security/keytabs/my.keytab myprincipal

data generation

See data generation details

Interactivity

Interactive development experience can be achieved by executing:

make replSparkShell # full power of spark configuration properties in an interactive REPL

this will open an interactive shell which starts a (local) spark session and has the project on the classpath.

development on windows

A make build will not succeed in running the test cases on windows unless hadoop windows native binaries, winutils, are installed. Also correct permissions like outlined here are required to be configured on your local windows developer notebook.

Also configure your computer with the proxy settings available in the Code/build folder.

release a new version

execute the following steps to release a new version. This happens automatically when a merge request is merged to master branch.

A new tag is created automatically and releases are published.

make clean
make build
make test
make release
make publish

to release a specific version / introduce a new major version use:

make releaseVersion=1.0.0 release-major

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
benchmark-spark		benchmark-spark
common-modules		common-modules
configuration		configuration
data-generator		data-generator
environment		environment
gradle/wrapper		gradle/wrapper
img		img
publication		publication
results		results
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
ExampleDiagnoseCountCode.ipynb		ExampleDiagnoseCountCode.ipynb
LICENSE.TXT		LICENSE.TXT
Makefile		Makefile
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
replInput		replInput
scalastyle-config.xml		scalastyle-config.xml
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark mobility

POI enrichment

development

requirements

basics

locally

in the cluster

data generation

Interactivity

development on windows

release a new version

About

Releases 2

Packages

Languages

License

complexity-science-hub/distributed-POI-enrichment

Folders and files

Latest commit

History

Repository files navigation

spark mobility

POI enrichment

development

requirements

basics

locally

in the cluster

data generation

Interactivity

development on windows

release a new version

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages