Streamline Hackathon Boilerplate for GDELT 1.0 Event Database

This repository contains boilerplate Java/Scala code for Apache Flink and Apache Spark which parse and stream GDELT 1.0 Event Database [1]. It further includes simple aggregation examples on the data.

Run The Boilerplate (Option 1)

You may run the code from your favorite IDE. You just need to select a class with a static main method as entry point. Regardless of the usage of Flink or Spark, the selected processing engine will be launched as a internal component. This approach is recommended for developing and testing purpose.

Run The Boilerplate (Option 2)

You are supposed to deploy the job on a local Flink/Spark cluster launched on your machine. In order to do so, you first need to compile your code by executing on the root directory of this repository:

mvn clean package

Apache Flink

After than that, you need to submit the job to Flink Job Manager. Please, be sure that a standalone (or cluster) version of Flink is running on your machine as explained here [2]. Briefly, you need to start Flink by executing:

/path/to/flink/root/bin/start-local.sh  # Start Flink

Then you can run those long-running jobs.

# Java Job
/path/to/flink/root/bin/flink run \
hackathon-flink-java/target/hackathon-flink-java-0.1-SNAPSHOT.jar \
--path /path/to/data/180-days.csv --country USA

# Scala Job
/path/to/flink/root/bin/flink run \
hackathon-flink-scala/target/hackathon-flink-scala-0.1-SNAPSHOT.jar \
--path /path/to/data/180-days.csv --country USA

Please, note that those jobs will run forever. In order to shutdown the execution, you need to prompt

/path/to/flink/root/bin/flink cancel <jobID>

as explained here [3]

Apache Spark

After that, you need to submit the job to the Spark Cluster. First you need to start Spark in standalone mode on your local machine as explained here [4]. A quick way to start Spark in standalone mode is to run the following command:

/path/to/spark/root/sbin/start-all.sh # Start Spark

To run a jar file, you need the Spark master URL which you can find on master's web UI (by default http://localhost:8080).

Then you can run those long-running jobs.

# Java Job
/path/to/spark/root/bin/spark-submit \
--master spark://<host>:<port> \
--class eu.streamline.hackathon.spark.job.SparkJavaJob \
 hackathon-spark-java/target/hackathon-spark-java-0.1-SNAPSHOT.jar \
--path /path/to/data/180-days.csv \
--micro-batch-duration 5000
--country USA

# Scala Job
/path/to/spark/root/bin/spark-submit \
--master spark://<host>:<port> \
--class eu.streamline.hackathon.spark.scala.job.SparkScalaJob \
hackathon-spark-scala/target/hackathon-spark-scala-0.1-SNAPSHOT.jar \
--path /path/to/data/180-days.csv \
--micro-batch-duration 5000
--country USA

To suppress the logs in when the Spark program is running simply rename the

/path/to/spark/root/conf/log4j.properties.template 
to 
/path/to/spark/root/conf/log4j.properties

And change the line:

log4j.rootCategory=INFO, console
to
log4j.rootCategory=ERROR, console

Please, note that those jobs will run forever. In order to shutdown the execution, you can click on the kill button on the master's web UI (by default http://localhost:8080).

References

[1] GDELT Projet: https://www.gdeltproject.org

[2] https://ci.apache.org/projects/flink/flink-docs-release-1.3/quickstart/setup_quickstart.html

[3] https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/cli.html

[4] https://spark.apache.org/docs/2.2.0/spark-standalone.html

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
hackathon-common		hackathon-common
hackathon-flink-java		hackathon-flink-java
hackathon-flink-scala		hackathon-flink-scala
hackathon-spark-java		hackathon-spark-java
hackathon-spark-scala		hackathon-spark-scala
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streamline Hackathon Boilerplate for GDELT 1.0 Event Database

Run The Boilerplate (Option 1)

Run The Boilerplate (Option 2)

Apache Flink

Apache Spark

References

About

Releases

Packages

Contributors 3

Languages

TU-Berlin-DIMA/streamline-hackathon-boilerplate

Folders and files

Latest commit

History

Repository files navigation

Streamline Hackathon Boilerplate for GDELT 1.0 Event Database

Run The Boilerplate (Option 1)

Run The Boilerplate (Option 2)

Apache Flink

Apache Spark

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages