Skip to content

kubahorak/mrtweety-analytic

Repository files navigation

MrTweety Analytic

Build Status

Analyses tweets in real-time to figure out the most popular hashtags.

Uses Apache Spark and a Twitter Hosebird Client to retrieve top 5 tweets in a sliding window of 15 minutes. The result is stored in a JSON file and a small JS app is utilizing it to update a simple website.

See the site live at feud.eu.

Docker

Building

The app is designed to run as a single Docker container.

To build all the jars and generate the Dockerfile, use:

./gradlew build

Then build the Docker image:

docker build -t mrtweety-analytic .

Running

To run the Docker container, use your favourite Docker tool, or simply:

docker run --name mrtweety-analytic -d \
    -p 9090:80 \
    -e TWITTER_CONSUMER_KEY=xxx \
    -e TWITTER_CONSUMER_SECRET=xxx \
    -e TWITTER_ACCESS_TOKEN=xxx \
    -e TWITTER_ACCESS_TOKEN_SECRET=xxx \
    mrtweety-analytic

Replace xxx with appropriate Twitter API keys. Then point your browser to localhost:9090 and see the result.

The Docker container packages together Kafka producer, Spark app and a website with JS.

About

Kafka producer

Producer uses Twitter HBC to pipe tweets from Twitter’s streaming APIs into the Kafka messaging system.

Module takes tweets from the Twitter Streaming API and publishes them to the Kafka messaging system.

For the Kafka to run, you first need to start Zookeeper:

bin/zookeeper-server-start.sh config/zookeeper.properties

then start Kafka itself:

bin/kafka-server-start.sh config/server.properties

The following command is useful for looking at messages passing through Kafka in console:

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic tweet

Spark app

A single-node Spark app receiving data from Kafka and processing them real-time. Spark Streaming is used. The processed result is being stored in a JSON file every 10 seconds.

Build a "fat jar" like this:

./gradlew shadowJar

Optionally you can change the default path of the result JSON file by setting the environment variable RESULT_FILENAME, e.g. like this:

export RESULT_FILENAME=/tmp/analytic.json

Finally submit the "fat jar" to your Spark for execution:

spark-submit --class "cz.zee.mrtweety.analytic.spark.SparkApplication" --master 'local[4]' spark/build/libs/spark-all.jar

© 2017 Jakub Horák