Kafka-R: Real-time Prediction

This tutorial explains how a machine learning model is applied on real-time data. It predicts incoming data as well as the model is retrained when the prediction results decrease. It focuses on simplicity and can be seen as a baseline for similar projects. You can read more about it in my blog article: Apache Kafka and R: Real-Time Prediction and Model (Re)training.

Prerequisites

Data Flow

Kafka Producer

Let's go over the single parts of the data flow. A Kafka Producer produces simulated data of a fish's size measurement as well as the weight continuously into two Kafka topics: machine-weight and machine-measurement.

Kafka Streams

A Kafka Streams application consumes the machine-measurement topic and communicates via REST API with R to predict the weight using linear regression. You can find a unit test for the topology as well as an integration test for the REST communication here.

ksqlDB

In ksqlDB both streams are joined, and the prediction is compared with the actual weight (error).

Kafka Connect

One connector stores data in MongoDB so that it can be used for retraining the regression. The other connector acts as a trigger to do the retraining once the error exceeds a threshold.

RStudio

In R the model itself, the prediction function, and the retraining function are stored and accessible via REST API. You can find a test here.

Run

docker-compose up -d

It starts:

Zookeeper
Kafka Broker
Kafka Topics
- creates initial topics
Kafka Connect
- with MongoDB Source and Sink Connector
- with HTTP Sink Connector
ksqlDB Server
ksqlDB Client
MongoDB
Kafka Producer
- built docker image executing fat JAR
Kafka Streams
- built docker image executing fat JAR
RStudio
- built docker image with RStudio with plumber, dplyr, and mongolite installed and entrypoint

Make sure to wait some time until everything is fully started up.

Start Connectors

First, we start the two Kafka Connectors:

curl -X POST -H "Content-Type: application/json" --data @MongoDBConnector.json http://localhost:8083/connectors | jq
curl -X POST -H "Content-Type: application/json" --data @HTTPConnector.json http://localhost:8083/connectors | jq

Set up ksqlDB Queries

Use the client to access ksqlDB:

docker exec -it ksqldb-cli ksql http://ksqldb-server:8088

Run all queries stored in Queries.ksql.

Inspect Data Pipeline

To gain insights of the pipeline, we look at the Stream DIFF_WEIGHT:

SELECT * FROM DIFF_WEIGHT EMIT CHANGES;

We can also detect when the retrained model is applied because the prediction error decreases, and the model time changes.

In the KTable RETRAIN_WEIGHT, we see the events that trigger the retraining.

SET 'auto.offset.reset'='earliest';
SELECT * FROM RETRAIN_WEIGHT EMIT CHANGES;

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
KafkaProducer		KafkaProducer
KafkaStreams		KafkaStreams
KsqlDB		KsqlDB
R		R
HTTPConnector.json		HTTPConnector.json
MongoDBConnector.json		MongoDBConnector.json
README.md		README.md
docker-compose.yml		docker-compose.yml
image.png		image.png
ksqlDB-retraining.png		ksqlDB-retraining.png
ksqlDB-trigger.png		ksqlDB-trigger.png
ksqlDB.gif		ksqlDB.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kafka-R: Real-time Prediction

Prerequisites

Data Flow

Kafka Producer

Kafka Streams

ksqlDB

Kafka Connect

RStudio

Run

Start Connectors

Set up ksqlDB Queries

Inspect Data Pipeline

About

Releases

Packages

Languages

pneff93/Kafka-R-Realtime-Prediction

Folders and files

Latest commit

History

Repository files navigation

Kafka-R: Real-time Prediction

Prerequisites

Data Flow

Kafka Producer

Kafka Streams

ksqlDB

Kafka Connect

RStudio

Run

Start Connectors

Set up ksqlDB Queries

Inspect Data Pipeline

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages