Loading JSON and AVRO data from Confluent Cloud Kafka into StarRocks #22791

alberttwong · 2023-05-03T16:45:30Z

alberttwong
May 3, 2023

This tutorial describes how you can load AVRO and JSON data from Confluent Cloud into StarRocks. This tutorial uses StarRocks' Routine Load and NOT the Kafka Sink Connector.

Sept 2023 update: StarRocks released a StarRocks Kafka Connector. See https://docs.starrocks.io/en-us/latest/loading/Kafka-connector-starrocks for details.

Prerequisites

For this tutorial you need to:

A StarRocks or CelerData database cluster.
A Confluent Cloud cluster.
Create a Kafka topic.
JSON
- Generate test data in the Kafka topic.
- Create a Kafka client.
- Create a database, a table and query the data.
AVRO
- Generate test data in the Kafka topic.
- Create a Kafka client.
- Create a database, a table and query the data.

A StarRocks or CelerData database cluster

This is out of scope for the tutorial.

A Confluent Cloud cluster

This is out of scope for the tutorial.

Create a Kafka topic

You can use the UI to create the topic or create an API key, connect to confluent and then execute confluent kafka topic create quickstart.

[JSON] Generate test data in the Kafka topic

Create a Confluent Datagen source connector to create sample clickstream data. We are not using Kafka Schema Registry.

Sample JSON data will look like this:

{
  "ip": "122.152.45.245",
  "userid": 9,
  "remote_user": "-",
  "time": "5631",
  "_time": 5631,
  "request": "GET /site/user_status.html HTTP/1.1",
  "status": "407",
  "bytes": "278",
  "referrer": "-",
  "agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
}

[JSON] Create a Kafka client

Create a C/C++ client connection. StarRocks will act like the C/C++ client. What it'll do is give you the info you need to supply for the StarRocks connection. Make sure you also create Kafka Cluster API key. That will be the info you need to login from StarRocks to Confluent Cloud. Save all the configuration data. You will need it for the future steps.

[JSON] Create a database, a table and query the data.

Create the database.

create database demo;

Create the aggregate table.

use demo;
DROP TABLE `visits`;
CREATE TABLE `visits` (
`userid` int NULL COMMENT "",
`time` varchar(20) NULL COMMENT "",
`_time` DATETIME REPLACE NULL COMMENT "", 
`remote_user` varchar(50) REPLACE NULL COMMENT "",
`ip` varchar(50) REPLACE NULL COMMENT "", 
`request` varchar(50) REPLACE NULL COMMENT "", 
`status` varchar(50) REPLACE NULL COMMENT "", 
`bytes` int SUM NULL COMMENT "", 
`referrer` varchar(50) REPLACE NULL COMMENT "", 
`agent` varchar(200) REPLACE NULL COMMENT ""
) ENGINE=OLAP
AGGREGATE KEY(`userid`, `time` )
DISTRIBUTED BY HASH(`userid`) BUCKETS 1
PROPERTIES (
"replication_num" = "1",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);

Now load the data.

Tip: We use the setting "kafka_broker_list". It is the same as what Confluent calls "bootstrap.servers".

CREATE ROUTINE LOAD DEMO.quickstart ON visits
COLUMNS(userid, time, _time, remote_user, ip, request, status, bytes, referrer, agent)
PROPERTIES
(
"desired_concurrent_number"="1",
"format" ="json",
"jsonpaths" ="[\"$.userid\",\"$.time\",\"$._time\",\"$.remote_user\",\"$.ip\",\"$.request\",\"$.status\",\"$.bytes\",\"$.referrer\",\"$.agent\"]"
)
FROM KAFKA
(
"kafka_broker_list" ="pkc-rgm37.us-west-2.aws.confluent.cloud:9092",
"kafka_topic" = "quickstart",
"property.kafka_default_offsets" = "OFFSET_BEGINNING",
"property.security.protocol" = "SASL_SSL",
"property.sasl.mechanism" = "PLAIN", 
"property.sasl.username" = "XXX",
"property.sasl.password" = "YYY"
);

Check "routine load" job status by typing in

show routine load;

It should show you this

StarRocks > show routine load;
+-------+-------------+---------------------+---------------------+---------+--------+-----------+---------+----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------------------+--------------+----------+
| Id    | Name        | CreateTime          | PauseTime           | EndTime | DbName | TableName | State   | DataSourceType | CurrentTaskNum | JobProperties                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | DataSourceProperties                                                                                                      | CustomProperties                                                                                                                                                                                                                    | Statistic                                                                                                                                                                                                        | Progress                                                            | ReasonOfStateChanged | ErrorLogUrls | OtherMsg |
+-------+-------------+---------------------+---------------------+---------+--------+-----------+---------+----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------------------+--------------+----------+
| 34021 | quickstart | 2023-05-03 15:57:27 | NULL                | NULL    | DEMO   | visits    | RUNNING | KAFKA          | 1              | {"partitions":"*","partial_update":"false","columnToColumnExpr":"userid,time,_time,remote_user,ip,request,status,bytes,referrer,agent","maxBatchIntervalS":"10","whereExpr":"*","dataFormat":"json","timezone":"America/Los_Angeles","format":"json","json_root":"","strict_mode":"false","jsonpaths":"[\"$.userid\",\"$.time\",\"$._time\",\"$.remote_user\",\"$.ip\",\"$.request\",\"$.status\",\"$.bytes\",\"$.referrer\",\"$.agent\"]","desireTaskConcurrentNum":"1","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"1","maxBatchRows":"200000"} | {"topic":"quickstart","currentKafkaPartitions":"0,1,2,3,4,5","brokerList":"pkc-rgm37.us-west-2.aws.confluent.cloud:9092"} | {"security.protocol":"SASL_SSL","sasl.username":"6NWSFFI5QEEEKYPZ","sasl.mechanism":"PLAIN","kafka_default_offsets":"OFFSET_BEGINNING","group.id":"quickstart5_99c1c1e1-e578-4f59-a5fa-f280cb9a34fd","sasl.password":"******"}      | {"receivedBytes":146634,"errorRows":0,"committedTaskNum":1,"loadedRows":535,"loadRowsRate":0,"abortedTaskNum":0,"totalRows":535,"unselectedRows":0,"receivedBytesRate":264000,"taskExecuteTimeMs":555}           | {"0":"131","1":"99","2":"44","3":"150","4":"34","5":"71"}           |                      |              |          |
+-------+-------------+---------------------+---------------------+---------+--------+-----------+---------+----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------------------+--------------+----------+
1 rows in set (0.04 sec)

Once the data has been loaded, you can do a count.

select count(*) from visits;

You will see something like this

StarRocks > select count(*) from visits;
No connection. Trying to reconnect...
Connection id:    146
Current database: DEMO

+----------+
| count(*) |
+----------+
|     4540 |
+----------+
1 row in set (0.14 sec)

[AVRO] Generate test data in the Kafka topic

You must enable Kafka Schema Registry to use AVRO. Create a Confluent Datagen source connector to create sample order data.

Sample AVRO data will look like this:

{
      "ordertime": 1513488846309,
      "orderid": 99,
      "itemid": "Item_578",
      "orderunits": 0.3040753856696973,
      "address": {
        "city": "City_",
        "state": "State_34",
        "zipcode": 57758
      }
}

[AVRO] Create a Kafka client

Create a C/C++ client connection. StarRocks will act like the C/C++ client. What it'll do is give you the info you need to supply for the StarRocks connection. Make sure you also create a Kafka Cluster API key AND a Kafka Schema Registry API key. That will be the info you need to login from StarRocks to Confluent Cloud. Save all the configuration data. You will need it for the future steps.

Tip: The C/C++ client connection doesn't show you the schema registry URI or gives you the box to create the schema registry API key. If you click on the Python client connection, it'll give you that info and will have the box that can create your schema registry key.

[AVRO] Create a database, a table and query the data.

Create the database.

create database demo;

Create the aggregate table.

use demo;
CREATE TABLE IF NOT EXISTS orders (
    `itemid` STRING,
    `ordertime` DATETIME,
    `orderid` smallint,
    `orderunits` DECIMAL,
    `address` JSON
) DUPLICATE KEY (`itemid`)
DISTRIBUTED BY HASH (`itemid`);

Now load the data.

Tip: We use the setting "kafka_broker_list". It is the same as what Confluent calls "bootstrap.servers".

create routine load DEMO.quickstartarvo11
ON orders
COLUMNS (itemid,ordertime,orderid,orderunits,address)
PROPERTIES
(
    "desired_concurrent_number"="1",
    "format" ="avro"
 )
FROM KAFKA
(
    "kafka_broker_list" ="pkc-rgm37.us-west-2.aws.confluent.cloud:9092",
    "kafka_topic" = "quickstart",
    "property.kafka_default_offsets" = "OFFSET_BEGINNING",
    "property.security.protocol"="SASL_SSL",
    "property.sasl.mechanism"="PLAIN",
    "property.sasl.username"="XXXX",
    "property.sasl.password"="YYYY",
    "confluent.schema.registry.url"="https://AAAAA:[email protected]"

);

Check "routine load" job status by typing in

show routine load;

It should show you this

StarRocks > show routine load;
+-------+------------------+---------------------+---------------------+---------+--------+-----------+---------+----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+
| Id    | Name             | CreateTime          | PauseTime           | EndTime | DbName | TableName | State   | DataSourceType | CurrentTaskNum | JobProperties                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | DataSourceProperties                                                                                                      | CustomProperties                                                                                                                                                                                                                    | Statistic                                                                                                                                                                                                        | Progress                                                                                                      | ReasonOfStateChanged                                                                                                                                                                                                                                                                                                                                                                                                                      | ErrorLogUrls | OtherMsg |
+-------+------------------+---------------------+---------------------+---------+--------+-----------+---------+----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+
| 35059 | quickstartarvo | 2023-05-04 15:17:24 | NULL                | NULL    | DEMO   | orders    | RUNNING | KAFKA          | 1              | {"partitions":"*","rowDelimiter":"\t","partial_update":"false","columnToColumnExpr":"itemid,ordertime,orderid,orderunits,address","maxBatchIntervalS":"10","whereExpr":"*","timezone":"America/Los_Angeles","format":"avro","columnSeparator":"\t","strict_mode":"false","desireTaskConcurrentNum":"1","maxErrorNum":"0","currentTaskConcurrentNum":"1","maxBatchRows":"200000"}                                                                                                                                                                                               | {"topic":"quickstart","currentKafkaPartitions":"0,1,2,3,4,5","brokerList":"pkc-rgm37.us-west-2.aws.confluent.cloud:9092"} | {"security.protocol":"SASL_SSL","sasl.username":"UHWSB3HDJBSCUQHF","sasl.mechanism":"PLAIN","kafka_default_offsets":"OFFSET_BEGINNING","group.id":"quickstartarvo11_0585cbff-449a-4674-a171-829248c96b26","sasl.password":"******"} | {"receivedBytes":0,"errorRows":0,"committedTaskNum":0,"loadedRows":0,"loadRowsRate":0,"abortedTaskNum":0,"totalRows":0,"unselectedRows":0,"receivedBytesRate":0,"taskExecuteTimeMs":1}                           | {"0":"OFFSET_ZERO","1":"OFFSET_ZERO","2":"OFFSET_ZERO","3":"OFFSET_ZERO","4":"OFFSET_ZERO","5":"OFFSET_ZERO"} |                                                                                                                                                                                                                                                                                                                                                                                                                                           |              |          |
+-------+------------------+---------------------+---------------------+---------+--------+-----------+---------+----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+
11 rows in set (0.07 sec)

Once the data has been loaded, you can do a count.

select count(*) from orders;

You will see something like this

StarRocks > select count(*) from orders;
+----------+
| count(*) |
+----------+
|     6718 |
+----------+
1 row in set (0.14 sec)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading JSON and AVRO data from Confluent Cloud Kafka into StarRocks #22791

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Loading JSON and AVRO data from Confluent Cloud Kafka into StarRocks #22791

alberttwong May 3, 2023

Prerequisites

A StarRocks or CelerData database cluster

A Confluent Cloud cluster

Create a Kafka topic

[JSON] Generate test data in the Kafka topic

[JSON] Create a Kafka client

[JSON] Create a database, a table and query the data.

[AVRO] Generate test data in the Kafka topic

[AVRO] Create a Kafka client

[AVRO] Create a database, a table and query the data.

Replies: 0 comments

alberttwong
May 3, 2023