Using AirByte to load data (CSV, JSON, Excel, Feather, Parquet) into StarRocks #23713
Replies: 5 comments 7 replies
-
Update Airbyte connector to use stream load. #25695 |
Beta Was this translation helpful? Give feedback.
-
Have the AirByte connector support TLS/HTTPS connections. #24177 |
Beta Was this translation helpful? Give feedback.
-
Here is some SQL to transform the JSON into a separate table.
|
Beta Was this translation helpful? Give feedback.
-
image that i had 500 0000 rows csv file to load into starrock,what's the speed? about how many rows per minute. |
Beta Was this translation helpful? Give feedback.
-
Is it possible to control the table type? |
Beta Was this translation helpful? Give feedback.
-
AirByte has a lot of connectors. One of the connectors you can use is the source file connector which allows you to source CSV, JSON, Excel, Feather, Parquet. Then you can use the StarRocks AirByte destination connector to load that data into StarRocks. This tutorial solves the EL part of ELT. At the end of the tutorial you should do your own T "transform" to restructure the data to what you need it to be.
Prerequisites
For this tutorial you need to:
Have Docker Desktop or podman container runtime installed
This is out of scope for the tutorial.
A StarRocks or CelerData database cluster
This is out of scope for the tutorial. We also used #24771 to expose a local StarRocks cluster to the internet.
An Airbyte instance.
This is out of scope for the tutorial but we used the AirByte Docker Compose environment that was shown in https://docs.airbyte.com/quickstart/deploy-airbyte/.
Configure AirByte
First you need to add a new connector. Navigate to settings -> destinations and click on "add a new connector".
Container is available at
starrocks/destination-starrocks:latest
and you can see all versions at https://hub.docker.com/r/starrocks/destination-starrocksNext, you'll see "StarRocks" in the list of destinations connectors. Select it.
Fill out the information for your StarRocks/CelerData instance.
Screenshot of the connection info for StarRocks allIn1 Container image with serveo.net #24771
Note: HTTP port for StarRocks allin1 container is port 8040.
Screenshot of the connection info for CelerData Cloud
Note: HTTP port for cloud.celerdata.com is port 443
For this tutorial, we'll load a parquet file from NYC Taxi
https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet
Note: There is an upper file size limit so parquet files like yellow_tripdata are too large and will crash the airbyte worker threads. See airbytehq/airbyte#27174 and airbytehq/airbyte#27169
Now we need to setup a connection. Pick the source we created.
Then pick the destination that we created.
Do a check before finalizing the connection.
Here is the log output after a successful sync.
And can query the table after all the data has been inserted.
After the data has been inserted, I would then use "transform" from the current data schema to the target schema that you want. I would use the `insert into select' SQL statement. Another idea is to generate (SQL alter) columns to the table. See https://docs.starrocks.io/en-us/3.1/sql-reference/sql-statements/generated_columns for an example.
Beta Was this translation helpful? Give feedback.
All reactions