Loading and querying NYC Yellow or Green Taxi Data (Parquet format) with StarRocks #24504
Closed
Replies: 2 comments
-
Need to do this with a smaller dataset like green taxi data to save time and resources |
Beta Was this translation helpful? Give feedback.
0 replies
-
This is a more complex example and more typical of an analytics query.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Prerequisites
For this tutorial you need to:
Have Docker Desktop or podman container runtime installed
This is out of scope for the tutorial.
Have a MySQL client
This is out of scope for the tutorial.
A StarRocks or CelerData database cluster
This is out of scope for the tutorial.
Download the NYC Yellow or Green Taxi Data and upload into a S3 bucket
You can download the NYC Yellow Taxi Data at
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
and the data dictionary can be found athttps://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
. I downloaded the January 2023 data athttps://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
which is about 50 megs. The next step is that you have to upload the data into an object store. I used AWS S3 service.Alternatively you can use the Green Taxi Data. It's a smaller set of data. Between Yellow and Green, Green does not have 1 data field that Yellow has and Green has an additional data field.
RFE: Import parquet data from http:// URIs. #23903
Create a database, database table and query the data
Create the database.
Create the table based off of the data dictionary.
or for Green Taxi Data use the following sql
Execute local disk parquet file load
We need to first get the file onto disk that is running StarRocks.
Then execute the mysql client command to load the file.
Execute the load command to get data from S3 into StarRocks.
The load is similar for Green Taxi Data.
See the status of the load. Keep on running the
show load
command until you see a success or failure.Finally query the data.
And you will see this as a result.
Visualize this dataset via Apache SuperSet
Check out the StarRocks and Apache SuperSet tutorial at #23210 or StarRocks and preset.io tutorial at #24506.
Beta Was this translation helpful? Give feedback.
All reactions