Adding quickstart for localpod (#224)

* Adding quickstart for localpod * Update localpod/spicepod.yaml Co-authored-by: Evgenii Khramkov <[email protected]> * Update localpod/README.md Co-authored-by: Evgenii Khramkov <[email protected]> * Adding links to documentation --------- Co-authored-by: Evgenii Khramkov <[email protected]>
spiceai · Oct 31, 2024 · 1501bf0 · 1501bf0
1 parent db2a84c
commit 1501bf0
Show file tree

Hide file tree

Showing 5 changed files with 185 additions and 0 deletions.
diff --git a/localpod/.gitignore b/localpod/.gitignore
@@ -0,0 +1 @@
+.spice
diff --git a/localpod/README.md b/localpod/README.md
@@ -0,0 +1,132 @@
+# Local dataset replication
+
+
+The [Localpod](https://docs.spiceai.org/components/data-connectors/localpod) Data Connector allows you to link datasets in a parent/child relationship within the current Spicepod. This helps you set up multiple levels of data acceleration for a single dataset and ensures the data is downloaded only once from the remote source.
+
+```yaml
+version: v1beta1
+kind: Spicepod
+name: localpod
+
+datasets:
+    - from: file:data.csv
+      name: time_series
+      description: taxi trips in s3
+      params:
+        file_format: parquet
+      acceleration:
+        enabled: true
+        refresh_check_interval: 15s
+        refresh_mode: full
+    - from: localpod:time_series
+      name: local_time_series
+      acceleration:
+        enabled: true
+        engine: duckdb
+        mode: file
+
+```
+
+:::note
+
+The parent dataset must have `refresh_mode` set to `full` in order for the `localpod` data connector to function. See [here](https://docs.spiceai.org/components/data-connectors/localpod#synchronized-refreshes) for more information
+
+:::
+
+## Running this quickstart
+
+In a new terminal, start `spice` with `spice run`.
+
+You should see terminal output like so:
+
+```shell
+$ spice run
+2024/10/29 18:31:38 INFO Checking for latest Spice runtime release...
+2024/10/29 18:31:38 INFO Spice.ai runtime starting...
+2024-10-30T01:31:38.912802Z  INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051
+2024-10-30T01:31:38.913151Z  INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
+2024-10-30T01:31:38.913247Z  INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090
+2024-10-30T01:31:38.921580Z  INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052
+2024-10-30T01:31:39.112883Z  INFO runtime: Initialized results cache; max size: 128.00 MiB, item ttl: 1s
+2024-10-30T01:31:39.123137Z  INFO runtime: Tool [document_similarity] ready to use
+2024-10-30T01:31:39.123166Z  INFO runtime: Tool [table_schema] ready to use
+2024-10-30T01:31:39.123172Z  INFO runtime: Tool [sql] ready to use
+2024-10-30T01:31:39.123180Z  INFO runtime: Tool [list_datasets] ready to use
+2024-10-30T01:31:39.123183Z  INFO runtime: Tool [get_readiness] ready to use
+2024-10-30T01:31:39.123187Z  INFO runtime: Tool [random_sample] ready to use
+2024-10-30T01:31:39.123193Z  INFO runtime: Tool [sample_distinct_columns] ready to use
+2024-10-30T01:31:39.123197Z  INFO runtime: Tool [top_n_sample] ready to use
+2024-10-30T01:31:39.125295Z  INFO runtime: Dataset time_series registered (file:data.csv), acceleration (arrow, 15s refresh), results cache enabled.
+2024-10-30T01:31:39.126352Z  INFO runtime::accelerated_table::refresh_task: Loading data for dataset time_series
+2024-10-30T01:31:39.128337Z  INFO runtime::accelerated_table::refresh_task: Loaded 0 rows for dataset time_series in 1ms.
+2024-10-30T01:31:39.136703Z  INFO runtime::datafusion: Localpod dataset local_time_series synchronizing refreshes with parent table time_series
+2024-10-30T01:31:39.136764Z  INFO runtime: Dataset local_time_series registered (localpod:time_series), acceleration (duckdb:file, 10s refresh), results cache enabled.
+2024-10-30T01:31:39.137955Z  INFO runtime::accelerated_table::refresh_task: Loading data for dataset local_time_series
+2024-10-30T01:31:39.139139Z  INFO runtime::accelerated_table::refresh_task: Loaded 0 rows for dataset local_time_series in 1ms.
+```
+
+
+
+### Querying the `localpod`
+
+In a new terminal, start `spice sql` and run these two queries to validate that both datasets contain the same number of rows:
+
+```shell
+$ spice sql
+
+sql> SELECT COUNT(*) FROM time_series;
++----------+
+| count(*) |
++----------+
+| 0        |
++----------+
+
+Time: 0.004800375 seconds. 1 rows.
+sql> SELECT COUNT(*) FROM local_time_series;
++----------+
+| count(*) |
++----------+
+| 0        |
++----------+
+
+
+Time: 0.005054417 seconds. 1 rows.
+```
+
+### Updating the parent dataset
+
+Let's insert new data into the parent dataset and see the `localpod` update. In a new terminal, navigate to this sample directory and run the following:
+
+```shell
+$ ./generate_data.sh
+```
+
+In the terminal where `spice run` is running, you should see a message indicating the new data is loaded:
+
+```shell
+2024-10-30T01:37:24.266411Z  INFO runtime::accelerated_table::refresh_task: Loaded 1,000 rows (48.16 kiB) for dataset time_series in 3ms.
+2024-10-30T01:37:24.266422Z  INFO runtime::accelerated_table::refresh_task: Loaded 1,000 rows (48.16 kiB) for dataset local_time_series in 3ms.
+```
+
+And the same SQL queries as above will give updated results:
+
+```shell
+sql> SELECT COUNT(*) FROM time_series;
++----------+
+| count(*) |
++----------+
+| 1000     |
++----------+
+
+Time: 0.006115708 seconds. 1 rows.
+sql> SELECT COUNT(*) FROM local_time_series;
++----------+
+| count(*) |
++----------+
+| 1000     |
++----------+
+
+Time: 0.005385625 seconds. 1 rows.
+```
+
+The `local_time_series` dataset is faster because it's accelerated locally using [DuckDB](https://docs.spiceai.org/components/data-accelerators/duckdb)
diff --git a/localpod/data.csv b/localpod/data.csv
@@ -0,0 +1 @@
+timestamp,val1,val
diff --git a/localpod/generate_data.sh b/localpod/generate_data.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+# Set the output file name
+output_file="data.csv"
+
+# Number of rows to generate
+num_rows=1000
+
+# Write the header to the file
+if [ ! -f "$output_file" ]; then
+    echo "timestamp,val1,val2\n" > "$output_file"
+fi
+
+# Loop to generate each row
+for ((i=1; i<=num_rows; i++))
+do
+    # Get the current timestamp
+    timestamp=$(date +"%Y-%m-%d %H:%M:%S")
+
+    # Generate random values for val1 and val2
+    val1=$((RANDOM % 100))
+    val2=$((RANDOM % 100))
+
+    # Write the row to the CSV file
+    echo "$timestamp,$val1,$val2" >> "$output_file"
+
+    # Optional: Add a sleep to delay each row generation by 1 second
+    # sleep 1
+done
+
+echo "CSV file generated: $output_file"
diff --git a/localpod/spicepod.yaml b/localpod/spicepod.yaml
@@ -0,0 +1,20 @@
+version: v1beta1
+kind: Spicepod
+name: localpod
+
+datasets:
+    - from: file:data.csv
+      name: time_series
+      description: taxi trips in s3
+      params:
+        file_format: csv
+      acceleration:
+        enabled: true
+        refresh_check_interval: 15s
+        refresh_mode: full
+    - from: localpod:time_series
+      name: local_time_series
+      acceleration:
+        enabled: true
+        engine: duckdb
+        mode: file