Skip to content

Commit

Permalink
Adding quickstart for localpod (#224)
Browse files Browse the repository at this point in the history
* Adding quickstart for localpod

* Update localpod/spicepod.yaml

Co-authored-by: Evgenii Khramkov <[email protected]>

* Update localpod/README.md

Co-authored-by: Evgenii Khramkov <[email protected]>

* Adding links to documentation

---------

Co-authored-by: Evgenii Khramkov <[email protected]>
  • Loading branch information
slyons and ewgenius authored Oct 31, 2024
1 parent db2a84c commit 1501bf0
Show file tree
Hide file tree
Showing 5 changed files with 185 additions and 0 deletions.
1 change: 1 addition & 0 deletions localpod/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.spice
132 changes: 132 additions & 0 deletions localpod/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Local dataset replication


The [Localpod](https://docs.spiceai.org/components/data-connectors/localpod) Data Connector allows you to link datasets in a parent/child relationship within the current Spicepod. This helps you set up multiple levels of data acceleration for a single dataset and ensures the data is downloaded only once from the remote source.

```yaml
version: v1beta1
kind: Spicepod
name: localpod

datasets:
- from: file:data.csv
name: time_series
description: taxi trips in s3
params:
file_format: parquet
acceleration:
enabled: true
refresh_check_interval: 15s
refresh_mode: full
- from: localpod:time_series
name: local_time_series
acceleration:
enabled: true
engine: duckdb
mode: file

```

:::note

The parent dataset must have `refresh_mode` set to `full` in order for the `localpod` data connector to function. See [here](https://docs.spiceai.org/components/data-connectors/localpod#synchronized-refreshes) for more information

:::

## Running this quickstart

In a new terminal, start `spice` with `spice run`.

You should see terminal output like so:

```shell
$ spice run
2024/10/29 18:31:38 INFO Checking for latest Spice runtime release...
2024/10/29 18:31:38 INFO Spice.ai runtime starting...
2024-10-30T01:31:38.912802Z INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051
2024-10-30T01:31:38.913151Z INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
2024-10-30T01:31:38.913247Z INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090
2024-10-30T01:31:38.921580Z INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052
2024-10-30T01:31:39.112883Z INFO runtime: Initialized results cache; max size: 128.00 MiB, item ttl: 1s
2024-10-30T01:31:39.123137Z INFO runtime: Tool [document_similarity] ready to use
2024-10-30T01:31:39.123166Z INFO runtime: Tool [table_schema] ready to use
2024-10-30T01:31:39.123172Z INFO runtime: Tool [sql] ready to use
2024-10-30T01:31:39.123180Z INFO runtime: Tool [list_datasets] ready to use
2024-10-30T01:31:39.123183Z INFO runtime: Tool [get_readiness] ready to use
2024-10-30T01:31:39.123187Z INFO runtime: Tool [random_sample] ready to use
2024-10-30T01:31:39.123193Z INFO runtime: Tool [sample_distinct_columns] ready to use
2024-10-30T01:31:39.123197Z INFO runtime: Tool [top_n_sample] ready to use
2024-10-30T01:31:39.125295Z INFO runtime: Dataset time_series registered (file:data.csv), acceleration (arrow, 15s refresh), results cache enabled.
2024-10-30T01:31:39.126352Z INFO runtime::accelerated_table::refresh_task: Loading data for dataset time_series
2024-10-30T01:31:39.128337Z INFO runtime::accelerated_table::refresh_task: Loaded 0 rows for dataset time_series in 1ms.
2024-10-30T01:31:39.136703Z INFO runtime::datafusion: Localpod dataset local_time_series synchronizing refreshes with parent table time_series
2024-10-30T01:31:39.136764Z INFO runtime: Dataset local_time_series registered (localpod:time_series), acceleration (duckdb:file, 10s refresh), results cache enabled.
2024-10-30T01:31:39.137955Z INFO runtime::accelerated_table::refresh_task: Loading data for dataset local_time_series
2024-10-30T01:31:39.139139Z INFO runtime::accelerated_table::refresh_task: Loaded 0 rows for dataset local_time_series in 1ms.
```
### Querying the `localpod`
In a new terminal, start `spice sql` and run these two queries to validate that both datasets contain the same number of rows:
```shell
$ spice sql

sql> SELECT COUNT(*) FROM time_series;
+----------+
| count(*) |
+----------+
| 0 |
+----------+

Time: 0.004800375 seconds. 1 rows.
sql> SELECT COUNT(*) FROM local_time_series;
+----------+
| count(*) |
+----------+
| 0 |
+----------+


Time: 0.005054417 seconds. 1 rows.
```
### Updating the parent dataset
Let's insert new data into the parent dataset and see the `localpod` update. In a new terminal, navigate to this sample directory and run the following:
```shell
$ ./generate_data.sh
```
In the terminal where `spice run` is running, you should see a message indicating the new data is loaded:
```shell
2024-10-30T01:37:24.266411Z INFO runtime::accelerated_table::refresh_task: Loaded 1,000 rows (48.16 kiB) for dataset time_series in 3ms.
2024-10-30T01:37:24.266422Z INFO runtime::accelerated_table::refresh_task: Loaded 1,000 rows (48.16 kiB) for dataset local_time_series in 3ms.
```
And the same SQL queries as above will give updated results:
```shell
sql> SELECT COUNT(*) FROM time_series;
+----------+
| count(*) |
+----------+
| 1000 |
+----------+
Time: 0.006115708 seconds. 1 rows.
sql> SELECT COUNT(*) FROM local_time_series;
+----------+
| count(*) |
+----------+
| 1000 |
+----------+
Time: 0.005385625 seconds. 1 rows.
```
The `local_time_series` dataset is faster because it's accelerated locally using [DuckDB](https://docs.spiceai.org/components/data-accelerators/duckdb)
1 change: 1 addition & 0 deletions localpod/data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
timestamp,val1,val
31 changes: 31 additions & 0 deletions localpod/generate_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

# Set the output file name
output_file="data.csv"

# Number of rows to generate
num_rows=1000

# Write the header to the file
if [ ! -f "$output_file" ]; then
echo "timestamp,val1,val2\n" > "$output_file"
fi

# Loop to generate each row
for ((i=1; i<=num_rows; i++))
do
# Get the current timestamp
timestamp=$(date +"%Y-%m-%d %H:%M:%S")

# Generate random values for val1 and val2
val1=$((RANDOM % 100))
val2=$((RANDOM % 100))

# Write the row to the CSV file
echo "$timestamp,$val1,$val2" >> "$output_file"

# Optional: Add a sleep to delay each row generation by 1 second
# sleep 1
done

echo "CSV file generated: $output_file"
20 changes: 20 additions & 0 deletions localpod/spicepod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version: v1beta1
kind: Spicepod
name: localpod

datasets:
- from: file:data.csv
name: time_series
description: taxi trips in s3
params:
file_format: csv
acceleration:
enabled: true
refresh_check_interval: 15s
refresh_mode: full
- from: localpod:time_series
name: local_time_series
acceleration:
enabled: true
engine: duckdb
mode: file

0 comments on commit 1501bf0

Please sign in to comment.