Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding quickstart for localpod #224

Merged
merged 4 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions localpod/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.spice
132 changes: 132 additions & 0 deletions localpod/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Local dataset replication


The [Localpod](https://docs.spiceai.org/components/data-connectors/localpod) Data Connector allows you to link datasets in a parent/child relationship within the current Spicepod. This helps you set up multiple levels of data acceleration for a single dataset and ensures the data is downloaded only once from the remote source.

```yaml
version: v1beta1
kind: Spicepod
name: localpod

datasets:
- from: file:data.csv
name: time_series
description: taxi trips in s3
params:
file_format: parquet
acceleration:
enabled: true
refresh_check_interval: 15s
refresh_mode: full
- from: localpod:time_series
name: local_time_series
acceleration:
enabled: true
engine: duckdb
mode: file

```

:::note

The parent dataset must have `refresh_mode` set to `full` in order for the `localpod` data connector to function. See [here](https://docs.spiceai.org/components/data-connectors/localpod#synchronized-refreshes) for more information

:::

## Running this quickstart

In a new terminal, start `spice` with `spice run`.

You should see terminal output like so:

```shell
$ spice run
2024/10/29 18:31:38 INFO Checking for latest Spice runtime release...
2024/10/29 18:31:38 INFO Spice.ai runtime starting...
2024-10-30T01:31:38.912802Z INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051
2024-10-30T01:31:38.913151Z INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
2024-10-30T01:31:38.913247Z INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090
2024-10-30T01:31:38.921580Z INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052
2024-10-30T01:31:39.112883Z INFO runtime: Initialized results cache; max size: 128.00 MiB, item ttl: 1s
2024-10-30T01:31:39.123137Z INFO runtime: Tool [document_similarity] ready to use
2024-10-30T01:31:39.123166Z INFO runtime: Tool [table_schema] ready to use
2024-10-30T01:31:39.123172Z INFO runtime: Tool [sql] ready to use
2024-10-30T01:31:39.123180Z INFO runtime: Tool [list_datasets] ready to use
2024-10-30T01:31:39.123183Z INFO runtime: Tool [get_readiness] ready to use
2024-10-30T01:31:39.123187Z INFO runtime: Tool [random_sample] ready to use
2024-10-30T01:31:39.123193Z INFO runtime: Tool [sample_distinct_columns] ready to use
2024-10-30T01:31:39.123197Z INFO runtime: Tool [top_n_sample] ready to use
2024-10-30T01:31:39.125295Z INFO runtime: Dataset time_series registered (file:data.csv), acceleration (arrow, 15s refresh), results cache enabled.
2024-10-30T01:31:39.126352Z INFO runtime::accelerated_table::refresh_task: Loading data for dataset time_series
2024-10-30T01:31:39.128337Z INFO runtime::accelerated_table::refresh_task: Loaded 0 rows for dataset time_series in 1ms.
2024-10-30T01:31:39.136703Z INFO runtime::datafusion: Localpod dataset local_time_series synchronizing refreshes with parent table time_series
2024-10-30T01:31:39.136764Z INFO runtime: Dataset local_time_series registered (localpod:time_series), acceleration (duckdb:file, 10s refresh), results cache enabled.
2024-10-30T01:31:39.137955Z INFO runtime::accelerated_table::refresh_task: Loading data for dataset local_time_series
2024-10-30T01:31:39.139139Z INFO runtime::accelerated_table::refresh_task: Loaded 0 rows for dataset local_time_series in 1ms.
```



### Querying the `localpod`

In a new terminal, start `spice sql` and run these two queries to validate that both datasets contain the same number of rows:

```shell
$ spice sql

sql> SELECT COUNT(*) FROM time_series;
+----------+
| count(*) |
+----------+
| 0 |
+----------+

Time: 0.004800375 seconds. 1 rows.
sql> SELECT COUNT(*) FROM local_time_series;
+----------+
| count(*) |
+----------+
| 0 |
+----------+


Time: 0.005054417 seconds. 1 rows.
```

### Updating the parent dataset

Let's insert new data into the parent dataset and see the `localpod` update. In a new terminal, navigate to this sample directory and run the following:

```shell
$ ./generate_data.sh
```

In the terminal where `spice run` is running, you should see a message indicating the new data is loaded:

```shell
2024-10-30T01:37:24.266411Z INFO runtime::accelerated_table::refresh_task: Loaded 1,000 rows (48.16 kiB) for dataset time_series in 3ms.
2024-10-30T01:37:24.266422Z INFO runtime::accelerated_table::refresh_task: Loaded 1,000 rows (48.16 kiB) for dataset local_time_series in 3ms.
```

And the same SQL queries as above will give updated results:

```shell
sql> SELECT COUNT(*) FROM time_series;
+----------+
| count(*) |
+----------+
| 1000 |
+----------+

Time: 0.006115708 seconds. 1 rows.
sql> SELECT COUNT(*) FROM local_time_series;
+----------+
| count(*) |
+----------+
| 1000 |
+----------+

Time: 0.005385625 seconds. 1 rows.
```

The `local_time_series` dataset is faster because it's accelerated locally using [DuckDB](https://docs.spiceai.org/components/data-accelerators/duckdb)
1 change: 1 addition & 0 deletions localpod/data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
timestamp,val1,val
31 changes: 31 additions & 0 deletions localpod/generate_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

# Set the output file name
output_file="data.csv"

# Number of rows to generate
num_rows=1000

# Write the header to the file
if [ ! -f "$output_file" ]; then
echo "timestamp,val1,val2\n" > "$output_file"
fi

# Loop to generate each row
for ((i=1; i<=num_rows; i++))
do
# Get the current timestamp
timestamp=$(date +"%Y-%m-%d %H:%M:%S")

# Generate random values for val1 and val2
val1=$((RANDOM % 100))
val2=$((RANDOM % 100))

# Write the row to the CSV file
echo "$timestamp,$val1,$val2" >> "$output_file"

# Optional: Add a sleep to delay each row generation by 1 second
# sleep 1
done

echo "CSV file generated: $output_file"
20 changes: 20 additions & 0 deletions localpod/spicepod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version: v1beta1
kind: Spicepod
name: localpod

datasets:
- from: file:data.csv
name: time_series
description: taxi trips in s3
params:
file_format: csv
acceleration:
enabled: true
refresh_check_interval: 15s
refresh_mode: full
- from: localpod:time_series
name: local_time_series
acceleration:
enabled: true
engine: duckdb
mode: file
Loading