Skip to content

Commit

Permalink
Tweaks to quickstarts (#180)
Browse files Browse the repository at this point in the history
  • Loading branch information
phillipleblanc authored Sep 16, 2024
1 parent e1e1be9 commit 48357f2
Show file tree
Hide file tree
Showing 4 changed files with 74 additions and 54 deletions.
111 changes: 62 additions & 49 deletions databricks/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,43 @@
## Spice on Databricks
# Spice on Databricks

Spice can read data straight from a Databricks instance. This guide will create an app, configure Databricks, load and query a dataset. It assumes:

- Spice is installed (see the [Getting Started](https://docs.spiceai.org/getting-started) documentation).
- The Databricks instance is running against AWS S3 storage in `us-east-1`.
- Basic AWS authentication is configured (with environment variable credentials `AWS_ACCESS_KEY_ID` & `AWS_SECRET_ACCESS_KEY`).
- A Databricks personal access token is available (as the environment variable `DATABRICKS_TOKEN`).
- A table already exists in Databricks, called `spice_data.public.awesome_table`.

1. Initialize a Spice app

```shell
spice init databricks_demo
cd databricks_demo
```

1. Start the Spice runtime
```shell
>>> spice run
2024-03-27T05:27:52.696536Z INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090
2024-03-27T05:27:52.696543Z INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051
2024-03-27T05:27:52.696606Z INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052
```

1. In another terminal, working in the `databricks_demo` directory, configure Spice with the Databricks credentials

```shell
spice login databricks \
--token $DATABRICKS_TOKEN \
--aws-access-key-id $AWS_ACCESS_KEY_ID \
--aws-secret-access-key $AWS_SECRET_ACCESS_KEY \
--aws-region us-east-1
```
```

Executing `spice login` and successfully authenticating will create a `.env` file in the `databricks_demo` directory with the Databricks credentials.

1. Configure a Databricks dataset into the spicepod. The table provided must be a reference to a table in the Databricks unity catalog.
1. Start the Spice runtime

```shell
>>> spice run
2024-03-27T05:27:52.696536Z INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090
2024-03-27T05:27:52.696543Z INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051
2024-03-27T05:27:52.696606Z INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052
```

1. Configure a Databricks dataset into the spicepod. The table provided must be a reference to a table in the Databricks unity catalog.

```shell
>>> spice dataset configure
Expand All @@ -44,21 +49,22 @@ Spice can read data straight from a Databricks instance. This guide will create
Saved datasets/my_table/dataset.yaml
```

1. Edit the dataset to add `mode: delta_lake` and `databricks_cluster_id: <cluster id>` to the `params` section:
1. Edit the dataset to add `mode: delta_lake` to the `params` section:

```yaml
params:
mode: delta_lake
databricks_endpoint: <existing_endpoint>
databricks_cluster_id: <cluster id>
```

1. Confirm that the runtime has registered the new table (in the original terminal)

```shell
2024-03-27T05:27:54.051229Z INFO runtime: Dataset my_table registered (databricks:spice_data.public.awesome_table), results cache enabled.
```

1. Check the table exists from the Spice REPL

```shell
>>> spice sql
Welcome to the Spice.ai SQL REPL! Type 'help' for help.
Expand All @@ -76,9 +82,8 @@ Spice can read data straight from a Databricks instance. This guide will create
Time: 0.008540708 seconds
```


```shell
sql> describe datafusion.public.my_table
```shell
sql> describe spice.public.my_table
+-----------------------+------------------------------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+------------------------------+-------------+
Expand Down Expand Up @@ -106,6 +111,7 @@ Spice can read data straight from a Databricks instance. This guide will create
```

1. Query against the Databricks table. Since the table isn't accelerated, the spice runtime will make a network call to the object storage service.
```shell
>>> spice sql
sql> SELECT avg(total_amount), avg(tip_amount), count(1), passenger_count FROM my_table GROUP BY passenger_count ORDER BY passenger_count ASC;
Expand All @@ -129,40 +135,47 @@ Spice can read data straight from a Databricks instance. This guide will create
```
## (Optional): Accelerating Databricks
To improve the query performance, the Databricks dataset can be accelerated.
To improve the query performance, the Databricks dataset can be accelerated.
1. Edit the dataset, `my_table`.
```shell
echo """acceleration:
enabled: true""" >> datasets/my_table/dataset.yaml
```
```shell
echo """acceleration:
enabled: true""" >> datasets/my_table/dataset.yaml
```
2. The Spice runtime should be updated (i.e. `ACCELERATION=true`)
```shell
>>> spice datasets
FROM NAME REPLICATION ACCELERATION DEPENDSON STATUS
databricks:spice_data.public.awesome_table my_table false true Ready
```
```shell
>>> spice datasets
FROM NAME REPLICATION ACCELERATION DEPENDSON STATUS
databricks:spice_data.public.awesome_table my_table false true Ready
```
3. Rerun the query
```shell
>>> spice sql
sql> select avg(total_amount), avg(tip_amount), count(1), passenger_count from my_table group by passenger_count order by passenger_count asc;
+----------------------------+--------------------------+-----------------+-----------------+
| AVG(my_table.total_amount) | AVG(my_table.tip_amount) | COUNT(Int64(1)) | passenger_count |
+----------------------------+--------------------------+-----------------+-----------------+
| 25.32781693945653 | 3.072259971396793 | 31465 | 0 |
| 26.205230445474996 | 3.3712622884680052 | 2188739 | 1 |
| 29.520659930930304 | 3.7171302113290854 | 405103 | 2 |
| 29.138309044290263 | 3.5370455392167615 | 91262 | 3 |
| 30.877266710278306 | 3.466037634201712 | 51974 | 4 |
| 26.269129111203988 | 3.3797078135259317 | 33506 | 5 |
| 25.801183286359798 | 3.344098778687425 | 22353 | 6 |
| 57.735 | 8.37 | 8 | 7 |
| 95.66803921568626 | 11.972156862745097 | 51 | 8 |
| 18.45 | 3.05 | 1 | 9 |
| 25.81173663332435 | 1.545956750046378 | 140162 | |
+----------------------------+--------------------------+-----------------+-----------------+
Time: 0.0227835 seconds
```
Note: A dataset can be accelerated when configured by specifying yes (y) to `locally accelerate (y/n)?`.
```shell
>>> spice sql
sql> select avg(total_amount), avg(tip_amount), count(1), passenger_count from my_table group by passenger_count order by passenger_count asc;
+----------------------------+--------------------------+-----------------+-----------------+
| AVG(my_table.total_amount) | AVG(my_table.tip_amount) | COUNT(Int64(1)) | passenger_count |
+----------------------------+--------------------------+-----------------+-----------------+
| 25.32781693945653 | 3.072259971396793 | 31465 | 0 |
| 26.205230445474996 | 3.3712622884680052 | 2188739 | 1 |
| 29.520659930930304 | 3.7171302113290854 | 405103 | 2 |
| 29.138309044290263 | 3.5370455392167615 | 91262 | 3 |
| 30.877266710278306 | 3.466037634201712 | 51974 | 4 |
| 26.269129111203988 | 3.3797078135259317 | 33506 | 5 |
| 25.801183286359798 | 3.344098778687425 | 22353 | 6 |
| 57.735 | 8.37 | 8 | 7 |
| 95.66803921568626 | 11.972156862745097 | 51 | 8 |
| 18.45 | 3.05 | 1 | 9 |
| 25.81173663332435 | 1.545956750046378 | 140162 | |
+----------------------------+--------------------------+-----------------+-----------------+
Time: 0.0227835 seconds
```
Note: A dataset can be accelerated when configured by specifying yes (y) to `locally accelerate (y/n)?`.
10 changes: 8 additions & 2 deletions graphql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Follow these steps to get started with GraphQL as a Data Connector.

- The latest version of Spice. [Install Spice](https://docs.spiceai.org/getting-started/installation)
- A GraphQL endpoint with a query that returns data in JSON format.
- The GitHub GraphQL API (https://api.github.com/graphql) is a good example to get started with. [GitHub GraphQL API](https://docs.github.com/en/graphql)
- The GitHub GraphQL API (<https://api.github.com/graphql>) is a good example to get started with. [GitHub GraphQL API](https://docs.github.com/en/graphql)

**Step 1.** Edit the `spicepod.yaml` file in this directory and replace the `graphql_quickstart` dataset params with the connection parameters for the GraphQL instance, where `[local_table_name]` is the desired name for the federated table within Spice, `[graphql_endpoint]` is the URL to the GraphQL endpoint, `[graphql_query]` is the query to execute, and `[json_pointer]` is the JSON pointer to the data in the GraphQL response.

Expand Down Expand Up @@ -46,7 +46,13 @@ datasets:
See the [GraphQL data connector docs](https://docs.spiceai.org/components/data-connectors/graphql) for more configuration options.
To securely store GraphQL auth params, see [Secret Stores](https://docs.spiceai.org/components/secret-stores)
To securely store GraphQL auth params, see [Secret Stores](https://docs.spiceai.org/components/secret-stores).
Add the following environment variable to a `.env` file:

```bash
GH_TOKEN=<your GitHub token>
```

**Step 2.** Run the Spice runtime with `spice run` from the directory with the `spicepod.yaml` file.

Expand Down
2 changes: 1 addition & 1 deletion kubernetes/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
### Follow these steps to get started running Spice.ai in Kubernetes.
# Follow these steps to get started running Spice.ai in Kubernetes

**Step 1.** (Optional) Start a local [`kind`](https://kind.sigs.k8s.io/) cluster:

Expand Down
5 changes: 3 additions & 2 deletions spiceai/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Spice Quickstart Tutorial using the Spice.ai Cloud Platform
# Spice Quickstart Tutorial using the Spice.ai Cloud Platform

The Spice.ai Cloud Platform has many datasets that can be used within Spice. A valid login for the Spice.ai Cloud Platform is required to access the datasets. Before beginning this quickstart, [link your GitHub account to Spice.ai](https://spice.ai/login) to get access to the platform.

Expand All @@ -14,6 +14,7 @@ cd spiceai-demo
```bash
spice login
```

A browser window will open displaying a code that will appear in the terminal. Select Approve if the authorization codes match.

![Screenshot](./device_login.png)
Expand Down Expand Up @@ -90,7 +91,7 @@ description: ethereum recent blocks
The Spice runtime terminal will show that the dataset has been loaded:
```
```console
2024-07-23T01:01:50.403937Z INFO runtime: Dataset eth_recent_blocks registered (spice.ai/eth.recent_blocks), results cache enabled.
```
Expand Down

0 comments on commit 48357f2

Please sign in to comment.