From 48357f2714a8ce7a3e49d7f0a10c1d2994916d83 Mon Sep 17 00:00:00 2001 From: Phillip LeBlanc Date: Tue, 17 Sep 2024 03:47:20 +0900 Subject: [PATCH] Tweaks to quickstarts (#180) --- databricks/README.md | 111 ++++++++++++++++++++++++------------------- graphql/README.md | 10 +++- kubernetes/README.md | 2 +- spiceai/README.md | 5 +- 4 files changed, 74 insertions(+), 54 deletions(-) diff --git a/databricks/README.md b/databricks/README.md index d4bd5e3..990818c 100644 --- a/databricks/README.md +++ b/databricks/README.md @@ -1,6 +1,7 @@ -## Spice on Databricks +# Spice on Databricks Spice can read data straight from a Databricks instance. This guide will create an app, configure Databricks, load and query a dataset. It assumes: + - Spice is installed (see the [Getting Started](https://docs.spiceai.org/getting-started) documentation). - The Databricks instance is running against AWS S3 storage in `us-east-1`. - Basic AWS authentication is configured (with environment variable credentials `AWS_ACCESS_KEY_ID` & `AWS_SECRET_ACCESS_KEY`). @@ -8,31 +9,35 @@ Spice can read data straight from a Databricks instance. This guide will create - A table already exists in Databricks, called `spice_data.public.awesome_table`. 1. Initialize a Spice app + ```shell spice init databricks_demo cd databricks_demo ``` -1. Start the Spice runtime - ```shell - >>> spice run - 2024-03-27T05:27:52.696536Z INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090 - 2024-03-27T05:27:52.696543Z INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051 - 2024-03-27T05:27:52.696606Z INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052 - ``` - 1. In another terminal, working in the `databricks_demo` directory, configure Spice with the Databricks credentials + ```shell spice login databricks \ --token $DATABRICKS_TOKEN \ --aws-access-key-id $AWS_ACCESS_KEY_ID \ --aws-secret-access-key $AWS_SECRET_ACCESS_KEY \ --aws-region us-east-1 - ``` + ``` Executing `spice login` and successfully authenticating will create a `.env` file in the `databricks_demo` directory with the Databricks credentials. -1. Configure a Databricks dataset into the spicepod. The table provided must be a reference to a table in the Databricks unity catalog. +1. Start the Spice runtime + + ```shell + >>> spice run + 2024-03-27T05:27:52.696536Z INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090 + 2024-03-27T05:27:52.696543Z INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051 + 2024-03-27T05:27:52.696606Z INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052 + ``` + +1. Configure a Databricks dataset into the spicepod. The table provided must be a reference to a table in the Databricks unity catalog. + ```shell >>> spice dataset configure @@ -44,21 +49,22 @@ Spice can read data straight from a Databricks instance. This guide will create Saved datasets/my_table/dataset.yaml ``` -1. Edit the dataset to add `mode: delta_lake` and `databricks_cluster_id: ` to the `params` section: +1. Edit the dataset to add `mode: delta_lake` to the `params` section: ```yaml params: mode: delta_lake databricks_endpoint: - databricks_cluster_id: ``` 1. Confirm that the runtime has registered the new table (in the original terminal) + ```shell 2024-03-27T05:27:54.051229Z INFO runtime: Dataset my_table registered (databricks:spice_data.public.awesome_table), results cache enabled. ``` 1. Check the table exists from the Spice REPL + ```shell >>> spice sql Welcome to the Spice.ai SQL REPL! Type 'help' for help. @@ -76,9 +82,8 @@ Spice can read data straight from a Databricks instance. This guide will create Time: 0.008540708 seconds ``` - - ```shell - sql> describe datafusion.public.my_table + ```shell + sql> describe spice.public.my_table +-----------------------+------------------------------+-------------+ | column_name | data_type | is_nullable | +-----------------------+------------------------------+-------------+ @@ -106,6 +111,7 @@ Spice can read data straight from a Databricks instance. This guide will create ``` 1. Query against the Databricks table. Since the table isn't accelerated, the spice runtime will make a network call to the object storage service. + ```shell >>> spice sql sql> SELECT avg(total_amount), avg(tip_amount), count(1), passenger_count FROM my_table GROUP BY passenger_count ORDER BY passenger_count ASC; @@ -129,40 +135,47 @@ Spice can read data straight from a Databricks instance. This guide will create ``` ## (Optional): Accelerating Databricks -To improve the query performance, the Databricks dataset can be accelerated. + +To improve the query performance, the Databricks dataset can be accelerated. + 1. Edit the dataset, `my_table`. -```shell -echo """acceleration: - enabled: true""" >> datasets/my_table/dataset.yaml -``` + + ```shell + echo """acceleration: + enabled: true""" >> datasets/my_table/dataset.yaml + ``` + 2. The Spice runtime should be updated (i.e. `ACCELERATION=true`) -```shell ->>> spice datasets -FROM NAME REPLICATION ACCELERATION DEPENDSON STATUS -databricks:spice_data.public.awesome_table my_table false true Ready -``` + ```shell + >>> spice datasets + + FROM NAME REPLICATION ACCELERATION DEPENDSON STATUS + databricks:spice_data.public.awesome_table my_table false true Ready + ``` + 3. Rerun the query -```shell ->>> spice sql -sql> select avg(total_amount), avg(tip_amount), count(1), passenger_count from my_table group by passenger_count order by passenger_count asc; -+----------------------------+--------------------------+-----------------+-----------------+ -| AVG(my_table.total_amount) | AVG(my_table.tip_amount) | COUNT(Int64(1)) | passenger_count | -+----------------------------+--------------------------+-----------------+-----------------+ -| 25.32781693945653 | 3.072259971396793 | 31465 | 0 | -| 26.205230445474996 | 3.3712622884680052 | 2188739 | 1 | -| 29.520659930930304 | 3.7171302113290854 | 405103 | 2 | -| 29.138309044290263 | 3.5370455392167615 | 91262 | 3 | -| 30.877266710278306 | 3.466037634201712 | 51974 | 4 | -| 26.269129111203988 | 3.3797078135259317 | 33506 | 5 | -| 25.801183286359798 | 3.344098778687425 | 22353 | 6 | -| 57.735 | 8.37 | 8 | 7 | -| 95.66803921568626 | 11.972156862745097 | 51 | 8 | -| 18.45 | 3.05 | 1 | 9 | -| 25.81173663332435 | 1.545956750046378 | 140162 | | -+----------------------------+--------------------------+-----------------+-----------------+ - -Time: 0.0227835 seconds -``` - -Note: A dataset can be accelerated when configured by specifying yes (y) to `locally accelerate (y/n)?`. \ No newline at end of file + + ```shell + >>> spice sql + sql> select avg(total_amount), avg(tip_amount), count(1), passenger_count from my_table group by passenger_count order by passenger_count asc; + +----------------------------+--------------------------+-----------------+-----------------+ + | AVG(my_table.total_amount) | AVG(my_table.tip_amount) | COUNT(Int64(1)) | passenger_count | + +----------------------------+--------------------------+-----------------+-----------------+ + | 25.32781693945653 | 3.072259971396793 | 31465 | 0 | + | 26.205230445474996 | 3.3712622884680052 | 2188739 | 1 | + | 29.520659930930304 | 3.7171302113290854 | 405103 | 2 | + | 29.138309044290263 | 3.5370455392167615 | 91262 | 3 | + | 30.877266710278306 | 3.466037634201712 | 51974 | 4 | + | 26.269129111203988 | 3.3797078135259317 | 33506 | 5 | + | 25.801183286359798 | 3.344098778687425 | 22353 | 6 | + | 57.735 | 8.37 | 8 | 7 | + | 95.66803921568626 | 11.972156862745097 | 51 | 8 | + | 18.45 | 3.05 | 1 | 9 | + | 25.81173663332435 | 1.545956750046378 | 140162 | | + +----------------------------+--------------------------+-----------------+-----------------+ + + Time: 0.0227835 seconds + ``` + +Note: A dataset can be accelerated when configured by specifying yes (y) to `locally accelerate (y/n)?`. diff --git a/graphql/README.md b/graphql/README.md index bb2b3ea..8573cdc 100644 --- a/graphql/README.md +++ b/graphql/README.md @@ -6,7 +6,7 @@ Follow these steps to get started with GraphQL as a Data Connector. - The latest version of Spice. [Install Spice](https://docs.spiceai.org/getting-started/installation) - A GraphQL endpoint with a query that returns data in JSON format. - - The GitHub GraphQL API (https://api.github.com/graphql) is a good example to get started with. [GitHub GraphQL API](https://docs.github.com/en/graphql) + - The GitHub GraphQL API () is a good example to get started with. [GitHub GraphQL API](https://docs.github.com/en/graphql) **Step 1.** Edit the `spicepod.yaml` file in this directory and replace the `graphql_quickstart` dataset params with the connection parameters for the GraphQL instance, where `[local_table_name]` is the desired name for the federated table within Spice, `[graphql_endpoint]` is the URL to the GraphQL endpoint, `[graphql_query]` is the query to execute, and `[json_pointer]` is the JSON pointer to the data in the GraphQL response. @@ -46,7 +46,13 @@ datasets: See the [GraphQL data connector docs](https://docs.spiceai.org/components/data-connectors/graphql) for more configuration options. -To securely store GraphQL auth params, see [Secret Stores](https://docs.spiceai.org/components/secret-stores) +To securely store GraphQL auth params, see [Secret Stores](https://docs.spiceai.org/components/secret-stores). + +Add the following environment variable to a `.env` file: + +```bash +GH_TOKEN= +``` **Step 2.** Run the Spice runtime with `spice run` from the directory with the `spicepod.yaml` file. diff --git a/kubernetes/README.md b/kubernetes/README.md index f8f1130..dd515b7 100644 --- a/kubernetes/README.md +++ b/kubernetes/README.md @@ -1,4 +1,4 @@ -### Follow these steps to get started running Spice.ai in Kubernetes. +# Follow these steps to get started running Spice.ai in Kubernetes **Step 1.** (Optional) Start a local [`kind`](https://kind.sigs.k8s.io/) cluster: diff --git a/spiceai/README.md b/spiceai/README.md index ca1aaf7..0e1da1c 100644 --- a/spiceai/README.md +++ b/spiceai/README.md @@ -1,4 +1,4 @@ -## Spice Quickstart Tutorial using the Spice.ai Cloud Platform +# Spice Quickstart Tutorial using the Spice.ai Cloud Platform The Spice.ai Cloud Platform has many datasets that can be used within Spice. A valid login for the Spice.ai Cloud Platform is required to access the datasets. Before beginning this quickstart, [link your GitHub account to Spice.ai](https://spice.ai/login) to get access to the platform. @@ -14,6 +14,7 @@ cd spiceai-demo ```bash spice login ``` + A browser window will open displaying a code that will appear in the terminal. Select Approve if the authorization codes match. ![Screenshot](./device_login.png) @@ -90,7 +91,7 @@ description: ethereum recent blocks The Spice runtime terminal will show that the dataset has been loaded: -``` +```console 2024-07-23T01:01:50.403937Z INFO runtime: Dataset eth_recent_blocks registered (spice.ai/eth.recent_blocks), results cache enabled. ```