Skip to content

Commit

Permalink
update docs for dask-databricks
Browse files Browse the repository at this point in the history
  • Loading branch information
skirui-source committed Nov 10, 2023
1 parent 2369ee8 commit 4fbc992
Show file tree
Hide file tree
Showing 2 changed files with 58 additions and 76 deletions.
68 changes: 5 additions & 63 deletions source/platforms/databricks-dask.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,3 @@
# Databricks

## DASK Rapids in Databricks MNMG Cluster

You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster

```{warning}
It is also possible to use [Spark RAPIDS](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/databricks.html) with Dask on the same Databricks cluster. To do this, the user
must provide an init script that downloads the `rapids-4-spark-xxxx.jar`` plugin and then configure Spark to load this plugin.
```

### Init script

Before creating the cluster, we will need to create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask and the RAPIDS Accelerator for Apache Spark.

Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents:

```python
#!/bin/bash

set -e

echo "DB_IS_DRIVER = $DB_IS_DRIVER"
echo "DB_DRIVER_IP = $DB_DRIVER_IP"

pip install "dask[complete]"

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
echo "This node is the Dask scheduler."
dask scheduler &
else
echo "This node is a Dask worker."
echo "Connecting to Dask scheduler at $DB_DRIVER_IP:8786"
# Wait for the scheduler to start
while ! nc -z $DB_DRIVER_IP 8786; do
echo "Scheduler not available yet. Waiting..."
sleep 10
done
dask cuda worker tcp://$DB_DRIVER_IP:8786 &
fi

```

NOTE: The above script will be packaged as a library to be imported instead.

### Launch a Databricks cluster

Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Multi node".

![Screenshot of the Databricks compute page](../images/databricks-create-compute.png)

In order to launch a GPU node uncheck **Use Photon Acceleration**. Then choose Databricks ML GPU Runtime from the drop down. For instance,`13.3 LTS ML (GPU, Scala 2.12, Spark 3.4.1)`. Once you have done this you should be able to select one of the GPU-accelerated instance types for the Driver and Worker nodes.

Optional to enable autoscale for worker nodes based on load.
Expand All @@ -59,17 +8,10 @@ Expand the **Advanced Options** section and open the **Init Scripts** tab and ad

You can also configure cluster log delivery, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs/<cluster-id>/init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information.

Once your cluster has started create a new notebook or open an existing one.

### Test Rapids

Connect to the dask client using the scheduler address and submit tasks.
At the top of your notebook run any of the following `pip` install commands to install your preferred RAPIDS libraries.

```python
from dask.distributed import Client
import os

client = Client(f'{os.environ["SPARK_LOCAL_IP"]}:8786')
```text
!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
```

You can run also this HPO workflow example (link) to get started on using Dask X Spark RAPIDS. Refer to blog (link) for more information.
66 changes: 53 additions & 13 deletions source/platforms/databricks.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,42 @@
# Databricks

## Databricks Notebooks

You can install RAPIDS libraries into a Databricks GPU Notebook environment.

### Launch a single-node Databricks cluster
## DASK Rapids in Databricks MNMG Cluster

You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster. To do this, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask before launching the Databricks cluster.

Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents:

```bash
#!/bin/bash
set -e

# The Databricks Python directory isn't on the path in
# databricksruntime/gpu-tensorflow:cuda11.8 for some reason
export PATH="/databricks/python/bin:$PATH"

# Install RAPIDS (cudf & dask-cudf) and dask-databricks
/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
bokeh==3.2.2 \
cudf-cu11 \
dask[complete] \
dask-cudf-cu11 \
dask-cuda==23.10.0 \
dask-databricks

# Start the Dask cluster with CUDA workers
dask databricks run --cuda

```

```{note}
If you only need to install RAPIDS in a Databricks GPU Notebook environment, then skip this section and proceed directly to launch a Databricks cluster.
```

## Launch Databricks cluster

Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**.
Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Multi node" or "Single node".

![Screenshot of the Databricks compute page](../images/databricks-create-compute.png)

Expand All @@ -28,7 +58,7 @@ It is also possible to use the Databricks ML GPU Runtime to enable GPU nodes, ho

Select **Create Compute**.

### Install RAPIDS in your notebook
## Databricks notebook

Once your cluster has started create a new notebook or open an existing one.

Expand All @@ -44,14 +74,6 @@ At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does n
````

At the top of your notebook run any of the following `pip` install commands to install your preferred RAPIDS libraries.

```text
!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
```

### Test Rapids

```python
Expand All @@ -66,6 +88,24 @@ gdf

```

You can also connect to the dask client using the scheduler address and submit tasks.

```python
from dask.distributed import Client
from dask_databricks import DatabricksCluster

cluster = DatabricksCluster()
client = Client(cluster)

def inc(x):
return x + 1

x = client.submit(inc, 10)
x.result()
11

```

## Databricks Spark

You can also use the RAPIDS Accelerator for Apache Spark 3.x on Databricks. See the [Spark RAPIDS documentation](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html) for more information.

0 comments on commit 4fbc992

Please sign in to comment.