update docs for dask-databricks

rapidsai · Nov 10, 2023 · 4fbc992 · 4fbc992
1 parent 2369ee8
commit 4fbc992
Show file tree

Hide file tree

Showing 2 changed files with 58 additions and 76 deletions.
diff --git a/source/platforms/databricks-dask.md b/source/platforms/databricks-dask.md
@@ -1,54 +1,3 @@
-# Databricks
-
-## DASK Rapids in Databricks MNMG Cluster
-
-You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster
-
-```{warning}
-It is also possible to use [Spark RAPIDS](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/databricks.html) with Dask on the same Databricks cluster. To do this, the user
-must provide an init script that downloads the `rapids-4-spark-xxxx.jar`` plugin and then configure Spark to load this plugin.
-```
-
-### Init script
-
-Before creating the cluster, we will need to create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask and the RAPIDS Accelerator for Apache Spark.
-
-Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents:
-
-```python
-#!/bin/bash
-
-set -e
-
-echo "DB_IS_DRIVER = $DB_IS_DRIVER"
-echo "DB_DRIVER_IP = $DB_DRIVER_IP"
-
-pip install "dask[complete]"
-
-if [[ $DB_IS_DRIVER = "TRUE" ]]; then
-  echo "This node is the Dask scheduler."
-  dask scheduler &
-else
-  echo "This node is a Dask worker."
-  echo "Connecting to Dask scheduler at $DB_DRIVER_IP:8786"
-  # Wait for the scheduler to start
-  while ! nc -z $DB_DRIVER_IP 8786; do
-    echo "Scheduler not available yet. Waiting..."
-    sleep 10
-  done
-  dask cuda worker tcp://$DB_DRIVER_IP:8786 &
-fi
-
-```
-
-NOTE: The above script will be packaged as a library to be imported instead.
-
-### Launch a Databricks cluster
-
-Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Multi node".
-
-![Screenshot of the Databricks compute page](../images/databricks-create-compute.png)
-
 In order to launch a GPU node uncheck **Use Photon Acceleration**. Then choose Databricks ML GPU Runtime from the drop down. For instance,`13.3 LTS ML (GPU, Scala 2.12, Spark 3.4.1)`. Once you have done this you should be able to select one of the GPU-accelerated instance types for the Driver and Worker nodes.
 
 Optional to enable autoscale for worker nodes based on load.
@@ -59,17 +8,10 @@ Expand the **Advanced Options** section and open the **Init Scripts** tab and ad
 
 You can also configure cluster log delivery, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs/<cluster-id>/init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information.
 
-Once your cluster has started create a new notebook or open an existing one.
-
-### Test Rapids
-
-Connect to the dask client using the scheduler address and submit tasks.
+At the top of your notebook run any of the following `pip` install commands to install your preferred RAPIDS libraries.
 
-```python
-from dask.distributed import Client
-import os
-
-client = Client(f'{os.environ["SPARK_LOCAL_IP"]}:8786')
+```text
+!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
+!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
+!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
 ```
-
-You can run also this HPO workflow example (link) to get started on using Dask X Spark RAPIDS. Refer to blog (link) for more information.
diff --git a/source/platforms/databricks.md b/source/platforms/databricks.md
@@ -1,12 +1,42 @@
 # Databricks
 
-## Databricks Notebooks
-
 You can install RAPIDS libraries into a Databricks GPU Notebook environment.
 
-### Launch a single-node Databricks cluster
+## DASK Rapids in Databricks MNMG Cluster
+
+You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster. To do this, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask before launching the Databricks cluster.
+
+Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents:
+
+```bash
+#!/bin/bash
+set -e
+
+# The Databricks Python directory isn't on the path in
+# databricksruntime/gpu-tensorflow:cuda11.8 for some reason
+export PATH="/databricks/python/bin:$PATH"
+
+# Install RAPIDS (cudf & dask-cudf) and dask-databricks
+/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \
+      bokeh==3.2.2 \
+      cudf-cu11 \
+      dask[complete] \
+      dask-cudf-cu11 \
+      dask-cuda==23.10.0 \
+      dask-databricks
+
+# Start the Dask cluster with CUDA workers
+dask databricks run --cuda
+
+```
+
+```{note}
+If you only need to install RAPIDS in a Databricks GPU Notebook environment, then skip this section and proceed directly to launch a Databricks cluster.
+```
+
+## Launch Databricks cluster
 
-Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**.
+Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Multi node" or "Single node".
 
 ![Screenshot of the Databricks compute page](../images/databricks-create-compute.png)
 
@@ -28,7 +58,7 @@ It is also possible to use the Databricks ML GPU Runtime to enable GPU nodes, ho
 
 Select **Create Compute**.
 
-### Install RAPIDS in your notebook
+## Databricks notebook
 
 Once your cluster has started create a new notebook or open an existing one.
 
@@ -44,14 +74,6 @@ At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does n
 
 ````
 
-At the top of your notebook run any of the following `pip` install commands to install your preferred RAPIDS libraries.
-
-```text
-!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
-!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
-!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
-```
-
 ### Test Rapids
 
 ```python
@@ -66,6 +88,24 @@ gdf
 
 ```
 
+You can also connect to the dask client using the scheduler address and submit tasks.
+
+```python
+from dask.distributed import Client
+from dask_databricks import DatabricksCluster
+
+cluster = DatabricksCluster()
+client = Client(cluster)
+
+def inc(x):
+    return x + 1
+
+x = client.submit(inc, 10)
+x.result()
+ 11
+
+```
+
 ## Databricks Spark
 
 You can also use the RAPIDS Accelerator for Apache Spark 3.x on Databricks. See the [Spark RAPIDS documentation](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html) for more information.