diff --git a/source/platforms/databricks-dask.md b/source/platforms/databricks-dask.md index 258d8601..3ab34a69 100644 --- a/source/platforms/databricks-dask.md +++ b/source/platforms/databricks-dask.md @@ -1,54 +1,3 @@ -# Databricks - -## DASK Rapids in Databricks MNMG Cluster - -You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster - -```{warning} -It is also possible to use [Spark RAPIDS](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/databricks.html) with Dask on the same Databricks cluster. To do this, the user -must provide an init script that downloads the `rapids-4-spark-xxxx.jar`` plugin and then configure Spark to load this plugin. -``` - -### Init script - -Before creating the cluster, we will need to create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask and the RAPIDS Accelerator for Apache Spark. - -Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents: - -```python -#!/bin/bash - -set -e - -echo "DB_IS_DRIVER = $DB_IS_DRIVER" -echo "DB_DRIVER_IP = $DB_DRIVER_IP" - -pip install "dask[complete]" - -if [[ $DB_IS_DRIVER = "TRUE" ]]; then - echo "This node is the Dask scheduler." - dask scheduler & -else - echo "This node is a Dask worker." - echo "Connecting to Dask scheduler at $DB_DRIVER_IP:8786" - # Wait for the scheduler to start - while ! nc -z $DB_DRIVER_IP 8786; do - echo "Scheduler not available yet. Waiting..." - sleep 10 - done - dask cuda worker tcp://$DB_DRIVER_IP:8786 & -fi - -``` - -NOTE: The above script will be packaged as a library to be imported instead. - -### Launch a Databricks cluster - -Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Multi node". - -![Screenshot of the Databricks compute page](../images/databricks-create-compute.png) - In order to launch a GPU node uncheck **Use Photon Acceleration**. Then choose Databricks ML GPU Runtime from the drop down. For instance,`13.3 LTS ML (GPU, Scala 2.12, Spark 3.4.1)`. Once you have done this you should be able to select one of the GPU-accelerated instance types for the Driver and Worker nodes. Optional to enable autoscale for worker nodes based on load. @@ -59,17 +8,10 @@ Expand the **Advanced Options** section and open the **Init Scripts** tab and ad You can also configure cluster log delivery, which will write the init script logs to DBFS in a subdirectory called `dbfs:/cluster-logs//init_scripts/`. Refer to [docs](https://docs.databricks.com/en/init-scripts/logs.html) for more information. -Once your cluster has started create a new notebook or open an existing one. - -### Test Rapids - -Connect to the dask client using the scheduler address and submit tasks. +At the top of your notebook run any of the following `pip` install commands to install your preferred RAPIDS libraries. -```python -from dask.distributed import Client -import os - -client = Client(f'{os.environ["SPARK_LOCAL_IP"]}:8786') +```text +!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com +!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com +!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com ``` - -You can run also this HPO workflow example (link) to get started on using Dask X Spark RAPIDS. Refer to blog (link) for more information. diff --git a/source/platforms/databricks.md b/source/platforms/databricks.md index 03f96658..065c12f6 100644 --- a/source/platforms/databricks.md +++ b/source/platforms/databricks.md @@ -1,12 +1,42 @@ # Databricks -## Databricks Notebooks - You can install RAPIDS libraries into a Databricks GPU Notebook environment. -### Launch a single-node Databricks cluster +## DASK Rapids in Databricks MNMG Cluster + +You can launch Dask RAPIDS cluster on a multi-node GPU Databricks cluster. To do this, you must first create an [initialization script](https://docs.databricks.com/en/init-scripts/index.html) to install Dask before launching the Databricks cluster. + +Databricks recommends storing all cluster-scoped init scripts using workspace files. Each user has a Home directory configured under the `/Users` directory in the workspace. Navigate to your home directory in the UI and select **Create** > **File** from the menu, create an `init.sh` script with contents: + +```bash +#!/bin/bash +set -e + +# The Databricks Python directory isn't on the path in +# databricksruntime/gpu-tensorflow:cuda11.8 for some reason +export PATH="/databricks/python/bin:$PATH" + +# Install RAPIDS (cudf & dask-cudf) and dask-databricks +/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \ + bokeh==3.2.2 \ + cudf-cu11 \ + dask[complete] \ + dask-cudf-cu11 \ + dask-cuda==23.10.0 \ + dask-databricks + +# Start the Dask cluster with CUDA workers +dask databricks run --cuda + +``` + +```{note} +If you only need to install RAPIDS in a Databricks GPU Notebook environment, then skip this section and proceed directly to launch a Databricks cluster. +``` + +## Launch Databricks cluster -Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. +Navigate to the **All Purpose Compute** tab of the **Compute** section in Databricks and select **Create Compute**. Name your cluster and choose "Multi node" or "Single node". ![Screenshot of the Databricks compute page](../images/databricks-create-compute.png) @@ -28,7 +58,7 @@ It is also possible to use the Databricks ML GPU Runtime to enable GPU nodes, ho Select **Create Compute**. -### Install RAPIDS in your notebook +## Databricks notebook Once your cluster has started create a new notebook or open an existing one. @@ -44,14 +74,6 @@ At the time of writing the `databricksruntime/gpu-pytorch:cuda11.8` image does n ```` -At the top of your notebook run any of the following `pip` install commands to install your preferred RAPIDS libraries. - -```text -!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com -!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com -!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com -``` - ### Test Rapids ```python @@ -66,6 +88,24 @@ gdf ``` +You can also connect to the dask client using the scheduler address and submit tasks. + +```python +from dask.distributed import Client +from dask_databricks import DatabricksCluster + +cluster = DatabricksCluster() +client = Client(cluster) + +def inc(x): + return x + 1 + +x = client.submit(inc, 10) +x.result() + 11 + +``` + ## Databricks Spark You can also use the RAPIDS Accelerator for Apache Spark 3.x on Databricks. See the [Spark RAPIDS documentation](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html) for more information.