Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAPIDS/Dask Databricks MNMG #296

Closed
6 tasks done
jacobtomlinson opened this issue Oct 18, 2023 · 0 comments
Closed
6 tasks done

RAPIDS/Dask Databricks MNMG #296

jacobtomlinson opened this issue Oct 18, 2023 · 0 comments
Assignees

Comments

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Oct 18, 2023

This issue is intended to be a high-level tracking issue for supporting Dask RAPIDS deployments on Databricks.

High-level goals

  • Make it easy for users to launch a RAPIDS Dask cluster on an MNMG Databricks cluster.
  • Support running both Dask RAPIDS and Spark RAPIDS on the same MNMG Databricks cluster.
  • Show a workflow that uses both Spark RAPIDS and Dask RAPIDS together (or show alternate implementations of the same thing but using the same cluster)
  • Show a workflow that uses dask-deltatable to read data from Detla Lake

Databricks multi-node technical architecture

When you launch a multi-node cluster on Databricks a Spark driver node and many Spark worker nodes are provisioned. When you run a notebook session the notebook kernel is executed on the driver node and the Spark cluster can be leveraged using pyspark.

To use Spark RAPIDS with this Databricks cluster the user must provide an init script that downloads the rapids-4-spark-xxxx.jar plugin and then configure Spark to load this plugin. Spark queries will then leverage libcudf under the hood and benefit from GPU acceleration.

Adding Dask to Databricks clusters

The cluster architecture of having a driver and workers is also the same as a Dask cluster which has a scheduler and workers on different nodes, and we see from the Spark RAPIDS example that Databricks provides a mechanism to run a script on every node at startup to install plugins.

This paradigm of having a script run on every node at startup means that we could use this to create a Dask Runner (xref dask/community#346) that bootstraps a Dask cluster as part of the init script process.

When the init script is run it appears to have access to a number of environment variables. The most useful of these are DB_IS_DRIVER and DB_DRIVER_IP. These variables provide all the information we need to ensure that the driver node starts a dask scheduler process, and the worker nodes create a dask cuda worker process that connects to the scheduler. These processes would need to be started in the background so that the init script can complete and move on with the rest of the Spark cluster startup.

Then when using a notebook with the kernel being run on the driver node we should be able to use dask_cudf and other RAPIDS Dask integrations to communicate with the Dask scheduler on localhost, the same way pyspark does.

Deliverables

Tasks

Preview Give feedback
  1. doc platform/databricks
    skirui-source
  2. 3 of 3
    skirui-source
  3. skirui-source

Other info

@jacobtomlinson has started experimenting with the Dask Runners concept in https://github.com/jacobtomlinson/dask-runners (which will likely move to the dask-contrib org at some point). This is probably a reasonable place to prototype this in a PR.

@jacobtomlinson jacobtomlinson changed the title Create Databricks MNMG Runner RAPIDS/Dask Databricks MNMG Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants