You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is intended to be a high-level tracking issue for supporting Dask RAPIDS deployments on Databricks.
High-level goals
Make it easy for users to launch a RAPIDS Dask cluster on an MNMG Databricks cluster.
Support running both Dask RAPIDS and Spark RAPIDS on the same MNMG Databricks cluster.
Show a workflow that uses both Spark RAPIDS and Dask RAPIDS together (or show alternate implementations of the same thing but using the same cluster)
Show a workflow that uses dask-deltatable to read data from Detla Lake
Databricks multi-node technical architecture
When you launch a multi-node cluster on Databricks a Spark driver node and many Spark worker nodes are provisioned. When you run a notebook session the notebook kernel is executed on the driver node and the Spark cluster can be leveraged using pyspark.
To use Spark RAPIDS with this Databricks cluster the user must provide an init script that downloads the rapids-4-spark-xxxx.jar plugin and then configure Spark to load this plugin. Spark queries will then leverage libcudf under the hood and benefit from GPU acceleration.
Adding Dask to Databricks clusters
The cluster architecture of having a driver and workers is also the same as a Dask cluster which has a scheduler and workers on different nodes, and we see from the Spark RAPIDS example that Databricks provides a mechanism to run a script on every node at startup to install plugins.
This paradigm of having a script run on every node at startup means that we could use this to create a Dask Runner (xref dask/community#346) that bootstraps a Dask cluster as part of the init script process.
When the init script is run it appears to have access to a number of environment variables. The most useful of these are DB_IS_DRIVER and DB_DRIVER_IP. These variables provide all the information we need to ensure that the driver node starts a dask scheduler process, and the worker nodes create a dask cuda worker process that connects to the scheduler. These processes would need to be started in the background so that the init script can complete and move on with the rest of the Spark cluster startup.
Then when using a notebook with the kernel being run on the driver node we should be able to use dask_cudf and other RAPIDS Dask integrations to communicate with the Dask scheduler on localhost, the same way pyspark does.
Deliverables
The content you are editing has changed. Please copy your edits and refresh the page.
This issue is intended to be a high-level tracking issue for supporting Dask RAPIDS deployments on Databricks.
High-level goals
Databricks multi-node technical architecture
When you launch a multi-node cluster on Databricks a Spark driver node and many Spark worker nodes are provisioned. When you run a notebook session the notebook kernel is executed on the driver node and the Spark cluster can be leveraged using pyspark.
To use Spark RAPIDS with this Databricks cluster the user must provide an init script that downloads the
rapids-4-spark-xxxx.jar
plugin and then configure Spark to load this plugin. Spark queries will then leveragelibcudf
under the hood and benefit from GPU acceleration.Adding Dask to Databricks clusters
The cluster architecture of having a driver and workers is also the same as a Dask cluster which has a scheduler and workers on different nodes, and we see from the Spark RAPIDS example that Databricks provides a mechanism to run a script on every node at startup to install plugins.
This paradigm of having a script run on every node at startup means that we could use this to create a Dask Runner (xref dask/community#346) that bootstraps a Dask cluster as part of the init script process.
When the init script is run it appears to have access to a number of environment variables. The most useful of these are
DB_IS_DRIVER
andDB_DRIVER_IP
. These variables provide all the information we need to ensure that the driver node starts adask scheduler
process, and the worker nodes create adask cuda worker
process that connects to the scheduler. These processes would need to be started in the background so that the init script can complete and move on with the rest of the Spark cluster startup.Then when using a notebook with the kernel being run on the driver node we should be able to use
dask_cudf
and other RAPIDS Dask integrations to communicate with the Dask scheduler on localhost, the same way pyspark does.Deliverables
Tasks
Other info
@jacobtomlinson has started experimenting with the Dask Runners concept in https://github.com/jacobtomlinson/dask-runners (which will likely move to the
dask-contrib
org at some point). This is probably a reasonable place to prototype this in a PR.The text was updated successfully, but these errors were encountered: