Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update API reference and examples in docs #561

Merged
merged 23 commits into from
Apr 12, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
2ec29ae
First pass
charlesbluca Apr 6, 2021
8b6e955
Clients connect to the cluster
charlesbluca Apr 6, 2021
470dbef
Add important info to --net-devices help
charlesbluca Apr 6, 2021
922f1b4
Clarify installation instructions
charlesbluca Apr 6, 2021
fdcf3ad
Make docstrings/click help consistent
charlesbluca Apr 6, 2021
a92b820
Add examples of restricting workers
charlesbluca Apr 6, 2021
28e98b0
Add basic examples
charlesbluca Apr 7, 2021
cdf22fe
Clarify defaults for all parameters
charlesbluca Apr 7, 2021
9b22f2d
Minor clarification to death timeout
charlesbluca Apr 7, 2021
dd653e2
Clarify memory limit docstrings
charlesbluca Apr 8, 2021
40dd8e6
Update requirements
charlesbluca Apr 8, 2021
ed1b3bd
Run pre-commit hooks
charlesbluca Apr 8, 2021
7ec7d1d
Change code block highlighting
charlesbluca Apr 8, 2021
6bcce31
Add UCX example
charlesbluca Apr 8, 2021
8c34127
Last major changes
charlesbluca Apr 9, 2021
0b4d3d5
Merge remote-tracking branch 'upstream/branch-0.20' into update-docs
charlesbluca Apr 9, 2021
8eccc5a
Pass n_workers by kwarg
charlesbluca Apr 9, 2021
4baa15d
Bring up ucx_net_devices ValueError
charlesbluca Apr 12, 2021
65e47f7
Merge branch 'branch-0.20' into update-docs
charlesbluca Apr 12, 2021
1166fc9
Clarify that CUDA toolkit is on conda
charlesbluca Apr 12, 2021
c11cb9a
Clarify drivers / toolkit in installation
charlesbluca Apr 12, 2021
5e2184f
Merge remote-tracking branch 'upstream/branch-0.20' into update-docs
charlesbluca Apr 12, 2021
db7ad88
Clarify failure case for CLI
charlesbluca Apr 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dask_cuda.egg-info/
python/build
python/cudf/bindings/*.cpp
dask-worker-space/
docs/_build/

## Patching
*.diff
Expand Down
214 changes: 109 additions & 105 deletions dask_cuda/cli/dask_cuda_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,154 +19,160 @@

@click.command(context_settings=dict(ignore_unknown_options=True))
@click.argument("scheduler", type=str, required=False)
@click.option(
"--tls-ca-file",
type=pem_file_option_type,
default=None,
help="CA cert(s) file for TLS (in PEM format)",
)
@click.option(
"--tls-cert",
type=pem_file_option_type,
default=None,
help="certificate file for TLS (in PEM format)",
)
@click.option(
"--tls-key",
type=pem_file_option_type,
default=None,
help="private key file for TLS (in PEM format)",
)
@click.option("--dashboard-address", type=str, default=":0", help="dashboard address")
@click.option(
"--dashboard/--no-dashboard",
"dashboard",
default=True,
show_default=True,
required=False,
help="Launch dashboard",
@click.argument(
"preload_argv", nargs=-1, type=click.UNPROCESSED, callback=validate_preload_argv
)
@click.option(
"--host",
type=str,
default=None,
help="Serving host. Should be an ip address that is"
" visible to the scheduler and other workers. "
"See --listen-address and --contact-address if you "
"need different listen and contact addresses. "
"See --interface.",
help="""Serving host; should be an IP address visible to the scheduler and other
workers""",
)
@click.option(
"--interface",
type=str,
default=None,
help="The external interface used to connect to the scheduler, usually "
"an ethernet interface is used for connection, and not an InfiniBand "
"interface (if one is available).",
"--nthreads",
type=int,
default=1,
show_default=True,
help="Number of threads per process",
)
@click.option("--nthreads", type=int, default=1, help="Number of threads per process.")
@click.option(
"--name",
type=str,
default=None,
help="A unique name for this worker like 'worker-1'. "
"If used with --nprocs then the process number "
"will be appended like name-0, name-1, name-2, ...",
help="""A unique name for this worker like ``worker-1``; if used with ``--nprocs``,
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved
then the process number will be appended to the worker name, e.g. ``worker-1-0``,
``worker-1-1``, ``worker-1-2``, ...""",
)
@click.option(
"--memory-limit",
default="auto",
help="Bytes of memory per process that the worker can use. "
"This can be an integer (bytes), "
"float (fraction of total system memory), "
"string (like 5GB or 5000M), "
"'auto', or zero for no memory management",
show_default=True,
help="""Bytes of memory per process that the worker can use; can be an integer
(bytes), float (fraction of total system memory), string (like ``5GB`` or
``5000M``), ``auto``, or 0 for no memory management""",
)
@click.option(
"--device-memory-limit",
default="0.8",
help="Specifies the size of the CUDA device LRU cache, which "
"is used to determine when the worker starts spilling to host "
"memory. This can be a float (fraction of total device "
"memory), an integer (bytes), a string (like 5GB or 5000M), "
"and 'auto' or 0 to disable spilling to host (i.e., allow "
"full device memory usage). Default is 0.8, 80% of the "
"worker's total device memory.",
show_default=True,
help="""Specifies the size of the CUDA device LRU cache, which is used to determine
when the worker starts spilling to host memory; can be an integer (bytes), float
(fraction of total device memory), string (like ``5GB`` or ``5000M``), ``auto``,
or 0 to disable spilling to host (i.e., allow full device memory usage)""",
)
@click.option(
"--rmm-pool-size",
default=None,
help="If specified, initialize each worker with an RMM pool of "
"the given size, otherwise no RMM pool is created. This can be "
"an integer (bytes) or string (like 5GB or 5000M)."
"NOTE: This size is a per worker (i.e., per GPU) configuration, "
"and not cluster-wide!",
help="""If specified, initialize each worker with an RMM pool of the given size; can
be an integer (bytes) or string (like ``5GB`` or ``5000M``); this size is per-worker
(GPU) and not cluster-wide

.. note::
This size is a per-worker (GPU) configuration, and not cluster-wide""",
)
@click.option(
"--rmm-managed-memory/--no-rmm-managed-memory",
default=False,
help="If enabled, initialize each worker with RMM and set it to "
"use managed memory. If disabled, RMM may still be used if "
"--rmm-pool-size is specified, but in that case with default "
"(non-managed) memory type."
"WARNING: managed memory is currently incompatible with NVLink, "
"trying to enable both will result in an exception.",
show_default=True,
help="""If enabled, initialize each worker with RMM and set it to use managed
memory; if disabled, RMM may still be used by specifying ``--rmm-pool-size``

.. warning::
Managed memory is currently incompatible with NVLink; trying to enable both will
result in an exception.""",
)
@click.option(
"--rmm-log-directory",
default=None,
help="Directory to write per-worker RMM log files to; the client "
"and scheduler are not logged here."
"NOTE: Logging will only be enabled if --rmm-pool-size or "
"--rmm-managed-memory are specified.",
help="""Directory to write per-worker RMM log files to; the client
and scheduler are not logged here.

.. note::
Logging will only be enabled if ``--rmm-pool-size`` or ``--rmm-managed-memory``
are specified.""",
)
@click.option("--pid-file", type=str, default="", help="File to write the process PID")
@click.option(
"--resources",
type=str,
default="",
help="""Resources for task constraints like ``GPU=2 MEM=10e9``; resources are
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved
applied separately to each worker process (only relevant when starting multiple
worker processes with ``--nprocs``)""",
)
@click.option(
"--reconnect/--no-reconnect",
"--dashboard/--no-dashboard",
"dashboard",
default=True,
help="Reconnect to scheduler if disconnected",
show_default=True,
required=False,
help="Launch the dashboard",
)
@click.option(
"--dashboard-address",
type=str,
default=":0",
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved
show_default=True,
help="Relative address to serve dashboard (if enabled)",
)
@click.option("--pid-file", type=str, default="", help="File to write the process PID")
@click.option(
"--local-directory", default=None, type=str, help="Directory to place worker files"
)
@click.option(
"--resources",
"--scheduler-file",
type=str,
default="",
help='Resources for task constraints like "GPU=2 MEM=10e9". '
"Resources are applied separately to each worker process "
"(only relevant when starting multiple worker processes with '--nprocs').",
help="""Filename to JSON encoded scheduler information; use with ``dask-scheduler``
``--scheduler-file``""",
)
@click.option(
"--scheduler-file",
"--interface",
type=str,
default="",
help="Filename to JSON encoded scheduler information. "
"Use with dask-scheduler --scheduler-file",
default=None,
help="""External interface used to connect to the scheduler; usually an ethernet
interface is used for connection, and not an InfiniBand interface (if one is
available)""",
)
@click.option(
"--death-timeout",
type=str,
default=None,
help="Seconds to wait for a scheduler before closing",
)
@click.option(
"--dashboard-prefix", type=str, default=None, help="Prefix for the Dashboard"
)
@click.option(
"--preload",
type=str,
multiple=True,
is_eager=True,
help="Module that should be loaded by each worker process "
'like "foo.bar" or "/path/to/foo.py"',
help="""Module that should be loaded by each worker process like ``foo.bar`` or
``/path/to/foo.py``""",
)
@click.argument(
"preload_argv", nargs=-1, type=click.UNPROCESSED, callback=validate_preload_argv
@click.option(
"--dashboard-prefix", type=str, default=None, help="Prefix for the dashboard"
)
@click.option(
"--tls-ca-file",
type=pem_file_option_type,
default=None,
help="CA certificate(s) file for TLS (in PEM format)",
)
@click.option(
"--tls-cert",
type=pem_file_option_type,
default=None,
help="Certificate file for TLS (in PEM format)",
)
@click.option(
"--tls-key",
type=pem_file_option_type,
default=None,
help="Private key file for TLS (in PEM format)",
)
@click.option(
"--enable-tcp-over-ucx/--disable-tcp-over-ucx",
default=False,
show_default=True,
help="Enable TCP communication over UCX",
)
@click.option(
Expand All @@ -175,38 +181,36 @@
help="Enable InfiniBand communication",
)
@click.option(
"--enable-rdmacm/--disable-rdmacm",
"--enable-nvlink/--disable-nvlink",
default=False,
help="Enable RDMA connection manager, currently requires InfiniBand enabled.",
show_default=True,
help="Enable NVLink communication",
)
@click.option(
"--enable-nvlink/--disable-nvlink",
"--enable-rdmacm/--disable-rdmacm",
default=False,
help="Enable NVLink communication",
show_default=True,
help="Enable RDMA connection manager, currently requires InfiniBand enabled.",
)
@click.option(
"--net-devices",
type=str,
default=None,
help="When None (default), 'UCX_NET_DEVICES' will be left to its default. "
"Otherwise, it must be a non-empty string with the interface name, such as "
"such as 'eth0' or 'auto' to allow for automatically choosing the closest "
"interface based on the system's topology. Normally used only with "
"--enable-infiniband to specify the interface to be used by the worker, "
"such as 'mlx5_0:1' or 'ib0'. "
"WARNING: 'auto' requires UCX-Py to be installed and compiled with hwloc "
"support. Additionally that will always use the closest interface, and "
"that may cause unexpected errors if that interface is not properly "
"configured or is disconnected, for that reason it's limited to "
"InfiniBand only and will still cause unpredictable errors if not _ALL_ "
"interfaces are connected and properly configured.",
help="""If specified, workers will use this interface for UCX communication; can be
a string (like ``eth0``), or ``auto`` to pick the optimal interface based on
the system's topology; typically only used with ``--enable-infiniband``

.. warning::
``auto`` requires UCX-Py to be installed and compiled with hwloc support;
unexpected errors can occur when using ``auto`` if any interfaces are
disconnected or improperly configured""",
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved
)
@click.option(
"--enable-jit-unspill/--disable-jit-unspill",
default=None,
help="Enable just-in-time unspilling. This is experimental and doesn't "
"support memory spilling to disk Please see proxy_object.ProxyObject "
"and proxify_host_file.ProxifyHostFile.",
help="""Enable just-in-time unspilling; this is experimental and doesn't support
memory spilling to disk; see ``proxy_object.ProxyObject`` and
``proxify_host_file.ProxifyHostFile`` for more info""",
)
def main(
scheduler,
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
dask_cuda
numpydoc==1.1.0
sphinx_click==2.7.1
sphinx_rtd_theme==0.5.1
16 changes: 11 additions & 5 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@
API
===

Setup
------
.. currentmodule:: dask_cuda.initialize
.. autofunction:: initialize

Cluster
-------
.. currentmodule:: dask_cuda
.. autoclass:: LocalCUDACluster
:members:

Worker
------
.. click:: dask_cuda.cli.dask_cuda_worker:main
:prog: dask-cuda-worker
:nested: none

Client initialization
---------------------
.. currentmodule:: dask_cuda.initialize
.. autofunction:: initialize
2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@
"sphinx.ext.intersphinx",
"sphinx.ext.extlinks",
"numpydoc",
"sphinx_click",
"sphinx_rtd_theme",
]

numpydoc_show_class_members = False
Expand Down
26 changes: 21 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,32 @@
Dask-CUDA
=========

Dask-CUDA is tool for using `Dask <https://dask.org>`_ on GPUs. It extends Dask's `Single-Machine Cluster <https://docs.dask.org/en/latest/setup/single-distributed.html#localcluster>`_ and `Workers <https://distributed.dask.org/en/latest/worker.html>`_ for optimized distributed GPU workloads.
Dask-CUDA is a library for distributed computing in Python using GPUs.
It extends `Dask.distributed <https://distributed.dask.org/en/latest/>`_'s single-machine `LocalCluster <https://docs.dask.org/en/latest/setup/single-distributed.html#localcluster>`_ and `Worker <https://distributed.dask.org/en/latest/worker.html>`_ for use in GPU workloads.

Motivation
----------

While Distributed can be used to leverage GPU workloads through libraries such as `cuDF <https://docs.rapids.ai/api/cudf/stable/>`_, `CuPy <https://cupy.dev/>`_, and `Numba <https://numba.pydata.org/>`_, Dask-CUDA offers several unique features unavailable to Distributed:

- **Automatic instantiation of per-GPU workers** -- Using Dask-CUDA's LocalCUDACluster or ``dask-cuda-worker`` CLI will automatically launch one worker for each GPU available on the executing node, avoiding the need to explicitly select GPUs.
- **Automatic setting of CPU affinity** -- The setting of CPU affinity for each GPU is done automatically, preventing memory transfers from taking suboptimal paths.
- **Automatic selection of InfiniBand devices** -- When UCX communication is enabled over InfiniBand, Dask-CUDA automatically selects the optimal InfiniBand device for each GPU (see :doc:`UCX <ucx>` for instructions on enabling UCX communication).
- **Memory spilling from GPU** -- For memory-intensive workloads, Dask-CUDA supports spilling from device to host memory when a GPU reaches the default or user-specified memory utilization limit.

Contents
--------

.. toctree::
:maxdepth: 1
:hidden:
:caption: Getting Started

install
quickstart
specializations
worker
ucx
api

.. toctree::
:maxdepth: 1
:caption: Additional Features

ucx
Loading