From cdf6e19e269be76841efff51f06171575da1743d Mon Sep 17 00:00:00 2001 From: Lawrence Mitchell Date: Wed, 8 Feb 2023 12:27:19 +0000 Subject: [PATCH] More prose --- docs/dask_cudf/source/api.rst | 12 +++++- docs/dask_cudf/source/index.rst | 74 ++++++++++++++++++++++++++++++++- 2 files changed, 83 insertions(+), 3 deletions(-) diff --git a/docs/dask_cudf/source/api.rst b/docs/dask_cudf/source/api.rst index 9fa83ec846b..893f5dd7434 100644 --- a/docs/dask_cudf/source/api.rst +++ b/docs/dask_cudf/source/api.rst @@ -26,6 +26,14 @@ data reading facilities, followed by calling read_text, read_parquet +.. warning:: + + FIXME: where should the following live? + + .. autofunction:: dask_cudf.concat + + .. autofunction:: dask_cudf.from_delayed + Grouping ======== @@ -51,8 +59,8 @@ if possible. .. autofunction:: dask_cudf.groupby_agg -Dask Collections -================ +DataFrames and Series +===================== The core distributed objects provided by Dask-cuDF are the :class:`.DataFrame` and :class:`.Series`. These inherit respectively diff --git a/docs/dask_cudf/source/index.rst b/docs/dask_cudf/source/index.rst index 30f535d6f93..66aedcbbd48 100644 --- a/docs/dask_cudf/source/index.rst +++ b/docs/dask_cudf/source/index.rst @@ -25,9 +25,81 @@ When running on multi-GPU systems, `Dask-CUDA simplify the setup of the cluster, taking advantage of all features of the GPU and networking hardware. +Using Dask-cuDF +--------------- + +When installed, Dask-cuDF registers itself as a dataframe backend for +Dask. This means that in many cases, using cuDF-backed dataframes requires +only small changes to an existing workflow. The minimal change is to +select cuDF as the dataframe backend in :doc:`Dask's +configuration `. To do so, we must set the option +``dataframe.backend`` to ``cudf``. From Python, this can be achieved +like so:: + + import dask + + dask.config.set({"dataframe.backend": "cudf"}) + +Alternatively, you can set ``DASK_DATAFRAME__BACKEND=cudf`` in the +environment before running your code. + +Dataframe creation from on-disk formats +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If your workflow creates Dask dataframes from on-disk formats +(for example using :func:`dask.dataframe.read_parquet`), then setting +the backend may well be enough to migrate your workflow. + +For example, consider reading a dataframe from parquet:: + + import dask.dataframe as dd + + # By default, we obtain a pandas-backed dataframe + df = dd.read_parquet("data.parquet", ...) + + +To obtain a cuDF-backed dataframe, we must set the +``dataframe.backend`` configuration option:: + + import dask + import dask.dataframe as dd + + dask.config.set({"dataframe.backend": "cudf"}) + # This gives us a cuDF-backed dataframe + df = dd.read_parquet("data.parquet", ...) + +This code will use cuDF's GPU-accelerated :func:`parquet reader +` to read partitions of the data. + +Dataframe creation from in-memory formats +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you already have a dataframe in memory and want to convert it to a +cuDF-backend one, there are two options depending on whether the +dataframe is already a Dask one or not. If you have a Dask dataframe, +then :func:`dask_cudf.from_dask_dataframe` will convert for you; if +you have a pandas dataframe then you can either call +:func:`dask.dataframe.from_pandas` followed by +:func:`~dask_cudf.from_dask_dataframe` or first convert the dataframe +with :func:`cudf.from_pandas` and then parallelise this with +:func:`dask_cudf.from_cudf`. + +API Reference +------------- + +Generally speaking, Dask-cuDF tries to offer exactly the same API as +Dask itself. There are, however, some minor differences mostly because +cuDF does not :doc:`perfectly mirror ` +the pandas API, or because cuDF provides additional configuration +flags (these mostly occur in data reading and writing interfaces). + +As a result, straightforward workflows can be migrated without too +much trouble, but more complex ones that utilise more features may +need a bit of tweaking. The API documentation describes details of the +differences and all functionality that Dask-cuDF supports. + .. toctree:: :maxdepth: 2 - :caption: Contents: api