More prose

rapidsai · Feb 8, 2023 · cdf6e19 · cdf6e19
1 parent 18c00e2
commit cdf6e19
Show file tree

Hide file tree

Showing 2 changed files with 83 additions and 3 deletions.
diff --git a/docs/dask_cudf/source/api.rst b/docs/dask_cudf/source/api.rst
@@ -26,6 +26,14 @@ data reading facilities, followed by calling
       read_text,
       read_parquet
 
+.. warning::
+
+   FIXME: where should the following live?
+
+   .. autofunction:: dask_cudf.concat
+
+   .. autofunction:: dask_cudf.from_delayed
+
 Grouping
 ========
 
@@ -51,8 +59,8 @@ if possible.
 .. autofunction:: dask_cudf.groupby_agg
 
 
-Dask Collections
-================
+DataFrames and Series
+=====================
 
 The core distributed objects provided by Dask-cuDF are the
 :class:`.DataFrame` and :class:`.Series`. These inherit respectively

diff --git a/docs/dask_cudf/source/index.rst b/docs/dask_cudf/source/index.rst
@@ -25,9 +25,81 @@ When running on multi-GPU systems, `Dask-CUDA
 simplify the setup of the cluster, taking advantage of all features of
 the GPU and networking hardware.
 
+Using Dask-cuDF
+---------------
+
+When installed, Dask-cuDF registers itself as a dataframe backend for
+Dask. This means that in many cases, using cuDF-backed dataframes requires
+only small changes to an existing workflow. The minimal change is to
+select cuDF as the dataframe backend in :doc:`Dask's
+configuration <dask:configuration>`. To do so, we must set the option
+``dataframe.backend`` to ``cudf``. From Python, this can be achieved
+like so::
+
+  import dask
+
+  dask.config.set({"dataframe.backend": "cudf"})
+
+Alternatively, you can set ``DASK_DATAFRAME__BACKEND=cudf`` in the
+environment before running your code.
+
+Dataframe creation from on-disk formats
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your workflow creates Dask dataframes from on-disk formats
+(for example using :func:`dask.dataframe.read_parquet`), then setting
+the backend may well be enough to migrate your workflow.
+
+For example, consider reading a dataframe from parquet::
+
+   import dask.dataframe as dd
+
+   # By default, we obtain a pandas-backed dataframe
+   df = dd.read_parquet("data.parquet", ...)
+
+
+To obtain a cuDF-backed dataframe, we must set the
+``dataframe.backend`` configuration option::
+
+  import dask
+  import dask.dataframe as dd
+
+  dask.config.set({"dataframe.backend": "cudf"})
+  # This gives us a cuDF-backed dataframe
+  df = dd.read_parquet("data.parquet", ...)
+
+This code will use cuDF's GPU-accelerated :func:`parquet reader
+<cudf.read_parquet>` to read partitions of the data.
+
+Dataframe creation from in-memory formats
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you already have a dataframe in memory and want to convert it to a
+cuDF-backend one, there are two options depending on whether the
+dataframe is already a Dask one or not. If you have a Dask dataframe,
+then :func:`dask_cudf.from_dask_dataframe` will convert for you; if
+you have a pandas dataframe then you can either call
+:func:`dask.dataframe.from_pandas` followed by
+:func:`~dask_cudf.from_dask_dataframe` or first convert the dataframe
+with :func:`cudf.from_pandas` and then parallelise this with
+:func:`dask_cudf.from_cudf`.
+
+API Reference
+-------------
+
+Generally speaking, Dask-cuDF tries to offer exactly the same API as
+Dask itself. There are, however, some minor differences mostly because
+cuDF does not :doc:`perfectly mirror <cudf:user_guide/PandasCompat>`
+the pandas API, or because cuDF provides additional configuration
+flags (these mostly occur in data reading and writing interfaces).
+
+As a result, straightforward workflows can be migrated without too
+much trouble, but more complex ones that utilise more features may
+need a bit of tweaking. The API documentation describes details of the
+differences and all functionality that Dask-cuDF supports.
+
 .. toctree::
    :maxdepth: 2
-   :caption: Contents:
 
    api