From 66d457225145c0ab3df587db36ebe206990defc4 Mon Sep 17 00:00:00 2001 From: rjzamora Date: Tue, 3 Nov 2020 20:38:09 -0800 Subject: [PATCH 1/2] remove dask-cudf documentation and add README page --- docs/cudf/source/dask-cudf.md | 78 ----------------------------------- docs/cudf/source/index.rst | 1 - python/dask_cudf/README.md | 12 ++++++ 3 files changed, 12 insertions(+), 79 deletions(-) delete mode 100644 docs/cudf/source/dask-cudf.md create mode 100644 python/dask_cudf/README.md diff --git a/docs/cudf/source/dask-cudf.md b/docs/cudf/source/dask-cudf.md deleted file mode 100644 index 92ef4eb1c46..00000000000 --- a/docs/cudf/source/dask-cudf.md +++ /dev/null @@ -1,78 +0,0 @@ -Multi-GPU with Dask-cuDF -======================== - -cuDF is a single-GPU library. For Multi-GPU cuDF solutions we use [Dask](https://dask.org/) and the [dask-cudf package](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf), which is able to scale cuDF across multiple GPUs on a single machine, or multiple GPUs across many machines in a cluster. - -[Dask DataFrame](http://docs.dask.org/en/latest/dataframe.html) was originally designed to scale Pandas, orchestrating many Pandas DataFrames spread across many CPUs into a cohesive parallel DataFrame. Because cuDF currently implements only a subset of Pandas’s API, not all Dask DataFrame operations work with cuDF. - -The following is tested and expected to work: - -What works ----------- - -- Data ingestion - - ``dask_cudf.read_csv`` - - Use standard Dask ingestion with Pandas, then convert to cuDF (For - Parquet and other formats this is often decently fast) -- Linear operations - - Element-wise operations: ``df.x + df.y``, ``df ** 2`` - - Assignment: ``df['z'] = df.x + df.y`` - - Row-wise selections: ``df[df.x > 0]`` - - Loc: ``df.loc['2001-01-01': '2005-02-02']`` - - Date time/string accessors: ``df.timestamp.dt.dayofweek`` - - ... and most similar operations in this category that are already implemented in cuDF -- Reductions - - Like ``sum``, ``mean``, ``max``, ``count``, and so on on ``Series`` objects - - Support for reductions on full dataframes - - ``std`` - - Custom reductions with [dask.dataframe.reduction](http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.reduction) -- Groupby aggregations - - On single columns: ``df.groupby('x').y.max()`` - - With custom aggregations: - - groupby standard deviation - - grouping on multiple columns - - groupby agg for multiple outputs -- Joins: - - On full unsorted columns: ``left.merge(right, on='id')`` (expensive) - - On sorted indexes: ``left.merge(right, left_index=True, right_index=True)`` (fast) - - On large and small dataframes: ``left.merge(cudf_df, on='id')`` (fast) -- Rolling operations -- Converting to and from other forms - - Dask + Pandas to Dask + cuDF ``df.map_partitions(cudf.from_pandas)`` - - Dask + cuDF to Dask + Pandas ``df.map_partitions(lambda df: df.to_pandas())`` - - cuDF to Dask + cuDF: ``dask.dataframe.from_pandas(df, npartitions=20)`` - - Dask + cuDF to cuDF: ``df.compute()`` - -Additionally all generic Dask operations, like ``compute``, ``persist``, -``visualize`` and so on work regardless. - - -Developing the API ------------------- - -Above we mention the following: - -> and most similar operations in this category that are already implemented in cuDF - -This is because it is difficult to create a comprehensive list of operations in -the cuDF and Pandas libraries. The API is large enough to be difficult to track -effectively. For any operation that operates row-wise like ``fillna`` or -``query`` things will likely, but not certainly work. If operations don't work -it is often due to a slight inconsistency between Pandas and cuDF that is -generally easy to fix. We encourage users to look at the [cuDF issue -tracker](https://github.com/rapidsai/cudf/issues) to see if their issue has -already been reported and, if not, -[raise a new issue](https://github.com/rapidsai/cudf/issues/new). - - -Navigating the API ------------------- - -This project reuses the -[Dask DataFrame](https://docs.dask.org/en/latest/dataframe.html) project, which -was originally designed for Pandas, with the newer library cuDF. Because we use -the same Dask classes for both projects there are often methods that are -implemented for Pandas, but not yet for cuDF. As a result users looking at the -full Dask DataFrame API can be misleading, and often lead to frustration when -operations that are advertised in the Dask API do not work as expected with -cuDF. We apologize for this in advance. diff --git a/docs/cudf/source/index.rst b/docs/cudf/source/index.rst index 8a100487374..db4134800be 100644 --- a/docs/cudf/source/index.rst +++ b/docs/cudf/source/index.rst @@ -8,7 +8,6 @@ Welcome to cuDF's documentation! api.rst 10min.ipynb basics.rst - dask-cudf.md 10min-cudf-cupy.ipynb guide-to-udfs.ipynb internals.md diff --git a/python/dask_cudf/README.md b/python/dask_cudf/README.md new file mode 100644 index 00000000000..02f9ab0c000 --- /dev/null +++ b/python/dask_cudf/README.md @@ -0,0 +1,12 @@ +# Dask-cuDF + +A [cuDF](https://github.com/rapidsai/cudf)-backed [Dask-DataFrame](https://docs.dask.org/en/latest/dataframe.html) API for out-of-core and multi-GPU ETL. + +## Brief Introduction + +Dask is a task-based library for parallel scheduling and execution. In addition to its central scheduling machinery, the library also includes the [Dask-DataFrame](https://docs.dask.org/en/latest/dataframe.html) module, which is a scalable version of the [Pandas](https://pandas.pydata.org/) DataFrame/Series API. Dask-cuDF builds upon Dask-DataFrame to provide a convenient API for the decomposition and processing of large cuDF DataFrame and/or Series objects. + +### Documentation Links + +- [10 Minutes to cuDF and Dask-cuDF](https://docs.rapids.ai/api/cudf/stable/10min.html) +- [Dask-CUDA](https://github.com/rapidsai/dask-cuda) (for multi-GPU Scaling) \ No newline at end of file From e9cea4aad7508e4fc60cfbcb99803046d996bacc Mon Sep 17 00:00:00 2001 From: rjzamora Date: Tue, 3 Nov 2020 21:00:53 -0800 Subject: [PATCH 2/2] changelog --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 45ffed754f3..5b10a10908d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -30,6 +30,7 @@ - PR #6564 Load JNI library dependencies with a thread pool - PR #6573 Create `cudf::detail::byte_cast` for `cudf::byte_cast` - PR #6597 Use thread-local to track CUDA device in JNI +- PR #6665 Update stale dask-cudf documentation ## Bug Fixes