From 74b0db03267bbaceb2807d2d2c26eceec971decd Mon Sep 17 00:00:00 2001 From: Ashwin Srinath Date: Thu, 5 May 2022 09:41:51 -0400 Subject: [PATCH 01/14] Add comparison to Pandas doc --- .../source/user_guide/pandas-comparison.md | 155 ++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 docs/cudf/source/user_guide/pandas-comparison.md diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md new file mode 100644 index 00000000000..328040a660e --- /dev/null +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -0,0 +1,155 @@ +# Comparison of cuDF and Pandas + +cuDF is a DataFrame library that closely matches the Pandas API, but +leverages NVIDIA GPUs for performing computations for speed. However, +there are some differences between cuDF and Pandas, both in terms API +and behavior. This page documents the similarities and differences +between cuDF and Pandas. + +## Supported operations + +cuDF supports many of the same data structures and operations as +Pandas. This includes `Series`, `DataFrame`, `Index` and +operations on them such as unary and binary operations, indexing, +filtering, concatenating, joining, groupby and window operations - +among many others. + +The best way to check if we support a particular Pandas API is to search +our [API docs](/api_docs/index). + +## Data types + +cuDF supports many of the commonly-used data types in Pandas, +including numeric, datetime, timestamp, string, and categorical data +types. In addition, we support special data types for decimal, list +and "struct" values. See the section on [Data Types](data-types) for +details. + +Note that we do not support custom data types like Pandas' +`ExtensionDtype`. + +## Null (or "missing") values + +Unlike Pandas, *all* data types in cuDF are nullable, +meaning they can contain missing values (represented by `cudf.NA`). + +```{code} python +>>> s = cudf.Series([1, 2, cudf.NA]) +>>> s +>>> s +0 1 +1 2 +2 +dtype: int64 +``` + +Nulls are not coerced to `nan` in any situation; +compare the behaviour of cuDF with Pandas below: + +```{code} python +>>> s = cudf.Series([1, 2, cudf.NA], dtype="category") +>>> s +0 1 +1 2 +2 +dtype: category +Categories (2, int64): [1, 2] + +>>> s = pd.Series([1, 2, pd.NA], dtype="category") +>>> s +0 1 +1 2 +2 NaN +dtype: category +Categories (2, int64): [1, 2] +``` + +See the docs on [missing data](missing-data) for +details. + +## Iteration + +Iterating over a cuDF `Series`, `DataFrame` or `Index` is not +supported. This is because iterating over data that resides on the GPU +will yield *extremely* poor performance, as GPUs are optimized for +highly parallel operations rather than sequential operations. + +In the vast majority of cases, it is possible to avoid iteration and +use an existing function or method to accomplish the same task. If you +absolutely must iterate, copy the data from GPU to CPU by using +`.to_arrow()` or `.to_pandas()`, then copy the result back to GPU +using `.from_arrow()` or `.from_pandas()`. + +## Result ordering + +By default, `join` (or `merge`) and `groupby` operations in cuDF +do *not* guarantee output ordering by default. +Compare the results obtained from Pandas and cuDF below: + +```{code} python + >>> import cupy as cp + >>> df = cudf.DataFrame({'a': cp.random.randint(0, 1000, 1000), 'b': range(1000)}) + >>> df.groupby("a").mean().head() + b + a + 742 694.5 + 29 840.0 + 459 525.5 + 442 363.0 + 666 7.0 + >>> df.to_pandas().groupby("a").mean().head() + b + a + 2 643.75 + 6 48.00 + 7 631.00 + 9 906.00 + 10 640.00 +``` + +To match Pandas behavior, you must explicitly pass `sort=True`: + +```{code} python +>>> df.to_pandas().groupby("a", sort=True).mean().head() + b +a +2 643.75 +6 48.00 +7 631.00 +9 906.00 +10 640.00 +``` + +## Column names + +Unlike Pandas, cuDF does not support duplicate column names. +It is best to use strings for column names. + +## No true `"object"` data type + +In Pandas and NumPy, the `"object"` data type is used for +collections of arbitrary Python objects. For example, in Pandas you +can do the following: + +```{code} python +>>> import pandas as pd +>>> s = pd.Series(["a", 1, [1, 2, 3]]) +0 a +1 1 +2 [1, 2, 3] +dtype: object +``` + +For compatibilty with Pandas, cuDF reports the data type for strings +as `"object"`, but we do *not* support storing or operating on +collections of arbitrary Python objects. + +## `.apply()` function limitations + +The `.apply()` function in Pandas accecpts a user-defined function +(UDF) that can include arbitrary operations that are applied to each +value of a `Series`, `DataFrame`, or in the case of a groupby, +each group. cuDF also supports `apply()`, but it relies on Numba to +JIT compile the UDF and execute it on the GPU. This can be extremely +fast, but imposes a few limitations on what operations are allowed in +the UDF. See the docs on [UDFs](guide-to-udfs) for details. From daaa7ac2c552d417fbaa63cacfb75c621772651c Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Thu, 5 May 2022 10:24:54 -0400 Subject: [PATCH 02/14] [skip-ci] Update docs/cudf/source/user_guide/pandas-comparison.md --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 328040a660e..7a7ad78f9fb 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -2,7 +2,7 @@ cuDF is a DataFrame library that closely matches the Pandas API, but leverages NVIDIA GPUs for performing computations for speed. However, -there are some differences between cuDF and Pandas, both in terms API +there are some differences between cuDF and Pandas, both in terms of API and behavior. This page documents the similarities and differences between cuDF and Pandas. From f0add92cb0ae4c1114d3a1698d26fca3423f6b67 Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Thu, 5 May 2022 10:26:42 -0400 Subject: [PATCH 03/14] [skip-ci] Update docs/cudf/source/user_guide/pandas-comparison.md --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 7a7ad78f9fb..053f5091ccc 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -83,7 +83,7 @@ using `.from_arrow()` or `.from_pandas()`. ## Result ordering By default, `join` (or `merge`) and `groupby` operations in cuDF -do *not* guarantee output ordering by default. +do *not* guarantee output ordering. Compare the results obtained from Pandas and cuDF below: ```{code} python From 5a08dd22d21bdbea0a25dd73ba765d8b5f8275dd Mon Sep 17 00:00:00 2001 From: Ashwin Srinath Date: Fri, 13 May 2022 10:16:06 -0400 Subject: [PATCH 04/14] Add a bit about floating point computation --- .../source/user_guide/pandas-comparison.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 328040a660e..bf7f5641a68 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -120,6 +120,20 @@ a 10 640.00 ``` +## Floating-point computation + +cuDF leverages GPUs to execute operations in parallel. This means the +order of operations is not always deterministic. This impacts the +determinism of floating-point operations because floating-point +arithmetic is non-associative, that is, `a + b` is not equal to `b + a`. + +For example, `s.sum()` is not guaranteed to produce identical results +to Pandas nor produce identical results from run to run, when `s` is a +Series of floats. If you need to compare floating point results, you +should typically do so using the functions provided in the +[`cudf.testing`](testing-functions) module, which allow you to compare +values up to a desired precision. + ## Column names Unlike Pandas, cuDF does not support duplicate column names. @@ -153,3 +167,7 @@ each group. cuDF also supports `apply()`, but it relies on Numba to JIT compile the UDF and execute it on the GPU. This can be extremely fast, but imposes a few limitations on what operations are allowed in the UDF. See the docs on [UDFs](guide-to-udfs) for details. + + +[floating-point]: https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ +[testing-functions]: https://docs.rapids.ai/api/cudf/nightly/api_docs/general_utilities.html#testing-functions From 0910c7ff7136d79cad08aa8f5322e40ba621e2bb Mon Sep 17 00:00:00 2001 From: Ashwin Srinath Date: Fri, 13 May 2022 10:18:28 -0400 Subject: [PATCH 05/14] Change intro --- docs/cudf/source/user_guide/pandas-comparison.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index bf7f5641a68..f2345052fa7 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -1,9 +1,9 @@ # Comparison of cuDF and Pandas cuDF is a DataFrame library that closely matches the Pandas API, but -leverages NVIDIA GPUs for performing computations for speed. However, -there are some differences between cuDF and Pandas, both in terms API -and behavior. This page documents the similarities and differences +it is *not* a full drop-in replacement for Pandas. There are some +differences between cuDF and Pandas, both in terms of API and +behaviour. This page documents the similarities and differences between cuDF and Pandas. ## Supported operations From 595ab5cf33db95fb8dd655b4a10a68416c95960a Mon Sep 17 00:00:00 2001 From: Ashwin Srinath Date: Fri, 13 May 2022 10:44:52 -0400 Subject: [PATCH 06/14] Addressing review comments --- docs/cudf/source/user_guide/index.md | 1 + docs/cudf/source/user_guide/pandas-comparison.md | 8 ++------ 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/docs/cudf/source/user_guide/index.md b/docs/cudf/source/user_guide/index.md index 2750c75790a..d47ea158a69 100644 --- a/docs/cudf/source/user_guide/index.md +++ b/docs/cudf/source/user_guide/index.md @@ -4,6 +4,7 @@ :maxdepth: 2 10min +pandas-comparison data-types io missing-data diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index f2345052fa7..f19608c068e 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -131,8 +131,8 @@ For example, `s.sum()` is not guaranteed to produce identical results to Pandas nor produce identical results from run to run, when `s` is a Series of floats. If you need to compare floating point results, you should typically do so using the functions provided in the -[`cudf.testing`](testing-functions) module, which allow you to compare -values up to a desired precision. +[`cudf.testing`](/api_docs/general_utilities.html#testing-functions) +module, which allow you to compare values up to a desired precision. ## Column names @@ -167,7 +167,3 @@ each group. cuDF also supports `apply()`, but it relies on Numba to JIT compile the UDF and execute it on the GPU. This can be extremely fast, but imposes a few limitations on what operations are allowed in the UDF. See the docs on [UDFs](guide-to-udfs) for details. - - -[floating-point]: https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ -[testing-functions]: https://docs.rapids.ai/api/cudf/nightly/api_docs/general_utilities.html#testing-functions From 7844e40b7b0d9db64555368104f573c2097d8337 Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Fri, 13 May 2022 10:45:19 -0400 Subject: [PATCH 07/14] Update docs/cudf/source/user_guide/pandas-comparison.md Co-authored-by: Bradley Dice --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 053f5091ccc..3130e7a5b4e 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -43,7 +43,7 @@ meaning they can contain missing values (represented by `cudf.NA`). dtype: int64 ``` -Nulls are not coerced to `nan` in any situation; +Nulls are not coerced to `NaN` in any situation; compare the behaviour of cuDF with Pandas below: ```{code} python From f43d96d3f06afb7575c47fa4f3b52fb03d38971d Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Fri, 13 May 2022 10:45:33 -0400 Subject: [PATCH 08/14] Update docs/cudf/source/user_guide/pandas-comparison.md Co-authored-by: Bradley Dice --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 3130e7a5b4e..0d06194e402 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -44,7 +44,7 @@ dtype: int64 ``` Nulls are not coerced to `NaN` in any situation; -compare the behaviour of cuDF with Pandas below: +compare the behavior of cuDF with Pandas below: ```{code} python >>> s = cudf.Series([1, 2, cudf.NA], dtype="category") From d5353122705b5cd21cb2b88a1a97cdb16126b600 Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Fri, 13 May 2022 10:45:43 -0400 Subject: [PATCH 09/14] Update docs/cudf/source/user_guide/pandas-comparison.md Co-authored-by: Bradley Dice --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 0d06194e402..0346c8a119c 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -146,7 +146,7 @@ collections of arbitrary Python objects. ## `.apply()` function limitations -The `.apply()` function in Pandas accecpts a user-defined function +The `.apply()` function in Pandas accepts a user-defined function (UDF) that can include arbitrary operations that are applied to each value of a `Series`, `DataFrame`, or in the case of a groupby, each group. cuDF also supports `apply()`, but it relies on Numba to From c1c72de1fe407799d9e28e06dca4219506e539dc Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Fri, 13 May 2022 10:45:55 -0400 Subject: [PATCH 10/14] Update docs/cudf/source/user_guide/pandas-comparison.md Co-authored-by: Bradley Dice --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 0346c8a119c..12822b39939 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -149,7 +149,7 @@ collections of arbitrary Python objects. The `.apply()` function in Pandas accepts a user-defined function (UDF) that can include arbitrary operations that are applied to each value of a `Series`, `DataFrame`, or in the case of a groupby, -each group. cuDF also supports `apply()`, but it relies on Numba to +each group. cuDF also supports `.apply()`, but it relies on Numba to JIT compile the UDF and execute it on the GPU. This can be extremely fast, but imposes a few limitations on what operations are allowed in the UDF. See the docs on [UDFs](guide-to-udfs) for details. From df75b99952f362a5b7a539df6ae079118c34185a Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Fri, 13 May 2022 12:59:52 -0400 Subject: [PATCH 11/14] Update docs/cudf/source/user_guide/pandas-comparison.md Co-authored-by: Bradley Dice --- docs/cudf/source/user_guide/pandas-comparison.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 12822b39939..b542e8f701b 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -36,7 +36,6 @@ meaning they can contain missing values (represented by `cudf.NA`). ```{code} python >>> s = cudf.Series([1, 2, cudf.NA]) >>> s ->>> s 0 1 1 2 2 From 70f3cbb8a6252b2f25ce7ddfcc8265c5f6fda1a8 Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Fri, 13 May 2022 12:59:58 -0400 Subject: [PATCH 12/14] Update docs/cudf/source/user_guide/pandas-comparison.md Co-authored-by: Bradley Dice --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index b542e8f701b..35200da5372 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -21,7 +21,7 @@ our [API docs](/api_docs/index). cuDF supports many of the commonly-used data types in Pandas, including numeric, datetime, timestamp, string, and categorical data -types. In addition, we support special data types for decimal, list +types. In addition, we support special data types for decimal, list, and "struct" values. See the section on [Data Types](data-types) for details. From f1cd3c795729a582c0fa32f8ec0a290ed8502c3b Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Fri, 13 May 2022 13:00:06 -0400 Subject: [PATCH 13/14] Update docs/cudf/source/user_guide/pandas-comparison.md Co-authored-by: Bradley Dice --- docs/cudf/source/user_guide/pandas-comparison.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 35200da5372..3a91f58b7a7 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -63,8 +63,7 @@ dtype: category Categories (2, int64): [1, 2] ``` -See the docs on [missing data](missing-data) for -details. +See the docs on [missing data](missing-data) for details. ## Iteration From c26f4b9463cd4af690e47a810f6c9177e12ef464 Mon Sep 17 00:00:00 2001 From: Ashwin Srinath Date: Mon, 16 May 2022 09:05:44 -0400 Subject: [PATCH 14/14] [skip-ci] Mention that unique strings are best for column names --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index feacea5d896..d23880f02b4 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -135,7 +135,7 @@ module, which allow you to compare values up to a desired precision. ## Column names Unlike Pandas, cuDF does not support duplicate column names. -It is best to use strings for column names. +It is best to use unique strings for column names. ## No true `"object"` data type