From 593ec43dc9f26c64771190ccb059ac494bc5c77f Mon Sep 17 00:00:00 2001 From: Ashwin Srinath Date: Mon, 1 May 2023 06:52:09 -0400 Subject: [PATCH 1/4] Fix and clarify notes on result ordering --- .../source/user_guide/pandas-comparison.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index ba04a231f41..8ba6a6734dc 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -82,9 +82,16 @@ using `.from_arrow()` or `.from_pandas()`. ## Result ordering -By default, `join` (or `merge`) and `groupby` operations in cuDF -do *not* guarantee output ordering. -Compare the results obtained from Pandas and cuDF below: +In Pandas, `join` and `groupby` operations provide certain guarantees +about the order of rows in the result returned. In Pandas `join`, the +order of join keys are either preserved or sorted lexicographically by +default. `groupby` sorts the group keys, and preserves the order of +rows within each group. In some cases, disabling this option in Pandas +can yield better performance. + +By contrast, cuDF's default behavior is to return rows in a +non-deterministic order to maximize performance. Compare the results +obtained from Pandas and cuDF below: ```{code} python >>> import cupy as cp @@ -107,10 +114,11 @@ Compare the results obtained from Pandas and cuDF below: 10 640.00 ``` -To match Pandas behavior, you must explicitly pass `sort=True`: +If you require that results be returned in a predictable order, you +must pass the `sort=True` option explicitly: ```{code} python ->>> df.to_pandas().groupby("a", sort=True).mean().head() +>>> df.groupby("a", sort=True).mean().head() b a 2 643.75 From 8f31bafaa0d2f6c948f4c8ff229c3c2d53fbba58 Mon Sep 17 00:00:00 2001 From: Ashwin Srinath Date: Wed, 24 May 2023 15:42:55 -0400 Subject: [PATCH 2/4] Clarify use of index labels --- docs/cudf/source/user_guide/pandas-comparison.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 8ba6a6734dc..4e1887b6105 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -114,8 +114,11 @@ obtained from Pandas and cuDF below: 10 640.00 ``` -If you require that results be returned in a predictable order, you -must pass the `sort=True` option explicitly: +In most cases, the rows of a DataFrame are accessed by index labels +rather than by position, so the order in which rows are returned +doesn't matter. However, if you require that results be returned in a +predictable (sorted) order, you can pass the `sort=True` option +explicitly: ```{code} python >>> df.groupby("a", sort=True).mean().head() From f1469ca8c4d83c3d4a53b67e9df2ef492e4732b3 Mon Sep 17 00:00:00 2001 From: Lawrence Mitchell Date: Fri, 26 May 2023 10:57:35 +0100 Subject: [PATCH 3/4] Minor grammar fix --- docs/cudf/source/user_guide/pandas-comparison.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 4e1887b6105..837919ca587 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -83,7 +83,7 @@ using `.from_arrow()` or `.from_pandas()`. ## Result ordering In Pandas, `join` and `groupby` operations provide certain guarantees -about the order of rows in the result returned. In Pandas `join`, the +about the order of rows in the result returned. In a Pandas `join`, the order of join keys are either preserved or sorted lexicographically by default. `groupby` sorts the group keys, and preserves the order of rows within each group. In some cases, disabling this option in Pandas From 29f9f98c686c29bcfedab47761c46a0f247f573f Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Thu, 11 Apr 2024 20:41:42 +0000 Subject: [PATCH 4/4] Address review --- docs/cudf/source/user_guide/pandas-comparison.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 26671562e28..4aaaa8a93df 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -89,10 +89,11 @@ using `.from_arrow()` or `.from_pandas()`. In Pandas, `join` (or `merge`), `value_counts` and `groupby` operations provide certain guarantees about the order of rows in the result returned. In a Pandas -`join`, the order of join keys are either preserved or sorted lexicographically -by default. `groupby` sorts the group keys, and preserves the order of rows -within each group. In some cases, disabling this option in Pandas can yield -better performance. +`join`, the order of join keys is (depending on the particular style of join +being performed) either preserved or sorted lexicographically by default. +`groupby` sorts the group keys, and preserves the order of rows within each +group. In some cases, disabling this option in Pandas can yield better +performance. By contrast, cuDF's default behavior is to return rows in a non-deterministic order to maximize performance. Compare the results