From ff22a7ac0d565be2b2221c6080966eb0338676ee Mon Sep 17 00:00:00 2001 From: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Date: Thu, 11 Apr 2024 21:01:40 -0400 Subject: [PATCH] Fix and clarify notes on result ordering (#13255) I noticed when answering #13254 that the code example in this section of our documentation was incorrect and the text itself could use some improving. Authors: - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: https://github.com/rapidsai/cudf/pull/13255 --- .../source/user_guide/pandas-comparison.md | 27 +++++++++++++------ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md index 549d91b771a..4aaaa8a93df 100644 --- a/docs/cudf/source/user_guide/pandas-comparison.md +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -87,9 +87,17 @@ using `.from_arrow()` or `.from_pandas()`. ## Result ordering -By default, `join` (or `merge`), `value_counts` and `groupby` operations in cuDF -do *not* guarantee output ordering. -Compare the results obtained from Pandas and cuDF below: +In Pandas, `join` (or `merge`), `value_counts` and `groupby` operations provide +certain guarantees about the order of rows in the result returned. In a Pandas +`join`, the order of join keys is (depending on the particular style of join +being performed) either preserved or sorted lexicographically by default. +`groupby` sorts the group keys, and preserves the order of rows within each +group. In some cases, disabling this option in Pandas can yield better +performance. + +By contrast, cuDF's default behavior is to return rows in a +non-deterministic order to maximize performance. Compare the results +obtained from Pandas and cuDF below: ```{code} python >>> import cupy as cp @@ -114,13 +122,16 @@ a 4 342.000000 ``` -To match Pandas behavior, you must explicitly pass `sort=True` -or enable the `mode.pandas_compatible` option when trying to -match Pandas behavior with `sort=False`: +In most cases, the rows of a DataFrame are accessed by index labels +rather than by position, so the order in which rows are returned +doesn't matter. However, if you require that results be returned in a +predictable (sorted) order, you can pass the `sort=True` option +explicitly or enable the `mode.pandas_compatible` option when trying +to match Pandas behavior with `sort=False`: ```{code} python ->>> df.to_pandas().groupby("a", sort=True).mean().head() - b +>>> df.groupby("a", sort=True).mean().head() + b a 0 70.000000 1 356.333333