Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and clarify notes on result ordering #13255

Merged
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 16 additions & 5 deletions docs/cudf/source/user_guide/pandas-comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,16 @@ using `.from_arrow()` or `.from_pandas()`.

## Result ordering

By default, `join` (or `merge`) and `groupby` operations in cuDF
do *not* guarantee output ordering.
Compare the results obtained from Pandas and cuDF below:
In Pandas, `join` and `groupby` operations provide certain guarantees
about the order of rows in the result returned. In a Pandas `join`, the
order of join keys are either preserved or sorted lexicographically by
vyasr marked this conversation as resolved.
Show resolved Hide resolved
default. `groupby` sorts the group keys, and preserves the order of
rows within each group. In some cases, disabling this option in Pandas
can yield better performance.

By contrast, cuDF's default behavior is to return rows in a
non-deterministic order to maximize performance. Compare the results
obtained from Pandas and cuDF below:
wence- marked this conversation as resolved.
Show resolved Hide resolved

```{code} python
>>> import cupy as cp
Expand All @@ -107,10 +114,14 @@ Compare the results obtained from Pandas and cuDF below:
10 640.00
```

To match Pandas behavior, you must explicitly pass `sort=True`:
In most cases, the rows of a DataFrame are accessed by index labels
rather than by position, so the order in which rows are returned
doesn't matter. However, if you require that results be returned in a
predictable (sorted) order, you can pass the `sort=True` option
explicitly:

```{code} python
>>> df.to_pandas().groupby("a", sort=True).mean().head()
>>> df.groupby("a", sort=True).mean().head()
b
a
2 643.75
Expand Down