Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and clarify notes on result ordering #13255

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 19 additions & 8 deletions docs/cudf/source/user_guide/pandas-comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,17 @@ using `.from_arrow()` or `.from_pandas()`.

## Result ordering

By default, `join` (or `merge`), `value_counts` and `groupby` operations in cuDF
do *not* guarantee output ordering.
Compare the results obtained from Pandas and cuDF below:
In Pandas, `join` (or `merge`), `value_counts` and `groupby` operations provide
certain guarantees about the order of rows in the result returned. In a Pandas
`join`, the order of join keys is (depending on the particular style of join
being performed) either preserved or sorted lexicographically by default.
`groupby` sorts the group keys, and preserves the order of rows within each
group. In some cases, disabling this option in Pandas can yield better
performance.

By contrast, cuDF's default behavior is to return rows in a
non-deterministic order to maximize performance. Compare the results
obtained from Pandas and cuDF below:
wence- marked this conversation as resolved.
Show resolved Hide resolved

```{code} python
>>> import cupy as cp
Expand All @@ -114,13 +122,16 @@ a
4 342.000000
```

To match Pandas behavior, you must explicitly pass `sort=True`
or enable the `mode.pandas_compatible` option when trying to
match Pandas behavior with `sort=False`:
In most cases, the rows of a DataFrame are accessed by index labels
rather than by position, so the order in which rows are returned
doesn't matter. However, if you require that results be returned in a
predictable (sorted) order, you can pass the `sort=True` option
explicitly or enable the `mode.pandas_compatible` option when trying
to match Pandas behavior with `sort=False`:

```{code} python
>>> df.to_pandas().groupby("a", sort=True).mean().head()
b
>>> df.groupby("a", sort=True).mean().head()
b
a
0 70.000000
1 356.333333
Expand Down
Loading