Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a section to the docs that compares cuDF with Pandas #10796

Merged
merged 15 commits into from
May 16, 2022

Conversation

shwina
Copy link
Contributor

@shwina shwina commented May 5, 2022

Adds a section to the docs that calls out the similarities and differences from Pandas at a high level.

This is inspired by CuPy's page documenting the differences from NumPy.

@shwina shwina added non-breaking Non-breaking change doc Documentation labels May 5, 2022
docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved
Note that we do not support custom data types like Pandas'
`ExtensionDtype`.

## Null (or "missing") values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great place to call out the subtle differences in null handling logic we have vs pandas. Most of it can be dug up from the source code here but a good summary might be something like this (I think this is all of them?)

Nulls in cuDF behave differently from pandas in several edge cases. In cuDF, the rule is that nulls always propagate, whereas in pandas they may not if the mathematical result can be inferred without knowing the missing value: 
- `NA ** 0 == 1`
- `1 ** NA == 1`
- `NA | True == True`
- `True or NA == True`
- `False and NA == False`

Maybe a table or something might be better than this.

Copy link
Contributor

@bdice bdice May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these cases are also described in the docs (as a cross-reference with the source code linked above):

I find it a little concerning that we differ in this way because it means that cuDF cannot be consistent in its behaviors between scalars and columns. It should be specifically noted that scalar operations act like Pandas (because we use the same magic NA singleton object), and column operations always propagate NA.

>>> import cudf
>>> cudf.NA ** 0
1
>>> cudf.Scalar(cudf.NA, dtype=float) ** 0
Scalar(1.0, dtype=float64)
>>> cudf.Series([cudf.NA], dtype=float) ** 0
0    <NA>
dtype: float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the difference in column vs scalar behaviour is problematic. I think @brandon-b-miller has thought a lot about this, where maybe we should take this discussion offline and come back and raise a separate issue if needed.

For this PR, I'll hold off on adding any further information about null behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend thoroughly reading the discussion on pandas-dev/pandas#29997 before we relitigate any of that discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussions offline, I'm going to hold off on documenting the exceptional cases here. I think our priority should be to first align the behavior of nulls in all three of the following cases:

  • Scalar operations involving NA
  • Column operations involving NA
  • Operations in UDFs involving NA

We can choose to always return NA in all three cases, or make an exception for ** in all three cases, but we must be consistent. That done, we can come back here to document the difference from Pandas - if any.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some small edits. Thanks @shwina!

docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved
docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved
Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marking my approval for now since my only concern was around the null behavior.

@shwina
Copy link
Contributor Author

shwina commented May 16, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 09b7045 into rapidsai:branch-22.06 May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants