Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a section to the docs that compares cuDF with Pandas #10796
Add a section to the docs that compares cuDF with Pandas #10796
Changes from 2 commits
74b0db0
daaa7ac
f0add92
5a08dd2
0910c7f
595ab5c
7844e40
f43d96d
d535312
c1c72de
df75b99
70f3cbb
f1cd3c7
ef919cc
c26f4b9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great place to call out the subtle differences in null handling logic we have vs pandas. Most of it can be dug up from the source code here but a good summary might be something like this (I think this is all of them?)
Maybe a table or something might be better than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these cases are also described in the docs (as a cross-reference with the source code linked above):
I find it a little concerning that we differ in this way because it means that cuDF cannot be consistent in its behaviors between scalars and columns. It should be specifically noted that scalar operations act like Pandas (because we use the same magic
NA
singleton object), and column operations always propagateNA
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the difference in column vs scalar behaviour is problematic. I think @brandon-b-miller has thought a lot about this, where maybe we should take this discussion offline and come back and raise a separate issue if needed.
For this PR, I'll hold off on adding any further information about null behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend thoroughly reading the discussion on pandas-dev/pandas#29997 before we relitigate any of that discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussions offline, I'm going to hold off on documenting the exceptional cases here. I think our priority should be to first align the behavior of nulls in all three of the following cases:
NA
NA
NA
We can choose to always return
NA
in all three cases, or make an exception for**
in all three cases, but we must be consistent. That done, we can come back here to document the difference from Pandas - if any.