Add a section to the docs that compares cuDF with Pandas #10796

shwina · 2022-05-05T13:42:46Z

Adds a section to the docs that calls out the similarities and differences from Pandas at a high level.

This is inspired by CuPy's page documenting the differences from NumPy.

docs/cudf/source/user_guide/pandas-comparison.md

brandon-b-miller · 2022-05-09T13:35:07Z

docs/cudf/source/user_guide/pandas-comparison.md

+Note that we do not support custom data types like Pandas'
+`ExtensionDtype`.
+
+## Null (or "missing") values


I think this is a great place to call out the subtle differences in null handling logic we have vs pandas. Most of it can be dug up from the source code here but a good summary might be something like this (I think this is all of them?)

Nulls in cuDF behave differently from pandas in several edge cases. In cuDF, the rule is that nulls always propagate, whereas in pandas they may not if the mathematical result can be inferred without knowing the missing value: - `NA ** 0 == 1` - `1 ** NA == 1` - `NA | True == True` - `True or NA == True` - `False and NA == False`

Maybe a table or something might be better than this.

All these cases are also described in the docs (as a cross-reference with the source code linked above):

https://pandas.pydata.org/docs/user_guide/missing_data.html#propagation-in-arithmetic-and-comparison-operations

https://pandas.pydata.org/docs/user_guide/missing_data.html#logical-operations

I find it a little concerning that we differ in this way because it means that cuDF cannot be consistent in its behaviors between scalars and columns. It should be specifically noted that scalar operations act like Pandas (because we use the same magic NA singleton object), and column operations always propagate NA.

>>> import cudf >>> cudf.NA ** 0 1 >>> cudf.Scalar(cudf.NA, dtype=float) ** 0 Scalar(1.0, dtype=float64) >>> cudf.Series([cudf.NA], dtype=float) ** 0 0 <NA> dtype: float64

Yeah the difference in column vs scalar behaviour is problematic. I think @brandon-b-miller has thought a lot about this, where maybe we should take this discussion offline and come back and raise a separate issue if needed.

For this PR, I'll hold off on adding any further information about null behaviour.

I would recommend thoroughly reading the discussion on pandas-dev/pandas#29997 before we relitigate any of that discussion.

Based on our discussions offline, I'm going to hold off on documenting the exceptional cases here. I think our priority should be to first align the behavior of nulls in all three of the following cases:

Scalar operations involving NA

Column operations involving NA

Operations in UDFs involving NA

We can choose to always return NA in all three cases, or make an exception for ** in all three cases, but we must be consistent. That done, we can come back here to document the difference from Pandas - if any.

Co-authored-by: Bradley Dice <[email protected]>

bdice

LGTM with some small edits. Thanks @shwina!

docs/cudf/source/user_guide/pandas-comparison.md

Co-authored-by: Bradley Dice <[email protected]>

…pandas-comparison-docs

brandon-b-miller

marking my approval for now since my only concern was around the null behavior.

shwina · 2022-05-16T18:28:29Z

@gpucibot merge

Add comparison to Pandas doc

74b0db0

shwina added non-breaking Non-breaking change doc Documentation labels May 5, 2022

shwina commented May 5, 2022

View reviewed changes

docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved

[skip-ci] Update docs/cudf/source/user_guide/pandas-comparison.md

daaa7ac

shwina commented May 5, 2022

View reviewed changes

docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved

[skip-ci] Update docs/cudf/source/user_guide/pandas-comparison.md

f0add92

jrhemstad reviewed May 5, 2022

View reviewed changes

docs/cudf/source/user_guide/pandas-comparison.md Show resolved Hide resolved

bdice requested changes May 6, 2022

View reviewed changes

brandon-b-miller reviewed May 9, 2022

View reviewed changes

shwina and others added 7 commits May 13, 2022 10:16

Add a bit about floating point computation

5a08dd2

Change intro

0910c7f

Addressing review comments

595ab5c

Update docs/cudf/source/user_guide/pandas-comparison.md

7844e40

Co-authored-by: Bradley Dice <[email protected]>

Update docs/cudf/source/user_guide/pandas-comparison.md

f43d96d

Co-authored-by: Bradley Dice <[email protected]>

Update docs/cudf/source/user_guide/pandas-comparison.md

d535312

Co-authored-by: Bradley Dice <[email protected]>

Update docs/cudf/source/user_guide/pandas-comparison.md

c1c72de

Co-authored-by: Bradley Dice <[email protected]>

bdice approved these changes May 13, 2022

View reviewed changes

docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved

docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved

docs/cudf/source/user_guide/pandas-comparison.md Outdated Show resolved Hide resolved

shwina and others added 5 commits May 13, 2022 12:59

Update docs/cudf/source/user_guide/pandas-comparison.md

df75b99

Co-authored-by: Bradley Dice <[email protected]>

Update docs/cudf/source/user_guide/pandas-comparison.md

70f3cbb

Co-authored-by: Bradley Dice <[email protected]>

Update docs/cudf/source/user_guide/pandas-comparison.md

f1cd3c7

Co-authored-by: Bradley Dice <[email protected]>

Merge branch 'pandas-comparison-docs' of github.com:shwina/cudf into …

ef919cc

…pandas-comparison-docs

[skip-ci] Mention that unique strings are best for column names

c26f4b9

brandon-b-miller approved these changes May 16, 2022

View reviewed changes

rapids-bot bot merged commit 09b7045 into rapidsai:branch-22.06 May 16, 2022

shwina mentioned this pull request Jun 2, 2022

[DOC] RAPIDS 22.06 Release Blog Outline #10878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a section to the docs that compares cuDF with Pandas #10796

Add a section to the docs that compares cuDF with Pandas #10796

shwina commented May 5, 2022

brandon-b-miller May 9, 2022

bdice May 9, 2022 •

edited

Loading

shwina May 9, 2022

vyasr May 9, 2022

shwina May 16, 2022

bdice left a comment •

edited

Loading

brandon-b-miller left a comment

shwina commented May 16, 2022

Add a section to the docs that compares cuDF with Pandas #10796

Add a section to the docs that compares cuDF with Pandas #10796

Conversation

shwina commented May 5, 2022

brandon-b-miller May 9, 2022

Choose a reason for hiding this comment

bdice May 9, 2022 • edited Loading

Choose a reason for hiding this comment

shwina May 9, 2022

Choose a reason for hiding this comment

vyasr May 9, 2022

Choose a reason for hiding this comment

shwina May 16, 2022

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

brandon-b-miller left a comment

Choose a reason for hiding this comment

shwina commented May 16, 2022

bdice May 9, 2022 •

edited

Loading

bdice left a comment •

edited

Loading