Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a section to the docs that compares cuDF with Pandas #10796

Merged
merged 15 commits into from
May 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/cudf/source/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
:maxdepth: 2

10min
pandas-comparison
data-types
io
missing-data
Expand Down
167 changes: 167 additions & 0 deletions docs/cudf/source/user_guide/pandas-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Comparison of cuDF and Pandas
shwina marked this conversation as resolved.
Show resolved Hide resolved

cuDF is a DataFrame library that closely matches the Pandas API, but
it is *not* a full drop-in replacement for Pandas. There are some
differences between cuDF and Pandas, both in terms of API and
behaviour. This page documents the similarities and differences
between cuDF and Pandas.

## Supported operations

cuDF supports many of the same data structures and operations as
Pandas. This includes `Series`, `DataFrame`, `Index` and
operations on them such as unary and binary operations, indexing,
filtering, concatenating, joining, groupby and window operations -
among many others.

The best way to check if we support a particular Pandas API is to search
our [API docs](/api_docs/index).
shwina marked this conversation as resolved.
Show resolved Hide resolved

## Data types

cuDF supports many of the commonly-used data types in Pandas,
including numeric, datetime, timestamp, string, and categorical data
types. In addition, we support special data types for decimal, list,
and "struct" values. See the section on [Data Types](data-types) for
details.

Note that we do not support custom data types like Pandas'
`ExtensionDtype`.

## Null (or "missing") values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great place to call out the subtle differences in null handling logic we have vs pandas. Most of it can be dug up from the source code here but a good summary might be something like this (I think this is all of them?)

Nulls in cuDF behave differently from pandas in several edge cases. In cuDF, the rule is that nulls always propagate, whereas in pandas they may not if the mathematical result can be inferred without knowing the missing value: 
- `NA ** 0 == 1`
- `1 ** NA == 1`
- `NA | True == True`
- `True or NA == True`
- `False and NA == False`

Maybe a table or something might be better than this.

Copy link
Contributor

@bdice bdice May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these cases are also described in the docs (as a cross-reference with the source code linked above):

I find it a little concerning that we differ in this way because it means that cuDF cannot be consistent in its behaviors between scalars and columns. It should be specifically noted that scalar operations act like Pandas (because we use the same magic NA singleton object), and column operations always propagate NA.

>>> import cudf
>>> cudf.NA ** 0
1
>>> cudf.Scalar(cudf.NA, dtype=float) ** 0
Scalar(1.0, dtype=float64)
>>> cudf.Series([cudf.NA], dtype=float) ** 0
0    <NA>
dtype: float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the difference in column vs scalar behaviour is problematic. I think @brandon-b-miller has thought a lot about this, where maybe we should take this discussion offline and come back and raise a separate issue if needed.

For this PR, I'll hold off on adding any further information about null behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend thoroughly reading the discussion on pandas-dev/pandas#29997 before we relitigate any of that discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussions offline, I'm going to hold off on documenting the exceptional cases here. I think our priority should be to first align the behavior of nulls in all three of the following cases:

  • Scalar operations involving NA
  • Column operations involving NA
  • Operations in UDFs involving NA

We can choose to always return NA in all three cases, or make an exception for ** in all three cases, but we must be consistent. That done, we can come back here to document the difference from Pandas - if any.


Unlike Pandas, *all* data types in cuDF are nullable,
meaning they can contain missing values (represented by `cudf.NA`).

```{code} python
>>> s = cudf.Series([1, 2, cudf.NA])
>>> s
0 1
1 2
2 <NA>
dtype: int64
```

Nulls are not coerced to `NaN` in any situation;
compare the behavior of cuDF with Pandas below:

```{code} python
>>> s = cudf.Series([1, 2, cudf.NA], dtype="category")
>>> s
0 1
1 2
2 <NA>
dtype: category
Categories (2, int64): [1, 2]

>>> s = pd.Series([1, 2, pd.NA], dtype="category")
>>> s
0 1
1 2
2 NaN
dtype: category
Categories (2, int64): [1, 2]
```

See the docs on [missing data](missing-data) for details.

## Iteration

Iterating over a cuDF `Series`, `DataFrame` or `Index` is not
supported. This is because iterating over data that resides on the GPU
will yield *extremely* poor performance, as GPUs are optimized for
highly parallel operations rather than sequential operations.

In the vast majority of cases, it is possible to avoid iteration and
use an existing function or method to accomplish the same task. If you
absolutely must iterate, copy the data from GPU to CPU by using
`.to_arrow()` or `.to_pandas()`, then copy the result back to GPU
using `.from_arrow()` or `.from_pandas()`.

## Result ordering
shwina marked this conversation as resolved.
Show resolved Hide resolved

By default, `join` (or `merge`) and `groupby` operations in cuDF
do *not* guarantee output ordering.
Compare the results obtained from Pandas and cuDF below:

```{code} python
>>> import cupy as cp
>>> df = cudf.DataFrame({'a': cp.random.randint(0, 1000, 1000), 'b': range(1000)})
>>> df.groupby("a").mean().head()
b
a
742 694.5
29 840.0
459 525.5
442 363.0
666 7.0
>>> df.to_pandas().groupby("a").mean().head()
b
a
2 643.75
6 48.00
7 631.00
9 906.00
10 640.00
```

To match Pandas behavior, you must explicitly pass `sort=True`:

```{code} python
>>> df.to_pandas().groupby("a", sort=True).mean().head()
b
a
2 643.75
6 48.00
7 631.00
9 906.00
10 640.00
```

## Floating-point computation

cuDF leverages GPUs to execute operations in parallel. This means the
order of operations is not always deterministic. This impacts the
determinism of floating-point operations because floating-point
arithmetic is non-associative, that is, `a + b` is not equal to `b + a`.

For example, `s.sum()` is not guaranteed to produce identical results
to Pandas nor produce identical results from run to run, when `s` is a
Series of floats. If you need to compare floating point results, you
should typically do so using the functions provided in the
[`cudf.testing`](/api_docs/general_utilities.html#testing-functions)
module, which allow you to compare values up to a desired precision.

## Column names

Unlike Pandas, cuDF does not support duplicate column names.
It is best to use unique strings for column names.

## No true `"object"` data type

In Pandas and NumPy, the `"object"` data type is used for
shwina marked this conversation as resolved.
Show resolved Hide resolved
collections of arbitrary Python objects. For example, in Pandas you
can do the following:

```{code} python
>>> import pandas as pd
>>> s = pd.Series(["a", 1, [1, 2, 3]])
0 a
1 1
2 [1, 2, 3]
dtype: object
```

For compatibilty with Pandas, cuDF reports the data type for strings
as `"object"`, but we do *not* support storing or operating on
collections of arbitrary Python objects.

## `.apply()` function limitations

The `.apply()` function in Pandas accepts a user-defined function
(UDF) that can include arbitrary operations that are applied to each
value of a `Series`, `DataFrame`, or in the case of a groupby,
each group. cuDF also supports `.apply()`, but it relies on Numba to
JIT compile the UDF and execute it on the GPU. This can be extremely
fast, but imposes a few limitations on what operations are allowed in
the UDF. See the docs on [UDFs](guide-to-udfs) for details.