Skip to content

Commit

Permalink
Change from rst -> md
Browse files Browse the repository at this point in the history
  • Loading branch information
shwina committed Apr 22, 2022
1 parent 1f4b884 commit 705365b
Show file tree
Hide file tree
Showing 2 changed files with 155 additions and 164 deletions.
155 changes: 155 additions & 0 deletions docs/cudf/source/user_guide/pandas-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Comparison of cuDF and Pandas

cuDF is a DataFrame library that closely matches the Pandas API, but
leverages NVIDIA GPUs for performing computations for speed. However,
there are some differences between cuDF and Pandas, both in terms API
and behavior. This page documents the similarities and differences
between cuDF and Pandas.

## Supported operations

cuDF supports many of the same data structures and operations as
Pandas. This includes `Series`, `DataFrame`, `Index` and
operations on them such as unary and binary operations, indexing,
filtering, concatenating, joining, groupby and window operations -
among many others.

The best way to see if we support a particular Pandas API is to search
our [API docs](/api_docs/index).

## Data types

cuDF supports many common data types supported by Pandas, including
numeric, datetime, timestamp, string, and categorical data types. In
addition, we support special data types for decimal, list and "struct"
values. See the section on [Data Types](data-types) for
details.

Note that we do not support custom data types like Pandas'
`ExtensionDtype`.

## Null (or "missing") values

Unlike Pandas, *all* data types in cuDF are nullable,
meaning they can contain missing values (represented by `cudf.NA`).

```{code} python
>>> s = cudf.Series([1, 2, cudf.NA])
>>> s
>>> s
0 1
1 2
2 <NA>
dtype: int64
```

Nulls are not coerced to `nan` in any situation;
compare the behaviour of cuDF with Pandas below:

```{code} python
>>> s = cudf.Series([1, 2, cudf.NA], dtype="category")
>>> s
0 1
1 2
2 <NA>
dtype: category
Categories (2, int64): [1, 2]
>>> s = pd.Series([1, 2, pd.NA], dtype="category")
>>> s
0 1
1 2
2 NaN
dtype: category
Categories (2, int64): [1, 2]
```

See the docs on [missing data](Working-with-missing-data) for
details.

## Iteration

Iterating over a cuDF `Series`, `DataFrame` or `Index` is not
supported. This is because iterating over data that resides on the GPU
will yield *extremely* poor performance, as GPUs are optimized for
highly parallel operations rather than sequential operations.

In the vast majority of cases, it is possible to avoid iteration and
use an existing function or method to accomplish the same task. If you
absolutely must iterate, copy the data from GPU to CPU by using
`.to_arrow()` or `.to_pandas()`, then copy the result back to GPU
using `.from_arrow()` or `.from_pandas()`.

## Result ordering

By default, `join` (or `merge`) and `groupby` operations in cuDF
do *not* guarantee output ordering by default.
Compare the results obtained from Pandas and cuDF below:

```{code} python
>>> import cupy as cp
>>> df = cudf.DataFrame({'a': cp.random.randint(0, 1000, 1000), 'b': range(1000)})
>>> df.groupby("a").mean().head()
b
a
742 694.5
29 840.0
459 525.5
442 363.0
666 7.0
>>> df.to_pandas().groupby("a").mean().head()
b
a
2 643.75
6 48.00
7 631.00
9 906.00
10 640.00
```

To match Pandas behavior, you must explicitly pass `sort=True`:

```{code} python
>>> df.to_pandas().groupby("a", sort=True).mean().head()
b
a
2 643.75
6 48.00
7 631.00
9 906.00
10 640.00
```

## Column names

Unlike Pandas, cuDF does not support duplicate column names.
It is best to use strings for column names.

## No true `"object"` data type

In Pandas and NumPy, the `"object"` data type is used for
collections of arbitrary Python objects. For example, in Pandas you
can do the following:

```{code} python
>>> import pandas as pd
>>> s = pd.Series(["a", 1, [1, 2, 3]])
0 a
1 1
2 [1, 2, 3]
dtype: object
```

For compatibilty with Pandas, cuDF reports the data type for strings
as `"object"`, but we do *not* support storing or operating on
collections of arbitrary Python objects.

## `.apply()` function limitations

The `.apply()` function in Pandas accecpts a user-defined function
(UDF) that can include arbitrary operations that are applied to each
value of a `Series`, `DataFrame`, or in the case of a groupby,
each group. cuDF also supports `apply()`, but it relies on Numba to
JIT compile the UDF and execute it on the GPU. This can be extremely
fast, but imposes a few limitations on what operations are allowed in
the UDF. See the docs on [UDFs](guide-to-udfs) for details.
164 changes: 0 additions & 164 deletions docs/cudf/source/user_guide/pandas-comparison.rst

This file was deleted.

0 comments on commit 705365b

Please sign in to comment.