Skip to content

Commit

Permalink
Interchange dataframe protocol (#9071)
Browse files Browse the repository at this point in the history
This PR is a basic implementation of the [interchange dataframe protocol](https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py) for cudf.
As well-known, there are many dataframe libraries out there where one's weakness is handle by another. To work across these libraries, we rely on `pandas` with method like `from_pandas` and `to_pandas`.
This is a bad design as libraries should maintain an additional dependency to pandas peculiarities.
This protocol provides a high level API that must be implemented by dataframe libraries to allow communication between them.
Thus, we get rid of the high coupling with pandas and depend only on the protocol API where each library has the freedom of its implementation details.
To illustrate:

- `df_obj =  cudf_dataframe.__dataframe__()`

`df_obj` can be consumed by any library implementing the protocol.
- `df = cudf.from_dataframe(any_supported_dataframe)`

here we create  a `cudf dataframe` from any dataframe object supporting the protocol.

So far, it supports the following:

-  Column dtypes: `uint8`, `int`, `float`, `bool` and `categorical`.
-  Missing values are handled for all these dtypes.
-  `string` support is on the way.

Additionally, we support dataframe from CPU device like `pandas`. But it is not testable here  as pandas has not yet adopted the protocol. We've tested it locally with a pandas monkey patched implementation of the protocol.

Authors:
  - Ismaël Koné (https://github.com/iskode)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9071
  • Loading branch information
iskode authored Nov 17, 2021
1 parent 17e6f5b commit 32bacfa
Show file tree
Hide file tree
Showing 4 changed files with 1,061 additions and 2 deletions.
2 changes: 1 addition & 1 deletion python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
UInt64Index,
interval_range,
)
from cudf.core.dataframe import DataFrame, from_pandas, merge
from cudf.core.dataframe import DataFrame, from_pandas, merge, from_dataframe
from cudf.core.series import Series
from cudf.core.multiindex import MultiIndex
from cudf.core.cut import cut
Expand Down
13 changes: 12 additions & 1 deletion python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
is_string_dtype,
is_struct_dtype,
)
from cudf.core import column, reshape
from cudf.core import column, df_protocol, reshape
from cudf.core.abc import Serializable
from cudf.core.column import (
as_column,
Expand Down Expand Up @@ -6329,6 +6329,17 @@ def explode(self, column, ignore_index=False):

return super()._explode(column, ignore_index)

def __dataframe__(
self, nan_as_null: bool = False, allow_copy: bool = True
):
return df_protocol.__dataframe__(
self, nan_as_null=nan_as_null, allow_copy=allow_copy
)


def from_dataframe(df, allow_copy=False):
return df_protocol.from_dataframe(df, allow_copy=allow_copy)


def make_binop_func(op, postprocess=None):
# This function is used to wrap binary operations in Frame with an
Expand Down
Loading

0 comments on commit 32bacfa

Please sign in to comment.