-
Notifications
You must be signed in to change notification settings - Fork 917
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'branch-24.08' into fix-offsetalator-many-rows
- Loading branch information
Showing
41 changed files
with
895 additions
and
114 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
# cudf.pandas | ||
The use of the cuDF pandas accelerator mode (`cudf.pandas`) is explained [in the user guide](../cudf_pandas/index.rst). | ||
The purpose of this document is to explain how the fast-slow proxy mechanism works and document internal environment variables that can be used to debug `cudf.pandas` itself. | ||
|
||
## fast-slow proxy mechanism | ||
`cudf.pandas` works by wrapping each Pandas type and its corresponding cuDF type in a new proxy type also known as a fast-slow proxy type. | ||
The purpose of proxy types is to attempt computations on the fast (cuDF) object first, and then fall back to running on the slow (Pandas) object if the fast version fails. | ||
|
||
### Types: | ||
#### Wrapped Types and Proxy Types | ||
The "wrapped" types/classes are the Pandas and cuDF specific types that have been wrapped into proxy types. | ||
Wrapped objects and proxy objects are instances of wrapped types and proxy types, respectively. | ||
In the snippet below `s1` and `s2` are wrapped objects and `s3` is a fast-slow proxy object. | ||
Also note that the module `xpd` is a wrapped module and contains cuDF and Pandas modules as attributes. | ||
```python | ||
import cudf.pandas | ||
cudf.pandas.install() | ||
import pandas as xpd | ||
|
||
cudf = xpd._fsproxy_fast | ||
pd = xpd._fsproxy_slow | ||
|
||
s1 = cudf.Series([1,2]) | ||
s2 = pd.Series([1,2]) | ||
s3 = xpd.Series([1,2]) | ||
``` | ||
|
||
```{note} | ||
Note that users should never have to interact with the wrapped objects directly in this way. | ||
This code is purely for demonstrative purposes. | ||
``` | ||
|
||
#### The Different Kinds of Proxy Types | ||
In `cudf.pandas`, there are two main kinds of proxy types: final types and intermediate types. | ||
|
||
##### Final and Intermediate Proxy Types | ||
Final types are types for which known operations exist for converting an object of a "fast" type to a "slow" type and vice versa. | ||
For example, `cudf.DataFrame` can be converted to Pandas using the method `to_pandas`, and `pd.DataFrame` can be converted to cuDF using the function `cudf.from_pandas`. | ||
Intermediate types are the types of the results of operations invoked on final types. | ||
For example, `xpd.DataFrameGroupBy` is an intermediate type that will be created during a groupby operation on the final type `xpd.DataFrame`. | ||
|
||
##### Attributes and Callable Proxy Types | ||
Final proxy types are typically classes or modules, both of which have attributes. | ||
Classes also have methods. | ||
These attributes and methods must be wrapped as well to support the fast-slow proxy scheme. | ||
|
||
#### Creating New Proxy Types | ||
`_FinalProxy` and `_IntermediateProxy` types are created using the functions `make_final_proxy_type` and `make_intermediate_proxy` type, respectively. | ||
Creating a new final type looks like this. | ||
|
||
```python | ||
DataFrame = make_final_proxy_type( | ||
"DataFrame", | ||
cudf.DataFrame, | ||
pd.DataFrame, | ||
fast_to_slow=lambda fast: fast.to_pandas(), | ||
slow_to_fast=cudf.from_pandas, | ||
) | ||
``` | ||
|
||
### The Fallback Mechanism | ||
Proxied calls are implemented with fallback via [`_fast_slow_function_call`](https://github.com/rapidsai/cudf/blob/57aeeb78d85e169ac18b82f51d2b1cbd01b0608d/python/cudf/cudf/pandas/fast_slow_proxy.py#L869). This implements the mechanism by which we attempt operations the fast way (using cuDF) and then fall back to the slow way (using Pandas) on failure. | ||
The function looks like this: | ||
```python | ||
def _fast_slow_function_call(func: Callable, *args, **kwargs): | ||
try: | ||
... | ||
fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs) | ||
result = func(*fast_args, **fast_kwargs) | ||
... | ||
except Exception: | ||
... | ||
slow_args, slow_kwargs = _slow_arg(args), _slow_arg(kwargs) | ||
result = func(*slow_args, **slow_kwargs) | ||
... | ||
return _maybe_wrap_result(result, func, *args, **kwargs), fast | ||
``` | ||
As we can see the function attempts to call `func` the fast way using cuDF and if any `Exception` occurs, it calls the function using Pandas. | ||
In essence, this `try-except` is what allows `cudf.pandas` to support the bulk of the Pandas API. | ||
|
||
At the end, the function wraps the result from either path in a fast-slow proxy object, if necessary. | ||
|
||
#### Converting Proxy Objects | ||
Note that before the `func` is called, the proxy object and its attributes need to be converted to either their cuDF or Pandas implementations. | ||
This conversion is handled in the function `_transform_arg` which both `_fast_arg` and `_slow_arg` call. | ||
|
||
`_transform_arg` is a recursive function that will call itself depending on the type or argument passed to it (eg. `_transform_arg` is called for each element in a list of arguments). | ||
|
||
### Using Metaclasses | ||
`cudf.pandas` uses a [metaclass](https://docs.python.org/3/glossary.html#term-metaclass) called (`_FastSlowProxyMeta`) to find class attributes and classmethods of fast-slow proxy types. | ||
For example, in the snippet below, the `xpd.Series` type is an instance of `_FastSlowProxyMeta`. | ||
Therefore we can access the property `_fsproxy_fast` defined in the metaclass. | ||
```python | ||
import cudf.pandas | ||
cudf.pandas.install() | ||
import pandas as xpd | ||
|
||
print(xpd.Series._fsproxy_fast) # output is cudf.core.series.Series | ||
``` | ||
|
||
## debugging `cudf.pandas` | ||
Several environment variables are available for debugging purposes. | ||
|
||
Setting the environment variable `CUDF_PANDAS_DEBUGGING` produces a warning when the results from cuDF and Pandas differ from one another. | ||
For example, the snippet below produces the warning below. | ||
```python | ||
import cudf.pandas | ||
cudf.pandas.install() | ||
import pandas as pd | ||
import numpy as np | ||
|
||
setattr(pd.Series.mean, "_fsproxy_slow", lambda self, *args, **kwargs: np.float64(1)) | ||
s = pd.Series([1,2,3]) | ||
s.mean() | ||
``` | ||
``` | ||
UserWarning: The results from cudf and pandas were different. The exception was | ||
Arrays are not almost equal to 7 decimals | ||
ACTUAL: 1.0 | ||
DESIRED: 2.0. | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,4 +27,5 @@ testing | |
benchmarking | ||
options | ||
pylibcudf | ||
cudf_pandas | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
==== | ||
Avro | ||
==== | ||
|
||
.. automodule:: cudf._lib.pylibcudf.io.avro | ||
:members: |
18 changes: 18 additions & 0 deletions
18
docs/cudf/source/user_guide/api_docs/pylibcudf/io/index.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
=== | ||
I/O | ||
=== | ||
|
||
I/O Utility Classes | ||
=================== | ||
|
||
.. automodule:: cudf._lib.pylibcudf.io.types | ||
:members: | ||
|
||
|
||
I/O Functions | ||
============= | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
avro |
6 changes: 6 additions & 0 deletions
6
docs/cudf/source/user_guide/api_docs/pylibcudf/strings/contains.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
======== | ||
contains | ||
======== | ||
|
||
.. automodule:: cudf._lib.pylibcudf.strings.contains | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,4 +4,5 @@ strings | |
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
contains | ||
replace |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# ============================================================================= | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except | ||
# in compliance with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software distributed under the License | ||
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express | ||
# or implied. See the License for the specific language governing permissions and limitations under | ||
# the License. | ||
# ============================================================================= | ||
|
||
set(cython_sources avro.pyx types.pyx) | ||
|
||
set(linked_libraries cudf::cudf) | ||
rapids_cython_create_modules( | ||
CXX | ||
SOURCE_FILES "${cython_sources}" | ||
LINKED_LIBRARIES "${linked_libraries}" MODULE_PREFIX pylibcudf_io_ ASSOCIATED_TARGETS cudf | ||
) | ||
|
||
set(targets_using_arrow_headers pylibcudf_io_avro pylibcudf_io_types) | ||
link_to_pyarrow_headers("${targets_using_arrow_headers}") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
|
||
from . cimport avro, types | ||
from .types cimport SourceInfo, TableWithMetadata |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
|
||
from . import avro, types | ||
from .types import SourceInfo, TableWithMetadata |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
from cudf._lib.pylibcudf.io.types cimport SourceInfo, TableWithMetadata | ||
from cudf._lib.pylibcudf.libcudf.io.avro cimport avro_reader_options | ||
from cudf._lib.pylibcudf.libcudf.types cimport size_type | ||
|
||
|
||
cpdef TableWithMetadata read_avro( | ||
SourceInfo source_info, | ||
list columns = *, | ||
size_type skip_rows = *, | ||
size_type num_rows = * | ||
) |
Oops, something went wrong.