Skip to content

Commit

Permalink
Merge branch 'branch-24.08' into feature/combine-fast-and-slow-paths
Browse files Browse the repository at this point in the history
  • Loading branch information
galipremsagar authored Jun 6, 2024
2 parents 276c095 + d1e511e commit 4637b04
Show file tree
Hide file tree
Showing 98 changed files with 2,412 additions and 851 deletions.
306 changes: 306 additions & 0 deletions CHANGELOG.md

Large diffs are not rendered by default.

34 changes: 17 additions & 17 deletions cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,14 @@ function(ConfigureTest CMAKE_TEST_NAME)
INSTALL_COMPONENT_SET testing
)

set_tests_properties(
${CMAKE_TEST_NAME}
PROPERTIES
ENVIRONMENT
"GTEST_CUDF_STREAM_MODE=new_${_CUDF_TEST_STREAM_MODE}_default;LD_PRELOAD=$<TARGET_FILE:cudf_identify_stream_usage_mode_${_CUDF_TEST_STREAM_MODE}>"
)
if(CUDF_BUILD_STREAMS_TEST_UTIL)
set_tests_properties(
${CMAKE_TEST_NAME}
PROPERTIES
ENVIRONMENT
"GTEST_CUDF_STREAM_MODE=new_${_CUDF_TEST_STREAM_MODE}_default;LD_PRELOAD=$<TARGET_FILE:cudf_identify_stream_usage_mode_${_CUDF_TEST_STREAM_MODE}>"
)
endif()
endfunction()

# ##################################################################################################
Expand Down Expand Up @@ -401,14 +403,10 @@ ConfigureTest(SPAN_TEST utilities_tests/span_tests.cu)
ConfigureTest(SPAN_TEST_DEVICE_VECTOR utilities_tests/span_tests.cu)

# Overwrite the environments set by ConfigureTest
set_tests_properties(
SPAN_TEST
PROPERTIES
ENVIRONMENT
"GTEST_FILTER=-${_allowlist_filter};GTEST_CUDF_STREAM_MODE=new_cudf_default;LD_PRELOAD=$<TARGET_FILE:cudf_identify_stream_usage_mode_cudf>"
)
set_tests_properties(
SPAN_TEST_DEVICE_VECTOR PROPERTIES ENVIRONMENT "GTEST_FILTER=${_allowlist_filter}"
set_property(
TEST SPAN_TEST SPAN_TEST_DEVICE_VECTOR
APPEND
PROPERTY ENVIRONMENT "GTEST_FILTER=-${_allowlist_filter}"
)

# ##################################################################################################
Expand Down Expand Up @@ -671,9 +669,11 @@ target_include_directories(JIT_PARSER_TEST PRIVATE "$<BUILD_INTERFACE:${CUDF_SOU

# ##################################################################################################
# * stream testing ---------------------------------------------------------------------------------
ConfigureTest(
STREAM_IDENTIFICATION_TEST identify_stream_usage/test_default_stream_identification.cu
)
if(CUDF_BUILD_STREAMS_TEST_UTIL)
ConfigureTest(
STREAM_IDENTIFICATION_TEST identify_stream_usage/test_default_stream_identification.cu
)
endif()

ConfigureTest(STREAM_BINARYOP_TEST streams/binaryop_test.cpp STREAM_MODE testing)
ConfigureTest(STREAM_CONCATENATE_TEST streams/concatenate_test.cpp STREAM_MODE testing)
Expand Down
121 changes: 121 additions & 0 deletions docs/cudf/source/developer_guide/cudf_pandas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# cudf.pandas
The use of the cuDF pandas accelerator mode (`cudf.pandas`) is explained [in the user guide](../cudf_pandas/index.rst).
The purpose of this document is to explain how the fast-slow proxy mechanism works and document internal environment variables that can be used to debug `cudf.pandas` itself.

## fast-slow proxy mechanism
`cudf.pandas` works by wrapping each Pandas type and its corresponding cuDF type in a new proxy type also known as a fast-slow proxy type.
The purpose of proxy types is to attempt computations on the fast (cuDF) object first, and then fall back to running on the slow (Pandas) object if the fast version fails.

### Types:
#### Wrapped Types and Proxy Types
The "wrapped" types/classes are the Pandas and cuDF specific types that have been wrapped into proxy types.
Wrapped objects and proxy objects are instances of wrapped types and proxy types, respectively.
In the snippet below `s1` and `s2` are wrapped objects and `s3` is a fast-slow proxy object.
Also note that the module `xpd` is a wrapped module and contains cuDF and Pandas modules as attributes.
```python
import cudf.pandas
cudf.pandas.install()
import pandas as xpd

cudf = xpd._fsproxy_fast
pd = xpd._fsproxy_slow

s1 = cudf.Series([1,2])
s2 = pd.Series([1,2])
s3 = xpd.Series([1,2])
```

```{note}
Note that users should never have to interact with the wrapped objects directly in this way.
This code is purely for demonstrative purposes.
```

#### The Different Kinds of Proxy Types
In `cudf.pandas`, there are two main kinds of proxy types: final types and intermediate types.

##### Final and Intermediate Proxy Types
Final types are types for which known operations exist for converting an object of a "fast" type to a "slow" type and vice versa.
For example, `cudf.DataFrame` can be converted to Pandas using the method `to_pandas`, and `pd.DataFrame` can be converted to cuDF using the function `cudf.from_pandas`.
Intermediate types are the types of the results of operations invoked on final types.
For example, `xpd.DataFrameGroupBy` is an intermediate type that will be created during a groupby operation on the final type `xpd.DataFrame`.

##### Attributes and Callable Proxy Types
Final proxy types are typically classes or modules, both of which have attributes.
Classes also have methods.
These attributes and methods must be wrapped as well to support the fast-slow proxy scheme.

#### Creating New Proxy Types
`_FinalProxy` and `_IntermediateProxy` types are created using the functions `make_final_proxy_type` and `make_intermediate_proxy` type, respectively.
Creating a new final type looks like this.

```python
DataFrame = make_final_proxy_type(
"DataFrame",
cudf.DataFrame,
pd.DataFrame,
fast_to_slow=lambda fast: fast.to_pandas(),
slow_to_fast=cudf.from_pandas,
)
```

### The Fallback Mechanism
Proxied calls are implemented with fallback via [`_fast_slow_function_call`](https://github.com/rapidsai/cudf/blob/57aeeb78d85e169ac18b82f51d2b1cbd01b0608d/python/cudf/cudf/pandas/fast_slow_proxy.py#L869). This implements the mechanism by which we attempt operations the fast way (using cuDF) and then fall back to the slow way (using Pandas) on failure.
The function looks like this:
```python
def _fast_slow_function_call(func: Callable, *args, **kwargs):
try:
...
fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs)
result = func(*fast_args, **fast_kwargs)
...
except Exception:
...
slow_args, slow_kwargs = _slow_arg(args), _slow_arg(kwargs)
result = func(*slow_args, **slow_kwargs)
...
return _maybe_wrap_result(result, func, *args, **kwargs), fast
```
As we can see the function attempts to call `func` the fast way using cuDF and if any `Exception` occurs, it calls the function using Pandas.
In essence, this `try-except` is what allows `cudf.pandas` to support the bulk of the Pandas API.

At the end, the function wraps the result from either path in a fast-slow proxy object, if necessary.

#### Converting Proxy Objects
Note that before the `func` is called, the proxy object and its attributes need to be converted to either their cuDF or Pandas implementations.
This conversion is handled in the function `_transform_arg` which both `_fast_arg` and `_slow_arg` call.

`_transform_arg` is a recursive function that will call itself depending on the type or argument passed to it (eg. `_transform_arg` is called for each element in a list of arguments).

### Using Metaclasses
`cudf.pandas` uses a [metaclass](https://docs.python.org/3/glossary.html#term-metaclass) called (`_FastSlowProxyMeta`) to find class attributes and classmethods of fast-slow proxy types.
For example, in the snippet below, the `xpd.Series` type is an instance of `_FastSlowProxyMeta`.
Therefore we can access the property `_fsproxy_fast` defined in the metaclass.
```python
import cudf.pandas
cudf.pandas.install()
import pandas as xpd

print(xpd.Series._fsproxy_fast) # output is cudf.core.series.Series
```

## debugging `cudf.pandas`
Several environment variables are available for debugging purposes.

Setting the environment variable `CUDF_PANDAS_DEBUGGING` produces a warning when the results from cuDF and Pandas differ from one another.
For example, the snippet below produces the warning below.
```python
import cudf.pandas
cudf.pandas.install()
import pandas as pd
import numpy as np

setattr(pd.Series.mean, "_fsproxy_slow", lambda self, *args, **kwargs: np.float64(1))
s = pd.Series([1,2,3])
s.mean()
```
```
UserWarning: The results from cudf and pandas were different. The exception was
Arrays are not almost equal to 7 decimals
ACTUAL: 1.0
DESIRED: 2.0.
```
1 change: 1 addition & 0 deletions docs/cudf/source/developer_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,5 @@ testing
benchmarking
options
pylibcudf
cudf_pandas
```
9 changes: 8 additions & 1 deletion docs/cudf/source/user_guide/api_docs/pylibcudf/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This page provides API documentation for pylibcudf.

.. toctree::
:maxdepth: 1
:caption: API Documentation
:caption: Top-level modules

aggregation
binaryop
Expand All @@ -17,6 +17,7 @@ This page provides API documentation for pylibcudf.
filling
gpumemoryview
groupby
io/index.rst
join
lists
merge
Expand All @@ -32,3 +33,9 @@ This page provides API documentation for pylibcudf.
table
types
unary

.. toctree::
:maxdepth: 2
:caption: Subpackages

strings/index.rst
6 changes: 6 additions & 0 deletions docs/cudf/source/user_guide/api_docs/pylibcudf/io/avro.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
====
Avro
====

.. automodule:: cudf._lib.pylibcudf.io.avro
:members:
18 changes: 18 additions & 0 deletions docs/cudf/source/user_guide/api_docs/pylibcudf/io/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
===
I/O
===

I/O Utility Classes
===================

.. automodule:: cudf._lib.pylibcudf.io.types
:members:


I/O Functions
=============

.. toctree::
:maxdepth: 1

avro
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
========
contains
========

.. automodule:: cudf._lib.pylibcudf.strings.contains
:members:
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
strings
=======

.. toctree::
:maxdepth: 1

contains
replace
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
=======
replace
=======

.. automodule:: cudf._lib.pylibcudf.strings.replace
:members:
50 changes: 14 additions & 36 deletions python/cudf/cudf/_lib/avro.pyx
Original file line number Diff line number Diff line change
@@ -1,20 +1,12 @@
# Copyright (c) 2020-2024, NVIDIA CORPORATION.

from libcpp.string cimport string
from libcpp.utility cimport move
from libcpp.vector cimport vector
from cudf._lib.utils cimport data_from_pylibcudf_io

from cudf._lib.io.utils cimport make_source_info
from cudf._lib.pylibcudf.libcudf.io.avro cimport (
avro_reader_options,
read_avro as libcudf_read_avro,
)
from cudf._lib.pylibcudf.libcudf.io.types cimport table_with_metadata
from cudf._lib.pylibcudf.libcudf.types cimport size_type
from cudf._lib.utils cimport data_from_unique_ptr
import cudf._lib.pylibcudf as plc
from cudf._lib.pylibcudf.io.types import SourceInfo


cpdef read_avro(datasource, columns=None, skip_rows=-1, num_rows=-1):
cpdef read_avro(datasource, columns=None, skip_rows=0, num_rows=-1):
"""
Cython function to call libcudf read_avro, see `read_avro`.
Expand All @@ -28,28 +20,14 @@ cpdef read_avro(datasource, columns=None, skip_rows=-1, num_rows=-1):

if not isinstance(num_rows, int) or num_rows < -1:
raise TypeError("num_rows must be an int >= -1")
if not isinstance(skip_rows, int) or skip_rows < -1:
raise TypeError("skip_rows must be an int >= -1")

cdef vector[string] c_columns
if columns is not None and len(columns) > 0:
c_columns.reserve(len(columns))
for col in columns:
c_columns.push_back(str(col).encode())

cdef avro_reader_options options = move(
avro_reader_options.builder(make_source_info([datasource]))
.columns(c_columns)
.skip_rows(<size_type> skip_rows)
.num_rows(<size_type> num_rows)
.build()
if not isinstance(skip_rows, int) or skip_rows < 0:
raise TypeError("skip_rows must be an int >= 0")

return data_from_pylibcudf_io(
plc.io.avro.read_avro(
SourceInfo([datasource]),
columns,
skip_rows,
num_rows
)
)

cdef table_with_metadata c_result

with nogil:
c_result = move(libcudf_read_avro(options))

names = [info.name.decode() for info in c_result.metadata.schema_info]

return data_from_unique_ptr(move(c_result.tbl), column_names=names)
8 changes: 4 additions & 4 deletions python/cudf/cudf/_lib/csv.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -151,14 +151,14 @@ cdef csv_reader_options make_csv_reader_options(
)

if quoting == 1:
c_quoting = quote_style.QUOTE_ALL
c_quoting = quote_style.ALL
elif quoting == 2:
c_quoting = quote_style.QUOTE_NONNUMERIC
c_quoting = quote_style.NONNUMERIC
elif quoting == 3:
c_quoting = quote_style.QUOTE_NONE
c_quoting = quote_style.NONE
else:
# Default value
c_quoting = quote_style.QUOTE_MINIMAL
c_quoting = quote_style.MINIMAL

cdef csv_reader_options csv_reader_options_c = move(
csv_reader_options.builder(c_source_info)
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/_lib/parquet.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ def write_parquet(
"Valid values are '1.0' and '2.0'"
)

dict_policy = (
cdef cudf_io_types.dictionary_policy dict_policy = (
cudf_io_types.dictionary_policy.ADAPTIVE
if use_dictionary
else cudf_io_types.dictionary_policy.NEVER
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/pylibcudf/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,4 @@ link_to_pyarrow_headers(pylibcudf_interop)

add_subdirectory(libcudf)
add_subdirectory(strings)
add_subdirectory(io)
3 changes: 2 additions & 1 deletion python/cudf/cudf/_lib/pylibcudf/interop.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ ARROW_TO_PYLIBCUDF_TYPES = {
pa.timestamp('us'): type_id.TIMESTAMP_MICROSECONDS,
pa.timestamp('ns'): type_id.TIMESTAMP_NANOSECONDS,
pa.date32(): type_id.TIMESTAMP_DAYS,
pa.null(): type_id.EMPTY,
}

LIBCUDF_TO_ARROW_TYPES = {
Expand Down Expand Up @@ -245,7 +246,7 @@ def _to_arrow_datatype(cudf_object, **kwargs):
return pa.list_(value_type)
else:
try:
return ARROW_TO_PYLIBCUDF_TYPES[cudf_object.id()]
return LIBCUDF_TO_ARROW_TYPES[cudf_object.id()]
except KeyError:
raise TypeError(
f"Unable to convert {cudf_object.id()} to arrow datatype"
Expand Down
25 changes: 25 additions & 0 deletions python/cudf/cudf/_lib/pylibcudf/io/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# =============================================================================
# Copyright (c) 2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing permissions and limitations under
# the License.
# =============================================================================

set(cython_sources avro.pyx types.pyx)

set(linked_libraries cudf::cudf)
rapids_cython_create_modules(
CXX
SOURCE_FILES "${cython_sources}"
LINKED_LIBRARIES "${linked_libraries}" MODULE_PREFIX pylibcudf_io_ ASSOCIATED_TARGETS cudf
)

set(targets_using_arrow_headers pylibcudf_io_avro pylibcudf_io_types)
link_to_pyarrow_headers("${targets_using_arrow_headers}")
Loading

0 comments on commit 4637b04

Please sign in to comment.