Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch pylibcudf-enabled types to use enum class in Cython #13931

Merged
merged 13 commits into from
Aug 25, 2023
66 changes: 55 additions & 11 deletions docs/cudf/source/developer_guide/pylibcudf.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,26 +86,70 @@ cpdef Table gather(
cpp_copying.gather(
source_table.view(),
gather_map.view(),
py_policy_to_c_policy(bounds_policy)
bounds_policy
)
)
return Table.from_libcudf(move(c_result))
```

There are a couple of notable points from the snippet above:
- The object returned from libcudf is immediately converted to a pylibcudf type.
- `cudf::gather` accepts a `cudf::out_of_bounds_policy` enum parameter, which is mirrored by the `cdef `class OutOfBoundsPolicy` as mentioned in [the data structures example above](data-structures).
- `cudf::gather` accepts a `cudf::out_of_bounds_policy` enum parameter. `OutOfBoundsPolicy` is an alias for this type in pylibcudf that matches our Python naming conventions (CapsCase instead of snake\_case).

## Miscellaneous Notes

### Cython Scoped Enums and Casting
Cython does not support scoped enumerations.
It assumes that enums correspond to their underlying value types and will thus attempt operations that are invalid.
To fix this, many places in pylibcudf Cython code contain double casts that look like
### Cython Scoped Enums
Cython 3 introduced support for scoped enumerations.
However, this support has some bugs as well as some easy pitfalls.
Our usage of enums is intended to minimize the complexity of our code while also working around Cython's limitations.

```{warning}
The guidance in this section may change often as Cython is updated and our understanding of best practices evolves.
```

- All pxd files that declare a C++ enum should use `cpdef enum class` declarations.
- Reason: This declaration makes the C++ enum available in Cython code while also transparently creating a Python enum.
- Any pxd file containing only C++ declarations must still have a corresponding pyx file if any of the declarations are scoped enums.
- Reason: The creation of the Python enum requires that Cython actually generate the necessary code Python C API code, which will not happen if only a pxd file is present.
vyasr marked this conversation as resolved.
Show resolved Hide resolved
- If a C++ enum will be part of a pylibcudf module's public API, then it should be imported (not cimported) directly into the pyx file and aliased with a name that matches our Python class naming conventions (CapsCase) instead of our C++ naming convention (snake\_case).
- Reason: We want to expose the enum to both Python and Cython consumers of the module. As a side effect, this aliasing avoids [this Cython bug](https://github.com/cython/cython/issues/5609).
- Note: Once the above Cython bug is resolved, the enum should also be aliased into the pylibcudf pxd file when it is cimported so that Python and Cython usage will match.

Here is an example of appropriate enum usage.


```cython
return <cpp_type> (
<underlying_type_t_cpp_type> py_policy
)
# cpp/copying.pxd
cdef extern from "cudf/copying.hpp" namespace "cudf" nogil:
# cpdef here so that we export both a cdef enum class and a Python enum.Enum.
cpdef enum class out_of_bounds_policy(bool):
NULLIFY
DONT_CHECK


# cpp/copying.pyx
# This file is empty, but is required to compile the Python enum in cpp/copying.pxd


# pylibcudf/copying.pxd

# cimport the enum using the exact name
# Once https://github.com/cython/cython/issues/5609 is resolved,
# this import should instead be
# from cudf._lib.cpp.copying cimport out_of_bounds_policy as OutOfBoundsPolicy
from cudf._lib.cpp.copying cimport out_of_bounds_policy


# pylibcudf/copying.pyx
# Access cpp.copying members that aren't part of this module's public API via
# this module alias
from cudf._lib.cpp cimport copying as cpp_copying
from cudf._lib.cpp.copying cimport out_of_bounds_policy

# This import exposes the enum in the public API of this module.
# It requires a no-cython-lint tag because it will be unused: all typing of
# parameters etc will need to use the Cython name `out_of_bounds_policy` until
# the Cython bug is resolved.
from cudf._lib.cpp.copying import \
out_of_bounds_policy as OutOfBoundsPolicy # no-cython-lint
```
where `cpp_type` is some libcudf enum with a specified underlying type.
This double-cast will be removed when we migrate to Cython 3, which adds proper support for C++ scoped enumerations.
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ foreach(target IN LISTS targets_using_arrow_headers)
target_include_directories(${target} PRIVATE "${PYARROW_INCLUDE_DIR}")
endforeach()

add_subdirectory(cpp)
add_subdirectory(io)
add_subdirectory(nvtext)
add_subdirectory(pylibcudf)
Expand Down
23 changes: 23 additions & 0 deletions python/cudf/cudf/_lib/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# =============================================================================
# Copyright (c) 2022-2023, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing permissions and limitations under
# the License.
# =============================================================================

set(cython_sources copying.pyx types.pyx)

set(linked_libraries cudf::cudf)

rapids_cython_create_modules(
CXX
SOURCE_FILES "${cython_sources}"
LINKED_LIBRARIES "${linked_libraries}" ASSOCIATED_TARGETS cudf MODULE_PREFIX cpp
)
6 changes: 3 additions & 3 deletions python/cudf/cudf/_lib/cpp/copying.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ from cudf._lib.exception_handler cimport cudf_exception_handler
ctypedef const scalar constscalar

cdef extern from "cudf/copying.hpp" namespace "cudf" nogil:
ctypedef enum out_of_bounds_policy:
NULLIFY 'cudf::out_of_bounds_policy::NULLIFY'
DONT_CHECK 'cudf::out_of_bounds_policy::DONT_CHECK'
cpdef enum class out_of_bounds_policy(bool):
NULLIFY
DONT_CHECK

cdef unique_ptr[table] gather (
const table_view& source_table,
Expand Down
Empty file.
65 changes: 35 additions & 30 deletions python/cudf/cudf/_lib/cpp/types.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ from libc.stdint cimport int32_t, uint32_t


cdef extern from "cudf/types.hpp" namespace "cudf" nogil:
# The declaration below is to work around
# https://github.com/cython/cython/issues/5637
"""
#define __PYX_ENUM_CLASS_DECL enum
"""
ctypedef int32_t size_type
ctypedef uint32_t bitmask_type
ctypedef uint32_t char_utf8
Expand Down Expand Up @@ -49,36 +54,36 @@ cdef extern from "cudf/types.hpp" namespace "cudf" nogil:
ALL_EQUAL "cudf::nan_equality::ALL_EQUAL"
NANS_UNEQUAL "cudf::nan_equality::UNEQUAL"

ctypedef enum type_id "cudf::type_id":
EMPTY "cudf::type_id::EMPTY"
INT8 "cudf::type_id::INT8"
INT16 "cudf::type_id::INT16"
INT32 "cudf::type_id::INT32"
INT64 "cudf::type_id::INT64"
UINT8 "cudf::type_id::UINT8"
UINT16 "cudf::type_id::UINT16"
UINT32 "cudf::type_id::UINT32"
UINT64 "cudf::type_id::UINT64"
FLOAT32 "cudf::type_id::FLOAT32"
FLOAT64 "cudf::type_id::FLOAT64"
BOOL8 "cudf::type_id::BOOL8"
TIMESTAMP_DAYS "cudf::type_id::TIMESTAMP_DAYS"
TIMESTAMP_SECONDS "cudf::type_id::TIMESTAMP_SECONDS"
TIMESTAMP_MILLISECONDS "cudf::type_id::TIMESTAMP_MILLISECONDS"
TIMESTAMP_MICROSECONDS "cudf::type_id::TIMESTAMP_MICROSECONDS"
TIMESTAMP_NANOSECONDS "cudf::type_id::TIMESTAMP_NANOSECONDS"
DICTIONARY32 "cudf::type_id::DICTIONARY32"
STRING "cudf::type_id::STRING"
LIST "cudf::type_id::LIST"
STRUCT "cudf::type_id::STRUCT"
NUM_TYPE_IDS "cudf::type_id::NUM_TYPE_IDS"
DURATION_SECONDS "cudf::type_id::DURATION_SECONDS"
DURATION_MILLISECONDS "cudf::type_id::DURATION_MILLISECONDS"
DURATION_MICROSECONDS "cudf::type_id::DURATION_MICROSECONDS"
DURATION_NANOSECONDS "cudf::type_id::DURATION_NANOSECONDS"
DECIMAL32 "cudf::type_id::DECIMAL32"
DECIMAL64 "cudf::type_id::DECIMAL64"
DECIMAL128 "cudf::type_id::DECIMAL128"
cpdef enum class type_id(int32_t):
EMPTY
INT8
INT16
INT32
INT64
UINT8
UINT16
UINT32
UINT64
FLOAT32
FLOAT64
BOOL8
TIMESTAMP_DAYS
TIMESTAMP_SECONDS
TIMESTAMP_MILLISECONDS
TIMESTAMP_MICROSECONDS
TIMESTAMP_NANOSECONDS
DICTIONARY32
STRING
LIST
STRUCT
NUM_TYPE_IDS
DURATION_SECONDS
DURATION_MILLISECONDS
DURATION_MICROSECONDS
DURATION_NANOSECONDS
DECIMAL32
DECIMAL64
DECIMAL128

cdef cppclass data_type:
data_type() except +
Expand Down
Empty file.
5 changes: 3 additions & 2 deletions python/cudf/cudf/_lib/pylibcudf/__init__.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ from . cimport copying
from .column cimport Column
from .gpumemoryview cimport gpumemoryview
from .table cimport Table
from .types cimport DataType, TypeId
# TODO: cimport type_id once
# https://github.com/cython/cython/issues/5609 is resolved
from .types cimport DataType

__all__ = [
"Column",
"DataType",
"Table",
"TypeId",
"copying",
"gpumemoryview",
]
5 changes: 3 additions & 2 deletions python/cudf/cudf/_lib/pylibcudf/column.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ from cudf._lib.cpp.column.column cimport column, column_contents
from cudf._lib.cpp.types cimport size_type

from .gpumemoryview cimport gpumemoryview
from .types cimport DataType, TypeId
from .types cimport DataType, type_id
from .utils cimport int_to_bitmask_ptr, int_to_void_ptr


Expand Down Expand Up @@ -179,10 +179,11 @@ cdef class Column:
cpdef list children(self):
return self._children


cdef class ListColumnView:
"""Accessor for methods of a Column that are specific to lists."""
def __init__(self, Column col):
if col.type().id() != TypeId.LIST:
if col.type().id() != type_id.LIST:
raise TypeError("Column is not a list type")
self._column = col

Expand Down
20 changes: 2 additions & 18 deletions python/cudf/cudf/_lib/pylibcudf/copying.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,14 @@

from libcpp cimport bool as cbool

from cudf._lib.cpp cimport copying as cpp_copying
from cudf._lib.cpp.copying cimport out_of_bounds_policy

from .column cimport Column
from .table cimport Table

ctypedef cbool underlying_type_t_out_of_bounds_policy


# Enum representing possible enum policies. This is the Cython representation
# of libcudf's out_of_bounds_policy.
cpdef enum OutOfBoundsPolicy:
NULLIFY = <underlying_type_t_out_of_bounds_policy> cpp_copying.NULLIFY
DONT_CHECK = (
<underlying_type_t_out_of_bounds_policy> cpp_copying.DONT_CHECK
)


cdef cpp_copying.out_of_bounds_policy py_policy_to_c_policy(
OutOfBoundsPolicy py_policy
) nogil


cpdef Table gather(
Table source_table,
Column gather_map,
OutOfBoundsPolicy bounds_policy
out_of_bounds_policy bounds_policy
)
20 changes: 8 additions & 12 deletions python/cudf/cudf/_lib/pylibcudf/copying.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,23 @@ from libcpp.utility cimport move
# we really want here would be
# cimport libcudf... libcudf.copying.algo(...)
from cudf._lib.cpp cimport copying as cpp_copying
from cudf._lib.cpp.copying cimport out_of_bounds_policy

from cudf._lib.cpp.copying import \
out_of_bounds_policy as OutOfBoundsPolicy # no-cython-lint

from cudf._lib.cpp.table.table cimport table

from .column cimport Column
from .table cimport Table


cdef inline cpp_copying.out_of_bounds_policy py_policy_to_c_policy(
OutOfBoundsPolicy py_policy
) nogil:
"""Convert a Cython policy the corresponding libcudf policy type."""
return <cpp_copying.out_of_bounds_policy> (
<underlying_type_t_out_of_bounds_policy> py_policy
)


# TODO: Is it OK to reference the corresponding libcudf algorithm in the
# documentation? Otherwise there's a lot of room for duplication.
cpdef Table gather(
Table source_table,
Column gather_map,
OutOfBoundsPolicy bounds_policy
out_of_bounds_policy bounds_policy
):
"""Select rows from source_table according to the provided gather_map.

Expand All @@ -40,7 +36,7 @@ cpdef Table gather(
The table object from which to pull data.
gather_map : Column
The list of row indices to pull out of the source table.
bounds_policy : OutOfBoundsPolicy
bounds_policy : out_of_bounds_policy
Controls whether out of bounds indices are checked and nullified in the
output or if indices are assumed to be in bounds.

Expand All @@ -55,7 +51,7 @@ cpdef Table gather(
cpp_copying.gather(
source_table.view(),
gather_map.view(),
py_policy_to_c_policy(bounds_policy)
bounds_policy
)
)
return Table.from_libcudf(move(c_result))
55 changes: 2 additions & 53 deletions python/cudf/cudf/_lib/pylibcudf/types.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -3,64 +3,13 @@
from libc.stdint cimport int32_t
from libcpp cimport bool as cbool

from cudf._lib.cpp.types cimport data_type, interpolation, null_policy, type_id

ctypedef int32_t underlying_type_t_type_id


# Enum representing possible data type ids. This is the Cython representation
# of libcudf's type_id.
cpdef enum TypeId:
EMPTY = <underlying_type_t_type_id> type_id.EMPTY
INT8 = <underlying_type_t_type_id> type_id.INT8
INT16 = <underlying_type_t_type_id> type_id.INT16
INT32 = <underlying_type_t_type_id> type_id.INT32
INT64 = <underlying_type_t_type_id> type_id.INT64
UINT8 = <underlying_type_t_type_id> type_id.UINT8
UINT16 = <underlying_type_t_type_id> type_id.UINT16
UINT32 = <underlying_type_t_type_id> type_id.UINT32
UINT64 = <underlying_type_t_type_id> type_id.UINT64
FLOAT32 = <underlying_type_t_type_id> type_id.FLOAT32
FLOAT64 = <underlying_type_t_type_id> type_id.FLOAT64
BOOL8 = <underlying_type_t_type_id> type_id.BOOL8
TIMESTAMP_DAYS = <underlying_type_t_type_id> type_id.TIMESTAMP_DAYS
TIMESTAMP_SECONDS = <underlying_type_t_type_id> type_id.TIMESTAMP_SECONDS
TIMESTAMP_MILLISECONDS = (
<underlying_type_t_type_id> type_id.TIMESTAMP_MILLISECONDS
)
TIMESTAMP_MICROSECONDS = (
<underlying_type_t_type_id> type_id.TIMESTAMP_MICROSECONDS
)
TIMESTAMP_NANOSECONDS = (
<underlying_type_t_type_id> type_id.TIMESTAMP_NANOSECONDS
)
DICTIONARY32 = <underlying_type_t_type_id> type_id.DICTIONARY32
STRING = <underlying_type_t_type_id> type_id.STRING
LIST = <underlying_type_t_type_id> type_id.LIST
STRUCT = <underlying_type_t_type_id> type_id.STRUCT
NUM_TYPE_IDS = <underlying_type_t_type_id> type_id.NUM_TYPE_IDS
DURATION_SECONDS = <underlying_type_t_type_id> type_id.DURATION_SECONDS
DURATION_MILLISECONDS = (
<underlying_type_t_type_id> type_id.DURATION_MILLISECONDS
)
DURATION_MICROSECONDS = (
<underlying_type_t_type_id> type_id.DURATION_MICROSECONDS
)
DURATION_NANOSECONDS = (
<underlying_type_t_type_id> type_id.DURATION_NANOSECONDS
)
DECIMAL32 = <underlying_type_t_type_id> type_id.DECIMAL32
DECIMAL64 = <underlying_type_t_type_id> type_id.DECIMAL64
DECIMAL128 = <underlying_type_t_type_id> type_id.DECIMAL128


cdef type_id py_type_to_c_type(TypeId py_type_id) nogil
from cudf._lib.cpp.types cimport data_type, type_id


cdef class DataType:
cdef data_type c_obj

cpdef TypeId id(self)
cpdef type_id id(self)
cpdef int32_t scale(self)

@staticmethod
Expand Down
Loading