Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at adding testing for pylibcudf #15300

Merged
merged 49 commits into from
Apr 1, 2024
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
c9a920a
Update developer guide
vyasr Mar 14, 2024
0243360
Add some basic utilities for testing equality
vyasr Mar 14, 2024
2deb817
Add a function to copy a column
vyasr Mar 14, 2024
cf10f9d
Add some tests of copying
vyasr Mar 14, 2024
1a32411
Implement num_rows and num_columns for a table
vyasr Mar 14, 2024
260ec20
Add test of gather and fix a number of minor issues
vyasr Mar 14, 2024
9f2c2ff
Move pylibcudf tests to separate top-level directory
vyasr Mar 14, 2024
c7f86af
Add pytest.ini
vyasr Mar 14, 2024
fb576ed
Test new directory in pytests
vyasr Mar 14, 2024
ad70d26
Fix parameter name
vyasr Mar 14, 2024
a4e677d
Implement __eq__ for DataType
vyasr Mar 14, 2024
aef8395
Fix C++ doc
vyasr Mar 14, 2024
ecf8546
Add remaining tests so that every API is tested at least once
vyasr Mar 14, 2024
6179315
Add missing case
vyasr Mar 14, 2024
678e773
Test exception handling for gather
vyasr Mar 14, 2024
6c67886
Add tests of scatter exception cases
vyasr Mar 14, 2024
02e7d02
Update exceptions thrown by various C++ copying APIs.
vyasr Mar 15, 2024
ff0346a
Update all docs
vyasr Mar 15, 2024
639b486
Add tests of the remaining error cases
vyasr Mar 15, 2024
7b23085
Fix typo
vyasr Mar 15, 2024
fc9b2c7
Revert C++ changes
vyasr Mar 19, 2024
a91c57f
Merge remote-tracking branch 'upstream/branch-24.04' into feat/pylibc…
vyasr Mar 19, 2024
db97c3e
Fix compilation
vyasr Mar 19, 2024
e28333f
Merge remote-tracking branch 'upstream/branch-24.04' into feat/pylibc…
vyasr Mar 20, 2024
1174d09
Switch testing utilities to use the new interop APIs and make sure th…
vyasr Mar 20, 2024
e3a0123
Remove parallelism in CI
vyasr Mar 20, 2024
a93681f
Apply suggestions from code review
vyasr Mar 20, 2024
6381212
Fix style
vyasr Mar 20, 2024
fd87b0a
Merge remote-tracking branch 'upstream/branch-24.06' into feat/pylibc…
vyasr Mar 22, 2024
593753a
Use cudf_raises wrapper to verify that exceptions are coming from lib…
vyasr Mar 22, 2024
3a482a9
Also move all the pyarrow objects to fixtures
vyasr Mar 22, 2024
9b4fd29
Implement scattering properly with arrow
vyasr Mar 22, 2024
49c6e39
Remove hardcoded values from a few more tests
vyasr Mar 22, 2024
b10892e
Also standardize the mask
vyasr Mar 22, 2024
8dcf320
Generalize tests to work for floats
vyasr Mar 22, 2024
911027c
Add tests of strings
vyasr Mar 25, 2024
7cc2310
Add tests of bool
vyasr Mar 25, 2024
45cbc64
Add tests of lists
vyasr Mar 25, 2024
452bf33
Remove skipped tests in favor of simple hardcoded results
vyasr Mar 25, 2024
f0c525a
Add tests of struct
vyasr Mar 25, 2024
62337a2
Centralize the struct type
vyasr Mar 25, 2024
f83bd66
Some cleanup and reorganization of fixtures and helper functions
vyasr Mar 25, 2024
78759d9
Add a note on testing nullable data
vyasr Mar 25, 2024
f94d548
Address PR feedback
vyasr Mar 25, 2024
cd27ab1
Merge remote-tracking branch 'upstream/branch-24.06' into feat/pylibc…
vyasr Mar 25, 2024
3327d85
Some updates to the dev guide
vyasr Mar 25, 2024
469b536
Remove erroneous error
vyasr Mar 25, 2024
fcd4a98
Address feedback
vyasr Mar 26, 2024
be6edb3
Merge branch 'branch-24.06' into feat/pylibcudf_testing
vyasr Mar 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions ci/test_python_cudf.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,14 @@ EXITCODE=0
trap "EXITCODE=1" ERR
set +e

rapids-logger "pytest pylibcudf"
pushd python/cudf/cudf/pylibcudf_tests
python -m pytest \
--cache-clear \
--dist=worksteal \
.
popd

rapids-logger "pytest cudf"
./ci/run_cudf_pytests.sh \
--junitxml="${RAPIDS_TESTS_DIR}/junit-cudf.xml" \
Expand Down
8 changes: 8 additions & 0 deletions ci/test_wheel_cudf.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,14 @@ if [[ "$(arch)" == "aarch64" && ${RAPIDS_BUILD_TYPE} == "pull-request" ]]; then
rapids-logger "Run smoke tests for cudf"
python ./ci/wheel_smoke_test_cudf.py
else
rapids-logger "pytest pylibcudf"
pushd python/cudf/cudf/pylibcudf_tests
python -m pytest \
--cache-clear \
--dist=worksteal \
.
popd

rapids-logger "pytest cudf"
pushd python/cudf/cudf/tests
python -m pytest \
Expand Down
3 changes: 3 additions & 0 deletions cpp/include/cudf/copying.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,8 @@ std::unique_ptr<column> empty_like(scalar const& input);
* If the `mask_alloc` allocates a validity mask that mask is also uninitialized
* and the validity bits and the null count should be set by the caller.
*
* @throws cudf::data_type_error if input type is not of fixed width.
*
* @param input Immutable view of input column to emulate
* @param mask_alloc Optional, Policy for allocating null mask. Defaults to RETAIN
* @param mr Device memory resource used to allocate the returned column's device memory
Expand Down Expand Up @@ -360,6 +362,7 @@ void copy_range_in_place(column_view const& source,
*
* @throws std::out_of_range for any invalid range.
* @throws cudf::data_type_error if @p target and @p source have different types.
* @throws cudf::data_type_error if the data type is not fixed width, string, or dictionary
*
* @param source The column to copy from inside the range
* @param target The column to copy from outside the range
Expand Down
5 changes: 3 additions & 2 deletions cpp/src/copying/copy.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
* Copyright (c) 2019-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -122,7 +122,8 @@ std::unique_ptr<column> allocate_like(column_view const& input,
rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
CUDF_EXPECTS(is_fixed_width(input.type()), "Expects only fixed-width type column");
CUDF_EXPECTS(
is_fixed_width(input.type()), "Expects only fixed-width type column", cudf::data_type_error);
mask_state allocate_mask = should_allocate_mask(mask_alloc, input.nullable());

return std::make_unique<column>(input.type(),
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/copying/copy_range.cu
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ struct out_of_place_copy_range_dispatch {
std::enable_if_t<not cudf::is_rep_layout_compatible<T>(), std::unique_ptr<cudf::column>>
operator()(Args...)
{
CUDF_FAIL("Unsupported type for out of place copy.");
CUDF_FAIL("Unsupported type for out of place copy.", cudf::data_type_error);
}
};

Expand Down
11 changes: 10 additions & 1 deletion cpp/src/copying/scatter.cu
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,9 @@ struct column_scalar_scatterer_impl<string_view, MapIterator> {
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const
{
CUDF_EXPECTS(source.get().type() == target.type(), "scalar and column types must match");
CUDF_EXPECTS(source.get().type() == target.type(),
"scalar and column types must match",
cudf::data_type_error);

auto const scalar_impl = static_cast<string_scalar const*>(&source.get());
auto const source_view = string_view(scalar_impl->data(), scalar_impl->size());
Expand All @@ -166,6 +168,9 @@ struct column_scalar_scatterer_impl<list_view, MapIterator> {
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const
{
CUDF_EXPECTS(source.get().type() == target.type(),
"scalar and column types must match",
cudf::data_type_error);
auto result =
lists::detail::scatter(source, scatter_iter, scatter_iter + scatter_rows, target, stream, mr);

Expand Down Expand Up @@ -249,6 +254,10 @@ struct column_scalar_scatterer_impl<struct_view, MapIterator> {
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr) const
{
CUDF_EXPECTS(source.get().type() == target.type(),
"scalar and column types must match",
cudf::data_type_error);

// For each field of `source`, copy construct a scalar from the field
// and dispatch to the corresponding scalar scatterer

Expand Down
66 changes: 66 additions & 0 deletions docs/cudf/source/developer_guide/pylibcudf.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,72 @@ There are a couple of notable points from the snippet above:
- The object returned from libcudf is immediately converted to a pylibcudf type.
- `cudf::gather` accepts a `cudf::out_of_bounds_policy` enum parameter. `OutOfBoundsPolicy` is an alias for this type in pylibcudf that matches our Python naming conventions (CapsCase instead of snake\_case).

## Testing

When writing pylibcudf tests, it is important to remember that all the APIs should be tested in the C++ layer in libcudf already.
The primary purpose of pylibcudf tests is to ensure the correctness of the _bindings_; the correctness of the underlying implementation should generally be validated in libcudf.
If pylibcudf tests uncover a libcudf bug, a suitable libcudf test should be added to cover this case rather than relying solely on pylibcudf testing.

pylibcudf's ``conftest.py`` contains some standard parametrized dtype fixture lists that may in turn be used to parametrize other fixtures.
Fixtures allocating data should leverage these dtype lists wherever possible to simplify testing across the matrix of important types.
Where appropriate, new fixture lists may be added.

To run tests as efficiently as possible, the test suite should make generous use of fixtures.
The simplest general structure to follow is for pyarrow array/table/scalar fixtures to be parametrized by one of the dtype list.
Then, a corresponding pylibcudf fixture may be created using a simple `from_arrow` call.
This approach ensures consistent global coverage across types for various tests.

In general, pylibcudf tests should prefer validating against a corresponding pyarrow implementation rather than hardcoding data.
This approach is more resilient to changes to input data, particularly given the fixture strategy outlined above.
Standard tools for comparing between pylibcudf and pyarrow types are provided in the utils module.

Here is an example demonstrating the above points:

```python
import pyarrow as pa
import pyarrow.compute as pc
import pytest
from cudf._lib import pylibcudf as plc
from utils import assert_column_eq

# The pa_dtype fixture is defined in conftest.py.
@pytest.fixture(scope="module")
def pa_column(pa_dtype):
pa.array([1, 2, 3])


@pytest.fixture(scope="module")
def column(pa_column):
return plc.interop.from_arrow(pa_column)


def test_foo(pa_column, column):
index = 1
result = plc.foo(column)
expected = pa.foo(pa_column)

assert_column_eq(result, expected)
```

Some guidelines on what should be tested:
- Tests SHOULD comprehensively cover the API, including all possible combinations of arguments required to ensure good test coverage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a word about testing over a range of types here? The rule I've been following is that if the type is an explicit argument, then all supported types should be tested. However many APIs are supported for a range of types even though the type isn't an argument - scatter works for ints, floats, etc. Do we need to test all of them? Is it okay to just test one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely, we should figure out how we want to treat these. I worry that trying to test the Cartesian product of all types will lead to a prohibitively large number of tests, but maybe that's OK as long as we keep the runtime per test minimal? I know you said you've got thousands in your binops PR and that's taking a couple of minutes. Have you tried following the guidelines on fixtures that I wrote up in this PR? If not, can you see if that makes much of a difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One compromise I've gone with in the past is to test over a matrix of "kind" rather than the full matrix. For instance, {"uint64", "int64", "float64", "string", "bool", "List", "Struct"}. Might save a few orders of magnitude. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems like a reasonable compromise to me. That way we are probably guaranteed to hit every distinct code path.

Another way of thinking about it: what we discussed was that pylibcudf should be focused on testing every code path and ensuring that all the logic that isn't testing in libcudf gets tested. All of the internal bits of pylibcudf involved in things like converting columns from libcudf, for example, need to be tested for every type since there is definitely type-specific logic there around converting different types of buffers etc. However, I don't think we need to test every user-facing API for every type because most APIs don't differ at all for different types in pylibcudf, only in libcudf. IOW something like copying a column is identical for string and float columns in pylibcudf up to differences in libcudf.

Thinking this way does require us to be a more cognizant of possible divergences in pylibcudf though, and it opens up an increased possibility for future error if we create new code branches in the future. A possible compromise would be to pick a couple of modules that we know exercise the internals sufficiently and test them for all types, then for others only test a sufficient subset of types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with basically this compromise. The chosen types are now standardized via a session-scope fixture, which should make it much easier to conform to these decisions across the whole test suite.

- pylibcudf SHOULD NOT attempt to stress test large data sizes, and SHOULD instead defer to libcudf tests.
- Exception: In special cases where constructing suitable large tests is difficult in C++ (such as creating suitable input data for I/O testing), tests may be added to pylibcudf instead.
- Nullable data should always be tested.
- Expected exceptions should be tested. Tests should be written from the user's perspective in mind, and if the API is not currently throwing the appropriate exception it should be updated.
- Important note: If the exception should be produced by libcudf, the underlying libcudf API should be updated to throw the desired exception in C++. Such changes may require consultation with libcudf devs in nontrivial cases. [This issue](https://github.com/rapidsai/cudf/issues/12885) provides an overview and an indication of acceptable exception types that should cover most use cases. In rare cases a new C++ exception may need to be introduced in [`error.hpp`](https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/include/cudf/utilities/error.hpp). If so, this exception will also need to be mapped to a suitable Python exception in [`exception_handler.pxd`](https://github.com/rapidsai/cudf/blob/branch-24.04/python/cudf/cudf/_lib/exception_handler.pxd).

Some guidelines on how best to use pytests.
- By default, fixtures producing device data containers should be of module scope and treated as immutable by tests. Allocating data on the GPU is expensive and slows tests. Almost all pylibcudf operations are out of place operations, so module-scoped fixtures should not typically be problematic to work with. Session-scoped fixtures would also work, but they are harder to reason about since they live in a different module, and if they need to change for any reason they could affect an arbitrarily large number of tests. Module scope is a good balance.
- Where necessary, mutable fixtures should be named as such (e.g. `mutable_col`) and be of function scope. If possible, they can be implemented as simply making a copy of a corresponding module-scope immutable fixture to avoid duplicating the generation logic.

Tests should be organized corresponding to pylibcudf modules, i.e. one test module for each pylibcudf module.

The following sections of the cuDF Python testing guide also generally apply to pylibcudf unless superseded by any statements above:
- [](#test_parametrization)
- [](#xfailing_tests)
- [](#testing_warnings)
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved

## Miscellaneous Notes

### Cython Scoped Enums
Expand Down
6 changes: 6 additions & 0 deletions docs/cudf/source/developer_guide/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ Typically, exception cases require specific assertions or other special logic, s
The main exception to this rule is tests based on comparison to pandas.
Such tests may test exceptional cases alongside more typical cases since the logic is generally identical.

(test_parametrization)=

### Parametrization: custom fixtures and `pytest.mark.parametrize`

When it comes to parametrizing tests written with `pytest`,
Expand Down Expand Up @@ -140,6 +142,8 @@ def test_odds():

Other approaches are also possible, and the best solution should be discussed on a case-by-case basis during PR review.

(xfailing_tests)=

### Tests with expected failures (`xfail`s)

In some circumstances it makes sense to mark a test as _expected_ to
Expand Down Expand Up @@ -218,6 +222,8 @@ This way, when the bug is fixed, the test suite will fail at this
point (and we will remember to update the test).


(testing_warnings)=

### Testing code that throws warnings

Some code may be expected to throw warnings.
Expand Down
42 changes: 21 additions & 21 deletions python/cudf/cudf/_lib/cpp/copying.pxd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2023, NVIDIA CORPORATION.
# Copyright (c) 2020-2024, NVIDIA CORPORATION.

from libc.stdint cimport int32_t, int64_t, uint8_t
from libcpp cimport bool
Expand Down Expand Up @@ -33,19 +33,19 @@ cdef extern from "cudf/copying.hpp" namespace "cudf" nogil:
const column_view& input,
size_type offset,
const scalar& fill_values
) except +
) except +cudf_exception_handler

cdef unique_ptr[table] scatter (
const table_view& source_table,
const column_view& scatter_map,
const table_view& target_table,
) except +
) except +cudf_exception_handler

cdef unique_ptr[table] scatter (
const vector[reference_wrapper[constscalar]]& source_scalars,
const column_view& indices,
const table_view& target,
) except +
) except +cudf_exception_handler

cpdef enum class mask_allocation_policy(int32_t):
NEVER
Expand All @@ -54,99 +54,99 @@ cdef extern from "cudf/copying.hpp" namespace "cudf" nogil:

cdef unique_ptr[column] empty_like (
const column_view& input_column
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] allocate_like (
const column_view& input_column,
mask_allocation_policy policy
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] allocate_like (
const column_view& input_column,
size_type size,
mask_allocation_policy policy
) except +
) except +cudf_exception_handler

cdef unique_ptr[table] empty_like (
const table_view& input_table
) except +
) except +cudf_exception_handler

cdef void copy_range_in_place (
const column_view& input_column,
mutable_column_view& target_column,
size_type input_begin,
size_type input_end,
size_type target_begin
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] copy_range (
const column_view& input_column,
const column_view& target_column,
size_type input_begin,
size_type input_end,
size_type target_begin
) except +
) except +cudf_exception_handler

cdef vector[column_view] slice (
const column_view& input_column,
vector[size_type] indices
) except +
) except +cudf_exception_handler

cdef vector[table_view] slice (
const table_view& input_table,
vector[size_type] indices
) except +
) except +cudf_exception_handler

cdef vector[column_view] split (
const column_view& input_column,
vector[size_type] splits
) except +
) except +cudf_exception_handler

cdef vector[table_view] split (
const table_view& input_table,
vector[size_type] splits
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] copy_if_else (
const column_view& lhs,
const column_view& rhs,
const column_view& boolean_mask
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] copy_if_else (
const scalar& lhs,
const column_view& rhs,
const column_view& boolean_mask
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] copy_if_else (
const column_view& lhs,
const scalar& rhs,
const column_view boolean_mask
) except +
) except +cudf_exception_handler

cdef unique_ptr[column] copy_if_else (
const scalar& lhs,
const scalar& rhs,
const column_view boolean_mask
) except +
) except +cudf_exception_handler

cdef unique_ptr[table] boolean_mask_scatter (
const table_view& input,
const table_view& target,
const column_view& boolean_mask
) except +
) except +cudf_exception_handler

cdef unique_ptr[table] boolean_mask_scatter (
const vector[reference_wrapper[constscalar]]& input,
const table_view& target,
const column_view& boolean_mask
) except +
) except +cudf_exception_handler

cdef unique_ptr[scalar] get_element (
const column_view& input,
size_type index
) except +
) except +cudf_exception_handler

cpdef enum class sample_with_replacement(bool):
FALSE
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/pylibcudf/column.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ cdef class Column:
cpdef gpumemoryview data(self)
cpdef gpumemoryview null_mask(self)
cpdef list children(self)
cpdef Column copy(self)

cpdef ListColumnView list_view(self)

Expand Down
9 changes: 8 additions & 1 deletion python/cudf/cudf/_lib/pylibcudf/column.pyx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright (c) 2023-2024, NVIDIA CORPORATION.

from cython.operator cimport dereference
from libcpp.memory cimport unique_ptr
from libcpp.memory cimport make_unique, unique_ptr
from libcpp.utility cimport move

from rmm._lib.device_buffer cimport DeviceBuffer
Expand Down Expand Up @@ -274,6 +274,13 @@ cdef class Column:
"""The children of the column."""
return self._children

cpdef Column copy(self):
"""Create a copy of the column."""
cdef unique_ptr[column] c_result
with nogil:
c_result = move(make_unique[column](self.view()))
return Column.from_libcudf(move(c_result))


cdef class ListColumnView:
"""Accessor for methods of a Column that are specific to lists."""
Expand Down
6 changes: 3 additions & 3 deletions python/cudf/cudf/_lib/pylibcudf/copying.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -58,12 +58,12 @@ cpdef Column copy_range(
size_type target_begin,
)

cpdef Column shift(Column input, size_type offset, Scalar fill_values)

cpdef list split(ColumnOrTable input, list splits)
cpdef Column shift(Column input, size_type offset, Scalar fill_value)

cpdef list slice(ColumnOrTable input, list indices)

cpdef list split(ColumnOrTable input, list splits)

cpdef Column copy_if_else(
LeftCopyIfElseOperand lhs,
RightCopyIfElseOperand rhs,
Expand Down
Loading
Loading