Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate column factories to pylibcudf #15257

Merged

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Mar 8, 2024

This PR implements column_factories.hpp using pylibcudf and migrates the cuDF cython to use them cc @vyasr

@brandon-b-miller brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Mar 8, 2024
@github-actions github-actions bot added the CMake CMake build issue label Mar 8, 2024
@vyasr
Copy link
Contributor

vyasr commented Mar 19, 2024

FYI @brandon-b-miller we should address the TODO I noted in this commit in this PR.

Copy link

copy-pr-bot bot commented Apr 17, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. conda Java Affects Java cuDF API. labels Apr 17, 2024
@brandon-b-miller brandon-b-miller changed the base branch from branch-24.04 to branch-24.06 April 17, 2024 13:37
@brandon-b-miller brandon-b-miller removed conda Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 17, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure looks good overall two things:

  1. It feels like the fused types for empty column construction can be simpler.
  2. It would be great if the datatype conversions didn't rely on quite so many third-party libraries.

Comment on lines +93 to +94
elif isinstance(pyarrow_object, pa.ListType):
return DataType(type_id.LIST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The pylibcudf DataType object doesn't have a concept of nested lists, but the pyarrow datatype does: for pyarrow, the datatype is List(element_type), whereas for pylibcudf it's List. Does that potentially cause problems here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the "nestedness" is a property of the column and not as much a property of the type, at least from libcudf's perspective.

For the from_arrow case, there's nothing to return except for DataType(type_id.LIST). For to_arrow, we have the column to inspect for the non dtype cases, but not for the dtype only case. Since we can't figure out exactly what kind of pa.List or pa.Struct to return, I chose to error there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There simply isn't a 1-1 mapping between pyarrow and pylibcudf data types for nested types because in pylibcudf (read: libcudf) the nested types are encoded in the column as Brandon pointed out. So I don't think there's much else we can do in the from_arrow case.

For the to_arrow case, maybe we should accept a kwarg column that can be inspected in the case of list/struct types. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walrus (walri?) abound in my attempt here bffe500

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 96 to 99
return DataType(
SUPPORTED_NUMPY_TO_LIBCUDF_TYPES.get(
np.dtype(pyarrow_object.to_pandas_dtype()))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: So conversion, much wow.

We would like to get to a state where pylibcudf doesn't depend on arrow, numpy, and, in this case, transitively pandas. Can we not introduce that dependency here by explicitly enumerating the handling of all types?

Copy link
Contributor

@vyasr vyasr May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely shouldn't have a pandas dependence in pylibcudf. pyarrow dependence should be isolated to the interop module, which should only be loaded conditionally. numpy dependence I was less worried about, but from a quick search the only usage I currently see in pylibcudf is here, so it seems like we might be able to strip that out fairly painlessly too.

This PR adds an extra np.dtype call in column.pyx in place of enumerating. I'm in the same boat as Lawrence and think we should avoid using numpy if it's easy to enumerate. I don't know how painful that will be to maintain for all dtypes though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b3e934c reverts the earlier stab at things and adds two explicit mappings.

python/cudf/cudf/pylibcudf_tests/test_column_factories.py Outdated Show resolved Hide resolved
python/cudf/cudf/pylibcudf_tests/test_column_factories.py Outdated Show resolved Hide resolved
python/cudf/cudf/pylibcudf_tests/test_column_factories.py Outdated Show resolved Hide resolved
python/cudf/cudf/pylibcudf_tests/test_column_factories.py Outdated Show resolved Hide resolved
python/cudf/cudf/pylibcudf_tests/test_column_factories.py Outdated Show resolved Hide resolved
@brandon-b-miller brandon-b-miller requested a review from wence- May 22, 2024 21:29
@vyasr vyasr added the pylibcudf Issues specific to the pylibcudf package label May 28, 2024
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things, but generally looks good now. Thanks!

Comment on lines +93 to +94
elif isinstance(pyarrow_object, pa.ListType):
return DataType(type_id.LIST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python/cudf/cudf/pylibcudf_tests/test_column_factories.py Outdated Show resolved Hide resolved
python/cudf/cudf/pylibcudf_tests/test_interop.py Outdated Show resolved Hide resolved
python/cudf/cudf/pylibcudf_tests/test_column_factories.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny niggles, looks great.

python/cudf/cudf/_lib/pylibcudf/interop.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/interop.pyx Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/interop.pyx Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Jun 4, 2024

/merge

@rapids-bot rapids-bot bot merged commit eb46016 into rapidsai:branch-24.08 Jun 4, 2024
69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants