Migrate column factories to pylibcudf #15257

brandon-b-miller · 2024-03-08T19:05:44Z

This PR implements column_factories.hpp using pylibcudf and migrates the cuDF cython to use them cc @vyasr

vyasr · 2024-03-19T00:19:28Z

FYI @brandon-b-miller we should address the TODO I noted in this commit in this PR.

copy-pr-bot · 2024-04-17T13:36:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wence-

Structure looks good overall two things:

It feels like the fused types for empty column construction can be simpler.
It would be great if the datatype conversions didn't rely on quite so many third-party libraries.

python/cudf/cudf/_lib/pylibcudf/column_factories.pyx

wence- · 2024-05-22T14:50:16Z

python/cudf/cudf/_lib/pylibcudf/interop.pyx

+    elif isinstance(pyarrow_object, pa.ListType):
+        return DataType(type_id.LIST)


question: The pylibcudf DataType object doesn't have a concept of nested lists, but the pyarrow datatype does: for pyarrow, the datatype is List(element_type), whereas for pylibcudf it's List. Does that potentially cause problems here?

My understanding is that the "nestedness" is a property of the column and not as much a property of the type, at least from libcudf's perspective.

For the from_arrow case, there's nothing to return except for DataType(type_id.LIST). For to_arrow, we have the column to inspect for the non dtype cases, but not for the dtype only case. Since we can't figure out exactly what kind of pa.List or pa.Struct to return, I chose to error there.

There simply isn't a 1-1 mapping between pyarrow and pylibcudf data types for nested types because in pylibcudf (read: libcudf) the nested types are encoded in the column as Brandon pointed out. So I don't think there's much else we can do in the from_arrow case.

For the to_arrow case, maybe we should accept a kwarg column that can be inspected in the case of list/struct types. WDYT?

Walrus (walri?) abound in my attempt here bffe500

https://www.quora.com/What-is-the-plural-of-%E2%80%98walrus%E2%80%99-Is-it-%E2%80%98walruses%E2%80%99-or-%E2%80%98walri%E2%80%99-considering-that-the-root-word-ends-in-us

😄

wence- · 2024-05-22T14:51:45Z

python/cudf/cudf/_lib/pylibcudf/interop.pyx

+        return DataType(
+            SUPPORTED_NUMPY_TO_LIBCUDF_TYPES.get(
+                np.dtype(pyarrow_object.to_pandas_dtype()))
+            )


issue: So conversion, much wow.

We would like to get to a state where pylibcudf doesn't depend on arrow, numpy, and, in this case, transitively pandas. Can we not introduce that dependency here by explicitly enumerating the handling of all types?

We definitely shouldn't have a pandas dependence in pylibcudf. pyarrow dependence should be isolated to the interop module, which should only be loaded conditionally. numpy dependence I was less worried about, but from a quick search the only usage I currently see in pylibcudf is here, so it seems like we might be able to strip that out fairly painlessly too.

This PR adds an extra np.dtype call in column.pyx in place of enumerating. I'm in the same boat as Lawrence and think we should avoid using numpy if it's easy to enumerate. I don't know how painful that will be to maintain for all dtypes though.

b3e934c reverts the earlier stab at things and adds two explicit mappings.

python/cudf/cudf/pylibcudf_tests/test_column_factories.py

…s.pxd Co-authored-by: Lawrence Mitchell <[email protected]>

python/cudf/cudf/_lib/pylibcudf/interop.pyx

vyasr

Some small things, but generally looks good now. Thanks!

vyasr · 2024-06-03T20:36:06Z

python/cudf/cudf/_lib/pylibcudf/interop.pyx

+    elif isinstance(pyarrow_object, pa.ListType):
+        return DataType(type_id.LIST)


https://www.quora.com/What-is-the-plural-of-%E2%80%98walrus%E2%80%99-Is-it-%E2%80%98walruses%E2%80%99-or-%E2%80%98walri%E2%80%99-considering-that-the-root-word-ends-in-us

😄

python/cudf/cudf/pylibcudf_tests/test_column_factories.py

python/cudf/cudf/pylibcudf_tests/test_interop.py

python/cudf/cudf/pylibcudf_tests/test_column_factories.py

Co-authored-by: Vyas Ramasubramani <[email protected]>

wence-

Tiny niggles, looks great.

python/cudf/cudf/_lib/pylibcudf/interop.pyx

wence- · 2024-06-04T10:35:14Z

/merge

python/cudf/cudf/_lib/pylibcudf/interop.pyx

begin column factorires

70386b8

brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Mar 8, 2024

github-actions bot added the CMake CMake build issue label Mar 8, 2024

brandon-b-miller added 2 commits March 8, 2024 11:08

updates

38ca43d

progress

6055670

brandon-b-miller added 8 commits March 19, 2024 05:40

moving things around re: fused types

535c812

compiles

b5c888d

add back the rest of the column factories

4b4d7b3

cleanup

7857097

Merge branch 'branch-24.04' into pylibcudf-column-factories

5b41c2e

add TypeId back in

4f78361

Merge branch 'branch-24.06' into pylibcudf-column-factories

b1bce3e

fix up make_empty_column

ba490b1

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. conda Java Affects Java cuDF API. labels Apr 17, 2024

brandon-b-miller changed the base branch from branch-24.04 to branch-24.06 April 17, 2024 13:37

brandon-b-miller removed conda Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 17, 2024

brandon-b-miller added 4 commits April 17, 2024 09:19

test_make_empty_column

d0eb39f

Merge branch 'branch-24.06' into pylibcudf-column-factories

eec143c

few more tests

d0e1ed5

add more make_numeric_column tests

dfdbc77

brandon-b-miller added 3 commits May 22, 2024 04:49

plumbing, fixes

c4f874f

to_arrow updates

9cadc1c

small test fixes

3a478be

wence- requested changes May 22, 2024

View reviewed changes

brandon-b-miller and others added 4 commits May 22, 2024 13:42

use explicit mappings

b3e934c

dont validate the values themselves

10b07b8

Update python/cudf/cudf/_lib/pylibcudf/libcudf/column/column_factorie…

1e73dfe

…s.pxd Co-authored-by: Lawrence Mitchell <[email protected]>

listify parameterization

092a2a5

brandon-b-miller requested a review from wence- May 22, 2024 21:29

style

abda755

mroeschke reviewed May 24, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/interop.pyx Outdated Show resolved Hide resolved

vyasr added the pylibcudf Issues specific to the pylibcudf package label May 28, 2024

brandon-b-miller added 2 commits May 30, 2024 06:01

Merge branch 'branch-24.08' into pylibcudf-column-factories

f09afa1

fix up to_arrow for datatype and add some tests

bffe500

brandon-b-miller changed the base branch from branch-24.06 to branch-24.08 May 30, 2024 14:29

brandon-b-miller requested a review from vyasr May 30, 2024 14:31

brandon-b-miller mentioned this pull request May 30, 2024

Ensure literals have correct dtype #15890

Merged

3 tasks

lithomas1 mentioned this pull request Jun 3, 2024

[FEA] Implement all libcudf modules required by cuDF Python in pylibcudf #15162

Closed

vyasr approved these changes Jun 3, 2024

View reviewed changes

brandon-b-miller and others added 2 commits June 3, 2024 16:14

Apply suggestions from code review

1dfbea4

Co-authored-by: Vyas Ramasubramani <[email protected]>

style fix

498a002

wence- approved these changes Jun 4, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/interop.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/interop.pyx Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/interop.pyx Show resolved Hide resolved

wence- added 3 commits June 4, 2024 11:05

Translate date32

8c67671

Minor fixes

dba2ab3

Fix whitespace

463ed02

rapids-bot bot merged commit eb46016 into rapidsai:branch-24.08 Jun 4, 2024
69 checks passed

wence- reviewed Jun 4, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/interop.pyx Show resolved Hide resolved

wence- mentioned this pull request Jun 25, 2024

Migrate lists/extract to pylibcudf #16071

Merged

3 tasks

vyasr mentioned this pull request Oct 28, 2024

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate column factories to pylibcudf #15257

Migrate column factories to pylibcudf #15257

brandon-b-miller commented Mar 8, 2024 •

edited

Loading

vyasr commented Mar 19, 2024

copy-pr-bot bot commented Apr 17, 2024

wence- left a comment

wence- May 22, 2024

brandon-b-miller May 22, 2024

vyasr May 23, 2024

brandon-b-miller May 30, 2024

vyasr Jun 3, 2024

wence- May 22, 2024

vyasr May 22, 2024 •

edited

Loading

brandon-b-miller May 22, 2024

vyasr left a comment

vyasr Jun 3, 2024

wence- left a comment •

edited

Loading

wence- commented Jun 4, 2024

		elif isinstance(pyarrow_object, pa.ListType):
		return DataType(type_id.LIST)

Migrate column factories to pylibcudf #15257

Migrate column factories to pylibcudf #15257

Conversation

brandon-b-miller commented Mar 8, 2024 • edited Loading

vyasr commented Mar 19, 2024

copy-pr-bot bot commented Apr 17, 2024

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr May 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment • edited Loading

Choose a reason for hiding this comment

wence- commented Jun 4, 2024

brandon-b-miller commented Mar 8, 2024 •

edited

Loading

vyasr May 22, 2024 •

edited

Loading

wence- left a comment •

edited

Loading