Migrate expressions to pylibcudf #16056

lithomas1 · 2024-06-18T18:50:38Z

Description

xref #15162

Migrates expresions to use pylibcudf.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

python/cudf/cudf/_lib/pylibcudf/expressions.pyx

…f-expr

vyasr

I think we should take a slightly different route here than we currently are with Literals. Instead of having the objects own both a libcudf scalar and a libcudf expression, I think we should have them own a pylibcudf Scalar and a libcudf expression. Right now the only way to construct pylibcudf Scalars is via arrow interop, but the logic that we're using in this file to generate different scalar types is essentially the same as what we would want in a Scalar constructor. Furthermore, once we do that we could replace the branching logic in this PR using a libcudf type_dispatcher-based approach that should make it cleaner and easier to maintain. This separation of concerns would also allow us to transparently support arrow scalars, like you asked. I realize that's a pretty big change from how this PR is currently structured, though, and I'm happy to work with you on that if you'd like!

lithomas1 · 2024-06-26T00:03:34Z

Draft pending resolution of Vyas's comments.

…f-expr

lithomas1 · 2024-07-15T17:28:35Z

@vyasr Let's put this in as is for now?

lithomas1 · 2024-07-15T23:52:06Z

python/cudf/cudf/_lib/pylibcudf/expressions.pxd

-    pass
+    # Hold on to the input expressions so
+    # they don't get gc'ed
+    cdef Expression right


There was a use-after-free here (in the old cudf/_lib/expressions.pyx) previously, but it didn't get caught I think because the inputs to Operations weren't gc'ed yet.

In general, if the C++ side takes a reference, we need to hold on to the object on the Python side, right?

Yup we do, that's a good catch.

The previous approach was "person constructing the expression must keep everything alive".

vyasr

Let's merge and refactor later.

vyasr · 2024-07-16T01:23:34Z

python/cudf/cudf/_lib/pylibcudf/expressions.pxd

-    pass
+    # Hold on to the input expressions so
+    # they don't get gc'ed
+    cdef Expression right


Yup we do, that's a good catch.

wence-

Ideally, we would cull the numpy requirement and use pylibcudf scalars.

wence- · 2024-07-16T10:45:21Z

python/cudf/cudf/_lib/pylibcudf/expressions.pxd

-    pass
+    # Hold on to the input expressions so
+    # they don't get gc'ed
+    cdef Expression right


The previous approach was "person constructing the expression must keep everything alive".

wence- · 2024-07-16T10:47:45Z

python/cudf/cudf/_lib/pylibcudf/expressions.pyx

@@ -0,0 +1,210 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.
+
+import numpy as np


I would love to drop the runtime numpy dependency. What would it take to use pylibcudf datatypes instead?

wence- · 2024-07-16T10:57:23Z

python/cudf/cudf/_lib/pylibcudf/expressions.pyx

+    value : Union[int, float, str, np.datetime64, np.timedelta64]
+        A scalar value to use.
+    """
+    def __cinit__(self, value):


I think this should accept a value and a pylibcudf DataType then we don't have to do (dangerous) introspection of the value to determine what scalar we will make.

Alternatively, we could accept a pylibcudf Scalar, and then (?) borrow the reference for our c_scalar?

Either of those changes would also obviate the need to have a runtime dependency on numpy.

In both cases we need to have a dispatch/type_mapping from datatype to concrete scalar type but it's more explicit than introspection.

Having thought a bit more, we should use a pylibcudf Scalar I think, and then do something like:

def __cinit__(self, Scalar value): self.scalar = value cdef data_type typ = value.type() cdef type_id tid = typ.id() if not (plc.traits.is_numeric(data_type) or plc.traits.is_chrono(data_type) or tid == STRING): raise ... if tid == plc.TypeID.INT8: self.c_obj = <unique_ptr[scalar]>move(make_unique[literal](<numeric_scalar[int8_t] &>dereference(value.c_obj))) elif tid == plc.TypeId.INT16: ...

It would be nice if we could pass a type-erased scalar to the ast::literal constructor, but my type-dispatching-fu is not sufficient.

wence- · 2024-07-16T14:02:53Z

Ah, I realise I had the same high-level issue as @vyasr, it seems like you'd decided to do that separately?

lithomas1 · 2024-07-16T14:11:48Z

Ah, I realise I had the same high-level issue as @vyasr, it seems like you'd decided to do that separately?

Yeah, I kinda thought about this for a bit, but it's a pretty involved fix.
(but eventually, I would like this to take a pylibcudf Scalar)

Easy way out is:
Just convert everything to a pyarrow scalar and call interop on it (this is what DeviceScalar does). This is probably not good for the goal of getting pyarrow out of the Cython.

I could also try to delete the datetime/timedelta handling code in this PR.
(I don't think that gets used in cuDF Python)

The more correct way is to use the interchange mechanisms on the array object/scalar object we take in which I talked about with Vyas.

IMO, the best way to do this is to use DLPack, so we only have to support one thing.
(Not sure if we need C++ changes here, though )

wence- · 2024-07-16T14:24:48Z

(but eventually, I would like this to take a pylibcudf Scalar)

Why is this one so hard? I think you just need to have a big switch statement in the constructor that takes the type-erased Scalar object and constructs (via casting) the correct numeric_scalar/timestamp_scalar/duration_scalar.

Effectively you would just replace the current switch statement that dispatches on numpy-like types with one that dispatches on scalar.type().id(), I think.

What am I missing?

wence- · 2024-07-16T14:25:25Z

I could also try to delete the datetime/timedelta handling code in this PR. (I don't think that gets used in cuDF Python)

This is kind of critical for predicate pushdown in cudf-polars.

lithomas1 · 2024-07-16T14:34:42Z

(but eventually, I would like this to take a pylibcudf Scalar)

Why is this one so hard? I think you just need to have a big switch statement in the constructor that takes the type-erased Scalar object and constructs (via casting) the correct numeric_scalar/timestamp_scalar/duration_scalar.

I meant creating the Scalar itself.
The Scalar constructor has to accept a lot of possible inputs (e.g. Python scalars, Numpy scalars, Pyarrow scalars, potentially cupy scalars)

I think your proposed approach would work after we fixed the Scalar constructor.
(actually now that I think about it, maybe fixing this isn't a very immediate concern, I thought we'd need pyarrow in the Cython, but we can just force people to do plc.interop.from_arrow in their Python code outside the Cython)

EDIT: I will try the approach taking in a Scalar.

…f-expr

wence- · 2024-07-16T16:07:20Z

I think your proposed approach would work after we fixed the Scalar constructor.

Yeah, I think from the pylibcudf point of view in some sense, for the expression API, it is not (or should not be) the job of pylibcudf to help with the construction of a Scalar in the first place.

wence-

Thanks! Trivial grammar.

python/cudf/cudf/_lib/pylibcudf/expressions.pyx

Co-authored-by: Lawrence Mitchell <[email protected]>

lithomas1 · 2024-07-16T20:09:24Z

/merge

Migrate expressions to pylibcudf

c256f1e

lithomas1 added feature request New feature or request non-breaking Non-breaking change labels Jun 18, 2024

github-actions bot added Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jun 18, 2024

lithomas1 commented Jun 18, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/expressions.pyx Outdated Show resolved Hide resolved

lithomas1 marked this pull request as ready for review June 18, 2024 18:51

lithomas1 requested a review from a team as a code owner June 18, 2024 18:51

lithomas1 requested review from bdice and Matt711 June 18, 2024 18:51

lithomas1 added 3 commits June 18, 2024 20:50

fix typo in docs

a57b132

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

5bc6917

…f-expr

add to init file

7c0d72d

vyasr requested changes Jun 25, 2024

View reviewed changes

lithomas1 marked this pull request as draft June 26, 2024 00:03

lithomas1 added 2 commits July 3, 2024 16:26

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

04213d1

…f-expr

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

2bc2fb8

…f-expr

lithomas1 marked this pull request as ready for review July 15, 2024 17:28

lithomas1 requested a review from vyasr July 15, 2024 17:28

lithomas1 mentioned this pull request Jul 15, 2024

Migrate Parquet reader to pylibcudf #16078

Merged

3 tasks

sync parquet changes

0210fd2

lithomas1 commented Jul 15, 2024

View reviewed changes

vyasr approved these changes Jul 16, 2024

View reviewed changes

wence- requested changes Jul 16, 2024

View reviewed changes

Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

6f3ae30

…f-expr

take in scalars instead

649454c

lithomas1 requested a review from wence- July 16, 2024 16:42

wence- approved these changes Jul 16, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/expressions.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/expressions.pyx Outdated Show resolved Hide resolved

lithomas1 and others added 2 commits July 16, 2024 10:48

Apply suggestions from code review

a6f305b

Co-authored-by: Lawrence Mitchell <[email protected]>

Merge branch 'branch-24.08' into pylibcudf-expr

2590f6e

rapids-bot bot merged commit 6a954e2 into rapidsai:branch-24.08 Jul 16, 2024
85 checks passed

lithomas1 deleted the pylibcudf-expr branch July 16, 2024 22:32

lithomas1 mentioned this pull request Jul 22, 2024

[FEA] Implement all libcudf modules required by cuDF Python in pylibcudf #15162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate expressions to pylibcudf #16056

Migrate expressions to pylibcudf #16056

lithomas1 commented Jun 18, 2024 •

edited

Loading

vyasr left a comment

lithomas1 commented Jun 26, 2024

lithomas1 commented Jul 15, 2024

lithomas1 Jul 15, 2024 •

edited

Loading

lithomas1 Jul 15, 2024

vyasr Jul 16, 2024

wence- Jul 16, 2024

vyasr left a comment

vyasr Jul 16, 2024

wence- left a comment

wence- Jul 16, 2024

wence- Jul 16, 2024

wence- Jul 16, 2024

wence- Jul 16, 2024

wence- commented Jul 16, 2024

lithomas1 commented Jul 16, 2024

wence- commented Jul 16, 2024

wence- commented Jul 16, 2024

lithomas1 commented Jul 16, 2024 •

edited

Loading

wence- commented Jul 16, 2024

wence- left a comment

lithomas1 commented Jul 16, 2024

		@@ -0,0 +1,210 @@
		# Copyright (c) 2024, NVIDIA CORPORATION.

		import numpy as np

Migrate expressions to pylibcudf #16056

Migrate expressions to pylibcudf #16056

Conversation

lithomas1 commented Jun 18, 2024 • edited Loading

Description

Checklist

vyasr left a comment

Choose a reason for hiding this comment

lithomas1 commented Jun 26, 2024

lithomas1 commented Jul 15, 2024

lithomas1 Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Jul 16, 2024

lithomas1 commented Jul 16, 2024

wence- commented Jul 16, 2024

wence- commented Jul 16, 2024

lithomas1 commented Jul 16, 2024 • edited Loading

wence- commented Jul 16, 2024

wence- left a comment

Choose a reason for hiding this comment

lithomas1 commented Jul 16, 2024

lithomas1 commented Jun 18, 2024 •

edited

Loading

lithomas1 Jul 15, 2024 •

edited

Loading

lithomas1 commented Jul 16, 2024 •

edited

Loading