-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate expressions to pylibcudf #16056
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should take a slightly different route here than we currently are with Literals. Instead of having the objects own both a libcudf scalar and a libcudf expression, I think we should have them own a pylibcudf Scalar and a libcudf expression. Right now the only way to construct pylibcudf Scalars is via arrow interop, but the logic that we're using in this file to generate different scalar types is essentially the same as what we would want in a Scalar constructor. Furthermore, once we do that we could replace the branching logic in this PR using a libcudf type_dispatcher
-based approach that should make it cleaner and easier to maintain. This separation of concerns would also allow us to transparently support arrow scalars, like you asked. I realize that's a pretty big change from how this PR is currently structured, though, and I'm happy to work with you on that if you'd like!
Draft pending resolution of Vyas's comments. |
@vyasr Let's put this in as is for now? |
pass | ||
# Hold on to the input expressions so | ||
# they don't get gc'ed | ||
cdef Expression right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a use-after-free here (in the old cudf/_lib/expressions.pyx) previously, but it didn't get caught I think because the inputs to Operations weren't gc'ed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, if the C++ side takes a reference, we need to hold on to the object on the Python side, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup we do, that's a good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous approach was "person constructing the expression must keep everything alive".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge and refactor later.
pass | ||
# Hold on to the input expressions so | ||
# they don't get gc'ed | ||
cdef Expression right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup we do, that's a good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would cull the numpy requirement and use pylibcudf scalars.
pass | ||
# Hold on to the input expressions so | ||
# they don't get gc'ed | ||
cdef Expression right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous approach was "person constructing the expression must keep everything alive".
@@ -0,0 +1,210 @@ | |||
# Copyright (c) 2024, NVIDIA CORPORATION. | |||
|
|||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to drop the runtime numpy dependency. What would it take to use pylibcudf datatypes instead?
value : Union[int, float, str, np.datetime64, np.timedelta64] | ||
A scalar value to use. | ||
""" | ||
def __cinit__(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should accept a value
and a pylibcudf DataType
then we don't have to do (dangerous) introspection of the value to determine what scalar
we will make.
Alternatively, we could accept a pylibcudf Scalar
, and then (?) borrow the reference for our c_scalar
?
Either of those changes would also obviate the need to have a runtime dependency on numpy.
In both cases we need to have a dispatch/type_mapping from datatype to concrete scalar
type but it's more explicit than introspection.
Having thought a bit more, we should use a pylibcudf Scalar
I think, and then do something like:
def __cinit__(self, Scalar value):
self.scalar = value
cdef data_type typ = value.type()
cdef type_id tid = typ.id()
if not (plc.traits.is_numeric(data_type) or plc.traits.is_chrono(data_type) or tid == STRING):
raise ...
if tid == plc.TypeID.INT8:
self.c_obj = <unique_ptr[scalar]>move(make_unique[literal](<numeric_scalar[int8_t] &>dereference(value.c_obj)))
elif tid == plc.TypeId.INT16:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could pass a type-erased scalar to the ast::literal
constructor, but my type-dispatching-fu is not sufficient.
Ah, I realise I had the same high-level issue as @vyasr, it seems like you'd decided to do that separately? |
Yeah, I kinda thought about this for a bit, but it's a pretty involved fix. Easy way out is: I could also try to delete the datetime/timedelta handling code in this PR. The more correct way is to use the interchange mechanisms on the array object/scalar object we take in which I talked about with Vyas. IMO, the best way to do this is to use DLPack, so we only have to support one thing. |
Why is this one so hard? I think you just need to have a big switch statement in the constructor that takes the type-erased Effectively you would just replace the current switch statement that dispatches on numpy-like types with one that dispatches on What am I missing? |
This is kind of critical for predicate pushdown in cudf-polars. |
I meant creating the Scalar itself. I think your proposed approach would work after we fixed the Scalar constructor. EDIT: I will try the approach taking in a Scalar. |
Yeah, I think from the pylibcudf point of view in some sense, for the expression API, it is not (or should not be) the job of pylibcudf to help with the construction of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Trivial grammar.
Co-authored-by: Lawrence Mitchell <[email protected]>
/merge |
Description
xref #15162
Migrates expresions to use pylibcudf.
Checklist