Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate expressions to pylibcudf #16056

Merged
merged 11 commits into from
Jul 16, 2024

Conversation

lithomas1
Copy link
Contributor

@lithomas1 lithomas1 commented Jun 18, 2024

Description

xref #15162

Migrates expresions to use pylibcudf.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@lithomas1 lithomas1 added feature request New feature or request non-breaking Non-breaking change labels Jun 18, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jun 18, 2024
@lithomas1 lithomas1 marked this pull request as ready for review June 18, 2024 18:51
@lithomas1 lithomas1 requested a review from a team as a code owner June 18, 2024 18:51
@lithomas1 lithomas1 requested review from bdice and Matt711 June 18, 2024 18:51
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should take a slightly different route here than we currently are with Literals. Instead of having the objects own both a libcudf scalar and a libcudf expression, I think we should have them own a pylibcudf Scalar and a libcudf expression. Right now the only way to construct pylibcudf Scalars is via arrow interop, but the logic that we're using in this file to generate different scalar types is essentially the same as what we would want in a Scalar constructor. Furthermore, once we do that we could replace the branching logic in this PR using a libcudf type_dispatcher-based approach that should make it cleaner and easier to maintain. This separation of concerns would also allow us to transparently support arrow scalars, like you asked. I realize that's a pretty big change from how this PR is currently structured, though, and I'm happy to work with you on that if you'd like!

@lithomas1 lithomas1 marked this pull request as draft June 26, 2024 00:03
@lithomas1
Copy link
Contributor Author

Draft pending resolution of Vyas's comments.

@lithomas1 lithomas1 marked this pull request as ready for review July 15, 2024 17:28
@lithomas1 lithomas1 requested a review from vyasr July 15, 2024 17:28
@lithomas1
Copy link
Contributor Author

@vyasr Let's put this in as is for now?

pass
# Hold on to the input expressions so
# they don't get gc'ed
cdef Expression right
Copy link
Contributor Author

@lithomas1 lithomas1 Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a use-after-free here (in the old cudf/_lib/expressions.pyx) previously, but it didn't get caught I think because the inputs to Operations weren't gc'ed yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, if the C++ side takes a reference, we need to hold on to the object on the Python side, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup we do, that's a good catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous approach was "person constructing the expression must keep everything alive".

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge and refactor later.

pass
# Hold on to the input expressions so
# they don't get gc'ed
cdef Expression right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup we do, that's a good catch.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would cull the numpy requirement and use pylibcudf scalars.

pass
# Hold on to the input expressions so
# they don't get gc'ed
cdef Expression right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous approach was "person constructing the expression must keep everything alive".

@@ -0,0 +1,210 @@
# Copyright (c) 2024, NVIDIA CORPORATION.

import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to drop the runtime numpy dependency. What would it take to use pylibcudf datatypes instead?

value : Union[int, float, str, np.datetime64, np.timedelta64]
A scalar value to use.
"""
def __cinit__(self, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should accept a value and a pylibcudf DataType then we don't have to do (dangerous) introspection of the value to determine what scalar we will make.

Alternatively, we could accept a pylibcudf Scalar, and then (?) borrow the reference for our c_scalar?

Either of those changes would also obviate the need to have a runtime dependency on numpy.

In both cases we need to have a dispatch/type_mapping from datatype to concrete scalar type but it's more explicit than introspection.

Having thought a bit more, we should use a pylibcudf Scalar I think, and then do something like:

def __cinit__(self, Scalar value):
    self.scalar = value
    cdef data_type typ = value.type()
    cdef type_id tid = typ.id()
    if not (plc.traits.is_numeric(data_type) or plc.traits.is_chrono(data_type) or tid == STRING):
        raise ...
    if tid == plc.TypeID.INT8:
         self.c_obj = <unique_ptr[scalar]>move(make_unique[literal](<numeric_scalar[int8_t] &>dereference(value.c_obj)))
    elif tid == plc.TypeId.INT16:
        ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could pass a type-erased scalar to the ast::literal constructor, but my type-dispatching-fu is not sufficient.

@wence-
Copy link
Contributor

wence- commented Jul 16, 2024

Ah, I realise I had the same high-level issue as @vyasr, it seems like you'd decided to do that separately?

@lithomas1
Copy link
Contributor Author

Ah, I realise I had the same high-level issue as @vyasr, it seems like you'd decided to do that separately?

Yeah, I kinda thought about this for a bit, but it's a pretty involved fix.
(but eventually, I would like this to take a pylibcudf Scalar)

Easy way out is:
Just convert everything to a pyarrow scalar and call interop on it (this is what DeviceScalar does). This is probably not good for the goal of getting pyarrow out of the Cython.

I could also try to delete the datetime/timedelta handling code in this PR.
(I don't think that gets used in cuDF Python)

The more correct way is to use the interchange mechanisms on the array object/scalar object we take in which I talked about with Vyas.

IMO, the best way to do this is to use DLPack, so we only have to support one thing.
(Not sure if we need C++ changes here, though )

@wence-
Copy link
Contributor

wence- commented Jul 16, 2024

(but eventually, I would like this to take a pylibcudf Scalar)

Why is this one so hard? I think you just need to have a big switch statement in the constructor that takes the type-erased Scalar object and constructs (via casting) the correct numeric_scalar/timestamp_scalar/duration_scalar.

Effectively you would just replace the current switch statement that dispatches on numpy-like types with one that dispatches on scalar.type().id(), I think.

What am I missing?

@wence-
Copy link
Contributor

wence- commented Jul 16, 2024

I could also try to delete the datetime/timedelta handling code in this PR. (I don't think that gets used in cuDF Python)

This is kind of critical for predicate pushdown in cudf-polars.

@lithomas1
Copy link
Contributor Author

lithomas1 commented Jul 16, 2024

(but eventually, I would like this to take a pylibcudf Scalar)

Why is this one so hard? I think you just need to have a big switch statement in the constructor that takes the type-erased Scalar object and constructs (via casting) the correct numeric_scalar/timestamp_scalar/duration_scalar.

I meant creating the Scalar itself.
The Scalar constructor has to accept a lot of possible inputs (e.g. Python scalars, Numpy scalars, Pyarrow scalars, potentially cupy scalars)

I think your proposed approach would work after we fixed the Scalar constructor.
(actually now that I think about it, maybe fixing this isn't a very immediate concern, I thought we'd need pyarrow in the Cython, but we can just force people to do plc.interop.from_arrow in their Python code outside the Cython)

EDIT: I will try the approach taking in a Scalar.

@wence-
Copy link
Contributor

wence- commented Jul 16, 2024

I think your proposed approach would work after we fixed the Scalar constructor.

Yeah, I think from the pylibcudf point of view in some sense, for the expression API, it is not (or should not be) the job of pylibcudf to help with the construction of a Scalar in the first place.

@lithomas1 lithomas1 requested a review from wence- July 16, 2024 16:42
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Trivial grammar.

python/cudf/cudf/_lib/pylibcudf/expressions.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/expressions.pyx Outdated Show resolved Hide resolved
@lithomas1
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 6a954e2 into rapidsai:branch-24.08 Jul 16, 2024
85 checks passed
@lithomas1 lithomas1 deleted the pylibcudf-expr branch July 16, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants