-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DataFrame.eval using libcudf ASTs #8022
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #8022 +/- ##
================================================
+ Coverage 86.35% 86.42% +0.07%
================================================
Files 142 143 +1
Lines 22335 22438 +103
================================================
+ Hits 19287 19393 +106
+ Misses 3048 3045 -3
Continue to review full report at Codecov.
|
Moving to 21.10 |
368a4b5
to
997a746
Compare
eed923f
to
6ff2d3c
Compare
6ff2d3c
to
44ef2a5
Compare
I have some thoughts about the mixed dtype issue, but they're probably a little out of scope of this PR. Let's follow up when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have several comments that I would like to see addressed. For now, I am requesting changes. If you would like to punt all these comments/fixes to a future PR, I will approve and let you do that separately so that this large chunk of changes doesn't get stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments but this looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all my comments were addressed here.
@gpucibot merge |
This PR exposes
libcudf
's expression parsing functionality incudf
and uses it to implementDataFrame.eval
. The implementation is mostly feature-complete, but there are a few limitations relative to thepandas
API and a couple of gotchas around type casting. The implementation is reasonably performant, improving upon an equivalentdf.apply
even accounting for JIT-compilation overhead. This implementation provides a stepping stone to leveraginglibcudf
's AST implementation for more complex tasks incudf
such as conditional joins.The most significant issue with the current implementation is the lack of casting between integral types, meaning that operations can only be performed between columns of the exact same dtype. For example, operations between int8 and int16 would fail. This becomes particularly problematic for constants e.g.
df.eval('x+1')
. The best paths to improve this are at the C++ level of the expression evaluation, so I think we'll have to live with this limitation for now if we want to move forward.Resolves #9112