-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix use of row UDFs at intermediate query stages #409
Fix use of row UDFs at intermediate query stages #409
Conversation
Codecov Report
@@ Coverage Diff @@
## main #409 +/- ##
==========================================
+ Coverage 88.94% 89.15% +0.21%
==========================================
Files 68 68
Lines 3337 3338 +1
Branches 657 658 +1
==========================================
+ Hits 2968 2976 +8
+ Misses 297 287 -10
- Partials 72 75 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 🙂 just a couple suggestions:
dask_sql/datacontainer.py
Outdated
df = column_args[0].to_frame() | ||
for col in column_args[1:]: | ||
df[col.name] = col | ||
for name, col in zip(self.names, column_args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pass the first parameter column name to to_frame
, we lose a layer off the resulting HLG and don't have to deal with a superfluous column:
df = column_args[0].to_frame(self.names[0])
for name, col in zip(self.names[1:], column_args[1:]):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions for my first review:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
As we discussed internally, it would probably be nice to follow this PR up with some work to make function registration more robust so we don't have to assume argument order - I can open up an issue to track that
Row UDFs in
dask-sql
are analogous to the functions expected bypandas.DataFrame.apply
an expect a row of data containing all the scalars in a single row. The UDF accesses these scalars from the row via the corresponding column labels within the function:This works fine if the dataframe that calls the
apply
from dask in the end actually contains columns nameda
andb
. This is however likely only the case for simple queries and fails if any columns are aliased during more complex operations, such as joins.This PR proposes to solve the problem by retaining the names of the original variables the function was registered with and reapplying them to the data later, just before calling
apply
.