Fix use of row UDFs at intermediate query stages #409

brandon-b-miller · 2022-02-23T23:41:45Z

Row UDFs in dask-sql are analogous to the functions expected by pandas.DataFrame.apply an expect a row of data containing all the scalars in a single row. The UDF accesses these scalars from the row via the corresponding column labels within the function:

def f(row):
    x = row['a'] * 2
    y = row['b']
    return x - y

This works fine if the dataframe that calls the apply from dask in the end actually contains columns named a and b. This is however likely only the case for simple queries and fails if any columns are aliased during more complex operations, such as joins.

This PR proposes to solve the problem by retaining the names of the original variables the function was registered with and reapplying them to the data later, just before calling apply.

codecov-commenter · 2022-02-28T22:09:44Z

Codecov Report

Merging #409 (f2ce42f) into main (ef0fa16) will increase coverage by 0.21%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #409      +/-   ##
==========================================
+ Coverage   88.94%   89.15%   +0.21%     
==========================================
  Files          68       68              
  Lines        3337     3338       +1     
  Branches      657      658       +1     
==========================================
+ Hits         2968     2976       +8     
+ Misses        297      287      -10     
- Partials       72       75       +3

Impacted Files	Coverage Δ
dask_sql/context.py	`100.00% <100.00%> (ø)`
dask_sql/datacontainer.py	`95.61% <100.00%> (+1.80%)`	⬆️
dask_sql/_version.py	`34.00% <0.00%> (+1.44%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef0fa16...f2ce42f. Read the comment docs.

charlesbluca

Looks good 🙂 just a couple suggestions:

charlesbluca · 2022-03-01T12:24:44Z

dask_sql/datacontainer.py

            df = column_args[0].to_frame()
-            for col in column_args[1:]:
-                df[col.name] = col
+            for name, col in zip(self.names, column_args):


If we pass the first parameter column name to to_frame, we lose a layer off the resulting HLG and don't have to deal with a superfluous column:

df = column_args[0].to_frame(self.names[0]) for name, col in zip(self.names[1:], column_args[1:]):

tests/integration/test_function.py

charlesbluca

Some suggestions for my first review:

dask_sql/datacontainer.py

charlesbluca

LGTM!

As we discussed internally, it would probably be nice to follow this PR up with some work to make function registration more robust so we don't have to assume argument order - I can open up an issue to track that

brandon-b-miller added 4 commits February 23, 2022 14:53

retain original input names and transform to them later

8919fc0

adjust impl

7787d11

tests and updates

6aa08cd

fix tests

cf7b183

charlesbluca requested changes Mar 1, 2022

View reviewed changes

dask_sql/datacontainer.py Outdated Show resolved Hide resolved

dask_sql/datacontainer.py Outdated Show resolved Hide resolved

brandon-b-miller added 2 commits March 1, 2022 10:19

Address reviews, fix tests

f2ce42f

update docs

ee0c520

charlesbluca approved these changes Mar 1, 2022

View reviewed changes

charlesbluca merged commit cd38818 into dask-contrib:main Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix use of row UDFs at intermediate query stages #409

Fix use of row UDFs at intermediate query stages #409

brandon-b-miller commented Feb 23, 2022

codecov-commenter commented Feb 28, 2022 •

edited

Loading

charlesbluca left a comment

charlesbluca Mar 1, 2022

charlesbluca left a comment

charlesbluca left a comment

Fix use of row UDFs at intermediate query stages #409

Fix use of row UDFs at intermediate query stages #409

Conversation

brandon-b-miller commented Feb 23, 2022

codecov-commenter commented Feb 28, 2022 • edited Loading

Codecov Report

charlesbluca left a comment

Choose a reason for hiding this comment

charlesbluca Mar 1, 2022

Choose a reason for hiding this comment

charlesbluca left a comment

Choose a reason for hiding this comment

charlesbluca left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 28, 2022 •

edited

Loading