Correlated subqueries #683

sarahyurick · 2022-08-11T23:23:56Z

Previously, we would get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index. for something like:

from dask_sql import Context
import dask.dataframe as dd
import pandas as pd

c = Context()

names = ["Miracle", "Sunshine", "Pretty woman", "Handsome man", "Barbie", "Cool painting", "Black square #1000", "Mountains"]
prices = [300, 700, 2800, 2300, 250, 5000, 50, 1300]
ids = [11, 12, 13, 14, 15, 16, 17, 18]
artist_id = [1, 1, 2, 2, 3, 3, 3, 4]
paintings = dd.from_pandas(pd.DataFrame({"id": ids, "name": names, "artist_id": artist_id, "listed_price": prices}), npartitions=1)
c.create_table("paintings", paintings)

sql1 = """
SELECT name, listed_price
FROM paintings
WHERE listed_price > (
    SELECT AVG(listed_price)
    FROM paintings
)
"""
c.sql(sql1).compute()

Not sure if this is the way we should go about this (not generalizable enough?), but here is an initial quick fix for that example. The general idea is that since we are comparing listed_price to a 1x1 table containing AVG(listed_price), the latter has to be converted to a single value by calling compute() and with casting.

sarahyurick · 2022-08-11T23:26:16Z

Fixes example in #320

df = pd.DataFrame({'id': [0, 1, 2], 'name': ['a', 'b', 'c'], 'val': [0, 1, 2]})

c.create_table('test', df)
c.sql("""
select name, val, id from test a
where val >
  (select avg(val) from test)
""").compute()

codecov-commenter · 2022-08-11T23:43:01Z

Codecov Report

❗ No coverage uploaded for pull request base (datafusion-sql-planner@c8259b9). Click here to learn what that means.
The diff coverage is n/a.

@@                    Coverage Diff                    @@
##             datafusion-sql-planner     #683   +/-   ##
=========================================================
  Coverage                          ?   66.95%           
=========================================================
  Files                             ?       73           
  Lines                             ?     3640           
  Branches                          ?      753           
=========================================================
  Hits                              ?     2437           
  Misses                            ?     1057           
  Partials                          ?      146

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

charlesbluca

Thanks for opening @sarahyurick! While running through the issue repro you shared, I notice we also get some DataFusion warnings / errors:

Skipping optimizer rule decorrelate_scalar_subquery due to unexpected error: scalar subqueries must have a filter to be correlated at /home/nfs/charlesb/.cargo/git/checko
uts/arrow-datafusion-71ae82d9dec9a01c/6c32098/datafusion/optimizer/src/decorrelate_scalar_subquery.rs:177                                                                 
caused by                                                                                                                                                                 
Error during planning: Could not coerce into Filter! at /home/nfs/charlesb/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/6c32098/datafusion/expr/src/logical_plan
/plan.rs:1127

cc @andygrove in case you have some thoughts on this

charlesbluca · 2022-08-25T15:28:53Z

dask_sql/physical/rex/core/call.py

+            except ValueError:
+                return reduce(
+                    partial(self.operation, **kwargs),
+                    (operands[0], float(operands[1][operands[1].columns[0]].loc[0].compute()))


I'd imagine we'd want to generalize the typecast here to handle other potential aggregating functions, but ATM can't think of an immediate way to do this.

andygrove · 2022-08-25T16:44:59Z

cc @andygrove in case you have some thoughts on this

I filed an issue against DataFusion to add support for this type of query: apache/datafusion#3266

sarahyurick · 2022-09-08T05:25:13Z

Looks like this was resolved on the DataFusion side with apache/datafusion#3287 !

initial fix

bc3cec6

sarahyurick requested review from ayushdg, charlesbluca and galipremsagar as code owners August 11, 2022 23:23

charlesbluca reviewed Aug 25, 2022

View reviewed changes

charlesbluca mentioned this pull request Aug 25, 2022

[DF] Implement subquery decorrelation optimizer rules #626

Open

sarahyurick closed this Sep 8, 2022

sarahyurick deleted the correlated_subqueries branch September 21, 2022 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correlated subqueries #683

Correlated subqueries #683

sarahyurick commented Aug 11, 2022 •

edited

Loading

sarahyurick commented Aug 11, 2022 •

edited

Loading

codecov-commenter commented Aug 11, 2022 •

edited

Loading

charlesbluca left a comment •

edited

Loading

charlesbluca Aug 25, 2022

andygrove commented Aug 25, 2022

sarahyurick commented Sep 8, 2022

Correlated subqueries #683

Correlated subqueries #683

Conversation

sarahyurick commented Aug 11, 2022 • edited Loading

sarahyurick commented Aug 11, 2022 • edited Loading

codecov-commenter commented Aug 11, 2022 • edited Loading

Codecov Report

charlesbluca left a comment • edited Loading

Choose a reason for hiding this comment

charlesbluca Aug 25, 2022

Choose a reason for hiding this comment

andygrove commented Aug 25, 2022

sarahyurick commented Sep 8, 2022

sarahyurick commented Aug 11, 2022 •

edited

Loading

sarahyurick commented Aug 11, 2022 •

edited

Loading

codecov-commenter commented Aug 11, 2022 •

edited

Loading

charlesbluca left a comment •

edited

Loading