You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently, queries such as:
fromdask.datasetsimporttimeseriesfromdask_sqlimportContextc=Context()
c.create_table("df", timeseries(), gpu=True)
c.sql("""select * from dfunion allselect * from df where false""")
Will error out with TypeError: cannot concatenate object of type <class 'pandas.core.frame.DataFrame'>. This is because DataFusion simplifies select * from df where false to an EmptyRelation, and our EmptyRelation plugin only has support for CPU:
While this example is trivial, optimizations introduced by DataFusion 14.0.0 in #903 would allow us to simplify to EmptyRelation in less trivial cases, and is currently the cause of several query regressions (#903 (comment)).
Describe the solution you'd like
Handling should be added so plugins that create tables on the fly are aware of if the tables should be CPU or GPU-backed; this is easier said than done, and probably requires some refactoring either of how we parse queries on the DataFusion end or how we enable/disable GPU support on the Python end. Some solutions that come to mind (in order of perceived technical complexity):
have GPU support be enabled/disabled with a kwarg passed into Context at construction; since all plugins have access to their overarching Context object, it would be easy to poll this to see if a CPU or GPU table needs to be made
this would entail removing the gpu kwarg from create_table, and having it set implicitly based on if the context is CPU or GPU enabled; this would also mean we wouldn't be able to have CPU/GPU tables in the same context
have GPU support be enabled/disabled with a flag passed to Context.sql(); this flag could then be passed down through the relational plugins
we could still allow for mixed CPU/GPU tables in a context with this method, though it would then be up to users to enable or disable GPU support per-query if they run into failures
intelligently enable/disable GPU support for on the fly plugins based on if their results will later be used with a GPU table; this would probably require adding a GPU attribute to the Rust representation of a DaskTable, custom implementations of any DF-native on the fly plugins, and an optimizer rule to modify them accordingly
Describe alternatives you've considered
Our current workaround in #903 is to disable the associated optimizer rule causing the regressions; however, I imagine the performance gains associated with this optimization are significant enough that we would want to unblock this.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently, queries such as:
Will error out with
TypeError: cannot concatenate object of type <class 'pandas.core.frame.DataFrame'>
. This is because DataFusion simplifiesselect * from df where false
to anEmptyRelation
, and ourEmptyRelation
plugin only has support for CPU:dask-sql/dask_sql/physical/rel/logical/empty.py
Lines 32 to 35 in 9f97cc7
While this example is trivial, optimizations introduced by DataFusion 14.0.0 in #903 would allow us to simplify to
EmptyRelation
in less trivial cases, and is currently the cause of several query regressions (#903 (comment)).Describe the solution you'd like
Handling should be added so plugins that create tables on the fly are aware of if the tables should be CPU or GPU-backed; this is easier said than done, and probably requires some refactoring either of how we parse queries on the DataFusion end or how we enable/disable GPU support on the Python end. Some solutions that come to mind (in order of perceived technical complexity):
Context
at construction; since all plugins have access to their overarchingContext
object, it would be easy to poll this to see if a CPU or GPU table needs to be madegpu
kwarg fromcreate_table
, and having it set implicitly based on if the context is CPU or GPU enabled; this would also mean we wouldn't be able to have CPU/GPU tables in the same contextContext.sql()
; this flag could then be passed down through the relational pluginsDaskTable
, custom implementations of any DF-native on the fly plugins, and an optimizer rule to modify them accordinglyDescribe alternatives you've considered
Our current workaround in #903 is to disable the associated optimizer rule causing the regressions; however, I imagine the performance gains associated with this optimization are significant enough that we would want to unblock this.
The text was updated successfully, but these errors were encountered: