Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add GPU support to plugins that create tables on the fly #914

Open
charlesbluca opened this issue Nov 14, 2022 · 0 comments
Open

[ENH] Add GPU support to plugins that create tables on the fly #914

charlesbluca opened this issue Nov 14, 2022 · 0 comments
Labels
enhancement New feature or request needs triage Awaiting triage by a dask-sql maintainer

Comments

@charlesbluca
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Currently, queries such as:

from dask.datasets import timeseries
from dask_sql import Context

c = Context()
c.create_table("df", timeseries(), gpu=True)

c.sql("""
select * from df
union all
select * from df where false
""")

Will error out with TypeError: cannot concatenate object of type <class 'pandas.core.frame.DataFrame'>. This is because DataFusion simplifies select * from df where false to an EmptyRelation, and our EmptyRelation plugin only has support for CPU:

return DataContainer(
dd.from_pandas(pd.DataFrame(data, columns=col_names), npartitions=1),
ColumnContainer(col_names),
)

While this example is trivial, optimizations introduced by DataFusion 14.0.0 in #903 would allow us to simplify to EmptyRelation in less trivial cases, and is currently the cause of several query regressions (#903 (comment)).

Describe the solution you'd like
Handling should be added so plugins that create tables on the fly are aware of if the tables should be CPU or GPU-backed; this is easier said than done, and probably requires some refactoring either of how we parse queries on the DataFusion end or how we enable/disable GPU support on the Python end. Some solutions that come to mind (in order of perceived technical complexity):

  • have GPU support be enabled/disabled with a kwarg passed into Context at construction; since all plugins have access to their overarching Context object, it would be easy to poll this to see if a CPU or GPU table needs to be made
    • this would entail removing the gpu kwarg from create_table, and having it set implicitly based on if the context is CPU or GPU enabled; this would also mean we wouldn't be able to have CPU/GPU tables in the same context
  • have GPU support be enabled/disabled with a flag passed to Context.sql(); this flag could then be passed down through the relational plugins
    • we could still allow for mixed CPU/GPU tables in a context with this method, though it would then be up to users to enable or disable GPU support per-query if they run into failures
  • intelligently enable/disable GPU support for on the fly plugins based on if their results will later be used with a GPU table; this would probably require adding a GPU attribute to the Rust representation of a DaskTable, custom implementations of any DF-native on the fly plugins, and an optimizer rule to modify them accordingly

Describe alternatives you've considered
Our current workaround in #903 is to disable the associated optimizer rule causing the regressions; however, I imagine the performance gains associated with this optimization are significant enough that we would want to unblock this.

@charlesbluca charlesbluca added enhancement New feature or request needs triage Awaiting triage by a dask-sql maintainer labels Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage Awaiting triage by a dask-sql maintainer
Projects
None yet
Development

No branches or pull requests

1 participant