Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding pure SQL GPU-BDB Queries #235

Merged
merged 4 commits into from
Feb 15, 2022
Merged

Conversation

DaceT
Copy link
Contributor

@DaceT DaceT commented Feb 9, 2022

Fixes #230

All of the queries (01, 06, 07, 09 11, 12,13,14,15,16,17,20,21, 23, 24, 29) run successfully except for q22. However, once that is up and running, I'll commit the changes here.

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @DaceT .

I dont think we should rely on os.envron when we are running these queries. We should read it via a config file and then check based on the dataframe passed.

A big problem with it is the reduced readabilty of import pandas as cudf means that when a trace comes it will be insanely hard to reason about something that says its coming from cuDF but coming from pandas and vice versa.

Also, environment variable based optional imports look fidgety to me , especially with things that get bundled up as a package and then dask uses them. I think it can cause a lot of problems downstream.

gpu_bdb/bdb_tools/q20_utils.py Outdated Show resolved Hide resolved
gpu_bdb/bdb_tools/q20_utils.py Outdated Show resolved Hide resolved
Comment on lines 102 to 105
if os.getenv("CPU_ONLY") == 'True':
import dask.dataframe as dask_cudf
else:
import dask_cudf
Copy link
Member

@VibhuJawa VibhuJawa Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have us pass a ** kwarg and expand this class .

def __init__(
self, basepath, split_row_groups=False,
):
self.table_path_mapping = {
table: os.path.join(basepath, table, "*.parquet") for table in TABLE_NAMES
}
self.split_row_groups = split_row_groups

So something like:

 def __init__( self, basepath, split_row_groups=False,  backend = 'GPU'): 
     self.table_path_mapping = { 
         table: os.path.join(basepath, table, "*.parquet") for table in TABLE_NAMES 
     } 
     self.split_row_groups = split_row_groups 
    if back_end =='CPU'
      self.back_end = dask_cudf
   else:
      self.back_end = dask.dataframe

And Call it like :

self.back_end.read_parquet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want the opposite or did you mean to put 'GPU' instead of 'CPU'? @VibhuJawa

if back_end =='CPU'
      self.back_end = dask.dataframe
else:
      self.back_end = dask_cudf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. Exactly that.

gpu_bdb/bdb_tools/utils.py Outdated Show resolved Hide resolved
gpu_bdb/bdb_tools/utils.py Outdated Show resolved Hide resolved
Comment on lines 16 to 23
import os

from bdb_tools.cluster_startup import attach_to_cluster
import cudf

if os.getenv("CPU_ONLY") == 'True':
import pandas as cudf
else:
import cudf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use to_frame below.

    sales_corr = result["x"].corr(result["y"]).compute()
    result_df = sales_corr.to_frame()

gpu_bdb/bdb_tools/utils.py Outdated Show resolved Hide resolved
gpu_bdb/bdb_tools/utils.py Outdated Show resolved Hide resolved
DaceT and others added 2 commits February 10, 2022 12:37
Changing the parameter since dask_cudf.DataFrame imports from dask.DataFrame

Co-authored-by: Vibhu Jawa <[email protected]>
Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

One small style fix but everything else looks good.

gpu_bdb/bdb_tools/q29_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add CPU version of pure SQL GPU-BDB Queries
2 participants