[WIP] Adding pure SQL GPU-BDB Queries #235

DaceT · 2022-02-09T18:26:44Z

Fixes #230

All of the queries (01, 06, 07, 09 11, 12,13,14,15,16,17,20,21, 23, 24, 29) run successfully except for q22. However, once that is up and running, I'll commit the changes here.

VibhuJawa

Thanks for working on this @DaceT .

I dont think we should rely on os.envron when we are running these queries. We should read it via a config file and then check based on the dataframe passed.

A big problem with it is the reduced readabilty of import pandas as cudf means that when a trace comes it will be insanely hard to reason about something that says its coming from cuDF but coming from pandas and vice versa.

Also, environment variable based optional imports look fidgety to me , especially with things that get bundled up as a package and then dask uses them. I think it can cause a lot of problems downstream.

gpu_bdb/bdb_tools/q20_utils.py

VibhuJawa · 2022-02-09T22:02:31Z

gpu_bdb/bdb_tools/readers.py

+        if os.getenv("CPU_ONLY") == 'True':
+            import dask.dataframe as dask_cudf
+        else:
+            import dask_cudf


I would have us pass a ** kwarg and expand this class .

gpu-bdb/gpu_bdb/bdb_tools/readers.py

Lines 90 to 97 in db207a1

def __init__(

self, basepath, split_row_groups=False,

):

self.table_path_mapping = {

table: os.path.join(basepath, table, "*.parquet") for table in TABLE_NAMES

}

self.split_row_groups = split_row_groups

So something like:

def __init__( self, basepath, split_row_groups=False, backend = 'GPU'): self.table_path_mapping = { table: os.path.join(basepath, table, "*.parquet") for table in TABLE_NAMES } self.split_row_groups = split_row_groups if back_end =='CPU' self.back_end = dask_cudf else: self.back_end = dask.dataframe

And Call it like :

self.back_end.read_parquet

Wouldn't we want the opposite or did you mean to put 'GPU' instead of 'CPU'? @VibhuJawa

if back_end =='CPU' self.back_end = dask.dataframe else: self.back_end = dask_cudf

yup. Exactly that.

gpu_bdb/bdb_tools/utils.py

VibhuJawa · 2022-02-09T22:20:47Z

gpu_bdb/queries/q11/gpu_bdb_query_11_dask_sql.py

+import os

 from bdb_tools.cluster_startup import attach_to_cluster
-import cudf
+
+if os.getenv("CPU_ONLY") == 'True':
+    import pandas as cudf
+else:
+    import cudf


Just use to_frame below.

sales_corr = result["x"].corr(result["y"]).compute() result_df = sales_corr.to_frame()

gpu_bdb/bdb_tools/utils.py

Changing the parameter since dask_cudf.DataFrame imports from dask.DataFrame Co-authored-by: Vibhu Jawa <[email protected]>

VibhuJawa

LGTM !

One small style fix but everything else looks good.

gpu_bdb/bdb_tools/q29_utils.py

Co-authored-by: Vibhu Jawa <[email protected]>

VibhuJawa

LGTM

Adding pure SQL GPU-BDB Queries

db207a1

VibhuJawa requested changes Feb 9, 2022

View reviewed changes

DaceT and others added 2 commits February 10, 2022 12:37

Update gpu_bdb/bdb_tools/utils.py

5c12e8f

Changing the parameter since dask_cudf.DataFrame imports from dask.DataFrame Co-authored-by: Vibhu Jawa <[email protected]>

Updated the files w/ suggestions

99917b7

VibhuJawa approved these changes Feb 14, 2022

View reviewed changes

gpu_bdb/bdb_tools/q29_utils.py Outdated Show resolved Hide resolved

Update gpu_bdb/bdb_tools/q29_utils.py

e441e25

Co-authored-by: Vibhu Jawa <[email protected]>

VibhuJawa approved these changes Feb 15, 2022

View reviewed changes

VibhuJawa merged commit 87d0c2e into rapidsai:main Feb 15, 2022

VibhuJawa mentioned this pull request Mar 16, 2022

[REVIEW] Enable using CPU backend with first set of dask queries #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Adding pure SQL GPU-BDB Queries #235

[WIP] Adding pure SQL GPU-BDB Queries #235

DaceT commented Feb 9, 2022

VibhuJawa left a comment •

edited

Loading

VibhuJawa Feb 9, 2022 •

edited

Loading

DaceT Feb 10, 2022

VibhuJawa Feb 10, 2022

VibhuJawa Feb 9, 2022

VibhuJawa left a comment

VibhuJawa left a comment

	def __init__(
	self, basepath, split_row_groups=False,
	):
	self.table_path_mapping = {
	table: os.path.join(basepath, table, "*.parquet") for table in TABLE_NAMES
	}
	self.split_row_groups = split_row_groups

[WIP] Adding pure SQL GPU-BDB Queries #235

[WIP] Adding pure SQL GPU-BDB Queries #235

Conversation

DaceT commented Feb 9, 2022

VibhuJawa left a comment • edited Loading

Choose a reason for hiding this comment

VibhuJawa Feb 9, 2022 • edited Loading

Choose a reason for hiding this comment

DaceT Feb 10, 2022

Choose a reason for hiding this comment

VibhuJawa Feb 10, 2022

Choose a reason for hiding this comment

VibhuJawa Feb 9, 2022

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa left a comment •

edited

Loading

VibhuJawa Feb 9, 2022 •

edited

Loading