-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Adding pure SQL GPU-BDB Queries #235
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @DaceT .
I dont think we should rely on os.envron
when we are running these queries. We should read it via a config file and then check based on the dataframe passed.
A big problem with it is the reduced readabilty of import pandas as cudf
means that when a trace comes it will be insanely hard to reason about something that says its coming from cuDF but coming from pandas and vice versa.
Also, environment variable based optional imports look fidgety to me , especially with things that get bundled up as a package and then dask uses them. I think it can cause a lot of problems downstream.
gpu_bdb/bdb_tools/readers.py
Outdated
if os.getenv("CPU_ONLY") == 'True': | ||
import dask.dataframe as dask_cudf | ||
else: | ||
import dask_cudf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have us pass a ** kwarg and expand this class .
gpu-bdb/gpu_bdb/bdb_tools/readers.py
Lines 90 to 97 in db207a1
def __init__( | |
self, basepath, split_row_groups=False, | |
): | |
self.table_path_mapping = { | |
table: os.path.join(basepath, table, "*.parquet") for table in TABLE_NAMES | |
} | |
self.split_row_groups = split_row_groups | |
So something like:
def __init__( self, basepath, split_row_groups=False, backend = 'GPU'):
self.table_path_mapping = {
table: os.path.join(basepath, table, "*.parquet") for table in TABLE_NAMES
}
self.split_row_groups = split_row_groups
if back_end =='CPU'
self.back_end = dask_cudf
else:
self.back_end = dask.dataframe
And Call it like :
self.back_end.read_parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we want the opposite or did you mean to put 'GPU' instead of 'CPU'? @VibhuJawa
if back_end =='CPU'
self.back_end = dask.dataframe
else:
self.back_end = dask_cudf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup. Exactly that.
import os | ||
|
||
from bdb_tools.cluster_startup import attach_to_cluster | ||
import cudf | ||
|
||
if os.getenv("CPU_ONLY") == 'True': | ||
import pandas as cudf | ||
else: | ||
import cudf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use to_frame
below.
sales_corr = result["x"].corr(result["y"]).compute()
result_df = sales_corr.to_frame()
Changing the parameter since dask_cudf.DataFrame imports from dask.DataFrame Co-authored-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
One small style fix but everything else looks good.
Co-authored-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #230
All of the queries (01, 06, 07, 09 11, 12,13,14,15,16,17,20,21, 23, 24, 29) run successfully except for q22. However, once that is up and running, I'll commit the changes here.