Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Adding queries 25, 26 and 30 to be reviewed" #244

Merged
merged 1 commit into from
Mar 14, 2022

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Mar 14, 2022

Reverts #241

We should revert the changes pushed in this PR as :

  1. Q25 and Q26 changes are buggy and introduce the following error.
Traceback (most recent call last):
  File "/datasets/vjawa/miniconda3/envs/rapids-gpu-bdb-dask-sql-feb-11/lib/python3.8/site-packages/bdb_tools/utils.py", line 334, in run_sql_query
    results = benchmark(
  File "/datasets/vjawa/miniconda3/envs/rapids-gpu-bdb-dask-sql-feb-11/lib/python3.8/site-packages/bdb_tools/utils.py", line 57, in benchmark
    result = func(*args, **kwargs)
  File "gpu_bdb_query_25_dask_sql.py", line 188, in main
    results_dict = get_clusters(client=client, ml_input_df=cluster_input_ddf)
  File "gpu_bdb_query_25_dask_sql.py", line 58, in get_clusters
    pd.DataFrame(results_dict["cid_labels"]), npartitions=output.npartitions
  File "/datasets/vjawa/miniconda3/envs/rapids-gpu-bdb-dask-sql-feb-11/lib/python3.8/site-packages/pandas/core/frame.py", line 684, in __init__
    data = list(data)
  File "/home/nfs/vjawa/gpu_bdb_latest/dask-cuda/dask_cuda/proxy_object.py", line 560, in __iter__
    return iter(self._pxy_deserialize())
  File "/datasets/vjawa/miniconda3/envs/rapids-gpu-bdb-dask-sql-feb-11/lib/python3.8/site-packages/cudf/utils/utils.py", line 209, in __iter__
    raise TypeError(
TypeError: Series object is not iterable. Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` if you wish to iterate over the values.

This is because the following should check for dask_cudf and not cudf.

if isinstance(ml_input_df, cudf.DataFrame):
labels_final = dask_cudf.from_cudf(
results_dict["cid_labels"], npartitions=output.npartitions
)

  1. The ML component still uses cuML so uses GPUs so this not really a legit CPU implementation for these queries.

def train_clustering_model(training_df, n_clusters, max_iter, n_init):
"""Trains a KMeans clustering model on the
given dataframe and returns the resulting
labels and WSSSE"""
from cuml.cluster.kmeans import KMeans
best_sse = 0
best_model = None
# Optimizing by doing multiple seeding iterations.
for i in range(n_init):
model = KMeans(
oversampling_factor=0,
n_clusters=n_clusters,
max_iter=max_iter,
random_state=np.random.randint(0, 500),
init="k-means++",
)
model.fit(training_df)
score = model.inertia_
if best_model is None:
best_sse = score
best_model = model
elif abs(score) < abs(best_sse):
best_sse = score
best_model = model
return {
"cid_labels": best_model.labels_,
"wssse": best_model.inertia_,
"cluster_centers": best_model.cluster_centers_,
"nclusters": n_clusters,
}

CC: @DaceT , @ayushdg

@VibhuJawa VibhuJawa requested a review from ayushdg March 14, 2022 17:51
Copy link
Member

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! We should revisit enabling q30 separately if that doesn't break anything.

I'll open up another issue to discuss some sort of CI testing to catch these changes earlier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants