Revert "Adding queries 25, 26 and 30 to be reviewed" #244

VibhuJawa · 2022-03-14T17:47:26Z

Reverts #241

We should revert the changes pushed in this PR as :

Q25 and Q26 changes are buggy and introduce the following error.

Traceback (most recent call last):
  File "/datasets/vjawa/miniconda3/envs/rapids-gpu-bdb-dask-sql-feb-11/lib/python3.8/site-packages/bdb_tools/utils.py", line 334, in run_sql_query
    results = benchmark(
  File "/datasets/vjawa/miniconda3/envs/rapids-gpu-bdb-dask-sql-feb-11/lib/python3.8/site-packages/bdb_tools/utils.py", line 57, in benchmark
    result = func(*args, **kwargs)
  File "gpu_bdb_query_25_dask_sql.py", line 188, in main
    results_dict = get_clusters(client=client, ml_input_df=cluster_input_ddf)
  File "gpu_bdb_query_25_dask_sql.py", line 58, in get_clusters
    pd.DataFrame(results_dict["cid_labels"]), npartitions=output.npartitions
  File "/datasets/vjawa/miniconda3/envs/rapids-gpu-bdb-dask-sql-feb-11/lib/python3.8/site-packages/pandas/core/frame.py", line 684, in __init__
    data = list(data)
  File "/home/nfs/vjawa/gpu_bdb_latest/dask-cuda/dask_cuda/proxy_object.py", line 560, in __iter__
    return iter(self._pxy_deserialize())
  File "/datasets/vjawa/miniconda3/envs/rapids-gpu-bdb-dask-sql-feb-11/lib/python3.8/site-packages/cudf/utils/utils.py", line 209, in __iter__
    raise TypeError(
TypeError: Series object is not iterable. Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` if you wish to iterate over the values.

This is because the following should check for dask_cudf and not cudf.

gpu-bdb/gpu_bdb/queries/q25/gpu_bdb_query_25_dask_sql.py

Lines 52 to 55 in 9ae8a4d

    
           if isinstance(ml_input_df, cudf.DataFrame): 
        
               labels_final = dask_cudf.from_cudf( 
        
                   results_dict["cid_labels"], npartitions=output.npartitions 
        
               )

The ML component still uses cuML so uses GPUs so this not really a legit CPU implementation for these queries.

gpu-bdb/gpu_bdb/bdb_tools/utils.py

Lines 958 to 994 in 9ae8a4d

    
           def train_clustering_model(training_df, n_clusters, max_iter, n_init): 
        
               """Trains a KMeans clustering model on the  
        
               given dataframe and returns the resulting 
        
               labels and WSSSE""" 
        
               from cuml.cluster.kmeans import KMeans 
        
               best_sse = 0 
        
               best_model = None 
        
               # Optimizing by doing multiple seeding iterations. 
        
               for i in range(n_init): 
        
                   model = KMeans( 
        
                       oversampling_factor=0, 
        
                       n_clusters=n_clusters, 
        
                       max_iter=max_iter, 
        
                       random_state=np.random.randint(0, 500), 
        
                       init="k-means++", 
        
                   ) 
        
                   model.fit(training_df) 
        
                   score = model.inertia_ 
        
                   if best_model is None: 
        
                       best_sse = score 
        
                       best_model = model 
        
                   elif abs(score) < abs(best_sse): 
        
                       best_sse = score 
        
                       best_model = model 
        
               return { 
        
                   "cid_labels": best_model.labels_, 
        
                   "wssse": best_model.inertia_, 
        
                   "cluster_centers": best_model.cluster_centers_, 
        
                   "nclusters": n_clusters, 
        
               }

CC: @DaceT , @ayushdg

This reverts commit 9ae8a4d.

ayushdg

lgtm! We should revisit enabling q30 separately if that doesn't break anything.

I'll open up another issue to discuss some sort of CI testing to catch these changes earlier

Revert "Adding queries 25, 26 and 30 to be reviewed (#241)"

dd002e8

This reverts commit 9ae8a4d.

VibhuJawa requested a review from ayushdg March 14, 2022 17:51

ayushdg approved these changes Mar 14, 2022

View reviewed changes

VibhuJawa merged commit 13987b4 into main Mar 14, 2022

ayushdg mentioned this pull request Mar 14, 2022

Add Ci testing to pr's #245

Open

This was referenced Mar 14, 2022

Adding query 2, 4 and 5 #243

Closed

CPU backend for Queries 25, 26 and 30 #246

Open

[CPU] ML Portion for GPU-BDB Queries #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Adding queries 25, 26 and 30 to be reviewed" #244

Revert "Adding queries 25, 26 and 30 to be reviewed" #244

VibhuJawa commented Mar 14, 2022 •

edited

Loading

ayushdg left a comment

	if isinstance(ml_input_df, cudf.DataFrame):
	labels_final = dask_cudf.from_cudf(
	results_dict["cid_labels"], npartitions=output.npartitions
	)

	def train_clustering_model(training_df, n_clusters, max_iter, n_init):
	"""Trains a KMeans clustering model on the
	given dataframe and returns the resulting
	labels and WSSSE"""

	from cuml.cluster.kmeans import KMeans

	best_sse = 0
	best_model = None

	# Optimizing by doing multiple seeding iterations.
	for i in range(n_init):
	model = KMeans(
	oversampling_factor=0,
	n_clusters=n_clusters,
	max_iter=max_iter,
	random_state=np.random.randint(0, 500),
	init="k-means++",
	)
	model.fit(training_df)

	score = model.inertia_

	if best_model is None:
	best_sse = score
	best_model = model

	elif abs(score) < abs(best_sse):
	best_sse = score
	best_model = model

	return {
	"cid_labels": best_model.labels_,
	"wssse": best_model.inertia_,
	"cluster_centers": best_model.cluster_centers_,
	"nclusters": n_clusters,
	}

Revert "Adding queries 25, 26 and 30 to be reviewed" #244

Revert "Adding queries 25, 26 and 30 to be reviewed" #244

Conversation

VibhuJawa commented Mar 14, 2022 • edited Loading

ayushdg left a comment

Choose a reason for hiding this comment

VibhuJawa commented Mar 14, 2022 •

edited

Loading