BUG: spectral clustering always crashes with > 100 eigenvectors #710

joelb92 · 2020-02-10T01:47:47Z

Describe the bug
I am attempting to partition a large sparse graph using cugraph's new spectralModularityMaximizationClustering method (however this bug also can be reproduced using the balanced version)

The graph I am performing on is ~ 100k*100k in size, with 444429 edges (quite sparse). No matter what I set the partition size to (in my case 5000), I cannot set the num_eigen_vects parameter to be greater than 100 without a crash:

To Reproduce
Steps to reproduce the behavior:

Graph building:

def getedges(f):

    fname = os.path.basename(f)[:-5]#get rid of .json suffix
    with open(f,'r') as fp:
        results = json.load(fp)
    edges = []
    for r in results:
        score = r['searchscore']
        rname = os.path.basename(r['source'])
        row = np.where(imagenames == fname)[0][0]
        column = np.where(imagenames == rname)[0]
        if len(column > 1):
            column = np.where(imagepaths == r['source'])[0]
        if len(column > 0):
            column = column[0]
            edges.append((row,column,score))
    return edges
edgelist = Parallel(n_jobs=20)(delayed(getedges)(f) for f in tqdm(jsonfiles))
adj_mat = lil_matrix((len(imagedb),len(imagedb)),dtype='float32')

 for edges in tqdm(edgelist):
        for row,column,score in edges:
            adj_mat[row,column] = score     adj_mat = adj_mat.tocsr()

Converting to cugraph:

row_offsets = cudf.Series(adj_mat.indptr)
col_indices = cudf.Series(adj_mat.indices)
values = cudf.Series(adj_mat.data)
G = cugraph.Graph()
G.add_adj_list(row_offsets, col_indices, values)`

Performing spectral clustering:

p = 5000
df = cugraph.spectralModularityMaximizationClustering(G,p,num_eigen_vects=100)#min(100,p))
score = cugraph.analyzeClustering_edge_cut(G, p, df['cluster'])
print(score)

Error:

GDFError                                  Traceback (most recent call last)
<ipython-input-15-9d098fb37fd7> in <module>
      1 p = 5000
----> 2 df = 
cugraph.spectralModularityMaximizationClustering(G,p,num_eigen_vects=101)#min(100,p))
      3 score = cugraph.analyzeClustering_edge_cut(G, p, df['cluster'])
      4 print(score)

cugraph/spectral_clustering/spectral_clustering.pyx in 
cugraph.spectralModularityMaximizationClustering()

cugraph/spectral_clustering/spectral_clustering.pyx in 
cugraph.spectralModularityMaximizationClustering()

cudf/bindings/cudf_cpp.pyx in cudf.bindings.cudf_cpp.check_gdf_error()

cudf/bindings/cudf_cpp.pyx in cudf.bindings.cudf_cpp.check_gdf_error()

GDFError: CUDA ERROR. b'cudaSuccess': b'no error'

Expected behavior
With a matrix of this size I would expect partitioning to allow more than 100 eigen vectors.

Desktop (please complete the following information):

OS: Ubuntu 18.04 with NVIDIA 2080ti
Software: Cugraph installed via this call: conda install -y -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.0

Can anyone else reproduce this issue with a large random graph?

Joel

The text was updated successfully, but these errors were encountered:

BradReesWork · 2020-02-18T16:00:23Z

@joelb92 We currently limit the number of eigenvectors since is limited benefit to using more than the first 30 eigenvectors. Is there an analytical problem that you are trying to solve based on the 100th eigenvector?

afender · 2020-02-18T17:10:22Z

@joelb92
Hi Joel,

I'd recommend giving a try a smaller number of eigenpairs. Based on the experiment in Fig. 2a : https://www.sciencedirect.com/science/article/pii/S1877050917307913

In our experiments we have found that for most networks it is possible to maximize the modularity by varying the number of clusters independently of the number of eigenpairs as shown in Fig. 2a. In this experiments, we have computed 2 to 8 eigenpairs, and afterward we have continued to increase only the number of k-means clusters. Notice that the plot demonstrates how the choice of the number of eigenpairs impacts the modularity score.
Based on the formulation of the modularity problem one can expect that the best-case scenario would be to compute a number of eigenpairs equal to the number of clusters. This is mostly, but not always, true, with the exceptions often due to loss of orthogonality or low-quality approximation of the latest eigenvectors.
We also see that it is possible to compute less eigenpairs for an equivalent clustering quality as shown in Fig. 2a. Indeed, modularity values seem to follow a trend set by the number of clusters and the selected number of eigenpairs is secondary. Hence, given a constant number of eigenpairs it is possible to maximize modularity by only increasing the number of k-means clusters.

BradReesWork · 2020-02-19T19:17:45Z

making as closed

joelb92 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 10, 2020

BradReesWork removed the ? - Needs Triage Need team to review and classify label Feb 18, 2020

BradReesWork closed this as completed Feb 19, 2020

CyanideCentral mentioned this issue May 7, 2024

[BUG]: spectralModularityMaximizationClustering fails for large graphs or large num_eigen_vects #4392

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: spectral clustering always crashes with > 100 eigenvectors #710

BUG: spectral clustering always crashes with > 100 eigenvectors #710

joelb92 commented Feb 10, 2020 •

edited

Loading

BradReesWork commented Feb 18, 2020

afender commented Feb 18, 2020 •

edited by BradReesWork

Loading

BradReesWork commented Feb 19, 2020

BUG: spectral clustering always crashes with > 100 eigenvectors #710

BUG: spectral clustering always crashes with > 100 eigenvectors #710

Comments

joelb92 commented Feb 10, 2020 • edited Loading

BradReesWork commented Feb 18, 2020

afender commented Feb 18, 2020 • edited by BradReesWork Loading

BradReesWork commented Feb 19, 2020

joelb92 commented Feb 10, 2020 •

edited

Loading

afender commented Feb 18, 2020 •

edited by BradReesWork

Loading