Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: spectral clustering always crashes with > 100 eigenvectors #710

Closed
joelb92 opened this issue Feb 10, 2020 · 3 comments
Closed

BUG: spectral clustering always crashes with > 100 eigenvectors #710

joelb92 opened this issue Feb 10, 2020 · 3 comments
Labels
bug Something isn't working

Comments

@joelb92
Copy link

joelb92 commented Feb 10, 2020

Describe the bug
I am attempting to partition a large sparse graph using cugraph's new spectralModularityMaximizationClustering method (however this bug also can be reproduced using the balanced version)

The graph I am performing on is ~ 100k*100k in size, with 444429 edges (quite sparse). No matter what I set the partition size to (in my case 5000), I cannot set the num_eigen_vects parameter to be greater than 100 without a crash:

To Reproduce
Steps to reproduce the behavior:

Graph building:

def getedges(f):

    fname = os.path.basename(f)[:-5]#get rid of .json suffix
    with open(f,'r') as fp:
        results = json.load(fp)
    edges = []
    for r in results:
        score = r['searchscore']
        rname = os.path.basename(r['source'])
        row = np.where(imagenames == fname)[0][0]
        column = np.where(imagenames == rname)[0]
        if len(column > 1):
            column = np.where(imagepaths == r['source'])[0]
        if len(column > 0):
            column = column[0]
            edges.append((row,column,score))
    return edges
edgelist = Parallel(n_jobs=20)(delayed(getedges)(f) for f in tqdm(jsonfiles))
adj_mat = lil_matrix((len(imagedb),len(imagedb)),dtype='float32')

 for edges in tqdm(edgelist):
        for row,column,score in edges:
            adj_mat[row,column] = score     adj_mat = adj_mat.tocsr()

Converting to cugraph:

row_offsets = cudf.Series(adj_mat.indptr)
col_indices = cudf.Series(adj_mat.indices)
values = cudf.Series(adj_mat.data)
G = cugraph.Graph()
G.add_adj_list(row_offsets, col_indices, values)`

Performing spectral clustering:

p = 5000
df = cugraph.spectralModularityMaximizationClustering(G,p,num_eigen_vects=100)#min(100,p))
score = cugraph.analyzeClustering_edge_cut(G, p, df['cluster'])
print(score)

Error:

GDFError                                  Traceback (most recent call last)
<ipython-input-15-9d098fb37fd7> in <module>
      1 p = 5000
----> 2 df = 
cugraph.spectralModularityMaximizationClustering(G,p,num_eigen_vects=101)#min(100,p))
      3 score = cugraph.analyzeClustering_edge_cut(G, p, df['cluster'])
      4 print(score)

cugraph/spectral_clustering/spectral_clustering.pyx in 
cugraph.spectralModularityMaximizationClustering()

cugraph/spectral_clustering/spectral_clustering.pyx in 
cugraph.spectralModularityMaximizationClustering()

cudf/bindings/cudf_cpp.pyx in cudf.bindings.cudf_cpp.check_gdf_error()

cudf/bindings/cudf_cpp.pyx in cudf.bindings.cudf_cpp.check_gdf_error()

GDFError: CUDA ERROR. b'cudaSuccess': b'no error'

Expected behavior
With a matrix of this size I would expect partitioning to allow more than 100 eigen vectors.

Desktop (please complete the following information):

  • OS: Ubuntu 18.04 with NVIDIA 2080ti
  • Software: Cugraph installed via this call: conda install -y -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.0

Can anyone else reproduce this issue with a large random graph?

Joel

@joelb92 joelb92 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 10, 2020
@BradReesWork BradReesWork removed the ? - Needs Triage Need team to review and classify label Feb 18, 2020
@BradReesWork
Copy link
Member

@joelb92 We currently limit the number of eigenvectors since is limited benefit to using more than the first 30 eigenvectors. Is there an analytical problem that you are trying to solve based on the 100th eigenvector?

@afender
Copy link
Member

afender commented Feb 18, 2020

@joelb92
Hi Joel,

I'd recommend giving a try a smaller number of eigenpairs. Based on the experiment in Fig. 2a : https://www.sciencedirect.com/science/article/pii/S1877050917307913

In our experiments we have found that for most networks it is possible to maximize the modularity by varying the number of clusters independently of the number of eigenpairs as shown in Fig. 2a. In this experiments, we have computed 2 to 8 eigenpairs, and afterward we have continued to increase only the number of k-means clusters. Notice that the plot demonstrates how the choice of the number of eigenpairs impacts the modularity score.
Based on the formulation of the modularity problem one can expect that the best-case scenario would be to compute a number of eigenpairs equal to the number of clusters. This is mostly, but not always, true, with the exceptions often due to loss of orthogonality or low-quality approximation of the latest eigenvectors.
We also see that it is possible to compute less eigenpairs for an equivalent clustering quality as shown in Fig. 2a. Indeed, modularity values seem to follow a trend set by the number of clusters and the selected number of eigenpairs is secondary. Hence, given a constant number of eigenpairs it is possible to maximize modularity by only increasing the number of k-means clusters.

@BradReesWork
Copy link
Member

making as closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants