Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results return by the similarity algorithms #3708

Closed
2 tasks done
Tracked by #3260
jnke2016 opened this issue Jul 14, 2023 · 3 comments
Closed
2 tasks done
Tracked by #3260

Incorrect results return by the similarity algorithms #3708

jnke2016 opened this issue Jul 14, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@jnke2016
Copy link
Contributor

jnke2016 commented Jul 14, 2023

Version

23.08

Which installation method(s) does this occur on?

Docker, Conda, Pip, Source

Describe the bug.

All the Similarity algorithms return incorrect results when calling the method number_of_vertices() before calling the algorithm if the vertices are not numbered from 0 to number_of_vertices -1. In fact, Jaccard for instance gets the number of vertices from the offsets array since the legacy version creates a graph from an adjacency list internally But when we explicitly call the method number_of_vertices() prior , this sets the number of vertices to a value different than what the offsets counterpart would do if the vertices are not renumbered (the offsets and indices array are different based on the number of vertices provided when creating those arrays.

Minimum reproducible example

data = 'PATH_TO/karate.csv'
gdf = cudf.read_csv(data, delimiter=' ', names=['src', 'dst', 'wt'], dtype=['int32', 'int32', 'float']  )
gdf = gdf.add(1)
G = cugraph.Graph(directed=False)
G.from_cudf_edgelist(gdf, source='src', destination='dst', renumber=True)
G.number_of_vertices()
sdf = cugraph.sorensen(G)
sdf.max()

Relevant log output

Case 1: don’t add 1 to gdf:
graph_offsets=[0,16,25,35,41,44,48,52,56,61,63,66,67,69,74,76,78,80,82,84,87,89,91,93,98,101,104,106,110,113,117,121,127,139];
graph_indices=[1,2,3,4,5,6,7,8,10,11,12,13,17,19,21,31,0,2,3,7,13,17,19,21,30,0,1,3,7,8,9,13,27,28,32,0,1,2,7,12,13,0,6,10,0,6,10,16,0,4,5,16,0,1,2,3,0,2,30,32,33,2,33,0,4,5,0,0,3,0,1,2,3,33,32,33,32,33,5,6,0,1,32,33,0,1,33,32,33,0,1,32,33,25,27,29,32,33,25,27,31,23,24,31,29,33,2,23,24,33,2,31,33,23,26,32,33,1,8,32,33,0,24,25,28,32,33,2,8,14,15,18,20,22,23,29,30,31,33,8,9,13,14,15,18,19,20,22,23,26,27,28,29,30,31,32];

Case 2: add 1 to gdf:
graph_offsets=[0,0,16,25,35,41,44,48,52,56,61,63,66,67,69,74,76,78,80,82,84,87,89,91,93,98,101,104,106,110,113,117,121,127];
graph_indices=[2,3,4,5,6,7,8,9,11,12,13,14,18,20,22,32,1,3,4,8,14,18,20,22,31,1,2,4,8,9,10,14,28,29,33,1,2,3,8,13,14,1,7,11,1,7,11,17,1,5,6,17,1,2,3,4,1,3,31,33,34,3,34,1,5,6,1,1,4,1,2,3,4,34,33,34,33,34,6,7,1,2,33,34,1,2,34,33,34,1,2,33,34,26,28,30,33,34,26,28,32,24,25,32,30,34,3,24,25,34,3,32,34,24,27,33,34,2,9,33,34,1,25,26,29,33,34,3,9,15,16,19,21,23,24,30,31,32,34,9,10,14,15,16,19,20,21,23,24,27,28,29,30,31,32,33];

The print from case 2 suggests that the graph is not, in fact, renumbered.  Thus there are 35 vertices (from 0 to 34).  The fact that we overwrite the number of vertices to 34 results in incorrect behavior

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@jnke2016 jnke2016 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 14, 2023
@BradReesWork BradReesWork removed the ? - Needs Triage Need team to review and classify label Jul 17, 2023
@kingmesal
Copy link

Is this resolved?

@BradReesWork BradReesWork changed the title [BUG]: Incorrect results return by the similarity algorithms Incorrect results return by the similarity algorithms Aug 3, 2023
@jnke2016
Copy link
Contributor Author

jnke2016 commented Aug 4, 2023

This issue is not resolved yet but we made progress in 23.08 to deliver a full fledged solution. In fact, we got the last piece that enables support for weights in the C++ layer which was merged during burndown. Prior to that as a safety guard, we merged a PR highlighting that the similarity algorithms don't currently support certain datasets and we throw an exception if this condition is not satisfied. Once the roadmap of 23.10 release is establish, we will have a better estimate as to when this will be resolved.

@jnke2016
Copy link
Contributor Author

Resolved by PR 3828.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants