Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MST optimization to guarantee the connectivity of CAGRA graphs #237

Merged
merged 6 commits into from
Jul 31, 2024

Conversation

anaruse
Copy link
Contributor

@anaruse anaruse commented Jul 22, 2024

This PR allows us to guarantee the connectivity of the CAGRA search graph using approximate MST.

It has been empirically shown that the graph indexes generated by CAGRA for search provide comparable search accuracy to other libraries, but reachability from any node to all nodes is not guaranteed. In fact, it has been confirmed that the number of strongly connected components (SCC) of graph indexes created by CAGRA is not 1 in some 100M scale datasets.

This problem can be alleviated by increasing the number of degrees in the search graph, but this would increase the size of the graph index. It is desirable to address this problem without increasing the number of degrees of the search graph.

Prior study has shown that this can be solved by using a Minimum Spanning Tree (MST)-like approach, but in general, MST calculation takes a long time. However, what is needed here is not an exact MST, but, for example, an approximate MST in which the total number of edges is not necessarily minimum. Such an approximate MST could be computed quickly on GPUs.

This PR contains implementation to create a approximate MST on the GPU at high speed based on the above policy and use it to guarantee the connectivity of the search graph.

This functionality is not always required, so it is considered an opt-in feature. A member variable named guarantee_connectivity is added to index_params, so set this variable to true if you wish to use this featgure.

cuvs::neighbors::cagra::index_params index_params;
index_params.guarantee_connectivity = true;
auto index = cuvs::neighbors::cagra::build(res, index_params, dataset_view);

@anaruse anaruse requested a review from a team as a code owner July 22, 2024 10:15
Copy link

copy-pr-bot bot commented Jul 22, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Jul 22, 2024
@tfeher tfeher added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 22, 2024
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Akira for porting the PR to cuVS! The original PR rapidsai/raft#2218 was already reviewed, and I see that most of the issues are addressed here. I have a marked a few smaller issues below.

There is some scope for improving code readability by relying more on mdspan, and adding a helper function to check duplicates (e.g. here). @anaruse I could easily apply the mdspan changes on the PR, if you are fine with that.

cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated Show resolved Hide resolved
cpp/include/cuvs/neighbors/common.hpp Outdated Show resolved Hide resolved
cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated Show resolved Hide resolved
cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated Show resolved Hide resolved
@anaruse
Copy link
Contributor Author

anaruse commented Jul 23, 2024

Thanks for the review, Tamas. I have fixed the points you pointed out, and it would be helpful if you could make the mdspan related changes.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Akira for the update. The PR looks good to me.

@cjnolet
Copy link
Member

cjnolet commented Jul 29, 2024

Before this is merged can we please have a GitHub issue created to expose this through the public APIs? We already have a use-case for this API in HDBSCAN and single linkage clustering.

@cjnolet
Copy link
Member

cjnolet commented Jul 29, 2024

Also, given that both of you have made changes to the PR, I’d like to have a 3rd person do a quick look over. I’ll take a look tomorrow.

@cjnolet
Copy link
Member

cjnolet commented Jul 29, 2024

Just noticed this issue #256 was created, so disregard my first comment.

@tfeher
Copy link
Contributor

tfeher commented Jul 29, 2024

Thanks Corey, more reviews are welcome! Note that I have only added a merge commit to this PR to bring it up to date with base branch. I planned to add mdspan usage related refactoring, but I decided to postpone that in a follow up (added a note in #256).

This PR is already useful in its current form: it helps to exclude/debug one issue that can lead to low recall. That is why I would prefer to get it merged even with some pending refactoring.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It would be nice to have a little more detail in the description of the guarantee_connectivity param, but i don't think that needs to hold up this PR.

@cjnolet
Copy link
Member

cjnolet commented Jul 29, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Jul 29, 2024

@anaruse @tfeher it looks like a small failure in the style checker

@cjnolet
Copy link
Member

cjnolet commented Jul 30, 2024

/ok to test

@tfeher
Copy link
Contributor

tfeher commented Jul 31, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Jul 31, 2024

/merge

@rapids-bot rapids-bot bot merged commit e67caa5 into rapidsai:branch-24.08 Jul 31, 2024
54 checks passed
divyegala pushed a commit to divyegala/cuvs that referenced this pull request Jul 31, 2024
…apidsai#237)

This PR allows us to guarantee the connectivity of the CAGRA search graph using approximate MST.

It has been empirically shown that the graph indexes generated by CAGRA for search provide comparable search accuracy to other libraries, but reachability from any node to all nodes is not guaranteed. In fact, it has been confirmed that the number of strongly connected components (SCC) of graph indexes created by CAGRA is not 1 in some 100M scale datasets.

This problem can be alleviated by increasing the number of degrees in the search graph, but this would increase the size of the graph index. It is desirable to address this problem without increasing the number of degrees of the search graph.

Prior study has shown that this can be solved by using a Minimum Spanning Tree (MST)-like approach, but in general, MST calculation takes a long time. However, what is needed here is not an exact MST, but, for example, an approximate MST in which the total number of edges is not necessarily minimum. Such an approximate MST could be computed quickly on GPUs.

This PR contains implementation to create a approximate MST on the GPU at high speed based on the above policy and use it to guarantee the connectivity of the search graph.

This functionality is not always required, so it is considered an opt-in feature. A member variable named `guarantee_connectivity` is added to `index_params`, so set this variable to `true` if you wish to use this featgure.

> cuvs::neighbors::cagra::index_params index_params;
> index_params.guarantee_connectivity = true;
> auto index = cuvs::neighbors::cagra::build(res, index_params, dataset_view);

Authors:
  - Akira Naruse (https://github.com/anaruse)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#237
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp feature request New feature or request non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

3 participants