[BUG] Fix a bug in NN descent #1869

enp1s0 · 2023-10-03T07:46:28Z

I got errors when compiling a program using raft NN-descent. This PR fixes the bug.

Error

e.g.

/home/.../include/raft/neighbors/detail/nn_descent.cuh(1158): error: invalid narrowing conversion from "unsigned long" to "int"
      h_rev_graph_old_{static_cast<size_t>(nrow_ * NUM_SAMPLES)},

nvcc

Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

gcc

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

thrust : 2.2 (This is the cause of this error [detail])

Change

Use Class(...) instead of Class{...}.

Cause

The NN-descent code calls constructors of thrust::host_vector as shown below:

graph_host_buffer_{static_cast<size_t>(nrow_ * DEGREE_ON_DEVICE)},

However, this constructor is regarded as a list initialization.
This is the same as the following code outputting 1 instead of 2.

#include <iostream>
#include <vector>

int main() {
  std::vector<float> list{2};

  std::cout << list.size() << std::endl;
}

detail

copy-pr-bot · 2023-10-03T07:46:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

enp1s0 · 2023-10-03T07:51:16Z

Thank you, @divyegala and @RayWang96, for porting NN-descent to raft. Is this modification correct?

RayWang96 · 2023-10-03T08:17:53Z

I believe you are right. And I'm curious why it can pass the unit test with this bug.

cjnolet · 2023-10-03T13:51:58Z

/ok to test

divyegala · 2023-10-04T01:35:48Z

@enp1s0 thanks for this PR. Could you post your NVCC and GCC versions in the description of the PR for hindsight? It seems like RAFT's CI NVCC+GCC are able to compile fine, so we should make note of what causes this compilation failure

divyegala

raft::device_* variants of mdarray do not have an initializer-list based constructor, so I prefer to keep brace initialization where possible

divyegala · 2023-10-04T01:36:17Z

cpp/include/raft/neighbors/detail/nn_descent.cuh

-    h_graph_new{nrow * num_samples},
-    h_list_sizes_new{nrow},
-    h_graph_old{nrow * num_samples},
+    h_dists(raft::make_host_matrix<DistData_t, size_t, raft::row_major>(nrow, node_degree)),


This can keep brace initializer

divyegala · 2023-10-04T01:36:31Z

cpp/include/raft/neighbors/detail/nn_descent.cuh

+    d_data_(raft::make_device_matrix<__half, Index_t, raft::row_major>(
+      res, nrow_, build_config.dataset_dim)),
+    l2_norms_(raft::make_device_vector<DistData_t, Index_t>(res, nrow_)),
+    graph_buffer_(
+      raft::make_device_matrix<ID_t, Index_t, raft::row_major>(res, nrow_, DEGREE_ON_DEVICE)),
+    dists_buffer_(
+      raft::make_device_matrix<DistData_t, Index_t, raft::row_major>(res, nrow_, DEGREE_ON_DEVICE)),


Same as above

divyegala · 2023-10-04T01:36:56Z

cpp/include/raft/neighbors/detail/nn_descent.cuh

+      raft::make_device_matrix<DistData_t, Index_t, raft::row_major>(res, nrow_, DEGREE_ON_DEVICE)),
+    graph_host_buffer_(nrow_ * DEGREE_ON_DEVICE),
+    dists_host_buffer_(nrow_ * DEGREE_ON_DEVICE),
+    d_locks_(raft::make_device_vector<int, Index_t>(res, nrow_)),


Same as above

divyegala · 2023-10-04T01:37:04Z

cpp/include/raft/neighbors/detail/nn_descent.cuh

+    d_list_sizes_new_(raft::make_device_vector<int2, Index_t>(res, nrow_)),
+    d_list_sizes_old_(raft::make_device_vector<int2, Index_t>(res, nrow_))


Same as above

enp1s0 · 2023-10-04T04:36:51Z

Thank you, @divyegala, @RayWang96, and @cjnolet, for reviewing the PR. I have modified the code based on the comments from @divyegala.

The compilation of the unit test passes because it uses int64_t as the index type, while I use int32_t in my program. The error is invalid narrowing conversion from "unsigned long" to "int" and "int" here is the index type, int32_t. In the case of the unit test, no narrowing conversion occurs because the index type is int64_t. But I'm not sure why the unit test works well while the memory is not allocated correctly. Let me look into this a little more. And I think this PR can be merged without waiting for the investigation.

cjnolet · 2023-10-04T04:39:07Z

Thanks @enp1s0. Release 23.10 code freeze has passed, though, and the branch is being prepped for release. Can you base your branch on the 23.12? I'm in the process of fixing some merge conflicts to our forward merger, but we can get your changes merged shortly after.

divyegala · 2023-10-04T04:45:27Z

@enp1s0 I think it's quite likely that this issue is a result of your nvcc version. NN-Descent hard-codes internal index types to int32 so whether you use int64 or int32 index types for CAGRA it should not matter.

Are you saying that the original code compiles with int64 index type for CAGRA with your nvcc version?

enp1s0 · 2023-10-04T10:59:01Z

Thank you, @divyegala, for your comment. I have found that the difference in the thrust versions is causing the compilation error. Raft uses thrust 1.17.2 while my program uses thrust 2.2, which is included in the CUDA 12.2 toolkit. And the list initializer was introduced in thrust 2.1 [release note].

So, this is not a bug. Is it better to close this PR or merge it for the future thrust version update used in raft? (cc: @cjnolet)

divyegala · 2023-10-04T12:01:48Z

@enp1s0 thanks for the investigations! We should merge this PR. Can you also add the thrust version to the description of this PR?

divyegala · 2023-10-04T12:02:20Z

/ok to test

cjnolet · 2023-10-06T00:25:12Z

/ok to test

cjnolet · 2023-10-06T02:59:16Z

/merge

I got errors when compiling a program using raft NN-descent. This PR fixes the bug. ## Error e.g. ``` /home/.../include/raft/neighbors/detail/nn_descent.cuh(1158): error: invalid narrowing conversion from "unsigned long" to "int" h_rev_graph_old_{static_cast<size_t>(nrow_ * NUM_SAMPLES)}, ``` - nvcc ``` Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 ``` - gcc ``` gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 ``` - thrust : 2.2 (This is the cause of this error [[detail](rapidsai#1869 (comment))]) # Change Use `Class(...)` instead of `Class{...}`. # Cause The NN-descent code calls constructors of `thrust::host_vector` as shown below: ```cpp graph_host_buffer_{static_cast<size_t>(nrow_ * DEGREE_ON_DEVICE)}, ``` However, this constructor is regarded as a list initialization. This is the same as the following code outputting 1 instead of 2. ```cpp #include <iostream> #include <vector> int main() { std::vector<float> list{2}; std::cout << list.size() << std::endl; } ``` [detail](https://en.cppreference.com/w/cpp/language/list_initialization) Authors: - tsuki (https://github.com/enp1s0) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Divye Gala (https://github.com/divyegala) URL: rapidsai#1869

enp1s0 requested a review from a team as a code owner October 3, 2023 07:46

github-actions bot added the cpp label Oct 3, 2023

enp1s0 added bug Something isn't working non-breaking Non-breaking change labels Oct 3, 2023

enp1s0 self-assigned this Oct 3, 2023

cjnolet approved these changes Oct 3, 2023

View reviewed changes

divyegala reviewed Oct 4, 2023

View reviewed changes

cjnolet changed the base branch from branch-23.10 to branch-23.12 October 4, 2023 04:44

cjnolet requested review from a team as code owners October 4, 2023 04:44

Fix braket

5ee1976

enp1s0 force-pushed the fix-nn_desccent-constructor-bug branch from 259e5a4 to 5ee1976 Compare October 4, 2023 06:57

divyegala approved these changes Oct 4, 2023

View reviewed changes

Merge branch 'branch-23.12' into fix-nn_desccent-constructor-bug

9134bdf

rapids-bot bot merged commit d9fde97 into rapidsai:branch-23.12 Oct 6, 2023

enp1s0 deleted the fix-nn_desccent-constructor-bug branch October 6, 2023 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix a bug in NN descent #1869

[BUG] Fix a bug in NN descent #1869

enp1s0 commented Oct 3, 2023 •

edited

Loading

copy-pr-bot bot commented Oct 3, 2023

enp1s0 commented Oct 3, 2023

RayWang96 commented Oct 3, 2023

cjnolet commented Oct 3, 2023

divyegala commented Oct 4, 2023

divyegala left a comment

divyegala Oct 4, 2023

divyegala Oct 4, 2023

divyegala Oct 4, 2023

divyegala Oct 4, 2023

enp1s0 commented Oct 4, 2023

cjnolet commented Oct 4, 2023

divyegala commented Oct 4, 2023

enp1s0 commented Oct 4, 2023 •

edited

Loading

divyegala commented Oct 4, 2023

divyegala commented Oct 4, 2023

cjnolet commented Oct 6, 2023

cjnolet commented Oct 6, 2023

		d_list_sizes_new_(raft::make_device_vector<int2, Index_t>(res, nrow_)),
		d_list_sizes_old_(raft::make_device_vector<int2, Index_t>(res, nrow_))

[BUG] Fix a bug in NN descent #1869

[BUG] Fix a bug in NN descent #1869

Conversation

enp1s0 commented Oct 3, 2023 • edited Loading

Error

Change

Cause

copy-pr-bot bot commented Oct 3, 2023

enp1s0 commented Oct 3, 2023

RayWang96 commented Oct 3, 2023

cjnolet commented Oct 3, 2023

divyegala commented Oct 4, 2023

divyegala left a comment

Choose a reason for hiding this comment

divyegala Oct 4, 2023

Choose a reason for hiding this comment

divyegala Oct 4, 2023

Choose a reason for hiding this comment

divyegala Oct 4, 2023

Choose a reason for hiding this comment

divyegala Oct 4, 2023

Choose a reason for hiding this comment

enp1s0 commented Oct 4, 2023

cjnolet commented Oct 4, 2023

divyegala commented Oct 4, 2023

enp1s0 commented Oct 4, 2023 • edited Loading

divyegala commented Oct 4, 2023

divyegala commented Oct 4, 2023

cjnolet commented Oct 6, 2023

cjnolet commented Oct 6, 2023

enp1s0 commented Oct 3, 2023 •

edited

Loading

enp1s0 commented Oct 4, 2023 •

edited

Loading