Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace csr_adj_graph functions with faster equivalent #746

Conversation

ahendriksen
Copy link
Contributor

@ahendriksen ahendriksen commented Jul 18, 2022

The csr_adj_graph functions are a performance bottleneck in the DBSCAN implementation in cuML. They are not used anywhere else.

This PR replaces the csr_adj_graph functions with the faster dense_bool_to_unsorted_csr function. It has the same functionality, but

  1. It requires the input adjacency matrix to be in row-major order (rather than column-major).
  2. The output column indices are not guaranteed to be in ascending order (hence unsorted).

@ahendriksen ahendriksen requested a review from a team as a code owner July 18, 2022 15:17
@github-actions github-actions bot added the cpp label Jul 18, 2022
@ahendriksen ahendriksen added 3 - Ready for Review improvement Improvement / enhancement to an existing function breaking Breaking change CMake labels Jul 18, 2022
@ahendriksen ahendriksen force-pushed the fea-replace-adj-to-csr-functionality branch from 99df1ef to 469cb13 Compare July 18, 2022 15:28
@ahendriksen ahendriksen requested a review from a team as a code owner July 18, 2022 15:28
@ahendriksen
Copy link
Contributor Author

Moving this functionality from cuML to RAFT as discussed in rapidsai/cuml#4803.

On an A10, this feature achieves +-90% of bandwidth

--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
bench_base<int64_t>/0/manual_time       4.44 ms         5.43 ms          158 BW=450.363G/s BW read=225.182G/s BW write=225.182G/s Columns=1073.74M Fraction nz=12.5 Rows=1
bench_base<int64_t>/1/manual_time       4.59 ms         4.77 ms          151 BW=436.12G/s BW read=218.06G/s BW write=218.06G/s Columns=134.218M Fraction nz=12.5 Rows=8
bench_base<int64_t>/2/manual_time       4.51 ms         4.62 ms          150 BW=443.356G/s BW read=221.678G/s BW write=221.678G/s Columns=16.7772M Fraction nz=12.5 Rows=64
bench_base<int64_t>/3/manual_time       4.51 ms         4.61 ms          150 BW=443.751G/s BW read=221.875G/s BW write=221.875G/s Columns=32.768k Fraction nz=12.5 Rows=32.768k
bench_base<int64_t>/4/manual_time       2.43 ms         3.01 ms          279 BW=462.471G/s BW read=411.085G/s BW write=51.3857G/s Columns=1073.74M Fraction nz=1.5625 Rows=1
bench_base<int64_t>/5/manual_time       2.45 ms         2.57 ms          269 BW=459.154G/s BW read=408.137G/s BW write=51.0171G/s Columns=134.218M Fraction nz=1.5625 Rows=8
bench_base<int64_t>/6/manual_time       2.44 ms         2.53 ms          268 BW=460.197G/s BW read=409.064G/s BW write=51.133G/s Columns=16.7772M Fraction nz=1.5625 Rows=64
bench_base<int64_t>/7/manual_time       2.22 ms         2.75 ms          307 BW=451.95G/s BW read=450.192G/s BW write=1.75856G/s Columns=1073.74M Fraction nz=0.0488281 Rows=1
bench_base<int64_t>/8/manual_time       2.24 ms         2.35 ms          291 BW=448.73G/s BW read=446.984G/s BW write=1.74603G/s Columns=134.218M Fraction nz=0.0488281 Rows=8
bench_base<int64_t>/9/manual_time       2.26 ms         2.34 ms          288 BW=443.597G/s BW read=441.871G/s BW write=1.72606G/s Columns=16.7772M Fraction nz=0.0488281 Rows=64

Practical RW bandwidth is 492 GB/s on this machine.

Unrolling the loop in the kernel (#pragma unroll) resulted in exactly the same performance.
A smaller chunk size (8 bytes per loop iteration) resulted in degraded performance.

@ahendriksen
Copy link
Contributor Author

I have not removed the csr_row_op function. It is not used anywhere at this moment. It could be used for other purposes in the future. Shall I leave it in?

I have deduplicated the .hpp and .cuh header files in the sparse/convert directory tree in commit 469cb13. Is this the correct way to do it?

The csr_adj_graph functions are a performance bottleneck in the DBSCAN
implementation in cuML. They are not used anywhere else.

This commit replaces the csr_adj_graph functions with the
dense_bool_to_unsorted_csr function. It has the same
functionality, *but*

1. It requires the input adjacency matrix to be in row-major
   order (rather than column-major).
2. The output column indices are not guaranteed to be in ascending
   order (hence unsorted).
@ahendriksen ahendriksen force-pushed the fea-replace-adj-to-csr-functionality branch from 469cb13 to 598f77b Compare July 18, 2022 15:43
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Allard, thanks for the PR. In principle it looks good, but what if we go a step further?

Currently, the function that you are adding needs to receive the row_ind values. It would be more self contained if it only takes the boolean matrix, and calculates row_ind internally, and returns it together with the row_ind_ptr

Earlier these steps were separated, because the vertex degree was calculated in the epsilon neighborhood kernel. But I see that in rapidsai/cuml#4803 you have added an explicit vertex degree calculation. If that is necessary, then it would make sense to move all the steps of building the csr matrix here (vertex degree, scan, adj_to_csr). What is your opinion?

cpp/bench/sparse/convert_csr.cu Show resolved Hide resolved
cpp/bench/sparse/convert_csr.cu Outdated Show resolved Hide resolved
cpp/test/sparse/convert_csr.cu Outdated Show resolved Hide resolved
cpp/include/raft/sparse/convert/detail/dense_to_csr.cuh Outdated Show resolved Hide resolved
cpp/include/raft/sparse/convert/detail/dense_to_csr.cuh Outdated Show resolved Hide resolved
cpp/include/raft/sparse/convert/detail/dense_to_csr.cuh Outdated Show resolved Hide resolved
@ahendriksen
Copy link
Contributor Author

@tfeher I agree that calculating the vertex degrees internally would simplify the API. If we want to have a non-allocating API however, then it makes less sense to calculate the vertex degrees internally.

The caller has to pre-allocate out_col_ind, which requires knowing the number of nonzeros in the matrix. Thus, summing over the matrix is already done by the caller. In the tests and benchmarks, we gloss over this by pre-allocating out_col_ind with size num_rows x num_cols, which is probably not acceptable in production code.

Therefore, the function as it stands now strikes a decent balance between usability and not repeating calculations.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahendriksen for addressing the issues! The PR looks good to me.

You are right, about the allocations. In the future, if we see a need for it, we can still introduce an alternate version of adj_to_csr, that could do all the calculations and allocations internally. Right now, I believe we can keep the current form.

- Rename dense_bool_to_unsorted_csr to adj_to_csr
- Add grid-stride loops for test case generation (both bench and test)
- Remove overload

In addition:
- Add test case for empty input
- Fix behavior in case of empty input (return early)
@ahendriksen ahendriksen force-pushed the fea-replace-adj-to-csr-functionality branch from 07bdc85 to 4acee1a Compare July 21, 2022 08:06
@ahendriksen ahendriksen changed the title [Review] Replace csr_adj_graph functions with faster equivalent [ENH] Replace csr_adj_graph functions with faster equivalent Jul 21, 2022
@tfeher tfeher changed the title [ENH] Replace csr_adj_graph functions with faster equivalent Replace csr_adj_graph functions with faster equivalent Jul 21, 2022
@cjnolet
Copy link
Member

cjnolet commented Jul 21, 2022

rerun tests

@ahendriksen
Copy link
Contributor Author

@cjnolet Is this PR good to go now? If it's merged into 22.08 that would make finishing up the cuML follow up easier (rapidsai/cuml#4803).

@cjnolet
Copy link
Member

cjnolet commented Jul 22, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 362f91c into rapidsai:branch-22.08 Jul 22, 2022
@ahendriksen ahendriksen deleted the fea-replace-adj-to-csr-functionality branch March 17, 2023 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge breaking Breaking change CMake cpp improvement Improvement / enhancement to an existing function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants