Replace csr_adj_graph functions with faster equivalent #746

ahendriksen · 2022-07-18T15:17:32Z

The csr_adj_graph functions are a performance bottleneck in the DBSCAN implementation in cuML. They are not used anywhere else.

This PR replaces the csr_adj_graph functions with the faster dense_bool_to_unsorted_csr function. It has the same functionality, but

It requires the input adjacency matrix to be in row-major order (rather than column-major).
The output column indices are not guaranteed to be in ascending order (hence unsorted).

ahendriksen · 2022-07-18T15:29:19Z

Moving this functionality from cuML to RAFT as discussed in rapidsai/cuml#4803.

On an A10, this feature achieves +-90% of bandwidth

--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
bench_base<int64_t>/0/manual_time       4.44 ms         5.43 ms          158 BW=450.363G/s BW read=225.182G/s BW write=225.182G/s Columns=1073.74M Fraction nz=12.5 Rows=1
bench_base<int64_t>/1/manual_time       4.59 ms         4.77 ms          151 BW=436.12G/s BW read=218.06G/s BW write=218.06G/s Columns=134.218M Fraction nz=12.5 Rows=8
bench_base<int64_t>/2/manual_time       4.51 ms         4.62 ms          150 BW=443.356G/s BW read=221.678G/s BW write=221.678G/s Columns=16.7772M Fraction nz=12.5 Rows=64
bench_base<int64_t>/3/manual_time       4.51 ms         4.61 ms          150 BW=443.751G/s BW read=221.875G/s BW write=221.875G/s Columns=32.768k Fraction nz=12.5 Rows=32.768k
bench_base<int64_t>/4/manual_time       2.43 ms         3.01 ms          279 BW=462.471G/s BW read=411.085G/s BW write=51.3857G/s Columns=1073.74M Fraction nz=1.5625 Rows=1
bench_base<int64_t>/5/manual_time       2.45 ms         2.57 ms          269 BW=459.154G/s BW read=408.137G/s BW write=51.0171G/s Columns=134.218M Fraction nz=1.5625 Rows=8
bench_base<int64_t>/6/manual_time       2.44 ms         2.53 ms          268 BW=460.197G/s BW read=409.064G/s BW write=51.133G/s Columns=16.7772M Fraction nz=1.5625 Rows=64
bench_base<int64_t>/7/manual_time       2.22 ms         2.75 ms          307 BW=451.95G/s BW read=450.192G/s BW write=1.75856G/s Columns=1073.74M Fraction nz=0.0488281 Rows=1
bench_base<int64_t>/8/manual_time       2.24 ms         2.35 ms          291 BW=448.73G/s BW read=446.984G/s BW write=1.74603G/s Columns=134.218M Fraction nz=0.0488281 Rows=8
bench_base<int64_t>/9/manual_time       2.26 ms         2.34 ms          288 BW=443.597G/s BW read=441.871G/s BW write=1.72606G/s Columns=16.7772M Fraction nz=0.0488281 Rows=64

Practical RW bandwidth is 492 GB/s on this machine.

Unrolling the loop in the kernel (#pragma unroll) resulted in exactly the same performance.
A smaller chunk size (8 bytes per loop iteration) resulted in degraded performance.

ahendriksen · 2022-07-18T15:31:58Z

I have not removed the csr_row_op function. It is not used anywhere at this moment. It could be used for other purposes in the future. Shall I leave it in?

I have deduplicated the .hpp and .cuh header files in the sparse/convert directory tree in commit 469cb13. Is this the correct way to do it?

The csr_adj_graph functions are a performance bottleneck in the DBSCAN implementation in cuML. They are not used anywhere else. This commit replaces the csr_adj_graph functions with the dense_bool_to_unsorted_csr function. It has the same functionality, *but* 1. It requires the input adjacency matrix to be in row-major order (rather than column-major). 2. The output column indices are not guaranteed to be in ascending order (hence unsorted).

tfeher

Hi Allard, thanks for the PR. In principle it looks good, but what if we go a step further?

Currently, the function that you are adding needs to receive the row_ind values. It would be more self contained if it only takes the boolean matrix, and calculates row_ind internally, and returns it together with the row_ind_ptr

Earlier these steps were separated, because the vertex degree was calculated in the epsilon neighborhood kernel. But I see that in rapidsai/cuml#4803 you have added an explicit vertex degree calculation. If that is necessary, then it would make sense to move all the steps of building the csr matrix here (vertex degree, scan, adj_to_csr). What is your opinion?

cpp/bench/sparse/convert_csr.cu

cpp/test/sparse/convert_csr.cu

cpp/include/raft/sparse/convert/detail/dense_to_csr.cuh

ahendriksen · 2022-07-20T09:54:55Z

@tfeher I agree that calculating the vertex degrees internally would simplify the API. If we want to have a non-allocating API however, then it makes less sense to calculate the vertex degrees internally.

The caller has to pre-allocate out_col_ind, which requires knowing the number of nonzeros in the matrix. Thus, summing over the matrix is already done by the caller. In the tests and benchmarks, we gloss over this by pre-allocating out_col_ind with size num_rows x num_cols, which is probably not acceptable in production code.

Therefore, the function as it stands now strikes a decent balance between usability and not repeating calculations.

tfeher

Thanks @ahendriksen for addressing the issues! The PR looks good to me.

You are right, about the allocations. In the future, if we see a need for it, we can still introduce an alternate version of adj_to_csr, that could do all the calculations and allocations internally. Right now, I believe we can keep the current form.

- Rename dense_bool_to_unsorted_csr to adj_to_csr - Add grid-stride loops for test case generation (both bench and test) - Remove overload In addition: - Add test case for empty input - Fix behavior in case of empty input (return early)

cjnolet · 2022-07-21T20:32:34Z

rerun tests

ahendriksen · 2022-07-22T14:39:10Z

@cjnolet Is this PR good to go now? If it's merged into 22.08 that would make finishing up the cuML follow up easier (rapidsai/cuml#4803).

cjnolet · 2022-07-22T14:45:58Z

@gpucibot merge

ahendriksen requested a review from a team as a code owner July 18, 2022 15:17

ahendriksen assigned cjnolet and tfeher Jul 18, 2022

github-actions bot added the cpp label Jul 18, 2022

ahendriksen added 3 - Ready for Review improvement Improvement / enhancement to an existing function breaking Breaking change CMake labels Jul 18, 2022

ahendriksen force-pushed the fea-replace-adj-to-csr-functionality branch from 99df1ef to 469cb13 Compare July 18, 2022 15:28

ahendriksen requested a review from a team as a code owner July 18, 2022 15:28

ahendriksen added 2 commits July 18, 2022 17:43

sparse: Deprecate .hpp headers

598f77b

ahendriksen force-pushed the fea-replace-adj-to-csr-functionality branch from 469cb13 to 598f77b Compare July 18, 2022 15:43

tfeher requested changes Jul 19, 2022

View reviewed changes

tfeher approved these changes Jul 20, 2022

View reviewed changes

sparse: Implement review feedback

4acee1a

- Rename dense_bool_to_unsorted_csr to adj_to_csr - Add grid-stride loops for test case generation (both bench and test) - Remove overload In addition: - Add test case for empty input - Fix behavior in case of empty input (return early)

ahendriksen force-pushed the fea-replace-adj-to-csr-functionality branch from 07bdc85 to 4acee1a Compare July 21, 2022 08:06

ahendriksen changed the title ~~[Review] Replace csr_adj_graph functions with faster equivalent~~ [ENH] Replace csr_adj_graph functions with faster equivalent Jul 21, 2022

ahendriksen added 5 - Ready to Merge and removed 3 - Ready for Review labels Jul 21, 2022

tfeher changed the title ~~[ENH] Replace csr_adj_graph functions with faster equivalent~~ Replace csr_adj_graph functions with faster equivalent Jul 21, 2022

rapids-bot bot merged commit 362f91c into rapidsai:branch-22.08 Jul 22, 2022

ahendriksen deleted the fea-replace-adj-to-csr-functionality branch March 17, 2023 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace csr_adj_graph functions with faster equivalent #746

Replace csr_adj_graph functions with faster equivalent #746

ahendriksen commented Jul 18, 2022 •

edited

Loading

ahendriksen commented Jul 18, 2022

ahendriksen commented Jul 18, 2022

tfeher left a comment

ahendriksen commented Jul 20, 2022

tfeher left a comment

cjnolet commented Jul 21, 2022

ahendriksen commented Jul 22, 2022

cjnolet commented Jul 22, 2022

Replace csr_adj_graph functions with faster equivalent #746

Replace csr_adj_graph functions with faster equivalent #746

Conversation

ahendriksen commented Jul 18, 2022 • edited Loading

ahendriksen commented Jul 18, 2022

ahendriksen commented Jul 18, 2022

tfeher left a comment

Choose a reason for hiding this comment

ahendriksen commented Jul 20, 2022

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Jul 21, 2022

ahendriksen commented Jul 22, 2022

cjnolet commented Jul 22, 2022

ahendriksen commented Jul 18, 2022 •

edited

Loading