-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Improve perf of DBScan csr_row_op kernel #2387
Comments
Definitely a +1 on this, @MatthiasKohl. It would definitely be valuable to find a better general technique for this. As many of the sparse prims are ultimately going to be moved to / consolidated with cugraphs prims in RAFT, it would be great to see this make use of coalescing and better parallelism. |
I’ve been thinking about this a little more as I’ve been working on some similar kernels that need to process individual rows of a CSR. I’m thinking even for cases where the data is extremely sparse, we could use the average degree as a heuristic to set the block size and then map each row of the CSR to individual blocks and have the threads each process the columns in parallel. This would enable warp-level reductions and coalescing as well and, if the average degree grows well beyond the max number of threads, would still enable multi-stage reductions. |
+1: in general, I like the idea of assigning rows to warps, as it helps with coalescing. |
I'm thinking the average of the difference of the At some point, I wouldn't mind doing some experiments with different ways to parallelize CSR arrays and put the results into a paper. I'm sure there's better heuristics for optimal block size than just the avg degree, I just figure it could be a reasonable starting place to add some more parallelism. |
yes that makes sense, it's what I thought about, too. That's OK if the array doesn't turn out to be that sparse but if the array does turn out to be very sparse (say 10k-20k elements total), then you'd pay the full launch latency twice for an overall computation that is very small. I don't have a better solution though right now, other than maybe being able to suggest an average degree if it's already available, such as in DBScan... |
While benchmarking, I made the same observation that I have a few comments regarding the discussion above, though.
Now, we want to fill the GPU while using good access patterns, so I'm thinking of the following approaches:
|
Thanks for looking into this @Nyrio.
|
So, if I recap your idea of combining both approaches, we would:
Notes:
|
Could we discuss this live at some point? I got confused with how rows and columns are actually defined in the implementation. In any case, I agree with what you're suggesting, and I think it is the way to go, there are just some subtleties I wanted to make sure I understood correctly. |
Sure, it's better to discuss that live. The definition of rows and columns is quite confusing in the code indeed. |
This issue has been labeled |
This is not an inactive issue. This optimization is important for DBSCAN's performance and it is in my backlog. |
@ahendriksen is working on this |
For large data sizes, the batch size of the DBSCAN algorithm is small in order to fit the distance matrix in memory. This results in a matrix that has dimensions num_points x batch_size, both for the distance and adjacency matrix. The conversion of the boolean adjacency matrix to CSR format is performed in the 'adjgraph' step. This step was slow when the batch size was small, as described in issue rapidsai#2387. In this commit, the adjgraph step is sped up. This is done in two ways: 1. The adjacency matrix is now stored in row-major batch_size x num_points format --- it was transposed before. This required changes in the vertexdeg step. 2. The csr_row_op kernel has been replaced by the adj_to_csr kernel. This kernel can divide the work over multiple blocks even when the number of rows (batch size) is small. It makes optimal use of memory bandwidth because rows of the matrix are laid out contiguously in memory.
For large data sizes, the batch size of the DBSCAN algorithm is small in order to fit the distance matrix in memory. This results in a matrix that has dimensions num_points x batch_size, both for the distance and adjacency matrix. The conversion of the boolean adjacency matrix to CSR format is performed in the 'adjgraph' step. This step was slow when the batch size was small, as described in issue rapidsai#2387. In this commit, the adjgraph step is sped up. This is done in two ways: 1. The adjacency matrix is now stored in row-major batch_size x num_points format --- it was transposed before. This required changes in the vertexdeg step. 2. The csr_row_op kernel has been replaced by the adj_to_csr kernel. This kernel can divide the work over multiple blocks even when the number of rows (batch size) is small. It makes optimal use of memory bandwidth because rows of the matrix are laid out contiguously in memory.
For large data sizes, the batch size of the DBSCAN algorithm is small in order to fit the distance matrix in memory. This results in a matrix that has dimensions num_points x batch_size, both for the distance and adjacency matrix. The conversion of the boolean adjacency matrix to CSR format is performed in the 'adjgraph' step. This step was slow when the batch size was small, as described in issue rapidsai#2387. In this commit, the adjgraph step is sped up. This is done in two ways: 1. The adjacency matrix is now stored in row-major batch_size x num_points format --- it was transposed before. This required changes in the vertexdeg step. 2. The csr_row_op kernel has been replaced by the adj_to_csr kernel. This kernel can divide the work over multiple blocks even when the number of rows (batch size) is small. It makes optimal use of memory bandwidth because rows of the matrix are laid out contiguously in memory.
Fixes issue #2387. For large data sizes, the batch size of the DBSCAN algorithm is small in order to fit the distance matrix in memory. This results in a matrix that has dimensions num_points x batch_size, both for the distance and adjacency matrix. The conversion of the boolean adjacency matrix to CSR format is performed in the 'adjgraph' step. This step was slow when the batch size was small, as described in issue #2387. In this commit, the adjgraph step is sped up. This is done in two ways: 1. The adjacency matrix is now stored in row-major batch_size x num_points format --- it was transposed before. This required changes in the vertexdeg step. 2. The csr_row_op kernel has been replaced by the adj_to_csr kernel. This kernel can divide the work over multiple blocks even when the number of rows (batch size) is small. It makes optimal use of memory bandwidth because rows of the matrix are laid out contiguously in memory. Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) URL: #4803
Fixes issue rapidsai#2387. For large data sizes, the batch size of the DBSCAN algorithm is small in order to fit the distance matrix in memory. This results in a matrix that has dimensions num_points x batch_size, both for the distance and adjacency matrix. The conversion of the boolean adjacency matrix to CSR format is performed in the 'adjgraph' step. This step was slow when the batch size was small, as described in issue rapidsai#2387. In this commit, the adjgraph step is sped up. This is done in two ways: 1. The adjacency matrix is now stored in row-major batch_size x num_points format --- it was transposed before. This required changes in the vertexdeg step. 2. The csr_row_op kernel has been replaced by the adj_to_csr kernel. This kernel can divide the work over multiple blocks even when the number of rows (batch size) is small. It makes optimal use of memory bandwidth because rows of the matrix are laid out contiguously in memory. Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#4803
Is your feature request related to a problem? Please describe.
Scaling DBScan to large sizes requires small batch sizes to avoid running out of memory. The
csr_row_op
kernel has quite bad performance at small batch sizes.Details:
If we look at this kernel https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/src_prims/sparse/csr.cuh#L649-L670, we can see it will issue
B
threads (B
is the batch size) each withN
work (N
is the total number of points).For large
N
, if we want to keep memory constant as it is limited,B
decreases proportionally toN
.For example with
~2G
memory, forN~1e6
:B~2k
and forN~2e6
:B~1k
.For large GPUs, having
B
around~1k
means that we don't fill the GPU anymore but the work per thread still increases (it is inO(N)
see the lambda in the link above).Describe the solution you'd like
A simple solution would be to check
N/B
and if it is large enough, switch to a different kernel (possibly one which just goes over the dense adjacency matrix and updates counterk
with atomic ops, would also allow coalesced access to the dense matrix).The text was updated successfully, but these errors were encountered: