Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve perf of DBScan csr_row_op kernel #2387

Open
MatthiasKohl opened this issue Jun 9, 2020 · 13 comments
Open

[FEA] Improve perf of DBScan csr_row_op kernel #2387

MatthiasKohl opened this issue Jun 9, 2020 · 13 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@MatthiasKohl
Copy link

Is your feature request related to a problem? Please describe.
Scaling DBScan to large sizes requires small batch sizes to avoid running out of memory. The csr_row_op kernel has quite bad performance at small batch sizes.

Details:
If we look at this kernel https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/src_prims/sparse/csr.cuh#L649-L670, we can see it will issue B threads (B is the batch size) each with N work (N is the total number of points).

For large N, if we want to keep memory constant as it is limited, B decreases proportionally to N.
For example with ~2G memory, for N~1e6 : B~2k and for N~2e6 : B~1k.

For large GPUs, having B around ~1k means that we don't fill the GPU anymore but the work per thread still increases (it is in O(N) see the lambda in the link above).

Describe the solution you'd like
A simple solution would be to check N/B and if it is large enough, switch to a different kernel (possibly one which just goes over the dense adjacency matrix and updates counter k with atomic ops, would also allow coalesced access to the dense matrix).

@MatthiasKohl MatthiasKohl added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jun 9, 2020
@cjnolet
Copy link
Member

cjnolet commented Jun 10, 2020

Definitely a +1 on this, @MatthiasKohl. It would definitely be valuable to find a better general technique for this. As many of the sparse prims are ultimately going to be moved to / consolidated with cugraphs prims in RAFT, it would be great to see this make use of coalescing and better parallelism.

@cjnolet
Copy link
Member

cjnolet commented Sep 13, 2020

I’ve been thinking about this a little more as I’ve been working on some similar kernels that need to process individual rows of a CSR.

I’m thinking even for cases where the data is extremely sparse, we could use the average degree as a heuristic to set the block size and then map each row of the CSR to individual blocks and have the threads each process the columns in parallel. This would enable warp-level reductions and coalescing as well and, if the average degree grows well beyond the max number of threads, would still enable multi-stage reductions.

@MatthiasKohl
Copy link
Author

+1: in general, I like the idea of assigning rows to warps, as it helps with coalescing.
There would only be one extreme case left where the number of rows is very small and average degree very large, but there we can use multi-stage reductions as you mentioned.
Also, how would we determine average degree? In DBScan, we have it available "for free", but if we need to launch a separate kernel, that might not always be worth it.

@cjnolet
Copy link
Member

cjnolet commented Sep 14, 2020

Also, how would we determine average degree?

I'm thinking the average of the difference of the indptr array might be the fastest way to compute this.

At some point, I wouldn't mind doing some experiments with different ways to parallelize CSR arrays and put the results into a paper. I'm sure there's better heuristics for optimal block size than just the avg degree, I just figure it could be a reasonable starting place to add some more parallelism.

@MatthiasKohl
Copy link
Author

yes that makes sense, it's what I thought about, too.
But in that case, if we imagine that indptr has 1k-10k elements, which I think is realistic, at least for DBScan with a small batch size, it definitely is, then computing this average degree kernel is basically only launch latency of the kernel (for the reduction, every thread on the GPU only processes ~1 element).

That's OK if the array doesn't turn out to be that sparse but if the array does turn out to be very sparse (say 10k-20k elements total), then you'd pay the full launch latency twice for an overall computation that is very small.

I don't have a better solution though right now, other than maybe being able to suggest an average degree if it's already available, such as in DBScan...

@Nyrio
Copy link
Contributor

Nyrio commented Jan 11, 2021

While benchmarking, I made the same observation that csr_row_op is a massive bottleneck for large datasets.

I have a few comments regarding the discussion above, though.

  • Why are you talking about a kernel to compute the average degree? Is that not nnz / n_rows?
  • Regarding coalesced memory accesses, iiuc the current code is coalesced wrt the dense matrix, which is a column-major BxN matrix (or row-major NxB as stated by the docs of epsUnexpL2SqNeighborhood). The accessed location in adj is B * col + row where row is the thread ID and col the inner loop.
  • So, in continuation of the previous point: out of the box, assigning rows to warps doesn't help with coalescing as stated above.

Now, we want to fill the GPU while using good access patterns, so I'm thinking of the following approaches:

  • As suggested by @MatthiasKohl: tiled kernel over the adjacency matrix with atomics to update k. Good scalability and coalesced reads in the dense matrix.
  • There is another approach that avoids atomics but involves more memory operations: a first pass is a scan of the rows of the dense matrix, and a second pass writes indices of the non-zero elements at the positions computed in the first pass.

@MatthiasKohl
Copy link
Author

MatthiasKohl commented Jan 11, 2021

Thanks for looking into this @Nyrio.

  • Why a separate kernel: you are right, nnz / n_rows is the average degree, I believe at the time I was just thinking about a more generic case where we don't have nnz but on second thought, I guess you can always get this from the last element of the row_ind array.
  • You are right that the access to adj is coalesced but not to row_ind_ptr. If you assign a warp to each row for example, you can get coalescing there, too, at least for all threads participating in the write op.
  • If I understand your second approach correctly, then I agree that it could be a good approach. However, the kernel is memory bound of course, so adding more memory ops is probably not going to work, unless you can do them in a lower level cache or somehow get another large benefit (better coalescing or vectorized accesses). I think this could be interesting if you do it at a warp-wide / block-wide level and keep the scan result in shared memory, but for large degrees, this could be problematic. In any case, I can see a combination of this (at a lower level) with the atomics (at a higher level) working well. Do you think this makes sense?

@Nyrio
Copy link
Contributor

Nyrio commented Jan 12, 2021

So, if I recap your idea of combining both approaches, we would:

  • Tile the dense matrix in 2D.
  • Scan the tile in shared memory.
  • For each row, atomic add to k the number of nnz values found in the scan
  • For each nnz element, write the id in the neighbor list at the position old_k + offset computed in the scan

Notes:

  • 32 rows for coalesced reads, and then we can play with the number of threads per block and work per thread to use the most efficient tile size
  • Though it is possible to have coalesced writes to the sparse matrix (by buffering in shared mem), one problem is that you don't know in advance how much shared memory you'd need for that, so it doubles the shared memory usage. We can start with uncoalesced writes to the sparse matrix, since this cost should be neglectable if the matrix is sparse enough.

@MatthiasKohl
Copy link
Author

Could we discuss this live at some point? I got confused with how rows and columns are actually defined in the implementation.

In any case, I agree with what you're suggesting, and I think it is the way to go, there are just some subtleties I wanted to make sure I understood correctly.

@Nyrio
Copy link
Contributor

Nyrio commented Jan 12, 2021

Sure, it's better to discuss that live. The definition of rows and columns is quite confusing in the code indeed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@Nyrio
Copy link
Contributor

Nyrio commented Mar 8, 2021

This is not an inactive issue. This optimization is important for DBSCAN's performance and it is in my backlog.

@tfeher
Copy link
Contributor

tfeher commented Jun 15, 2022

@ahendriksen is working on this

ahendriksen added a commit to ahendriksen/cuml that referenced this issue Jul 6, 2022
For large data sizes, the batch size of the DBSCAN algorithm is small in
order to fit the distance matrix in memory.

This results in a matrix that has dimensions num_points x batch_size,
both for the distance and adjacency matrix.

The conversion of the boolean adjacency matrix to CSR format is
performed in the 'adjgraph' step. This step was slow when the batch size
was small, as described in issue rapidsai#2387.

In this commit, the adjgraph step is sped up. This is done in two ways:

1. The adjacency matrix is now stored in row-major batch_size x
   num_points format --- it was transposed before. This required changes
   in the vertexdeg step.

2. The csr_row_op kernel has been replaced by the adj_to_csr kernel.
   This kernel can divide the work over multiple blocks even when the
   number of rows (batch size) is small. It makes optimal use of memory
   bandwidth because rows of the matrix are laid out contiguously in
   memory.
ahendriksen added a commit to ahendriksen/cuml that referenced this issue Jul 8, 2022
For large data sizes, the batch size of the DBSCAN algorithm is small in
order to fit the distance matrix in memory.

This results in a matrix that has dimensions num_points x batch_size,
both for the distance and adjacency matrix.

The conversion of the boolean adjacency matrix to CSR format is
performed in the 'adjgraph' step. This step was slow when the batch size
was small, as described in issue rapidsai#2387.

In this commit, the adjgraph step is sped up. This is done in two ways:

1. The adjacency matrix is now stored in row-major batch_size x
   num_points format --- it was transposed before. This required changes
   in the vertexdeg step.

2. The csr_row_op kernel has been replaced by the adj_to_csr kernel.
   This kernel can divide the work over multiple blocks even when the
   number of rows (batch size) is small. It makes optimal use of memory
   bandwidth because rows of the matrix are laid out contiguously in
   memory.
ahendriksen added a commit to ahendriksen/cuml that referenced this issue Jul 12, 2022
For large data sizes, the batch size of the DBSCAN algorithm is small in
order to fit the distance matrix in memory.

This results in a matrix that has dimensions num_points x batch_size,
both for the distance and adjacency matrix.

The conversion of the boolean adjacency matrix to CSR format is
performed in the 'adjgraph' step. This step was slow when the batch size
was small, as described in issue rapidsai#2387.

In this commit, the adjgraph step is sped up. This is done in two ways:

1. The adjacency matrix is now stored in row-major batch_size x
   num_points format --- it was transposed before. This required changes
   in the vertexdeg step.

2. The csr_row_op kernel has been replaced by the adj_to_csr kernel.
   This kernel can divide the work over multiple blocks even when the
   number of rows (batch size) is small. It makes optimal use of memory
   bandwidth because rows of the matrix are laid out contiguously in
   memory.
rapids-bot bot pushed a commit that referenced this issue Jul 22, 2022
Fixes issue #2387.

For large data sizes, the batch size of the DBSCAN algorithm is small in order to fit the distance matrix in memory.

This results in a matrix that has dimensions num_points x batch_size, both for the distance and adjacency matrix.

The conversion of the boolean adjacency matrix to CSR format is performed in the 'adjgraph' step. This step was slow when the batch size was small, as described in issue #2387.

In this commit, the adjgraph step is sped up. This is done in two ways:

1. The adjacency matrix is now stored in row-major batch_size x num_points format --- it was transposed before. This required changes    in the vertexdeg step.

2. The csr_row_op kernel has been replaced by the adj_to_csr kernel.    This kernel can divide the work over multiple blocks even when the    number of rows (batch size) is small. It makes optimal use of memory    bandwidth because rows of the matrix are laid out contiguously in memory.

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #4803
jakirkham pushed a commit to jakirkham/cuml that referenced this issue Feb 27, 2023
Fixes issue rapidsai#2387.

For large data sizes, the batch size of the DBSCAN algorithm is small in order to fit the distance matrix in memory.

This results in a matrix that has dimensions num_points x batch_size, both for the distance and adjacency matrix.

The conversion of the boolean adjacency matrix to CSR format is performed in the 'adjgraph' step. This step was slow when the batch size was small, as described in issue rapidsai#2387.

In this commit, the adjgraph step is sped up. This is done in two ways:

1. The adjacency matrix is now stored in row-major batch_size x num_points format --- it was transposed before. This required changes    in the vertexdeg step.

2. The csr_row_op kernel has been replaced by the adj_to_csr kernel.    This kernel can divide the work over multiple blocks even when the    number of rows (batch size) is small. It makes optimal use of memory    bandwidth because rows of the matrix are laid out contiguously in memory.

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#4803
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants