-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewriting row/column conversions for Spark <-> cudf data conversions #8444
Rewriting row/column conversions for Spark <-> cudf data conversions #8444
Conversation
Note that I copied a file from the java side to add my changes so I could benchmark it. This will not live here at the end. This makes the entire file look new, but in reality the only new things are related to the kernel |
@hyperbolic2346 can you give this a more specific title? |
Also please use draft PRs rather than "[WIP]" to reduce reviewer notification noise. |
Updated to fix corner cases found. Current benchmarks are not great. The kernel performance according to nsight is actually on par with or better than the existing case until the sizes get large and performance falls off a cliff. Unsure yet what is going on there, but it could be a worst-case memory access pattern. Investigation ongoing. There is a pile of work done before the kernel launch to produce some fairly large data arrays with things like row sizes. The original row conversion didn't require this work since it didn't support variable-width data. Each row can be a different size now, so that information must be created and passed to the kernel. For very small tables this overpowers the real work. Potential optimization of checking for variable-width data and only calculating and send the row sizes if variable-width data actually exists.
More work needed for validity. An interesting idea came up from Bobby pointing out that the validity data is just another table to copy. The window sizes may need to be limited to line up the validity bits with byte boundaries, but this should be pursued for sure. Speaking of window sizes, the window size is currently arbitrarily sized to 1024 rows and then as many columns as will fit into shared memory. Thinking about this more, I believe it would be best to have a "square" window. I put that in quotes because it isn't the same number of rows and columns, but instead the same number of bytes of each direction. This is another potential optimization on the horizon. |
Squared up the incoming windows for the GPU kernel operations. This made large improvements in throughput as there was more data to write out per row. Found out that there is an issue with shared memory writes striding in such a way that they produce bank conflicts. Need to think about how to get around that while still maintaining 8-byte writes out of shared memory.
|
// Because shared memory is limited we copy a subset of the rows at a time. | ||
// For simplicity we will refer to this as a row_group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a perfect algorithm for memcpy_async
and setting up a 2 stage pipeline of copying in data to shared memory asynchronously while writing out the other stage. See https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/
// In practice we have found writing more than 4 columns of data per thread | ||
// results in performance loss. As such we are using a 2 dimensional | ||
// kernel in terms of threads, but not in terms of blocks. Columns are | ||
// controlled by the y dimension (there is no y dimension in blocks). Rows | ||
// are controlled by the x dimension (there are multiple blocks in the x | ||
// dimension). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just say you're using a 1D grid of 2D blocks.
|
Performance issues with to_column are almost entirely in validity calculations. The current method grabs a bit from each row and builds up a 32-bit column of validity data, but there is a lot of duplicated fetching of row offset and validity data since we're only using a single bit out of each one. Cache alone isn't enough to save us here. Will change this to read a byte of validity and work on 8 columns per thread to coalesce the reads better. |
Performance has improved on the row to column front. The new code is faster once the table size gets over 1 million rows in the benchmark data set. Working on |
Moving to 21.10 |
Sorry for the large diff here, but lots of things shuffled and were renamed when I removed the block nomenclature, which was a great idea. |
|
column_sizes.reserve(num_columns); | ||
column_starts.reserve(num_columns + 1); // we add a final offset for validity data start | ||
|
||
auto schema_column_iter = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could alternatively have been a zip_iterator, avoiding a counting iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, but with the need to reach into the table column with the call to tbl.column(i).type
. Is there a way to make this work that I don't understand yet? The only way I can think of is to make a counting transform iterator for the type, which somewhat defeats the goal of less iterators.
Co-authored-by: MithunR <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👍. Thank you for taking the time to go over this with me, @hyperbolic2346.
There is a minor sticking point regarding the constant_iterator
comment. We might explore this shortly, but we needn't hold up the PR over it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of tiny things.
FInal performance result of initial PR:
|
@gpucibot merge |
rapidsai#8444 modified JCUDF transcoding logic (in Java/JNI) to use cudaMemcpyAsync() and cuda::barrier to allow for asynchronous memcpy on GPUs that support it. While this works for __CUDA_ARCH__ >= 700, for older GPUs (e.g. Pascal), JCUDF conversions cause CUDA errors and failures. E.g. ``` ai.rapids.cudf.CudfException: after reduction step 2: cudaErrorInvalidDeviceFunction: invalid device function ``` For older GPUs, rather than fail spectacularly, it would be good to provide a more stable (if less efficient) fallback implementation, via `memcpy()`. This commit adds code to conditionally use `cudaMemcpyAsync()` or `memcpy()`, depending on the GPU in play.
#8444 modified JCUDF transcoding logic (in Java/JNI) to use `cudaMemcpyAsync()` and `cuda::barrier` to allow for asynchronous memcpy on GPUs that support it. While this works for `__CUDA_ARCH__ >= 700`, for older GPUs (e.g. Pascal), JCUDF conversions cause CUDA errors and failures. E.g. ``` ai.rapids.cudf.CudfException: after reduction step 2: cudaErrorInvalidDeviceFunction: invalid device function ``` `cudaMemcpyAsync()` is not supported on Pascal GPUs or prior. (They lack the hardware, apparently.) For older GPUs, rather than fail spectacularly, it would be good to provide a more stable (if less efficient) fallback implementation, via `memcpy()`. This commit adds code to conditionally use `cudaMemcpyAsync()` or `memcpy()`, depending on the GPU in play. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #10329
Row to column and column to row conversions changed to support large numbers of columns and variable-width data.
So far this is the column to row work and variable width work is not completed yet.
This code is currently copied over to the cudf side for benchmarking, but will not remain there.