Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriting row/column conversions for Spark <-> cudf data conversions #8444

Merged

Conversation

hyperbolic2346
Copy link
Contributor

Row to column and column to row conversions changed to support large numbers of columns and variable-width data.

So far this is the column to row work and variable width work is not completed yet.

This code is currently copied over to the cudf side for benchmarking, but will not remain there.

@hyperbolic2346 hyperbolic2346 added 2 - In Progress Currently a work in progress 5 - DO NOT MERGE Hold off on merging; see PR for details labels Jun 7, 2021
@hyperbolic2346 hyperbolic2346 requested review from a team as code owners June 7, 2021 08:19
@hyperbolic2346 hyperbolic2346 requested a review from mythrocks June 7, 2021 08:19
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jun 7, 2021
@hyperbolic2346
Copy link
Contributor Author

Note that I copied a file from the java side to add my changes so I could benchmark it. This will not live here at the end.

This makes the entire file look new, but in reality the only new things are related to the kernel copy_from_columns and the function convert_to_rows2.

@harrism
Copy link
Member

harrism commented Jun 15, 2021

@hyperbolic2346 can you give this a more specific title?

@harrism
Copy link
Member

harrism commented Jun 15, 2021

Also please use draft PRs rather than "[WIP]" to reduce reviewer notification noise.

@hyperbolic2346 hyperbolic2346 marked this pull request as draft June 15, 2021 18:26
@hyperbolic2346 hyperbolic2346 changed the title [WIP] Row/column conversion changes Rewriting row/column conversions for Spark <-> cudf data conversions Jun 15, 2021
@hyperbolic2346
Copy link
Contributor Author

Updated to fix corner cases found. Current benchmarks are not great. The kernel performance according to nsight is actually on par with or better than the existing case until the sizes get large and performance falls off a cliff. Unsure yet what is going on there, but it could be a worst-case memory access pattern. Investigation ongoing. There is a pile of work done before the kernel launch to produce some fairly large data arrays with things like row sizes. The original row conversion didn't require this work since it didn't support variable-width data. Each row can be a different size now, so that information must be created and passed to the kernel. For very small tables this overpowers the real work. Potential optimization of checking for variable-width data and only calculating and send the row sizes if variable-width data actually exists.

Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/64/manual_time           0.037 ms        0.054 ms        19211 bytes_per_second=2.36861G/s
RowConversion/old_to_row_conversion/512/manual_time          0.038 ms        0.054 ms        13712 bytes_per_second=18.3985G/s
RowConversion/old_to_row_conversion/4096/manual_time         0.053 ms        0.067 ms        13043 bytes_per_second=105.713G/s
RowConversion/old_to_row_conversion/32768/manual_time        0.194 ms        0.211 ms         3552 bytes_per_second=230.037G/s
RowConversion/old_to_row_conversion/262144/manual_time        1.33 ms         1.35 ms          514 bytes_per_second=268.457G/s
RowConversion/old_to_row_conversion/1048576/manual_time       5.33 ms         5.35 ms          124 bytes_per_second=268.378G/s
RowConversion/new_to_row_conversion/64/manual_time            1.89 ms         1.91 ms          361 bytes_per_second=47.1822M/s
RowConversion/new_to_row_conversion/512/manual_time          0.558 ms        0.575 ms         1145 bytes_per_second=1.25006G/s
RowConversion/new_to_row_conversion/4096/manual_time         0.305 ms        0.320 ms         2250 bytes_per_second=18.3105G/s
RowConversion/new_to_row_conversion/32768/manual_time         1.52 ms         1.53 ms          455 bytes_per_second=29.4834G/s
RowConversion/new_to_row_conversion/262144/manual_time        38.8 ms         38.8 ms           18 bytes_per_second=9.21313G/s
RowConversion/new_to_row_conversion/1048576/manual_time        156 ms          156 ms            4 bytes_per_second=9.14925G/s

More work needed for validity. An interesting idea came up from Bobby pointing out that the validity data is just another table to copy. The window sizes may need to be limited to line up the validity bits with byte boundaries, but this should be pursued for sure. Speaking of window sizes, the window size is currently arbitrarily sized to 1024 rows and then as many columns as will fit into shared memory. Thinking about this more, I believe it would be best to have a "square" window. I put that in quotes because it isn't the same number of rows and columns, but instead the same number of bytes of each direction. This is another potential optimization on the horizon.

@hyperbolic2346
Copy link
Contributor Author

Squared up the incoming windows for the GPU kernel operations. This made large improvements in throughput as there was more data to write out per row. Found out that there is an issue with shared memory writes striding in such a way that they produce bank conflicts. Need to think about how to get around that while still maintaining 8-byte writes out of shared memory.

Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/64/manual_time           0.037 ms        0.054 ms        18742 bytes_per_second=2.35329G/s
RowConversion/old_to_row_conversion/512/manual_time          0.039 ms        0.055 ms        16916 bytes_per_second=17.9473G/s
RowConversion/old_to_row_conversion/4096/manual_time         0.053 ms        0.067 ms        10089 bytes_per_second=105.421G/s
RowConversion/old_to_row_conversion/32768/manual_time        0.195 ms        0.213 ms         3462 bytes_per_second=228.885G/s
RowConversion/old_to_row_conversion/262144/manual_time        1.33 ms         1.34 ms          521 bytes_per_second=269.189G/s
RowConversion/old_to_row_conversion/1048576/manual_time       5.33 ms         5.35 ms          125 bytes_per_second=268.217G/s
RowConversion/new_to_row_conversion/64/manual_time            1.87 ms         1.89 ms          371 bytes_per_second=47.663M/s
RowConversion/new_to_row_conversion/512/manual_time          0.298 ms        0.315 ms         2284 bytes_per_second=2.33964G/s
RowConversion/new_to_row_conversion/4096/manual_time         0.177 ms        0.192 ms         3865 bytes_per_second=31.4794G/s
RowConversion/new_to_row_conversion/32768/manual_time        0.973 ms        0.991 ms          654 bytes_per_second=45.9143G/s
RowConversion/new_to_row_conversion/262144/manual_time        7.08 ms         7.10 ms           87 bytes_per_second=50.4549G/s
RowConversion/new_to_row_conversion/1048576/manual_time       28.4 ms         28.4 ms           25 bytes_per_second=50.3041G/s

Comment on lines 86 to 82
// Because shared memory is limited we copy a subset of the rows at a time.
// For simplicity we will refer to this as a row_group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a perfect algorithm for memcpy_async and setting up a 2 stage pipeline of copying in data to shared memory asynchronously while writing out the other stage. See https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/

// In practice we have found writing more than 4 columns of data per thread
// results in performance loss. As such we are using a 2 dimensional
// kernel in terms of threads, but not in terms of blocks. Columns are
// controlled by the y dimension (there is no y dimension in blocks). Rows
// are controlled by the x dimension (there are multiple blocks in the x
// dimension).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just say you're using a 1D grid of 2D blocks.

@hyperbolic2346
Copy link
Contributor Author

--------------------------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/64/manual_time             0.038 ms        0.055 ms        18801 bytes_per_second=2.3232G/s
RowConversion/old_to_row_conversion/512/manual_time            0.038 ms        0.054 ms        18703 bytes_per_second=18.4552G/s
RowConversion/old_to_row_conversion/4096/manual_time           0.052 ms        0.065 ms        10013 bytes_per_second=107.946G/s
RowConversion/old_to_row_conversion/32768/manual_time          0.195 ms        0.212 ms         3514 bytes_per_second=228.956G/s
RowConversion/old_to_row_conversion/262144/manual_time          1.30 ms         1.32 ms          522 bytes_per_second=274.572G/s
RowConversion/old_to_row_conversion/1048576/manual_time         5.31 ms         5.32 ms          125 bytes_per_second=269.388G/s
RowConversion/new_to_row_conversion/64/manual_time             0.128 ms        0.145 ms         5403 bytes_per_second=698.176M/s
RowConversion/new_to_row_conversion/512/manual_time            0.066 ms        0.083 ms        10124 bytes_per_second=10.5311G/s
RowConversion/new_to_row_conversion/4096/manual_time           0.128 ms        0.143 ms         5444 bytes_per_second=43.5364G/s
RowConversion/new_to_row_conversion/32768/manual_time          0.883 ms        0.900 ms          692 bytes_per_second=50.6077G/s
RowConversion/new_to_row_conversion/262144/manual_time          6.74 ms         6.76 ms          100 bytes_per_second=53.0137G/s
RowConversion/new_to_row_conversion/1048576/manual_time         26.8 ms         26.9 ms           25 bytes_per_second=53.2619G/s
RowConversion/old_from_row_conversion/64/manual_time           0.179 ms        0.196 ms         3647 bytes_per_second=601.202M/s
RowConversion/old_from_row_conversion/512/manual_time          0.181 ms        0.196 ms         3936 bytes_per_second=4.65231G/s
RowConversion/old_from_row_conversion/4096/manual_time         0.183 ms        0.196 ms         3880 bytes_per_second=36.8696G/s
RowConversion/old_from_row_conversion/32768/manual_time        0.201 ms        0.217 ms         3472 bytes_per_second=267.588G/s
RowConversion/new_from_row_conversion/64/manual_time           0.179 ms        0.195 ms         3876 bytes_per_second=603.682M/s
RowConversion/new_from_row_conversion/512/manual_time          0.183 ms        0.198 ms         3270 bytes_per_second=4.60054G/s
RowConversion/new_from_row_conversion/4096/manual_time          4.74 ms         4.75 ms          148 bytes_per_second=1.42085G/s
RowConversion/new_from_row_conversion/32768/manual_time          304 ms          304 ms            2 bytes_per_second=181.646M/s

@hyperbolic2346
Copy link
Contributor Author

Performance issues with to_column are almost entirely in validity calculations. The current method grabs a bit from each row and builds up a 32-bit column of validity data, but there is a lot of duplicated fetching of row offset and validity data since we're only using a single bit out of each one. Cache alone isn't enough to save us here. Will change this to read a byte of validity and work on 8 columns per thread to coalesce the reads better.

@hyperbolic2346
Copy link
Contributor Author

Benchmark                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/64/manual_time             0.037 ms        0.054 ms        19092 bytes_per_second=2.37984G/s
RowConversion/old_to_row_conversion/512/manual_time            0.040 ms        0.056 ms        13312 bytes_per_second=17.3898G/s
RowConversion/old_to_row_conversion/4096/manual_time           0.052 ms        0.066 ms        10309 bytes_per_second=106.437G/s
RowConversion/old_to_row_conversion/32768/manual_time          0.195 ms        0.213 ms         3492 bytes_per_second=228.534G/s
RowConversion/old_to_row_conversion/262144/manual_time          1.30 ms         1.32 ms          527 bytes_per_second=273.955G/s
RowConversion/old_to_row_conversion/1048576/manual_time         5.31 ms         5.32 ms          127 bytes_per_second=269.39G/s
RowConversion/new_to_row_conversion/64/manual_time             0.130 ms        0.147 ms         5316 bytes_per_second=689.682M/s
RowConversion/new_to_row_conversion/512/manual_time            0.068 ms        0.084 ms         9935 bytes_per_second=10.2622G/s
RowConversion/new_to_row_conversion/4096/manual_time           0.136 ms        0.150 ms         4503 bytes_per_second=41.1081G/s
RowConversion/new_to_row_conversion/32768/manual_time          0.948 ms        0.966 ms          672 bytes_per_second=47.1212G/s
RowConversion/new_to_row_conversion/262144/manual_time          7.35 ms         7.37 ms           82 bytes_per_second=48.6385G/s
RowConversion/new_to_row_conversion/1048576/manual_time         28.7 ms         28.8 ms           25 bytes_per_second=49.7624G/s
RowConversion/old_from_row_conversion/64/manual_time           0.190 ms        0.207 ms         3828 bytes_per_second=568.013M/s
RowConversion/old_from_row_conversion/512/manual_time          0.184 ms        0.199 ms         3787 bytes_per_second=4.58102G/s
RowConversion/old_from_row_conversion/4096/manual_time         0.188 ms        0.201 ms         3421 bytes_per_second=35.8498G/s
RowConversion/old_from_row_conversion/32768/manual_time        0.216 ms        0.232 ms         3054 bytes_per_second=249.749G/s
RowConversion/old_from_row_conversion/262144/manual_time        1.33 ms         1.35 ms          522 bytes_per_second=323.235G/s
RowConversion/old_from_row_conversion/1048576/manual_time       5.08 ms         5.14 ms          137 bytes_per_second=339.467G/s
RowConversion/new_from_row_conversion/64/manual_time           0.187 ms        0.204 ms         3130 bytes_per_second=576.807M/s
RowConversion/new_from_row_conversion/512/manual_time          0.198 ms        0.213 ms         3649 bytes_per_second=4.25398G/s
RowConversion/new_from_row_conversion/4096/manual_time         0.201 ms        0.214 ms         3561 bytes_per_second=33.4906G/s
RowConversion/new_from_row_conversion/32768/manual_time        0.332 ms        0.346 ms         2126 bytes_per_second=162.438G/s
RowConversion/new_from_row_conversion/262144/manual_time        2.38 ms         2.40 ms          295 bytes_per_second=181.301G/s
RowConversion/new_from_row_conversion/1048576/manual_time       4.18 ms         4.23 ms          166 bytes_per_second=412.599G/s

Performance has improved on the row to column front. The new code is faster once the table size gets over 1 million rows in the benchmark data set. Working on cuda::memcpy_async now to hopefully see more gains.

@harrism
Copy link
Member

harrism commented Jul 21, 2021

Moving to 21.10

@harrism harrism changed the base branch from branch-21.08 to branch-21.10 July 21, 2021 23:30
@github-actions github-actions bot added conda Java Affects Java cuDF API. labels Aug 26, 2021
@hyperbolic2346
Copy link
Contributor Author

Sorry for the large diff here, but lots of things shuffled and were renamed when I removed the block nomenclature, which was a great idea.

@hyperbolic2346
Copy link
Contributor Author

Benchmark                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/64/manual_time             0.041 ms        0.058 ms        17483 bytes_per_second=2.14732G/s
RowConversion/old_to_row_conversion/512/manual_time            0.040 ms        0.055 ms        17466 bytes_per_second=17.6611G/s
RowConversion/old_to_row_conversion/4096/manual_time           0.053 ms        0.067 ms         9545 bytes_per_second=104.782G/s
RowConversion/old_to_row_conversion/32768/manual_time          0.195 ms        0.212 ms         3543 bytes_per_second=229.09G/s
RowConversion/old_to_row_conversion/262144/manual_time          1.33 ms         1.34 ms          521 bytes_per_second=269.09G/s
RowConversion/old_to_row_conversion/1048576/manual_time         5.28 ms         5.30 ms          127 bytes_per_second=270.805G/s
RowConversion/new_to_row_conversion/64/manual_time             0.263 ms        0.280 ms         2714 bytes_per_second=340.026M/s
RowConversion/new_to_row_conversion/512/manual_time            0.583 ms        0.598 ms         1093 bytes_per_second=1.1968G/s
RowConversion/new_to_row_conversion/4096/manual_time           0.932 ms        0.946 ms          711 bytes_per_second=5.99446G/s
RowConversion/new_to_row_conversion/32768/manual_time           1.34 ms         1.36 ms          517 bytes_per_second=33.3962G/s
RowConversion/new_to_row_conversion/262144/manual_time          4.18 ms         4.20 ms          166 bytes_per_second=85.4348G/s
RowConversion/new_to_row_conversion/1048576/manual_time         15.2 ms         15.2 ms           45 bytes_per_second=94.2443G/s
RowConversion/old_from_row_conversion/64/manual_time           0.195 ms        0.212 ms         3824 bytes_per_second=551.947M/s
RowConversion/old_from_row_conversion/512/manual_time          0.182 ms        0.198 ms         3811 bytes_per_second=4.62471G/s
RowConversion/old_from_row_conversion/4096/manual_time         0.185 ms        0.198 ms         3569 bytes_per_second=36.4532G/s
RowConversion/old_from_row_conversion/32768/manual_time        0.203 ms        0.218 ms         3432 bytes_per_second=266.06G/s
RowConversion/old_from_row_conversion/262144/manual_time        1.24 ms         1.25 ms          550 bytes_per_second=348.962G/s
RowConversion/old_from_row_conversion/1048576/manual_time       4.79 ms         4.84 ms          146 bytes_per_second=360.062G/s
RowConversion/new_from_row_conversion/64/manual_time           0.382 ms        0.398 ms         1818 bytes_per_second=282.453M/s
RowConversion/new_from_row_conversion/512/manual_time          0.662 ms        0.677 ms         1030 bytes_per_second=1.27215G/s
RowConversion/new_from_row_conversion/4096/manual_time          1.10 ms         1.12 ms          627 bytes_per_second=6.11477G/s
RowConversion/new_from_row_conversion/32768/manual_time         2.20 ms         2.21 ms          318 bytes_per_second=24.5208G/s
RowConversion/new_from_row_conversion/262144/manual_time        9.33 ms         9.36 ms           75 bytes_per_second=46.2188G/s
RowConversion/new_from_row_conversion/1048576/manual_time       37.3 ms         37.7 ms           19 bytes_per_second=46.1899G/s

column_sizes.reserve(num_columns);
column_starts.reserve(num_columns + 1); // we add a final offset for validity data start

auto schema_column_iter =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could alternatively have been a zip_iterator, avoiding a counting iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, but with the need to reach into the table column with the call to tbl.column(i).type. Is there a way to make this work that I don't understand yet? The only way I can think of is to make a counting transform iterator for the type, which somewhat defeats the goal of less iterators.

java/src/main/native/src/row_conversion.cu Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍. Thank you for taking the time to go over this with me, @hyperbolic2346.

There is a minor sticking point regarding the constant_iterator comment. We might explore this shortly, but we needn't hold up the PR over it.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of tiny things.

java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
@hyperbolic2346
Copy link
Contributor Author

FInal performance result of initial PR:

Benchmark                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------
RowConversion/old_to_row_conversion/64/manual_time             0.038 ms        0.055 ms        18268 bytes_per_second=2.29941G/s
RowConversion/old_to_row_conversion/512/manual_time            0.039 ms        0.055 ms        16679 bytes_per_second=17.7435G/s
RowConversion/old_to_row_conversion/4096/manual_time           0.054 ms        0.067 ms        11740 bytes_per_second=104.368G/s
RowConversion/old_to_row_conversion/32768/manual_time          0.195 ms        0.212 ms         3531 bytes_per_second=229.531G/s
RowConversion/old_to_row_conversion/262144/manual_time          1.32 ms         1.34 ms          517 bytes_per_second=270.185G/s
RowConversion/old_to_row_conversion/1048576/manual_time         5.28 ms         5.30 ms          125 bytes_per_second=270.541G/s
RowConversion/new_to_row_conversion/64/manual_time             0.254 ms        0.270 ms         2599 bytes_per_second=352.454M/s
RowConversion/new_to_row_conversion/512/manual_time            0.583 ms        0.598 ms         1094 bytes_per_second=1.19763G/s
RowConversion/new_to_row_conversion/4096/manual_time           0.932 ms        0.946 ms          719 bytes_per_second=5.99434G/s
RowConversion/new_to_row_conversion/32768/manual_time           1.32 ms         1.34 ms          516 bytes_per_second=33.8862G/s
RowConversion/new_to_row_conversion/262144/manual_time          4.20 ms         4.22 ms          168 bytes_per_second=85.0243G/s
RowConversion/new_to_row_conversion/1048576/manual_time         15.2 ms         15.2 ms           46 bytes_per_second=94.0081G/s
RowConversion/old_from_row_conversion/64/manual_time           0.174 ms        0.191 ms         3632 bytes_per_second=617.714M/s
RowConversion/old_from_row_conversion/512/manual_time          0.176 ms        0.192 ms         3920 bytes_per_second=4.78463G/s
RowConversion/old_from_row_conversion/4096/manual_time         0.182 ms        0.196 ms         3870 bytes_per_second=36.9156G/s
RowConversion/old_from_row_conversion/32768/manual_time        0.202 ms        0.218 ms         3440 bytes_per_second=266.348G/s
RowConversion/old_from_row_conversion/262144/manual_time        1.24 ms         1.26 ms          551 bytes_per_second=348.74G/s
RowConversion/old_from_row_conversion/1048576/manual_time       4.78 ms         4.84 ms          146 bytes_per_second=360.752G/s
RowConversion/new_from_row_conversion/64/manual_time           0.381 ms        0.398 ms         1862 bytes_per_second=282.912M/s
RowConversion/new_from_row_conversion/512/manual_time          0.639 ms        0.654 ms         1023 bytes_per_second=1.31774G/s
RowConversion/new_from_row_conversion/4096/manual_time          1.09 ms         1.11 ms          551 bytes_per_second=6.15463G/s
RowConversion/new_from_row_conversion/32768/manual_time         2.20 ms         2.21 ms          318 bytes_per_second=24.5236G/s
RowConversion/new_from_row_conversion/262144/manual_time        9.32 ms         9.31 ms           75 bytes_per_second=46.2711G/s
RowConversion/new_from_row_conversion/1048576/manual_time       37.1 ms         37.4 ms           19 bytes_per_second=46.5101G/s

@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit dd390a2 into rapidsai:branch-22.02 Jan 10, 2022
mythrocks added a commit to mythrocks/cudf that referenced this pull request Feb 18, 2022
rapidsai#8444 modified JCUDF transcoding logic (in Java/JNI) to use cudaMemcpyAsync()
and cuda::barrier to allow for asynchronous memcpy on GPUs that support it.
While this works for __CUDA_ARCH__ >= 700, for older GPUs (e.g. Pascal),
JCUDF conversions cause CUDA errors and failures. E.g.
```
ai.rapids.cudf.CudfException: after reduction step 2: cudaErrorInvalidDeviceFunction:
invalid device function
```
For older GPUs, rather than fail spectacularly, it would be good to provide
a more stable (if less efficient) fallback implementation, via `memcpy()`.

This commit adds code to conditionally use `cudaMemcpyAsync()` or `memcpy()`,
depending on the GPU in play.
rapids-bot bot pushed a commit that referenced this pull request Feb 21, 2022
#8444 modified JCUDF transcoding logic (in Java/JNI) to use `cudaMemcpyAsync()`
and `cuda::barrier` to allow for asynchronous memcpy on GPUs that support it.
While this works for `__CUDA_ARCH__ >= 700`, for older GPUs (e.g. Pascal),
JCUDF conversions cause CUDA errors and failures. E.g.
```
ai.rapids.cudf.CudfException: after reduction step 2: cudaErrorInvalidDeviceFunction:
invalid device function
```
`cudaMemcpyAsync()` is not supported on Pascal GPUs or prior. (They lack the hardware,
apparently.)
For older GPUs, rather than fail spectacularly, it would be good to provide
a more stable (if less efficient) fallback implementation, via `memcpy()`.

This commit adds code to conditionally use `cudaMemcpyAsync()` or `memcpy()`,
depending on the GPU in play.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10329
@hyperbolic2346 hyperbolic2346 deleted the mwilson/row_conversion branch May 17, 2022 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond breaking Breaking change feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants