[FEA] Add string support to row/column conversion #10033

hyperbolic2346 · 2022-01-13T04:15:56Z

Spark currently doesn't have a gpu-accelerated way to convert from an unsafe row format to a cudf table if the table includes strings. This is the feature request to add support for that.

Describe the solution you'd like
row/column conversions should occur on the gpu even if the table has strings in it.

Describe alternatives you've considered
There is a cpu-based path for this now that works, but isn't as performant as desired.

github-actions · 2022-04-21T21:03:15Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

hyperbolic2346 · 2022-04-22T17:53:30Z

still in progress

revans2 · 2022-05-19T19:33:45Z

All of the dependencies for this close. Just waiting for one more from CUDF. I took a quick look to see what changes we would need to make in the plugin, and it is a bit more complex than I had hoped to do in a few hours :). I put as much of it as I could here, but I might have missed something.

Enable Strings as a supported type for these transitions
Update CudfUnsafeRow to include
1. size estimates for Strings (does not have to be perfect) Spark uses 20 bytes as an estimate for the size of the String itself.
2. Add in an implementation for getUTF8String to read the offset and size. I think the code that is commented out is close, but needs to be modified because we don't guarantee 64-bit alignment for this.
Update AcceleratedColumnarToRowIterator to support strings. Specifically
1. drop the assert about all the types being fixed width
2. update packMap so it knows about Strings being int32 aligned and not 20 bytes.
3. figure out when we call convertToRowsFixedWidthOptimized vs convertToRows, because now we might have Strings in the data.
Update GeneratedInternalRowToCudfRowIterator. Be careful though this is the hard part because it does code generation to speed thing up. Also you might want to rename it because it is not going to CudfRow, it is going to a ColumnarBatch.
1. This means we are going to need to have a version of copyData for the String that puts in the offset and length at the fixed length location and then writes the string data to the offset after the fixed width portion. Then updates a new offset value so later Strings can use it.
Do some performance testing with this and the old code to see what the difference is, and see if we should adjust any of the heuristics from before that decide when we should do it one way vs another.

hyperbolic2346 · 2022-06-13T17:34:43Z

Only spark-rapids work left to do on this, closing this issue.

hyperbolic2346 added feature request New feature or request Needs Triage Need team to review and classify labels Jan 13, 2022

hyperbolic2346 added the Spark Functionality that helps Spark RAPIDS label Jan 14, 2022

hyperbolic2346 self-assigned this Jan 14, 2022

hyperbolic2346 mentioned this issue Jan 14, 2022

[FEA] explore faster data transitions NVIDIA/spark-rapids#507

Open

hyperbolic2346 mentioned this issue Jan 28, 2022

[FEA]Add support in column to row conversion for strings #10160

Closed

This was referenced Feb 11, 2022

Column to JCUDF row for strings #10234

Closed

JCUDF row to cuDF column for tables with strings #10286

Closed

github-actions bot added the inactive-30d label Apr 21, 2022

github-actions bot removed the inactive-30d label Apr 22, 2022

hyperbolic2346 closed this as completed Jun 13, 2022

amahussein mentioned this issue Jun 28, 2022

[WIP] Enable Strings as a supported type for GpuColumnarToRow transitions NVIDIA/spark-rapids#5927

Closed

amahussein mentioned this issue Jul 13, 2022

Enable Strings as a supported type for GpuColumnarToRow transitions NVIDIA/spark-rapids#5998

Merged

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add string support to row/column conversion #10033

[FEA] Add string support to row/column conversion #10033

hyperbolic2346 commented Jan 13, 2022 •

edited

Loading

github-actions bot commented Apr 21, 2022

hyperbolic2346 commented Apr 22, 2022

revans2 commented May 19, 2022

hyperbolic2346 commented Jun 13, 2022

[FEA] Add string support to row/column conversion #10033

[FEA] Add string support to row/column conversion #10033

Comments

hyperbolic2346 commented Jan 13, 2022 • edited Loading

github-actions bot commented Apr 21, 2022

hyperbolic2346 commented Apr 22, 2022

revans2 commented May 19, 2022

hyperbolic2346 commented Jun 13, 2022

hyperbolic2346 commented Jan 13, 2022 •

edited

Loading