Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add string support to row/column conversion #10033

Closed
3 tasks done
hyperbolic2346 opened this issue Jan 13, 2022 · 4 comments
Closed
3 tasks done

[FEA] Add string support to row/column conversion #10033

hyperbolic2346 opened this issue Jan 13, 2022 · 4 comments
Assignees
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@hyperbolic2346
Copy link
Contributor

hyperbolic2346 commented Jan 13, 2022

Spark currently doesn't have a gpu-accelerated way to convert from an unsafe row format to a cudf table if the table includes strings. This is the feature request to add support for that.

Describe the solution you'd like
row/column conversions should occur on the gpu even if the table has strings in it.

Describe alternatives you've considered
There is a cpu-based path for this now that works, but isn't as performant as desired.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@hyperbolic2346
Copy link
Contributor Author

still in progress

@revans2
Copy link
Contributor

revans2 commented May 19, 2022

All of the dependencies for this close. Just waiting for one more from CUDF. I took a quick look to see what changes we would need to make in the plugin, and it is a bit more complex than I had hoped to do in a few hours :). I put as much of it as I could here, but I might have missed something.

  1. Enable Strings as a supported type for these transitions
  2. Update CudfUnsafeRow to include
    1. size estimates for Strings (does not have to be perfect) Spark uses 20 bytes as an estimate for the size of the String itself.
    2. Add in an implementation for getUTF8String to read the offset and size. I think the code that is commented out is close, but needs to be modified because we don't guarantee 64-bit alignment for this.
  3. Update AcceleratedColumnarToRowIterator to support strings. Specifically
    1. drop the assert about all the types being fixed width
    2. update packMap so it knows about Strings being int32 aligned and not 20 bytes.
    3. figure out when we call convertToRowsFixedWidthOptimized vs convertToRows, because now we might have Strings in the data.
  4. Update GeneratedInternalRowToCudfRowIterator. Be careful though this is the hard part because it does code generation to speed thing up. Also you might want to rename it because it is not going to CudfRow, it is going to a ColumnarBatch.
    1. This means we are going to need to have a version of copyData for the String that puts in the offset and length at the fixed length location and then writes the string data to the offset after the fixed width portion. Then updates a new offset value so later Strings can use it.
  5. Do some performance testing with this and the old code to see what the difference is, and see if we should adjust any of the heuristics from before that decide when we should do it one way vs another.

@hyperbolic2346
Copy link
Contributor Author

Only spark-rapids work left to do on this, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants