-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JCUDF row to cuDF column for tables with strings #10286
Comments
This issue has been labeled |
Still needed. |
This issue has been labeled |
Still needed |
Is your feature request related to a problem? Please describe.
Now that the row offset iterator is written, the next step in getting strings converted in the row to column and column to row code is to implement one side. This is the implementation issue for the row to column portion of the work. This will accept JCUDF rows with strings and produce a table with string columns in it for use by the spark-rapids plugin.
Describe the solution you'd like
The plan is to use the existing fixed-width code to fill in a
device_uvector
with length and source offset values, since that data is written inside the fixed-width section. Scanning the length vector will produce an offset column. Then the string data itself will need to be copied, but with the length, src offset, and dest offset arrays this should be fairly trivial. The original pass at this will break it up with a string per warp, but this will scale poorly if the strings are drastically different sizes.Describe alternatives you've considered
Other methods to parallelize the work were considered including trying to break it up where each thread would copy a specific number of bytes to the proper destination. The complexity of this approach led us to the current solution in the interest of time.
Additional context
This is part of the larger feature of #10033
The text was updated successfully, but these errors were encountered: