Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the spark.sql.parquet.binaryAsString=true configuration option on the GPU #5830

Merged
merged 3 commits into from
Jun 14, 2022

Conversation

NVnavkumar
Copy link
Collaborator

Fixes #4040.

Now that rapidsai/cudf/pull/10750 has been merged, Parquet files with binary columns can now be read with those values as strings on the GPU. This now enables the Spark configuration option spark.sql.parquet.binaryAsString=true, and the expected behavior can now work properly on the GPU. Note that cuDF (and the plugin) still need to support binary types properly (as in reading binary as binary, see #5416 and rapidsai/cudf#11044 for more context). This pull request removes the check for the configuration that disables running on the GPU, and adds an integration test that tests the end to end reading of strings encoded as binary in parquet (as is the case in some systems outside of Spark) and converting those columns back to strings when read in Spark when spark.sql.parquet.binaryAsString=true.

@NVnavkumar
Copy link
Collaborator Author

build

@jlowe jlowe added this to the Jun 6 - Jun 17 milestone Jun 14, 2022
@NVnavkumar NVnavkumar merged commit f59a12b into NVIDIA:branch-22.08 Jun 14, 2022
@NVnavkumar NVnavkumar self-assigned this Jun 15, 2022
@sameerz sameerz added the feature request New feature or request label Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support spark.sql.parquet.binaryAsString=true
3 participants