Enable the spark.sql.parquet.binaryAsString=true
configuration option on the GPU
#5830
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #4040.
Now that rapidsai/cudf/pull/10750 has been merged, Parquet files with binary columns can now be read with those values as strings on the GPU. This now enables the Spark configuration option
spark.sql.parquet.binaryAsString=true
, and the expected behavior can now work properly on the GPU. Note that cuDF (and the plugin) still need to support binary types properly (as in reading binary as binary, see #5416 and rapidsai/cudf#11044 for more context). This pull request removes the check for the configuration that disables running on the GPU, and adds an integration test that tests the end to end reading of strings encoded as binary in parquet (as is the case in some systems outside of Spark) and converting those columns back to strings when read in Spark whenspark.sql.parquet.binaryAsString=true
.