Enable the `spark.sql.parquet.binaryAsString=true` configuration option on the GPU #5830

NVnavkumar · 2022-06-14T19:26:11Z

Now that rapidsai/cudf/pull/10750 has been merged, Parquet files with binary columns can now be read with those values as strings on the GPU. This now enables the Spark configuration option spark.sql.parquet.binaryAsString=true, and the expected behavior can now work properly on the GPU. Note that cuDF (and the plugin) still need to support binary types properly (as in reading binary as binary, see #5416 and rapidsai/cudf#11044 for more context). This pull request removes the check for the configuration that disables running on the GPU, and adds an integration test that tests the end to end reading of strings encoded as binary in parquet (as is the case in some systems outside of Spark) and converting those columns back to strings when read in Spark when spark.sql.parquet.binaryAsString=true.

…nary as string

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2022-06-14T19:26:44Z

build

NVnavkumar added 3 commits June 13, 2022 10:05

Add support for spark.sql.parquet.binaryAsString

a046c9d

Update integration test to perform end-to-end test of reading back bi…

2cb1b75

…nary as string

update comment copy

00cf779

Signed-off-by: Navin Kumar <[email protected]>

jlowe added this to the Jun 6 - Jun 17 milestone Jun 14, 2022

jlowe approved these changes Jun 14, 2022

View reviewed changes

NVnavkumar merged commit f59a12b into NVIDIA:branch-22.08 Jun 14, 2022

NVnavkumar self-assigned this Jun 15, 2022

sameerz added the feature request New feature or request label Jun 16, 2022

tgravescs mentioned this pull request Jul 19, 2022

[FEA] Fully support reading parquet binary as string #5417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the `spark.sql.parquet.binaryAsString=true` configuration option on the GPU #5830

Enable the `spark.sql.parquet.binaryAsString=true` configuration option on the GPU #5830

NVnavkumar commented Jun 14, 2022

NVnavkumar commented Jun 14, 2022

Enable the spark.sql.parquet.binaryAsString=true configuration option on the GPU #5830

Enable the spark.sql.parquet.binaryAsString=true configuration option on the GPU #5830

Conversation

NVnavkumar commented Jun 14, 2022

NVnavkumar commented Jun 14, 2022

Enable the `spark.sql.parquet.binaryAsString=true` configuration option on the GPU #5830

Enable the `spark.sql.parquet.binaryAsString=true` configuration option on the GPU #5830