Fix parquet binary reads to do the transformation in the plugin [databricks] #6292

revans2 · 2022-08-11T14:09:45Z

From testing we found a number of issues with nested types and reading binary data. There were bugs in the cudf code, even with the API that made this unworkable in the short term. So this has all of the transformation to binary from string happen in the plugin instead of doing it in CUDF. Hopefully in 22.10 when the CUDF code + API is fixed we can switch over to using it and it will be cleaner and more efficient.

This replaces #6283.

This fixes #6281

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2022-08-11T14:09:55Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ParquetSchemaUtils.scala

integration_tests/src/main/python/data_gen.py

revans2 · 2022-08-11T14:44:05Z

build

revans2 · 2022-08-11T14:44:36Z

@jlowe please take another look

Fix parquet binary reads to do the transformation in the plugin

f2c9136

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 added bug Something isn't working feature request New feature or request SQL part of the SQL/Dataframe plugin P0 Must have for release task Work required that improves the product but is not user facing labels Aug 11, 2022

revans2 added this to the Aug 8 - Aug 19 milestone Aug 11, 2022

revans2 self-assigned this Aug 11, 2022

revans2 mentioned this pull request Aug 11, 2022

Revert support for binary reads in parquet [databricks] #6283

Closed

jlowe reviewed Aug 11, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ParquetSchemaUtils.scala Outdated Show resolved Hide resolved

integration_tests/src/main/python/data_gen.py Outdated Show resolved Hide resolved

Addressed review comments

f72e5fb

jlowe approved these changes Aug 11, 2022

View reviewed changes

revans2 merged commit 8c0e81e into NVIDIA:branch-22.08 Aug 11, 2022

revans2 linked an issue Aug 11, 2022 that may be closed by this pull request

[BUG] Reading binary columns from nested types does not work. #6281

Closed

jlowe mentioned this pull request Aug 11, 2022

Fix merge conflict with branch-22.08 #6297

Merged

This was referenced Sep 1, 2022

[FEA] Read Parquet binary data directly from cudf #6480

Open

[FEA] Support reading binary data types from Parquet as binary (not strings) #5416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parquet binary reads to do the transformation in the plugin [databricks] #6292

Fix parquet binary reads to do the transformation in the plugin [databricks] #6292

revans2 commented Aug 11, 2022 •

edited

Loading

revans2 commented Aug 11, 2022

revans2 commented Aug 11, 2022

revans2 commented Aug 11, 2022

Fix parquet binary reads to do the transformation in the plugin [databricks] #6292

Fix parquet binary reads to do the transformation in the plugin [databricks] #6292

Conversation

revans2 commented Aug 11, 2022 • edited Loading

revans2 commented Aug 11, 2022

revans2 commented Aug 11, 2022

revans2 commented Aug 11, 2022

revans2 commented Aug 11, 2022 •

edited

Loading