[FEA] Support spark.sql.parquet.binaryAsString=true #4040

viadea · 2021-11-05T14:41:54Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]
Support spark.sql.parquet.binaryAsString=true.

!Exec <FileSourceScanExec> cannot run on GPU because GpuParquetScan does not support spark.sql.parquet.binaryAsString

The text was updated successfully, but these errors were encountered:

viadea · 2021-11-05T15:01:53Z

Mini repro:

Reading a hive parquet table:

Create a Hive parquet table with only int and string inside:

spark-sql> create table teststring2 stored as parquet  as select * from teststring;
spark-sql> select * from teststring2 ;
1	abcd

spark-sql> desc teststring2 ;
x	int	NULL
y	string	NULL

Enable spark.sql.parquet.binaryAsString and run the query, it will fallback.

spark-sql> set spark.sql.parquet.binaryAsString=true;
spark-sql> select * from teststring2 ;
21/11/05 07:01:37 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because GpuParquetScan does not support spark.sql.parquet.binaryAsString

1	abcd
Time taken: 0.146 seconds, Fetched 1 row(s)

Or reading a parquet file with string inside directly:

Seq("a", "b").toDF("name").write.format("parquet").mode("overwrite").save("/tmp/testparquet")
spark.read.parquet("/tmp/testparquet").createTempView("df")
spark.sql("select * from df").collect
spark.conf.set("spark.sql.parquet.binaryAsString",true)
spark.sql("select * from df").collect

SurajAralihalli · 2022-07-14T21:22:31Z

Adding Binary Type test

val ipaddr = Array[Byte](1,3,4)
Seq(ipaddr).toDF("name").write.format("parquet").mode("overwrite").save("/local/saralihalli/tmp/testParquet4/")
spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df4")
spark.conf.set("spark.sql.parquet.binaryAsString",false)
spark.sql("select * from df4").collect
spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df5")
spark.conf.set("spark.sql.parquet.binaryAsString",true)
spark.sql("select * from df5").collect

For binary Type data

Summary
With rapids-4-spark_2.12-22.08.0-20220714.123703-28.jar and spark.sql.parquet.binaryAsString set to

False: yes there is fallback
True: yes there is fallback

scala> val ipaddr = Array[Byte](1,3,4)
ipaddr: Array[Byte] = Array(1, 3, 4)

scala> Seq(ipaddr).toDF("name").write.format("parquet").mode("overwrite").save("/local/saralihalli/tmp/testParquet4/")
22/07/14 18:57:18 WARN GpuOverrides:
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced; unsupported data types in input: BinaryType [name#21]
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because unsupported data types BinaryType [name] in write for Parquet
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec; not all expressions can be replaced
    !Expression <AttributeReference> name#21 cannot run on GPU because expression AttributeReference name#21 produces an unsupported type BinaryType


scala> spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df4")

scala> spark.conf.set("spark.sql.parquet.binaryAsString",false)

scala> spark.sql("select * from df4").collect
22/07/14 18:57:18 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported data types in output: BinaryType [name#24]; unsupported data types BinaryType [name] in read for Parquet

res10: Array[org.apache.spark.sql.Row] = Array([[B@21249add])

scala> spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df5")

scala> spark.conf.set("spark.sql.parquet.binaryAsString",true)

scala> spark.sql("select * from df5").collect
22/07/14 18:57:20 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported data types in output: BinaryType [name#29]; unsupported data types BinaryType [name] in read for Parquet

res13: Array[org.apache.spark.sql.Row] = Array([[B@3e94ff8a])

NVnavkumar · 2022-07-14T23:19:55Z

Adding Binary Type test

val ipaddr = Array[Byte](1,3,4)
Seq(ipaddr).toDF("name").write.format("parquet").mode("overwrite").save("/local/saralihalli/tmp/testParquet4/")
spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df4")
spark.conf.set("spark.sql.parquet.binaryAsString",false)
spark.sql("select * from df4").collect
spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df5")
spark.conf.set("spark.sql.parquet.binaryAsString",true)
spark.sql("select * from df5").collect

For binary Type data

Summary With rapids-4-spark_2.12-22.08.0-20220714.123703-28.jar and spark.sql.parquet.binaryAsString set to

False: yes there is fallback
True: yes there is fallback

scala> val ipaddr = Array[Byte](1,3,4)
ipaddr: Array[Byte] = Array(1, 3, 4)

scala> Seq(ipaddr).toDF("name").write.format("parquet").mode("overwrite").save("/local/saralihalli/tmp/testParquet4/")
22/07/14 18:57:18 WARN GpuOverrides:
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced; unsupported data types in input: BinaryType [name#21]
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because unsupported data types BinaryType [name] in write for Parquet
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec; not all expressions can be replaced
    !Expression <AttributeReference> name#21 cannot run on GPU because expression AttributeReference name#21 produces an unsupported type BinaryType


scala> spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df4")

scala> spark.conf.set("spark.sql.parquet.binaryAsString",false)

scala> spark.sql("select * from df4").collect
22/07/14 18:57:18 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported data types in output: BinaryType [name#24]; unsupported data types BinaryType [name] in read for Parquet

res10: Array[org.apache.spark.sql.Row] = Array([[B@21249add])

scala> spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df5")

scala> spark.conf.set("spark.sql.parquet.binaryAsString",true)

scala> spark.sql("select * from df5").collect
22/07/14 18:57:20 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported data types in output: BinaryType [name#29]; unsupported data types BinaryType [name] in read for Parquet

res13: Array[org.apache.spark.sql.Row] = Array([[B@3e94ff8a])

So the way this flag works in Spark is that it is specifically for backwards compatibility with older versions of Parquet (or rather the versions of Parquet used in other systems like Hive/etc.). When I ran this in Spark on the CPU, this meant that when I wrote a parquet file from Spark that included a BINARY column, it would always be read back as a BINARY column (even when spark.sql.parquet.binaryAsString=true). I would suggest running this test on Spark without the plugin to confirm the schema when you read the parquet back.

This is why we still need #5416

viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 5, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Nov 9, 2021

sameerz assigned NVnavkumar Jun 3, 2022

sameerz added this to the Jun 6 - Jun 17 milestone Jun 3, 2022

NVnavkumar mentioned this issue Jun 14, 2022

Enable the spark.sql.parquet.binaryAsString=true configuration option on the GPU #5830

Merged

NVnavkumar closed this as completed in #5830 Jun 14, 2022

razajafri mentioned this issue Jul 14, 2022

[FEA] Fully support reading parquet binary as string #5417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support spark.sql.parquet.binaryAsString=true #4040

[FEA] Support spark.sql.parquet.binaryAsString=true #4040

viadea commented Nov 5, 2021

viadea commented Nov 5, 2021 •

edited

Loading

SurajAralihalli commented Jul 14, 2022 •

edited

Loading

NVnavkumar commented Jul 14, 2022 •

edited

Loading

[FEA] Support spark.sql.parquet.binaryAsString=true #4040

[FEA] Support spark.sql.parquet.binaryAsString=true #4040

Comments

viadea commented Nov 5, 2021

viadea commented Nov 5, 2021 • edited Loading

SurajAralihalli commented Jul 14, 2022 • edited Loading

NVnavkumar commented Jul 14, 2022 • edited Loading

viadea commented Nov 5, 2021 •

edited

Loading

SurajAralihalli commented Jul 14, 2022 •

edited

Loading

NVnavkumar commented Jul 14, 2022 •

edited

Loading