Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support spark.sql.parquet.binaryAsString=true #4040

Closed
viadea opened this issue Nov 5, 2021 · 3 comments · Fixed by #5830
Closed

[FEA] Support spark.sql.parquet.binaryAsString=true #4040

viadea opened this issue Nov 5, 2021 · 3 comments · Fixed by #5830
Assignees
Labels
feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Nov 5, 2021

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]
Support spark.sql.parquet.binaryAsString=true.

!Exec <FileSourceScanExec> cannot run on GPU because GpuParquetScan does not support spark.sql.parquet.binaryAsString
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 5, 2021
@viadea
Copy link
Collaborator Author

viadea commented Nov 5, 2021

Mini repro:

Reading a hive parquet table:

  1. Create a Hive parquet table with only int and string inside:
spark-sql> create table teststring2 stored as parquet  as select * from teststring;
spark-sql> select * from teststring2 ;
1	abcd

spark-sql> desc teststring2 ;
x	int	NULL
y	string	NULL
  1. Enable spark.sql.parquet.binaryAsString and run the query, it will fallback.
spark-sql> set spark.sql.parquet.binaryAsString=true;
spark-sql> select * from teststring2 ;
21/11/05 07:01:37 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because GpuParquetScan does not support spark.sql.parquet.binaryAsString

1	abcd
Time taken: 0.146 seconds, Fetched 1 row(s)

Or reading a parquet file with string inside directly:

Seq("a", "b").toDF("name").write.format("parquet").mode("overwrite").save("/tmp/testparquet")
spark.read.parquet("/tmp/testparquet").createTempView("df")
spark.sql("select * from df").collect
spark.conf.set("spark.sql.parquet.binaryAsString",true)
spark.sql("select * from df").collect

@SurajAralihalli
Copy link
Collaborator

SurajAralihalli commented Jul 14, 2022

Adding Binary Type test

val ipaddr = Array[Byte](1,3,4)
Seq(ipaddr).toDF("name").write.format("parquet").mode("overwrite").save("/local/saralihalli/tmp/testParquet4/")
spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df4")
spark.conf.set("spark.sql.parquet.binaryAsString",false)
spark.sql("select * from df4").collect
spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df5")
spark.conf.set("spark.sql.parquet.binaryAsString",true)
spark.sql("select * from df5").collect

For binary Type data

Summary
With rapids-4-spark_2.12-22.08.0-20220714.123703-28.jar and spark.sql.parquet.binaryAsString set to

  1. False: yes there is fallback
  2. True: yes there is fallback
scala> val ipaddr = Array[Byte](1,3,4)
ipaddr: Array[Byte] = Array(1, 3, 4)

scala> Seq(ipaddr).toDF("name").write.format("parquet").mode("overwrite").save("/local/saralihalli/tmp/testParquet4/")
22/07/14 18:57:18 WARN GpuOverrides:
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced; unsupported data types in input: BinaryType [name#21]
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because unsupported data types BinaryType [name] in write for Parquet
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec; not all expressions can be replaced
    !Expression <AttributeReference> name#21 cannot run on GPU because expression AttributeReference name#21 produces an unsupported type BinaryType


scala> spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df4")

scala> spark.conf.set("spark.sql.parquet.binaryAsString",false)

scala> spark.sql("select * from df4").collect
22/07/14 18:57:18 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported data types in output: BinaryType [name#24]; unsupported data types BinaryType [name] in read for Parquet

res10: Array[org.apache.spark.sql.Row] = Array([[B@21249add])

scala> spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df5")

scala> spark.conf.set("spark.sql.parquet.binaryAsString",true)

scala> spark.sql("select * from df5").collect
22/07/14 18:57:20 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported data types in output: BinaryType [name#29]; unsupported data types BinaryType [name] in read for Parquet

res13: Array[org.apache.spark.sql.Row] = Array([[B@3e94ff8a])

@NVnavkumar
Copy link
Collaborator

NVnavkumar commented Jul 14, 2022

Adding Binary Type test

val ipaddr = Array[Byte](1,3,4)
Seq(ipaddr).toDF("name").write.format("parquet").mode("overwrite").save("/local/saralihalli/tmp/testParquet4/")
spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df4")
spark.conf.set("spark.sql.parquet.binaryAsString",false)
spark.sql("select * from df4").collect
spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df5")
spark.conf.set("spark.sql.parquet.binaryAsString",true)
spark.sql("select * from df5").collect

For binary Type data

Summary With rapids-4-spark_2.12-22.08.0-20220714.123703-28.jar and spark.sql.parquet.binaryAsString set to

  1. False: yes there is fallback
  2. True: yes there is fallback
scala> val ipaddr = Array[Byte](1,3,4)
ipaddr: Array[Byte] = Array(1, 3, 4)

scala> Seq(ipaddr).toDF("name").write.format("parquet").mode("overwrite").save("/local/saralihalli/tmp/testParquet4/")
22/07/14 18:57:18 WARN GpuOverrides:
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced; unsupported data types in input: BinaryType [name#21]
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because unsupported data types BinaryType [name] in write for Parquet
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec; not all expressions can be replaced
    !Expression <AttributeReference> name#21 cannot run on GPU because expression AttributeReference name#21 produces an unsupported type BinaryType


scala> spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df4")

scala> spark.conf.set("spark.sql.parquet.binaryAsString",false)

scala> spark.sql("select * from df4").collect
22/07/14 18:57:18 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported data types in output: BinaryType [name#24]; unsupported data types BinaryType [name] in read for Parquet

res10: Array[org.apache.spark.sql.Row] = Array([[B@21249add])

scala> spark.read.parquet("/local/saralihalli/tmp/testParquet4/").createTempView("df5")

scala> spark.conf.set("spark.sql.parquet.binaryAsString",true)

scala> spark.sql("select * from df5").collect
22/07/14 18:57:20 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported data types in output: BinaryType [name#29]; unsupported data types BinaryType [name] in read for Parquet

res13: Array[org.apache.spark.sql.Row] = Array([[B@3e94ff8a])

So the way this flag works in Spark is that it is specifically for backwards compatibility with older versions of Parquet (or rather the versions of Parquet used in other systems like Hive/etc.). When I ran this in Spark on the CPU, this meant that when I wrote a parquet file from Spark that included a BINARY column, it would always be read back as a BINARY column (even when spark.sql.parquet.binaryAsString=true). I would suggest running this test on Spark without the plugin to confirm the schema when you read the parquet back.

This is why we still need #5416

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
5 participants