Added binary read support for Parquet [Databricks] #6161

razajafri · 2022-07-30T02:30:24Z

This PR adds Binary type to ParquetSourceScanExec.

Changes made in GpuOverrides to allow BinaryType to be supported for FileSourceScanExec
Added test to test binary type to be read in as binary
Added logic to honor the binaryAsString flag if set

depends on rapidsai/cudf#11410
fixes #5416

Signed-off-by: Raza Jafri [email protected]

Signed-off-by: Raza Jafri <[email protected]>

sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java

Signed-off-by: Raza Jafri <[email protected]>

revans2

I think this looks okay, but I really would like to hear from others too. @jlowe you have been looking at the schema code for parquet lately. What do you think?

revans2 · 2022-08-04T15:04:48Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java

 import org.apache.spark.sql.catalyst.expressions.Attribute;
+import org.apache.spark.sql.internal.SQLConf;


nit: can we revert these import changes that don't appear to be used?

jlowe · 2022-08-04T15:24:14Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

+          val clippedSchemaTmp = sparkToParquetSchema.convert(
+              parquetToSparkSchema.convert(ParquetSchemaClipShims.clipSchema(fileSchema,
+                readDataSchema, isCaseSensitive, readUseFieldId, timestampNTZEnabled)))


This seems like a roundabout way to solve a very specific problem, which is that if, and only if, spark.sql.parquet.binaryAsString is true, we should treat all occurrences of BinaryType in the file schema as if it were StringType.

Note that the comment above seems incorrect, as the file schema-as-spark-schema being passed into the schema evolution is a subset of the read schema, not the file schema. If the data is truly being loaded as a string by libcudf, there should have been no issue trying to apply the Spark read schema to those columns.

@jlowe perhaps I misunderstood this code, but I thought that spark.sql.parquet.binaryAsString primarily impacted the read schema, and only gets applied if the file that is written out does not have the Spark metadata in it for the schema already.

Because of this I thought that we had to go off of the read schema not the file schema and the config. What I think we want to do is to go through the read schema when setting up the columns to read in and see if the file schema says it is binary but the read schema says it is a String. Then we would add that column name to the set of columns that are binary, but should be read as a String.

I agree that this is a bit convoluted (updating the metadata in the modified file to do what we want), but it appears to be working which is why I wanted you opinion on it.

Because of this I thought that we had to go off of the read schema not the file schema and the config

That's my point. If libcudf was truly loading the binary column as a string, as the comment states it does, then trying to apply the read schema to that column should not be throwing an exception in GpuColumnVector.from as stated.

Maybe I'm missing something, but the comment doesn't seem to correctly capture what's actually going on without this workaround.

I also don't like the round-trip-the-entire-schema-through-spark-and-back, since it seems ripe to cause unintended consequences. For example, the Parquet schema may state that there are unsigned types, but Spark doesn't have an unsigned type. Is it OK to silently update the file metadata stating what was a UINT32 is now an INT64?

If we need to bash the Parquet schema to change binaries to strings, then I'd rather see a targeted approach to doing that unless we're convinced doing a roundtrip through Spark types and back won't cause other problems.

Thinking about this a bit more, I believe round-tripping through the Spark types will also strip out any field IDs being used in the Parquet schema. The Spark-to-Parquet converter will translate field IDs found in the Spark schema into Parquet field IDs, but I didn't see any evidence that the Parquet-to-Spark converter would convey Parquet field IDs into Spark. Losing field IDs will break some schema evolution cases that require field IDs to line up the Parquet schema with the Spark read schema.

Thanks for the feedback Jason.

I misunderstood the problem. The reason why I was running into the issue that I was seeing was because I was reading the conf directly. Instead, what we should do is let SparkToParquetSchema tell us what the column should be read as by cudf.

To explain a bit further. The binary file that I was reading from is written by Spark and Spark stores some metadata that was overriding the flag binaryAsString so while I was trying to read the file as a String, Spark was expecting the file to be read in as a binary.

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-08-04T20:55:48Z

build

jlowe · 2022-08-04T21:01:30Z

Is this no longer intended for 22.08? The commit history shows you merged in branch-22.10, but the base branch on this PR is still 22.08 which is why this PR is suddenly much larger, both in terms of commits and changes.

razajafri · 2022-08-04T21:09:35Z

build

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-08-04T21:55:23Z

build

jlowe · 2022-08-04T22:48:32Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

+        t._2.dataType == BinaryType &&
+            sparkToParquetSchema.convertField(t._2).asPrimitiveType().getPrimitiveTypeName
+                == PrimitiveTypeName.BINARY)


Build is failing here:

Error: ] /home/runner/work/spark-rapids/spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala:1399: type mismatch; found : Boolean required: String Error: one error found

Where? I just built it again locally and it's passing.

Do you have the cudf changes to ParquetOptions?

It's failing in CI (see failed runs below), although it looks like that was run against 22.10? It was trying to download spark-rapids-jni-22.10-SNAPSHOT.

It looks like the normal CI is running against 22.08 correctly, so I'll try to manually re-kick the ones that failed here.

sameerz · 2022-08-05T01:31:43Z

build

revans2 · 2022-08-05T14:03:34Z

On a side note could you test what happens if you pass in a schema that switches StringType to BinaryType or BinaryType to StringType?

val baseWithBothBinary = spark.read.schema(StructType(Seq(StructField("a", LongType), StructField("b", BinaryType), StructField("c", BinaryType)))).parquet("binary_as_string.parquet")
val baseWithBothString = spark.read.schema(StructType(Seq(StructField("a", LongType), StructField("b", StringType), StructField("c", StringType)))).parquet("binary_as_string.parquet")

It appears to work on the CPU, but I want to know if it is going to work on the GPU or if we are going to run into some odd errors that we didn't expect.

revans2 · 2022-08-05T15:42:29Z

I did some manual testing with this patch, and bothString works, but both binary does not. This is a good enough improvement that I will merge this in as is, and then do follow on work for the failing use case.

…)" This reverts commit 8d14f8c.

Added binary read support for Parquet

3e48dce

Signed-off-by: Raza Jafri <[email protected]>

sameerz added the feature request New feature or request label Aug 1, 2022

razajafri added 2 commits August 2, 2022 00:36

added tests and honor binaryAsString among other changes

76549ab

Signed-off-by: Raza Jafri <[email protected]>

Added DB support

f08760a

Signed-off-by: Raza Jafri <[email protected]>

razajafri self-assigned this Aug 2, 2022

razajafri marked this pull request as ready for review August 2, 2022 07:45

razajafri changed the title ~~Added binary read support for Parquet~~ Added binary read support for Parquet [Databricks] Aug 2, 2022

sameerz requested a review from revans2 August 2, 2022 16:31

revans2 reviewed Aug 2, 2022

View reviewed changes

sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java Outdated Show resolved Hide resolved

razajafri added 3 commits August 3, 2022 16:28

added more test

d32cb02

Signed-off-by: Raza Jafri <[email protected]>

removed the unnecessary change

a911f9d

Signed-off-by: Raza Jafri <[email protected]>

removed the unnecessary change

2360e1c

Signed-off-by: Raza Jafri <[email protected]>

revans2 reviewed Aug 4, 2022

View reviewed changes

jlowe reviewed Aug 4, 2022

View reviewed changes

Remove unnecessary conversion

1d66c2b

Signed-off-by: Raza Jafri <[email protected]>

razajafri changed the base branch from branch-22.08 to branch-22.10 August 4, 2022 21:09

razajafri requested review from tgravescs, GaryShen2008, NvTimLiu and zhanga5 as code owners August 4, 2022 21:09

Merge remote-tracking branch 'origin/branch-22.08' into HEAD

19befb2

Signed-off-by: Raza Jafri <[email protected]>

razajafri force-pushed the SP-5416 branch from 19f9ebc to 19befb2 Compare August 4, 2022 21:54

razajafri changed the base branch from branch-22.10 to branch-22.08 August 4, 2022 21:54

jlowe approved these changes Aug 4, 2022

View reviewed changes

jlowe reviewed Aug 4, 2022

View reviewed changes

revans2 merged commit 8d14f8c into NVIDIA:branch-22.08 Aug 5, 2022

jlowe mentioned this pull request Aug 5, 2022

Fix merge conflict with 22.08 #6246

Merged

revans2 mentioned this pull request Aug 10, 2022

[BUG] Reading binary columns from nested types does not work. #6281

Closed

revans2 added a commit to revans2/spark-rapids that referenced this pull request Aug 10, 2022

Revert "Added binary read support for Parquet [Databricks] (NVIDIA#6161…

6f70fbd

…)" This reverts commit 8d14f8c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added binary read support for Parquet [Databricks] #6161

Added binary read support for Parquet [Databricks] #6161

razajafri commented Jul 30, 2022 •

edited

Loading

revans2 left a comment

revans2 Aug 4, 2022

jlowe Aug 4, 2022

revans2 Aug 4, 2022

jlowe Aug 4, 2022

jlowe Aug 4, 2022

razajafri Aug 4, 2022

razajafri commented Aug 4, 2022

jlowe commented Aug 4, 2022

razajafri commented Aug 4, 2022

razajafri commented Aug 4, 2022

jlowe Aug 4, 2022

razajafri Aug 4, 2022

jlowe Aug 4, 2022

sameerz commented Aug 5, 2022

revans2 commented Aug 5, 2022

revans2 commented Aug 5, 2022

		import org.apache.spark.sql.catalyst.expressions.Attribute;
		import org.apache.spark.sql.internal.SQLConf;

Added binary read support for Parquet [Databricks] #6161

Added binary read support for Parquet [Databricks] #6161

Conversation

razajafri commented Jul 30, 2022 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Aug 4, 2022

jlowe commented Aug 4, 2022

razajafri commented Aug 4, 2022

razajafri commented Aug 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sameerz commented Aug 5, 2022

revans2 commented Aug 5, 2022

revans2 commented Aug 5, 2022

razajafri commented Jul 30, 2022 •

edited

Loading