-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata #28142
Conversation
Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. **ORC (`native` and `hive` implmentation)** ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` **PARQUET** ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` Pass the Jenkins with newly added test cases.
@@ -36,6 +37,19 @@ private[spark] object VersionUtils { | |||
*/ | |||
def minorVersion(sparkVersion: String): Int = majorMinorVersion(sparkVersion)._2 | |||
|
|||
/** | |||
* Given a Spark version string, return the short version string. | |||
* E.g., for 3.0.0-SNAPSHOT, return '3.0.0'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change this example in order to minimize the diff between branch-2.4
and master
.
|
||
test("Return short version number") { | ||
assert(shortVersion("3.0.0") === "3.0.0") | ||
assert(shortVersion("3.0.0-SNAPSHOT") === "3.0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change the version 3.0.x
in order to minimize the diff between master
and branch-2.4
.
cc @cloud-fan and @HyukjinKwon |
@@ -243,6 +245,22 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { | |||
checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts)) | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that the following test case is executed twice; OrcSourceSuite and HiveOrcSourceSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the backport!
Test build #120897 has finished for PR 28142 at commit
|
looks like the file statistics is a bit different between 2.4 and 3.0 and causes some test failures. |
Oops. Thank you! I'll fix them. |
Test build #120921 has finished for PR 28142 at commit
|
The last commit is only updating the constant inside test cases. |
### What changes were proposed in this pull request? This is a backport of #22932 . Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. **ORC (`native` and `hive` implmentation)** ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` **PARQUET** ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ### Why are the changes needed? This backport helps us handle this files differently in Apache Spark 3.0.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with newly added test cases. Closes #28142 from dongjoon-hyun/SPARK-25102-2.4. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Backport 6b1ca88, similar to #28142 ### What changes were proposed in this pull request? Write Spark version into Avro file metadata ### Why are the changes needed? The version info is very useful for backward compatibility. This is also done in parquet/orc. ### Does this PR introduce any user-facing change? no ### How was this patch tested? new test Closes #28150 from cloud-fan/pick. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This is a backport of #22932 .
Currently, Spark writes Spark version number into Hive Table properties with
spark.sql.create.version
.This PR aims to write Spark versions to ORC/Parquet file metadata with
org.apache.spark.sql.create.version
because we usedorg.apache.
prefix in Parquet metadata already. It's different from Hive Table property keyspark.sql.create.version
, but it seems that we cannot change Hive Table property for backward compatibility.After this PR, ORC and Parquet file generated by Spark will have the following metadata.
ORC (
native
andhive
implmentation)PARQUET
Why are the changes needed?
This backport helps us handle this files differently in Apache Spark 3.0.0.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass the Jenkins with newly added test cases.