Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata #28142

Closed
wants to merge 2 commits into from
Closed

[SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata #28142

wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Apr 7, 2020

What changes were proposed in this pull request?

This is a backport of #22932 .

Currently, Spark writes Spark version number into Hive Table properties with spark.sql.create.version.

parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761,
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}

This PR aims to write Spark versions to ORC/Parquet file metadata with org.apache.spark.sql.create.version because we used org.apache. prefix in Parquet metadata already. It's different from Hive Table property key spark.sql.create.version, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

ORC (native and hive implmentation)

$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0

PARQUET

$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}

Why are the changes needed?

This backport helps us handle this files differently in Apache Spark 3.0.0.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the Jenkins with newly added test cases.

Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761,
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
```

This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0
```

**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```

Pass the Jenkins with newly added test cases.
@@ -36,6 +37,19 @@ private[spark] object VersionUtils {
*/
def minorVersion(sparkVersion: String): Int = majorMinorVersion(sparkVersion)._2

/**
* Given a Spark version string, return the short version string.
* E.g., for 3.0.0-SNAPSHOT, return '3.0.0'.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this example in order to minimize the diff between branch-2.4 and master.


test("Return short version number") {
assert(shortVersion("3.0.0") === "3.0.0")
assert(shortVersion("3.0.0-SNAPSHOT") === "3.0.0")
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change the version 3.0.x in order to minimize the diff between master and branch-2.4.

@dongjoon-hyun
Copy link
Member Author

cc @cloud-fan and @HyukjinKwon
Also, cc @gatorsmile

@@ -243,6 +245,22 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll {
checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that the following test case is executed twice; OrcSourceSuite and HiveOrcSourceSuite.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the backport!

@SparkQA
Copy link

SparkQA commented Apr 7, 2020

Test build #120897 has finished for PR 28142 at commit 18e6932.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

looks like the file statistics is a bit different between 2.4 and 3.0 and causes some test failures.

@dongjoon-hyun
Copy link
Member Author

Oops. Thank you! I'll fix them.

@SparkQA
Copy link

SparkQA commented Apr 7, 2020

Test build #120921 has finished for PR 28142 at commit 65c2a32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Apr 7, 2020

The last commit is only updating the constant inside test cases.
Merged to branch-2.4. Thank you, @cloud-fan and @HyukjinKwon .

dongjoon-hyun added a commit that referenced this pull request Apr 7, 2020
### What changes were proposed in this pull request?

This is a backport of #22932 .

Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.
```
parameters:{
  spark.sql.sources.schema.part.0={
    "type":"struct",
    "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
  },
  transient_lastDdlTime=1541142761,
  spark.sql.sources.schema.numParts=1,
  spark.sql.create.version=2.4.0
}
```

This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility.

After this PR, ORC and Parquet file generated by Spark will have the following metadata.

**ORC (`native` and `hive` implmentation)**
```
$ orc-tools meta /tmp/o
File Version: 0.12 with ...
...
User Metadata:
  org.apache.spark.sql.create.version=3.0.0
```

**PARQUET**
```
$ parquet-tools meta /tmp/p
...
creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra:       org.apache.spark.sql.create.version = 3.0.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
```

### Why are the changes needed?

This backport helps us handle this files differently in Apache Spark 3.0.0.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with newly added test cases.

Closes #28142 from dongjoon-hyun/SPARK-25102-2.4.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-25102-2.4 branch April 7, 2020 20:10
dongjoon-hyun pushed a commit that referenced this pull request Apr 8, 2020
Backport 6b1ca88, similar to #28142

### What changes were proposed in this pull request?

Write Spark version into Avro file metadata

### Why are the changes needed?

The version info is very useful for backward compatibility. This is also done in parquet/orc.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

new test

Closes #28150 from cloud-fan/pick.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants