[SPARK-43880][BUILD] Organize `hadoop-cloud` in standard maven project structure #41380

pan3793 · 2023-05-30T10:04:07Z

What changes were proposed in this pull request?

Since Spark does not support Hadoop2, we can merge the hadoop3-specific code into the standard maven project folders, which simplifies the hadoop-cloud/pom.xml

I also checked and remove the unnecessary Hadoop related dependencies exclusions in hadoop-cloud/pom.xml.

Why are the changes needed?

Simplify Maven configuration files.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA.

Compare the dependency resolution, no change before and after this change.

build/mvn clean install -DskipTests -pl :spark-hadoop-cloud_2.13 -am -Phadoop-3,hadoop-cloud
build/mvn dependency:list -pl :spark-hadoop-cloud_2.13 -am -Phadoop-3,hadoop-cloud

[INFO] --------------< org.apache.spark:spark-hadoop-cloud_2.13 >--------------
[INFO] Building Spark Project Hadoop Cloud Integration 3.5.0-SNAPSHOT   [13/13]
[INFO]   from hadoop-cloud/pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-dependency-plugin:3.5.0:list (default-cli) @ spark-hadoop-cloud_2.13 ---
[INFO] 
[INFO] The following files have been resolved:
[INFO]    org.apache.orc:orc-core:jar:shaded-protobuf:1.8.3:compile
[INFO]    org.apache.orc:orc-shims:jar:1.8.3:compile
[INFO]    io.airlift:aircompressor:jar:0.21:compile
[INFO]    org.jetbrains:annotations:jar:17.0.0:compile
[INFO]    org.threeten:threeten-extra:jar:1.7.1:compile
[INFO]    org.apache.orc:orc-mapreduce:jar:shaded-protobuf:1.8.3:compile
[INFO]    org.apache.hive:hive-storage-api:jar:2.8.1:compile
[INFO]    org.apache.parquet:parquet-column:jar:1.13.1:compile
[INFO]    org.apache.parquet:parquet-common:jar:1.13.1:compile
[INFO]    org.apache.parquet:parquet-encoding:jar:1.13.1:compile
[INFO]    org.apache.yetus:audience-annotations:jar:0.13.0:compile
[INFO]    org.apache.parquet:parquet-hadoop:jar:1.13.1:compile
[INFO]    org.apache.parquet:parquet-format-structures:jar:1.13.1:compile
[INFO]    org.apache.parquet:parquet-jackson:jar:1.13.1:runtime
[INFO]    org.apache.avro:avro:jar:1.11.1:compile
[INFO]    org.apache.avro:avro-mapred:jar:1.11.1:compile
[INFO]    org.apache.avro:avro-ipc:jar:1.11.1:compile
[INFO]    org.tukaani:xz:jar:1.9:compile
[INFO]    javax.activation:activation:jar:1.1.1:compile
[INFO]    org.apache.curator:curator-recipes:jar:2.13.0:compile
[INFO]    org.apache.curator:curator-framework:jar:2.13.0:compile
[INFO]    org.apache.curator:curator-client:jar:2.13.0:compile
[INFO]    org.apache.zookeeper:zookeeper:jar:3.6.3:compile
[INFO]    org.apache.zookeeper:zookeeper-jute:jar:3.6.3:compile
[INFO]    commons-codec:commons-codec:jar:1.15:compile
[INFO]    org.apache.commons:commons-compress:jar:1.23.0:compile
[INFO]    org.apache.commons:commons-lang3:jar:3.12.0:compile
[INFO]    com.google.code.findbugs:jsr305:jar:3.0.0:runtime
[INFO]    org.slf4j:slf4j-api:jar:2.0.7:compile
[INFO]    org.xerial.snappy:snappy-java:jar:1.1.10.0:compile
[INFO]    com.github.luben:zstd-jni:jar:1.5.5-3:compile
[INFO]    org.apache.hadoop:hadoop-client-runtime:jar:3.3.5:compile
[INFO]    commons-logging:commons-logging:jar:1.1.3:compile
[INFO]    org.apache.hadoop:hadoop-aws:jar:3.3.5:compile
[INFO]    com.amazonaws:aws-java-sdk-bundle:jar:1.12.316:compile
[INFO]    org.wildfly.openssl:wildfly-openssl:jar:1.1.3.Final:compile
[INFO]    com.google.cloud.bigdataoss:gcs-connector:jar:shaded:hadoop3-2.2.14:compile
[INFO]    joda-time:joda-time:jar:2.12.5:compile
[INFO]    com.fasterxml.jackson.core:jackson-databind:jar:2.15.1:compile
[INFO]    com.fasterxml.jackson.core:jackson-core:jar:2.15.1:compile
[INFO]    com.fasterxml.jackson.core:jackson-annotations:jar:2.15.1:compile
[INFO]    com.fasterxml.jackson.dataformat:jackson-dataformat-cbor:jar:2.15.1:compile
[INFO]    org.apache.httpcomponents:httpclient:jar:4.5.14:compile
[INFO]    org.apache.httpcomponents:httpcore:jar:4.4.16:compile
[INFO]    org.apache.hadoop:hadoop-azure:jar:3.3.5:compile
[INFO]    com.microsoft.azure:azure-storage:jar:7.0.1:compile
[INFO]    com.microsoft.azure:azure-keyvault-core:jar:1.0.0:compile
[INFO]    org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:compile
[INFO]    org.apache.hadoop:hadoop-cloud-storage:jar:3.3.5:compile
[INFO]    org.apache.hadoop:hadoop-annotations:jar:3.3.5:compile
[INFO]    org.apache.hadoop:hadoop-aliyun:jar:3.3.5:compile
[INFO]    com.aliyun.oss:aliyun-sdk-oss:jar:3.13.0:compile
[INFO]    org.jdom:jdom2:jar:2.0.6:compile
[INFO]    com.aliyun:aliyun-java-sdk-core:jar:4.5.10:compile
[INFO]    org.ini4j:ini4j:jar:0.5.4:compile
[INFO]    io.opentracing:opentracing-api:jar:0.33.0:compile
[INFO]    io.opentracing:opentracing-util:jar:0.33.0:compile
[INFO]    io.opentracing:opentracing-noop:jar:0.33.0:compile
[INFO]    com.aliyun:aliyun-java-sdk-ram:jar:3.1.0:compile
[INFO]    com.aliyun:aliyun-java-sdk-kms:jar:2.11.0:compile
[INFO]    org.codehaus.jettison:jettison:jar:1.5.3:compile
[INFO]    org.apache.hadoop:hadoop-azure-datalake:jar:3.3.5:compile
[INFO]    com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile
[INFO]    org.eclipse.jetty:jetty-util:jar:9.4.51.v20230217:compile
[INFO]    org.eclipse.jetty:jetty-util-ajax:jar:9.4.51.v20230217:compile
[INFO]    org.spark-project.spark:unused:jar:1.0.0:compile
[INFO] 
[INFO] ------------------------------------------------------------------------

…structure

pan3793 · 2023-05-30T10:05:10Z

cc @LuciferYang @srowen

hadoop-cloud/src/hadoop-3/test/resources/log4j2.properties

LuciferYang · 2023-05-30T14:11:39Z

Merged to master. Thanks @srowen and @pan3793

…t structure ### What changes were proposed in this pull request? Since Spark does not support Hadoop2, we can merge the hadoop3-specific code into the standard maven project folders, which simplifies the `hadoop-cloud/pom.xml` I also checked and remove the unnecessary Hadoop related dependencies exclusions in `hadoop-cloud/pom.xml`. ### Why are the changes needed? Simplify Maven configuration files. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Compare the dependency resolution, no change before and after this change. ``` build/mvn clean install -DskipTests -pl :spark-hadoop-cloud_2.13 -am -Phadoop-3,hadoop-cloud build/mvn dependency:list -pl :spark-hadoop-cloud_2.13 -am -Phadoop-3,hadoop-cloud ``` ``` [INFO] --------------< org.apache.spark:spark-hadoop-cloud_2.13 >-------------- [INFO] Building Spark Project Hadoop Cloud Integration 3.5.0-SNAPSHOT [13/13] [INFO] from hadoop-cloud/pom.xml [INFO] --------------------------------[ jar ]--------------------------------- [INFO] [INFO] --- maven-dependency-plugin:3.5.0:list (default-cli) spark-hadoop-cloud_2.13 --- [INFO] [INFO] The following files have been resolved: [INFO] org.apache.orc:orc-core:jar:shaded-protobuf:1.8.3:compile [INFO] org.apache.orc:orc-shims:jar:1.8.3:compile [INFO] io.airlift:aircompressor:jar:0.21:compile [INFO] org.jetbrains:annotations:jar:17.0.0:compile [INFO] org.threeten:threeten-extra:jar:1.7.1:compile [INFO] org.apache.orc:orc-mapreduce:jar:shaded-protobuf:1.8.3:compile [INFO] org.apache.hive:hive-storage-api:jar:2.8.1:compile [INFO] org.apache.parquet:parquet-column:jar:1.13.1:compile [INFO] org.apache.parquet:parquet-common:jar:1.13.1:compile [INFO] org.apache.parquet:parquet-encoding:jar:1.13.1:compile [INFO] org.apache.yetus:audience-annotations:jar:0.13.0:compile [INFO] org.apache.parquet:parquet-hadoop:jar:1.13.1:compile [INFO] org.apache.parquet:parquet-format-structures:jar:1.13.1:compile [INFO] org.apache.parquet:parquet-jackson:jar:1.13.1:runtime [INFO] org.apache.avro:avro:jar:1.11.1:compile [INFO] org.apache.avro:avro-mapred:jar:1.11.1:compile [INFO] org.apache.avro:avro-ipc:jar:1.11.1:compile [INFO] org.tukaani:xz:jar:1.9:compile [INFO] javax.activation:activation:jar:1.1.1:compile [INFO] org.apache.curator:curator-recipes:jar:2.13.0:compile [INFO] org.apache.curator:curator-framework:jar:2.13.0:compile [INFO] org.apache.curator:curator-client:jar:2.13.0:compile [INFO] org.apache.zookeeper:zookeeper:jar:3.6.3:compile [INFO] org.apache.zookeeper:zookeeper-jute:jar:3.6.3:compile [INFO] commons-codec:commons-codec:jar:1.15:compile [INFO] org.apache.commons:commons-compress:jar:1.23.0:compile [INFO] org.apache.commons:commons-lang3:jar:3.12.0:compile [INFO] com.google.code.findbugs:jsr305:jar:3.0.0:runtime [INFO] org.slf4j:slf4j-api:jar:2.0.7:compile [INFO] org.xerial.snappy:snappy-java:jar:1.1.10.0:compile [INFO] com.github.luben:zstd-jni:jar:1.5.5-3:compile [INFO] org.apache.hadoop:hadoop-client-runtime:jar:3.3.5:compile [INFO] commons-logging:commons-logging:jar:1.1.3:compile [INFO] org.apache.hadoop:hadoop-aws:jar:3.3.5:compile [INFO] com.amazonaws:aws-java-sdk-bundle:jar:1.12.316:compile [INFO] org.wildfly.openssl:wildfly-openssl:jar:1.1.3.Final:compile [INFO] com.google.cloud.bigdataoss:gcs-connector:jar:shaded:hadoop3-2.2.14:compile [INFO] joda-time:joda-time:jar:2.12.5:compile [INFO] com.fasterxml.jackson.core:jackson-databind:jar:2.15.1:compile [INFO] com.fasterxml.jackson.core:jackson-core:jar:2.15.1:compile [INFO] com.fasterxml.jackson.core:jackson-annotations:jar:2.15.1:compile [INFO] com.fasterxml.jackson.dataformat:jackson-dataformat-cbor:jar:2.15.1:compile [INFO] org.apache.httpcomponents:httpclient:jar:4.5.14:compile [INFO] org.apache.httpcomponents:httpcore:jar:4.4.16:compile [INFO] org.apache.hadoop:hadoop-azure:jar:3.3.5:compile [INFO] com.microsoft.azure:azure-storage:jar:7.0.1:compile [INFO] com.microsoft.azure:azure-keyvault-core:jar:1.0.0:compile [INFO] org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:compile [INFO] org.apache.hadoop:hadoop-cloud-storage:jar:3.3.5:compile [INFO] org.apache.hadoop:hadoop-annotations:jar:3.3.5:compile [INFO] org.apache.hadoop:hadoop-aliyun:jar:3.3.5:compile [INFO] com.aliyun.oss:aliyun-sdk-oss:jar:3.13.0:compile [INFO] org.jdom:jdom2:jar:2.0.6:compile [INFO] com.aliyun:aliyun-java-sdk-core:jar:4.5.10:compile [INFO] org.ini4j:ini4j:jar:0.5.4:compile [INFO] io.opentracing:opentracing-api:jar:0.33.0:compile [INFO] io.opentracing:opentracing-util:jar:0.33.0:compile [INFO] io.opentracing:opentracing-noop:jar:0.33.0:compile [INFO] com.aliyun:aliyun-java-sdk-ram:jar:3.1.0:compile [INFO] com.aliyun:aliyun-java-sdk-kms:jar:2.11.0:compile [INFO] org.codehaus.jettison:jettison:jar:1.5.3:compile [INFO] org.apache.hadoop:hadoop-azure-datalake:jar:3.3.5:compile [INFO] com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile [INFO] org.eclipse.jetty:jetty-util:jar:9.4.51.v20230217:compile [INFO] org.eclipse.jetty:jetty-util-ajax:jar:9.4.51.v20230217:compile [INFO] org.spark-project.spark:unused:jar:1.0.0:compile [INFO] [INFO] ------------------------------------------------------------------------ ``` Closes apache#41380 from pan3793/SPARK-43880. Authored-by: Cheng Pan <[email protected]> Signed-off-by: yangjie01 <[email protected]>

### What changes were proposed in this pull request? This PR aims to downgrade the Apache Hadoop dependency to 3.3.4 in `Apache Spark 3.5` in order to prevent any regression from `Apache Spark 3.4.x`. In other words, although `Apache Spark 3.5.x` will lose many bug fixes of Apache Hadoop 3.3.5 and 3.3.6, it will be in the same situation with `Apache Spark 3.4.x`. - SPARK-44197 Upgrade Hadoop to 3.3.6 (#41744) - SPARK-42913 Upgrade Hadoop to 3.3.5 (#39124) - SPARK-43448 Remove dummy dependency `hadoop-openstack` (#41133) On top of reverting SPARK-44197 and SPARK-42913, this PR has additional dependency exclusion change due to the following. - SPARK-43880 Organize `hadoop-cloud` in standard maven project structure (#41380) ### Why are the changes needed? There is a community report on S3A committer performance regression. Although it's one liner fix, there is no available Hadoop release with that fix at this time. - HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer (apache/hadoop#5706) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #42345 from dongjoon-hyun/SPARK-44678. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-43880][BUILD] Organize hadoop-cloud in standard maven project …

9969cbc

…structure

github-actions bot added the BUILD label May 30, 2023

nit

9df37c0

srowen approved these changes May 30, 2023

View reviewed changes

LuciferYang reviewed May 30, 2023

View reviewed changes

hadoop-cloud/src/hadoop-3/test/resources/log4j2.properties Show resolved Hide resolved

LuciferYang approved these changes May 30, 2023

View reviewed changes

LuciferYang closed this in 4006559 May 30, 2023

dongjoon-hyun mentioned this pull request Aug 4, 2023

[SPARK-44678][BUILD][3.5] Downgrade Hadoop to 3.3.4 #42345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43880][BUILD] Organize `hadoop-cloud` in standard maven project structure #41380

[SPARK-43880][BUILD] Organize `hadoop-cloud` in standard maven project structure #41380

pan3793 commented May 30, 2023

pan3793 commented May 30, 2023

LuciferYang commented May 30, 2023

[SPARK-43880][BUILD] Organize hadoop-cloud in standard maven project structure #41380

[SPARK-43880][BUILD] Organize hadoop-cloud in standard maven project structure #41380

Conversation

pan3793 commented May 30, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

pan3793 commented May 30, 2023

LuciferYang commented May 30, 2023

[SPARK-43880][BUILD] Organize `hadoop-cloud` in standard maven project structure #41380

[SPARK-43880][BUILD] Organize `hadoop-cloud` in standard maven project structure #41380