[SPARK-43448][BUILD] Remove dummy dependency `hadoop-openstack` #41133

pan3793 · 2023-05-11T07:21:22Z

What changes were proposed in this pull request?

Remove the dummy dependency hadoop-openstack from Spark binary artifacts.

Why are the changes needed?

HADOOP-18442 removed the hadoop-openstack and temporarily retained a dummy jar for the downstream project which consumes it.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA.

pan3793 · 2023-05-11T08:38:08Z

cc @dongjoon-hyun @sunchao @steveloughran

dongjoon-hyun

It's really interesting because that kind of dependency deletion happened in the maintenance release (Apache Hadoop 3.3.5).

steveloughran · 2023-05-11T08:58:08Z

we emptied the jar but left the stub artifact there so that things which did explicitly pull it in wouldn't start breaking.

Now that spark is 3.3.5+ only most of the hadoop-cloud-storage dependencies can be reworked down to just

import hadoop-cloud-storage
cut alliyun-sdk if you don't want it (the 3.3.5 version isn't breaking s3 any more, FWIW)
add google gcs

pan3793 · 2023-05-11T08:58:31Z

@steveloughran does Hadoop 3.3.5 guarantee compatibility w/ previous versions? e.g. is it OK to use Hadoop 3.3.5 client to access Hadoop 3.3.0~3.3.4 server?

steveloughran · 2023-05-11T09:01:09Z

SPARK-42537 covers the full cleanup.

w.r.t this patch: LGTM.

dongjoon-hyun

Thank you for the context, @pan3793 and @steveloughran .

dongjoon-hyun · 2023-05-11T09:04:17Z

Merged to master for Apache Spark 3.5.0.

steveloughran · 2023-05-11T10:52:19Z

is it OK to use Hadoop 3.3.5 client to access Hadoop 3.3.0~3.3.4 server?

should be. IPC is all based on protobuf and we try not to remove things to avoid breaking existing code. HDFS compatibility across major versions is something which mattersd a lot, I believe webhdfs has the strongest guarantees.

what does break, guaranteed, is mixing hadoop libraries from different versions on the classpath. Avoid that. on and for cloudstuff openssl/wildfly is a source of extreme brittleness, even though when it works it's often faster than JVM ssl

pan3793 · 2023-05-11T11:00:30Z

Got it, thanks @steveloughran

### What changes were proposed in this pull request? This PR aims to downgrade the Apache Hadoop dependency to 3.3.4 in `Apache Spark 3.5` in order to prevent any regression from `Apache Spark 3.4.x`. In other words, although `Apache Spark 3.5.x` will lose many bug fixes of Apache Hadoop 3.3.5 and 3.3.6, it will be in the same situation with `Apache Spark 3.4.x`. - SPARK-44197 Upgrade Hadoop to 3.3.6 (#41744) - SPARK-42913 Upgrade Hadoop to 3.3.5 (#39124) - SPARK-43448 Remove dummy dependency `hadoop-openstack` (#41133) On top of reverting SPARK-44197 and SPARK-42913, this PR has additional dependency exclusion change due to the following. - SPARK-43880 Organize `hadoop-cloud` in standard maven project structure (#41380) ### Why are the changes needed? There is a community report on S3A committer performance regression. Although it's one liner fix, there is no available Hadoop release with that fix at this time. - HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer (apache/hadoop#5706) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #42345 from dongjoon-hyun/SPARK-44678. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-43448][BUILD] Remove dummy hadoop-openstack

7750b50

github-actions bot added the BUILD label May 11, 2023

dongjoon-hyun reviewed May 11, 2023

View reviewed changes

dongjoon-hyun approved these changes May 11, 2023

View reviewed changes

dongjoon-hyun closed this in e62aab2 May 11, 2023

dongjoon-hyun mentioned this pull request Aug 4, 2023

[SPARK-44678][BUILD][3.5] Downgrade Hadoop to 3.3.4 #42345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43448][BUILD] Remove dummy dependency `hadoop-openstack` #41133

[SPARK-43448][BUILD] Remove dummy dependency `hadoop-openstack` #41133

pan3793 commented May 11, 2023

pan3793 commented May 11, 2023

dongjoon-hyun left a comment

steveloughran commented May 11, 2023

pan3793 commented May 11, 2023

steveloughran commented May 11, 2023

dongjoon-hyun left a comment

dongjoon-hyun commented May 11, 2023

steveloughran commented May 11, 2023

pan3793 commented May 11, 2023

[SPARK-43448][BUILD] Remove dummy dependency hadoop-openstack #41133

[SPARK-43448][BUILD] Remove dummy dependency hadoop-openstack #41133

Conversation

pan3793 commented May 11, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

pan3793 commented May 11, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

steveloughran commented May 11, 2023

pan3793 commented May 11, 2023

steveloughran commented May 11, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 11, 2023

steveloughran commented May 11, 2023

pan3793 commented May 11, 2023

[SPARK-43448][BUILD] Remove dummy dependency `hadoop-openstack` #41133

[SPARK-43448][BUILD] Remove dummy dependency `hadoop-openstack` #41133