Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Track pipeline options revision for idempotent initialization of file systems #26694

Merged
merged 2 commits into from
Aug 2, 2023

Conversation

mosche
Copy link
Member

@mosche mosche commented May 15, 2023

Current initialization of FileSystems through FileSystems.setDefaultPipelineOptions is problematic and prone to race conditions, especially when triggered on deserialization of SerializablePipelineOptions (see #18430, BEAM-14465, BEAM-14355).

Particularly with S3FileSystem this is bad as it can easily leak resources (threads!) if used this way (see #26321).

Observations

  • Removing FileSystems.setDefaultPipelineOptions from deserialization in SerializablePipelineOptions (SerializablePipelineOptions should not call FileSystems.setDefaultPipelineOptions. #18430) is unlikely to happen any time soon as it requires a coordinated push across various runners and it’s not obvious where and when initialization is supposed to happen for each runner. As a consequence we must expect FileSystems.setDefaultPipelineOptions to be called any number of times.
  • Looking into previous related issues (BEAM-14465, BEAM-14355), logs give a clear hint how often the file systems got re-initialized. The logs mentioned in the tickets happen(ed) once per initialization of the S3 FS: In the case of the Go Flink integration tests that was almost 80k times, running Python wordcount on Beam it was happening > 600 times.
  • I spend quite some time investigating if Pipeline options could be locked during execution. That requires a lot of effort changing runners, but that part is possible. However, test cases strongly suggest being able to mutate pipeline options while running is an intended feature.

Changes in this PR

  • Extend pipeline options to track a monotonically increasing revision with each update.
  • Change the initialization of file systems to be idempotent by means of that revision. Additionally this is scoped to a particular pipeline using the options id.

Closes #27535, fixes #26321

(cc @robertwb Thanks for your reply on the email thread)


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@github-actions github-actions bot added the java label May 15, 2023
@mosche mosche force-pushed the pipeline_opts_revision branch from 02022b7 to ecc7ee6 Compare May 15, 2023 14:53
@github-actions
Copy link
Contributor

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 15, 2023
@mosche mosche removed the stale label Jul 18, 2023
@mosche mosche force-pushed the pipeline_opts_revision branch from ecc7ee6 to d017628 Compare July 18, 2023 08:52
@mosche mosche marked this pull request as ready for review July 18, 2023 09:10
@mosche mosche changed the title [Proposal] Track pipeline options revision for idempotent initialization of file systems [Java] Track pipeline options revision for idempotent initialization of file systems Jul 18, 2023
@mosche
Copy link
Member Author

mosche commented Jul 18, 2023

Run Java_IOs_Direct PreCommit

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@mosche
Copy link
Member Author

mosche commented Jul 18, 2023

Run Java PreCommit

2 similar comments
@mosche
Copy link
Member Author

mosche commented Jul 19, 2023

Run Java PreCommit

@mosche
Copy link
Member Author

mosche commented Jul 20, 2023

Run Java PreCommit

@mosche mosche force-pushed the pipeline_opts_revision branch from d017628 to d7872b9 Compare July 20, 2023 11:46
@mosche
Copy link
Member Author

mosche commented Jul 26, 2023

R: @aromanenko-dev

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@aromanenko-dev
Copy link
Contributor

CC: @robertwb
PTAL

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Before merging, I'd ask @robertwb to take a look for additional comments

Also, it would be helpful to add some unit tests for this new "revision" functionality.

@mosche mosche force-pushed the pipeline_opts_revision branch from d7872b9 to f3bdc8d Compare August 1, 2023 10:39
@@ -103,6 +103,18 @@ public class PipelineOptionsFactoryTest {
@Rule public TestRule restoreSystemProperties = new RestoreSystemProperties();
@Rule public ExpectedLogs expectedLogs = ExpectedLogs.none(PipelineOptionsFactory.class);

@Test
public void testRevision() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aromanenko-dev I've added a basic unit test

@aromanenko-dev
Copy link
Contributor

Run Java PreCommit

@aromanenko-dev
Copy link
Contributor

Failed test of Java PreCommit is not related.
Merging.

@aromanenko-dev aromanenko-dev merged commit 663bd52 into apache:master Aug 2, 2023
@mosche
Copy link
Member Author

mosche commented Aug 2, 2023

Thanks @aromanenko-dev

@mosche mosche deleted the pipeline_opts_revision branch August 2, 2023 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants