Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#27839] Write PipelineOptions to a file instead of an environment variable. #27842

Merged
merged 7 commits into from
Aug 16, 2023

Conversation

lostluck
Copy link
Contributor

@lostluck lostluck commented Aug 3, 2023

Writes the JSON pipeline options to a local in container file and sets a new env variable PIPELINE_OPTIONS_FILE

Sibling to #27841 and handles the Java variant of #27839.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@codecov
Copy link

codecov bot commented Aug 4, 2023

Codecov Report

Merging #27842 (4d0342b) into master (f388f3c) will decrease coverage by 0.03%.
Report is 15 commits behind head on master.
The diff coverage is 6.25%.

@@            Coverage Diff             @@
##           master   #27842      +/-   ##
==========================================
- Coverage   72.32%   72.30%   -0.03%     
==========================================
  Files         678      678              
  Lines       99726    99740      +14     
==========================================
- Hits        72130    72116      -14     
- Misses      26032    26063      +31     
+ Partials     1564     1561       -3     
Flag Coverage Δ
go 53.66% <6.25%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
sdks/java/container/boot.go 16.15% <6.25%> (-1.06%) ⬇️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@lostluck
Copy link
Contributor Author

lostluck commented Aug 4, 2023

TBH I expected more jobs to fail, since this basically removes Pipeline options from the harness until the SDK side change is fixed.

@lostluck
Copy link
Contributor Author

lostluck commented Aug 7, 2023

Run Java_Examples_Dataflow_Java17 PreCommit

@lostluck
Copy link
Contributor Author

lostluck commented Aug 7, 2023

Run Flink ValidatesRunner

@lostluck
Copy link
Contributor Author

lostluck commented Aug 7, 2023

Run Java Dataflow V2 ValidatesRunner

1 similar comment
@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

Run Java Dataflow V2 ValidatesRunner

@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

Run Spark ValidatesRunner Java 11

@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

Run PortableJar_Flink PostCommit

@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

Run PortableJar_Spark PostCommit

@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

Run Samza ValidatesRunner

@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

Java_PVR_Flink_Docker

@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

beam_PostCommit_Java_PVR_Spark3_Streaming

@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

beam_PostCommit_Java_PVR_Spark_Batch

@lostluck
Copy link
Contributor Author

lostluck commented Aug 8, 2023

Run Samza ValidatesRunner

@lostluck
Copy link
Contributor Author

lostluck commented Aug 9, 2023

Run Java Dataflow V2 ValidatesRunner

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks reasonable, but I'm curious why it's needed. Do environment variables have line length limits similar to command lines?

@lostluck
Copy link
Contributor Author

The change looks reasonable, but I'm curious why it's needed. Do environment variables have line length limits similar to command lines?

Per the issue (#27839), as far as the linux kernel is concerned, Environment Variables are Command Line Arguments. They consume the same "resource". See this stack overflow about it:
https://stackoverflow.com/questions/28865473/setting-environment-variable-to-a-large-value-argument-list-too-long

As a result, the only way to avoid Argument Too Long errors from the OS when starting a worker is to not serialize the whole pipeline options into the command line too.

This has affected Dataflow customers on RunnerV2 which use the portable SDK containers, when the filesToStage value becomes excessively long due to Java reasons.

The Legacy Java Dataflow Worker (runnerv1) has done this since 2018, and the fix was never backported to the containers, since we mistakenly believed that Environment Variables would get around the problem.

They don't.

@lostluck
Copy link
Contributor Author

I haven't been able to find a sufficiently indicative Test that would validate that Pipeline Options are being successfully passed around and available or not yet. Basically no tests were failing due to this change even before Byron's PR #27841 was put in.

So I'm not yet confident this is working as intended yet.

@robertwb
Copy link
Contributor

The change looks reasonable, but I'm curious why it's needed. Do environment variables have line length limits similar to command lines?

Per the issue (#27839), as far as the linux kernel is concerned, Environment Variables are Command Line Arguments. They consume the same "resource". See this stack overflow about it: https://stackoverflow.com/questions/28865473/setting-environment-variable-to-a-large-value-argument-list-too-long

As a result, the only way to avoid Argument Too Long errors from the OS when starting a worker is to not serialize the whole pipeline options into the command line too.

This has affected Dataflow customers on RunnerV2 which use the portable SDK containers, when the filesToStage value becomes excessively long due to Java reasons.

The Legacy Java Dataflow Worker (runnerv1) has done this since 2018, and the fix was never backported to the containers, since we mistakenly believed that Environment Variables would get around the problem.

They don't.

Interesting. Thanks for digging into this.

@lostluck
Copy link
Contributor Author

I've asked @damondouglas to help me validate, since I'm not confident I'll be able to get a functional Java/Maven pipeline authored without spending all day at it.

@lostluck
Copy link
Contributor Author

Specifically, the smoke test would need to

  1. Build a fresh container from this branch.
  2. Run a pipeline using the fresh container. Must have a DoFn that fails if some pipeline option is not set as expected.
  3. Run it against a portable runner in docker mode (eg. Flink or Spark in portable mode, or PyPortable, or Dataflow w/RunnerV2).

If we can manage that in a timely fashion I'd be up to cherry pick this to get it into the release.

@lostluck
Copy link
Contributor Author

I have validated the change with a little guidance from @damondouglas in testing live changes on the Java SDK.

I did have a bug in my write error handling causing the container to hard fail 100% of the time.

It's concerning that I couldn't find any Gradle tests that quickly smoke check the Java Portable Containers. I'm a little surprised that there wasn't any against the Python Portable runner that rely on containers. (Or if there is, it's not obvious and I couldn't find it among our Actions or Jenkins tasks, or gradle commands).

I'll eventually be getting such a suite up for Prism however when I get to seeing other SDKs run on it. Hopefully I can get back to that soon.

@lostluck
Copy link
Contributor Author

Run Java Dataflow V2 ValidatesRunner

lostluck added a commit to lostluck/beam that referenced this pull request Aug 15, 2023
lostluck added a commit that referenced this pull request Aug 15, 2023
Co-authored-by: lostluck <[email protected]>
@lostluck lostluck merged commit 1981cb0 into apache:master Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants