Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Large PIPELINE_OPTIONS can lead to command line args list too long errors in SDK containers #27839

Open
2 of 15 tasks
lostluck opened this issue Aug 3, 2023 · 6 comments

Comments

@lostluck
Copy link
Contributor

lostluck commented Aug 3, 2023

What happened?

A Dataflow customer with a large number of --filesToStage leads to workers unable to boot up, failing with Java exited: fork/exec /opt/java/openjdk/bin/java: argument list too long.

After some investigation, it's revealed that in Linux, Environment variables take up command line length apparently:

https://stackoverflow.com/questions/28865473/setting-environment-variable-to-a-large-value-argument-list-too-long

And Beam Java serializes the pipeline options in JSON format to an evironement variable.

https://github.com/apache/beam/blob/release-2.49.0/sdks/java/container/boot.go#L128

This also happens for Python:

os.Setenv("PIPELINE_OPTIONS", options)
but no reports for this as of yet.

Previous work to resolve this was here, focused on the Java class path: #25582

While that certainly helped the issue, large Pipeline options remain an issue.

The proposed fix for Java at least is to write another environment variable PIPELINE_OPTIONS_FILE, which will contain the file location for a json encoded version of the pipeline options, similar to how we've done the pathing jar.

The behavior from the portable SDK harness should be to look at this environment variable, and if it exists, read the JSON pipeline options from them. Otherwise, fall back to the existing behavior.

This allows for slight mismatch in container versions vs Beam versions for users who aren't experiencing this issue.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@lostluck
Copy link
Contributor Author

lostluck commented Aug 3, 2023

pipeline_options_dict = _load_pipeline_options(
is where the Python stuff primarily happens.

and

private static final String PIPELINE_OPTIONS = "PIPELINE_OPTIONS";

is the same for Java.

@lostluck
Copy link
Contributor Author

lostluck commented Aug 9, 2023

I'm unable to gain confidence that this PR doesn't simply break things, as tests weren't failing without the java change, or with the java change.

Since I don't have the ability to do anything quickly with the Java SDK, I'm unable to validate this before the 2.50 cut. However, users could build a custom container in the mean time, based on the beam branch once we do successfully get this in, but are unable to cherry pick it.

lostluck added a commit that referenced this issue Aug 16, 2023
…riable. (#27842)

* remove use of deprecated io/ioutil package.

* [#27839] Write pipeline options to a local file for SDK use.

* finish comment

* Use guave 32.1.2

* fix bad merges

* further merge fixes

* Fix write error handling.

---------

Co-authored-by: lostluck <[email protected]>
Co-authored-by: Yi Hu <[email protected]>
@lostluck
Copy link
Contributor Author

This should be resolved for Java in 2.50+. It's lower priority to do the same for Java and Go, but applying this protocol consistently across SDKs will prevent future issues.

@damccorm
Copy link
Contributor

@lostluck I noticed we've bumped this back several milestones - any reason to keep a milestone on it vs just removing it entirely and letting it get autotagged when we get to it?

Guessing its not actually a 2.52 release blocker

@lostluck lostluck removed this from the 2.52.0 Release milestone Oct 27, 2023
@lostluck
Copy link
Contributor Author

Agreed.

Java was the only one impacted, but migrating to this approach for all SDKs would be for parity.

@lostluck
Copy link
Contributor Author

lostluck commented Jun 3, 2024

A Dataflow Go SDK user has run into an issue that seems like this approach would resolve, so I'm going to see about completing the change for the Go SDK. A volunteer for the Python SDK would be appreciated, so we can close the issue (though ensuring Typescript and Swift aren't blockers, they may as well move to the paradigm to avoid future issues, and assure consistency among SDKs).

lostluck added a commit that referenced this issue Jun 4, 2024
…om a flag. (#31482)

* [#27839] Move pipeline options file creation to tools package.

* Write options to a file in the container instead of burdening the command line.

---------

Co-authored-by: lostluck <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants