Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test]: :sdks:go:test:ulrValidatesRunner appears to be very flaky at master #26061

Closed
1 of 15 tasks
tvalentyn opened this issue Mar 31, 2023 · 7 comments · Fixed by #26062
Closed
1 of 15 tasks
Assignees
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. failing test flake go P1 python tests

Comments

@tvalentyn
Copy link
Contributor

What happened?

Sample failure runs: https://ci-beam.apache.org/job/beam_PreCommit_GoPortable_Cron/2996

Also failed on some in-flight PRs.

13:19:29 --- PASS: TestEmitParDoAfterGBK (4.50s)
13:19:29 PASS
13:19:29 ok  	github.com/apache/beam/sdks/v2/go/test/regression	124.686s
13:19:29 FAIL
13:19:29 
13:19:29 > Task :sdks:go:test:ulrValidatesRunner FAILED
13:19:29 
13:19:29 FAILURE: Build failed with an exception.
13:19:29 
13:19:29 * Where:
13:19:29 Build file '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Cron/src/sdks/go/test/build.gradle' line: 152
13:19:29 
13:19:29 * What went wrong:
13:19:29 Execution failed for task ':sdks:go:test:ulrValidatesRunner'.
13:19:29 > Process 'command 'sh'' finished with non-zero exit value 1
13:19:29 
13:19:29 * Try:
13:19:29 > Run with --stacktrace option to get the stack trace.
13:19:29 > Run with --info or --debug option to get more log output.

Issue Failure

Failure: Test is flaky

Issue Priority

Priority: 1 (unhealthy code / failing or flaky postcommit so we cannot be sure the product is healthy)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@tvalentyn
Copy link
Contributor Author

cc: @lostluck

@lostluck
Copy link
Contributor

Note that this is happening on the Python Portable Runner, but not on any of the other validates runner suites, so it doesn't seem like it's Go side.

I was chasing this down from a different series. It's failing on artifact upload, which doesn't have any recent work on the Go side (https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/artifact).

I'd be concerned it was the datalayer rewrite (#25982), but that wasn't merged until the 28th, and this started happening on the 25th. (The datalayer rewrite does have a flake, but it's in a unit test #26057, not a runner test).

But python's portable runner doesn't seem to have anything new there either.

Very confusing. But then I'd expect to see non-infra flakes in the other runner suite tests.

@lostluck
Copy link
Contributor

2023/03/31 20:19:00 Prepared job with id: go-testxlang_multi-841-7e5f7f7d-e29d-41a5-85b4-c701695d54d4 and staging token: go-testxlang_multi-841-7e5f7f7d-e29d-41a5-85b4-c701695d54d4
    ptest.go:108: Failed to execute job: 	staging artifacts
        failed to stage /tmp/worker-7-1680293938316973953 in 3 attempts: failed to send chunks for /tmp/worker-7-1680293938316973953
        	caused by:
        chunk send failed
        	caused by:
        EOF; failed to send chunks for /tmp/worker-7-1680293938316973953
        	caused by:
        chunk send failed
        	caused by:
        EOF; failed to send chunks for /tmp/worker-7-1680293938316973953
        	caused by:
        chunk send failed
        	caused by:
        EOF; failed to send chunks for /tmp/worker-7-1680293938316973953
        	caused by:
        chunk send failed
        	caused by:
        EOF
--- FAIL: TestXLang_Multi (12.99s)

Same thing for some of the others. But it's very odd that it's happening to the Python runner and not Flink/Spark/Samza.

Feels sort of like a grpc thing, but again, not sure why it's only started recently.

@lostluck
Copy link
Contributor

That path isn't doing the correct thing WRT the error on Send. The EOF means close the stream and see what the server is returning. Typically the EOF means that the server side closed for some reason.
The upload is retrying 3 times and still getting failures like that, so lets try following the proper protocol WRT errors on sends...

@lostluck
Copy link
Contributor

lostluck commented Apr 1, 2023

Well that's unexpected:

 ptest.go:108: Failed to execute job: 	staging artifacts
        failed to stage /tmp/worker-5-1680307335622714242 in 3 attempts: failed to send chunks for /tmp/worker-5-1680307335622714242; close error: rpc error: code = Unimplemented desc = Method not found!
        	caused by:
        EOF; failed to send chunks for /tmp/worker-5-1680307335622714242; close error: rpc error: code = Unimplemented desc = Method not found!
        	caused by:
        EOF; failed to send chunks for /tmp/worker-5-1680307335622714242; close error: rpc error: code = Unimplemented desc = Method not found!
        	caused by:
        EOF; failed to send chunks for /tmp/worker-5-1680307335622714242; close error: rpc error: code = Unimplemented desc = Method not found!
        	caused by:
        EOF

So, the Server doesn't seem to have it implemented when the SDK connects? Very strange.
https://ci-beam.apache.org/job/beam_PreCommit_GoPortable_Phrase/176/consoleText

@lostluck
Copy link
Contributor

lostluck commented Apr 1, 2023

That's because if the "Portable" artifact upload fails, we don't log any of those errors, we just log that the old legacy method doesn't exists. Adding in that logging to see why it's a failing...

@lostluck
Copy link
Contributor

lostluck commented Apr 1, 2023

I'm not having any luck replicating that failure right now. Always after debugging is added. I suspect it's related to Jenkins Machine load, so this will have to wait until next week.

@github-actions github-actions bot added this to the 2.47.0 Release milestone Apr 4, 2023
@tvalentyn tvalentyn added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. failing test flake go P1 python tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants