Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostCommit Go VR Flink job is flaky #31122

Closed
github-actions bot opened this issue Apr 26, 2024 · 10 comments
Closed

The PostCommit Go VR Flink job is flaky #31122

github-actions bot opened this issue Apr 26, 2024 · 10 comments

Comments

@github-actions
Copy link
Contributor

The PostCommit Go VR Flink is failing over 50% of the time.
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Go_VR_Flink.yml?query=is%3Afailure+branch%3Amaster to see all failed workflow runs.
See also Grafana statistics: http://metrics.beam.apache.org/d/CTYdoxP4z/ga-post-commits-status?orgId=1&viewPanel=10&var-Workflow=PostCommit%20Go%20VR%20Flink

@Abacn
Copy link
Contributor

Abacn commented Jun 6, 2024

The failing tests are

2024-06-06T15:49:25.5514012Z     teststream_test.go:82: Failed to execute job: job go0testteststreamsimple0infinitydefault0645-runner-0606154852-104bc19c_903ad8b0-583f-49fa-9c25-7de441e9645e failed:
2024-06-06T15:49:25.5516578Z         org.apache.beam.sdk.coders.CoderException: `UnknownCoderWrapper` was used to perform an actual decoding in the Java SDK. Potentially a Java transform is being followed by a cross-language transform that uses a coder that is not available in the Java SDK. Please make sure that Python transforms at the multi-language boundary use Beam portable coders.
2024-06-06T15:49:25.5518501Z --- FAIL: TestTestStreamSimple_InfinityDefault (36.04s)
2024-06-06T15:48:49.4744385Z     teststream_test.go:77: Failed to execute job: job go0testteststreamsimple0300-runner-0606154839-3044966_f7c93228-67e5-4745-8d98-66cf379629eb failed:
2024-06-06T15:48:49.4748151Z         org.apache.beam.sdk.coders.CoderException: `UnknownCoderWrapper` was used to perform an actual decoding in the Java SDK. Potentially a Java transform is being followed by a cross-language transform that uses a coder that is not available in the Java SDK. Please make sure that Python transforms at the multi-language boundary use Beam portable coders.
2024-06-06T15:48:49.4751550Z --- FAIL: TestTestStreamSimple (13.58s)
2024-06-06T15:49:38.9044506Z     teststream_test.go:87: Failed to execute job: job go0testteststreamtogbk02-runner-0606154928-bee4a43d_deed8bb6-6a4b-42bf-9d42-2bd1911d0bfe failed:
2024-06-06T15:49:38.9048296Z         org.apache.beam.sdk.coders.CoderException: `UnknownCoderWrapper` was used to perform an actual decoding in the Java SDK. Potentially a Java transform is being followed by a cross-language transform that uses a coder that is not available in the Java SDK. Please make sure that Python transforms at the multi-language boundary use Beam portable coders.
2024-06-06T15:49:38.9050512Z --- FAIL: TestTestStreamToGBK (13.42s)
2024-06-06T15:49:48.8207809Z     teststream_test.go:92: Failed to execute job: job go0testteststreamtimerseventtime055-runner-0606154942-51c50c75_ae30c901-6f2f-4981-a593-956e0a3a0b79 failed:
2024-06-06T15:49:48.8209479Z         java.lang.NullPointerException
2024-06-06T15:49:48.8210362Z --- FAIL: TestTestStreamTimersEventTime (9.83s)

These tests were added in #31046 . Shall we fix the test or disable them on Flink VR test suites? @lostluck

@Abacn
Copy link
Contributor

Abacn commented Jun 6, 2024

cc: @kennknowles another currently permared PostCommit on release-2.57.0 branch

@Abacn
Copy link
Contributor

Abacn commented Jun 6, 2024

The error also sounds like #30994. The error message notes "Python transforms" which is not accurate either

@lostluck
Copy link
Contributor

lostluck commented Jun 6, 2024

Agreed, that filtering is the right move here. Those tests do use strings which is one of the affected datatypes Flink's TestStream corrupts.


At some point we determined that Flink does with TestStream and mutates those coders (adding length prefixes where they weren't previously) without making the equivalent mutations to the equivalent bytes, but we weren't able to pin down where it was coming from.

This we filtered out those tests here:

https://github.com/apache/beam/blob/master/sdks/go/test/integration/integration.go#L181

Ideally we fix the Flink test stream implementation, but until then, we filter, since it's not commonly used.

@lostluck
Copy link
Contributor

lostluck commented Jun 6, 2024

Actually, "TestTestStreamSimple" and "TestTestStreamToGBK" should be working, so those are new failures. "TestTestStreamTimersEventTime" I'd expect to fail based on previous behaviour.

@Abacn
Copy link
Contributor

Abacn commented Jun 6, 2024

Actually, "TestTestStreamSimple" and "TestTestStreamToGBK" should be working

All these 4 tests were added in #31046 and failing since first run. Or do you suggest the newly added test reveals some underlying bug/gap ?

@lostluck
Copy link
Contributor

lostluck commented Jun 6, 2024

Ah! Right I recall now. So those were added because they did reveal a gap in Prism's test stream implementation.

They're likely revealing one in Flink, so agreed they should be filtered.

The simple ones are pipelines without any Impulse transform, so the runner's TestStream must be capable of kicking off the pipeline.

@Abacn
Copy link
Contributor

Abacn commented Jun 6, 2024

so agreed they should be filtered.

yeah, thanks, I am trying to do that. However it seems not obvious how can I filter out a specific test for a specific runner in VR test suite.

In Java this was done by excludeCategories https://github.com/apache/beam/blob/master/runners/flink/flink_runner.gradle#L298 (as gradle is built for java)

@lostluck
Copy link
Contributor

lostluck commented Jun 6, 2024

We filter out those tests flink here:

https://github.com/apache/beam/blob/master/sdks/go/test/integration/integration.go#L181

Each runner has its own list of tests it can't run.

Wildcards or the raw test name can be used too.

@kennknowles
Copy link
Member

Ah, this is great news. I had not dug very deeply into the failures. The whole branch looked a mess so I assumed that our GHA just was broken in a big way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants