Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test]: Dataflow V2 VR Streaming Flakiness due to timeout close worker pool #29644

Open
1 of 16 tasks
Abacn opened this issue Dec 6, 2023 · 0 comments
Open
1 of 16 tasks

Comments

@Abacn
Copy link
Contributor

Abacn commented Dec 6, 2023

What happened?

This is not an SDK bug. Internal ticket will also be filed.

It is noticed that some random test failing on https://github.com/apache/beam/actions/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow_V2_Streaming.yml?query=event%3Aschedule

For example,

https://github.com/apache/beam/runs/19288524813 with

java.lang.RuntimeException: Dataflow job 2023-12-04_17_25_11-8098838800222529628 terminated in state RUNNING but did not return a failure reason.
	at org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:141)
	at org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:93)
	at org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:55)
	at org.apache.beam.sdk.Pipeline.run(Pipeline.java:321)

Taking a look at worker log (2023-12-04_01_06_49-4587814450665214461)

DEBUG 2023-12-04T09:10:15.680161433Z Cleaning up.
DEBUG 2023-12-04T09:10:15.758333246Z Starting worker pool teardown.
INFO 2023-12-04T09:10:15.782743429Z Stopping worker pool...
DEBUG 2023-12-04T09:10:15.860076821Z Starting worker pool teardown.
INFO 2023-12-04T09:10:15.884898782Z Stopping worker pool...
INFO 2023-12-04T09:10:22.584818Z *** SIGTERM received by PID 14 (TID 14) on cpu 0 from PID 0; stack trace: ***
INFO 2023-12-04T09:10:22.584901Z PC: @ 0x55e6a3671f96 (unknown) (unknown)
...
INFO 2023-12-04T09:10:23.448826420Z ****************************************
DEBUG 2023-12-04T09:15:59.183629917Z Cleaning up.
DEBUG 2023-12-04T09:16:00.002974729Z Autoscaling: Reduced the number of workers to 0 based on low average worker CPU utilization, and the pipeline having sufficiently low backlog and keeping up with input rate.
DEBUG 2023-12-04T09:16:05.045251507Z Starting worker pool teardown.
INFO 2023-12-04T09:16:10.060122904Z Stopping worker pool...
DEBUG 2023-12-04T09:16:10.114936278Z Starting worker pool teardown.
INFO 2023-12-04T09:16:10.144000861Z Stopping worker pool...
INFO 2023-12-04T09:16:10.179066170Z Worker pool stopped.
DEBUG 2023-12-04T09:16:10.204084191Z Tearing down pending resources...
INFO 2023-12-04T09:16:10.792660720Z Your project already contains 100 Dataflow-created metric descriptors...

Occasionally it took 6 minutes to stop the worker pool, where TestPipeline.run() is still blocking.

Another occurrence 2023-12-04_17_25_11-8098838800222529628

INFO 2023-12-05T01:27:58.530670Z Successfully sampled resources
DEBUG 2023-12-05T01:28:21.709062648Z Cleaning up.
DEBUG 2023-12-05T01:28:21.755866390Z Starting worker pool teardown.
INFO 2023-12-05T01:28:21.778534382Z Stopping worker pool...
DEBUG 2023-12-05T01:28:21.806963604Z Starting worker pool teardown.
INFO 2023-12-05T01:28:21.827367725Z Stopping worker pool...
ERROR 2023-12-05T01:28:28.218612Z *** SIGTERM received by PID 14 (TID 14) on cpu 0 from PID 0; stack trace: ***
ERROR 2023-12-05T01:28:28.219090Z PC: @ 0x5621e4e71f96 (unknown) (unknown) @ 0x5621e6f3de89 1344 (unknown) @ 0x7f9168c099a0 631864672 (unknown) @ 0x5621e4e71f96 96 (unknown) @ 0x5621e4cafcca 208 (unknown) @ 0x5621e5df4d30 64 (unknown) @ 0x5621e52547b7 1664 (unknown) @ 0x7f9168a77bbd 208 __libc_start_main @ 0x5621e4c13599 (unknown) (unknown)
WARNING 2023-12-05T01:28:28.223847Z --- CPU registers: --- r8=0 r9=ffffffffffffffff r10=0 r11=246 r12=7f91689c9100 r13=7f91689c9180 r14=881 r15=ffffffffffffffff rdi=7f91689c9180 rsi=189 rbp=7ffd8e451870 rbx=7f91689c9100 rdx=0 rax=fffffffffffffffc rcx=5621e4e71f96 rsp=7ffd8e451820 rip=5621e4e71f96 efl=246 cgf=2b000000000033 err=0 trp=0 msk=0 cr2=0
...
INFO 2023-12-05T01:28:28.674663925Z *********************
DEFAULT 2023-12-05T01:28:28.674675645Z {"logging.googleapis.com/diagnostic":{…}}
DEBUG 2023-12-05T01:33:47.725042327Z Autoscaling: Reduced the number of workers to 0 based on low average worker CPU utilization, and the pipeline having sufficiently low backlog and keeping up with input rate.
DEBUG 2023-12-05T01:33:47.946791112Z Cleaning up.
DEBUG 2023-12-05T01:33:48.026216727Z Starting worker pool teardown.
INFO 2023-12-05T01:33:48.042428706Z Stopping worker pool...
DEBUG 2023-12-05T01:33:48.064501675Z Starting worker pool teardown.
INFO 2023-12-05T01:33:48.086649213Z Stopping worker pool...
INFO 2023-12-05T01:33:48.118769551Z Worker pool stopped.
DEBUG 2023-12-05T01:33:48.138574980Z Tearing down pending resources...
INFO 2023-12-05T01:33:49.164659964Z Your project already contains 100 Dataflow-created metric descriptors...

also took 6 minutes.

It is strange to see the last log is "Your project already contains 100 Dataflow-created metric descriptors." This should be seen in setup, usually.

For comparison, a normally shutdown worker has this kind of log:

DEBUG 2023-12-06T21:04:26.941513220Z Cleaning up.
DEBUG 2023-12-06T21:04:26.980247165Z Starting worker pool teardown.
INFO 2023-12-06T21:04:27.007520790Z Stopping worker pool...
DEBUG 2023-12-06T21:04:27.036155756Z Starting worker pool teardown.
INFO 2023-12-06T21:04:27.057789052Z Stopping worker pool...
INFO 2023-12-06T21:04:34.289906Z *** SIGTERM received by PID 13 (TID 13) on cpu 1 from PID 0; stack trace: ***
ERROR 2023-12-06T21:04:34.293400Z *** SIGTERM received by PID 13 (TID 13) on cpu 1 from PID 0; stack trace: ***
DEFAULT 2023-12-06T21:04:34.718670731Z {"logging.googleapis.com/diagnostic":{…}}
DEBUG 2023-12-06T21:06:48.144509841Z Autoscaling: Reduced the number of workers to 0 based on low average worker CPU utilization, and the pipeline having sufficiently low backlog and keeping up with input rate.
INFO 2023-12-06T21:06:48.181400397Z Worker pool stopped.
DEBUG 2023-12-06T21:06:48.208832034Z Tearing down pending resources...

Likely some racing condition in stop worker pool causing inconsistent teardown time

Issue Failure

Failure: Test is flaky

Issue Priority

Priority: 2 (backlog / disabled test but we think the product is healthy)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant