[Failing Test]: Dataflow V2 VR Streaming Flakiness due to timeout close worker pool #29644

Abacn · 2023-12-06T21:37:12Z

What happened?

This is not an SDK bug. Internal ticket will also be filed.

It is noticed that some random test failing on https://github.com/apache/beam/actions/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow_V2_Streaming.yml?query=event%3Aschedule

For example,

https://github.com/apache/beam/runs/19288524813 with

java.lang.RuntimeException: Dataflow job 2023-12-04_17_25_11-8098838800222529628 terminated in state RUNNING but did not return a failure reason.
	at org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:141)
	at org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:93)
	at org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:55)
	at org.apache.beam.sdk.Pipeline.run(Pipeline.java:321)

Taking a look at worker log (2023-12-04_01_06_49-4587814450665214461)

DEBUG 2023-12-04T09:10:15.680161433Z Cleaning up.
DEBUG 2023-12-04T09:10:15.758333246Z Starting worker pool teardown.
INFO 2023-12-04T09:10:15.782743429Z Stopping worker pool...
DEBUG 2023-12-04T09:10:15.860076821Z Starting worker pool teardown.
INFO 2023-12-04T09:10:15.884898782Z Stopping worker pool...
INFO 2023-12-04T09:10:22.584818Z *** SIGTERM received by PID 14 (TID 14) on cpu 0 from PID 0; stack trace: ***
INFO 2023-12-04T09:10:22.584901Z PC: @ 0x55e6a3671f96 (unknown) (unknown)
...
INFO 2023-12-04T09:10:23.448826420Z ****************************************
DEBUG 2023-12-04T09:15:59.183629917Z Cleaning up.
DEBUG 2023-12-04T09:16:00.002974729Z Autoscaling: Reduced the number of workers to 0 based on low average worker CPU utilization, and the pipeline having sufficiently low backlog and keeping up with input rate.
DEBUG 2023-12-04T09:16:05.045251507Z Starting worker pool teardown.
INFO 2023-12-04T09:16:10.060122904Z Stopping worker pool...
DEBUG 2023-12-04T09:16:10.114936278Z Starting worker pool teardown.
INFO 2023-12-04T09:16:10.144000861Z Stopping worker pool...
INFO 2023-12-04T09:16:10.179066170Z Worker pool stopped.
DEBUG 2023-12-04T09:16:10.204084191Z Tearing down pending resources...
INFO 2023-12-04T09:16:10.792660720Z Your project already contains 100 Dataflow-created metric descriptors...

Occasionally it took 6 minutes to stop the worker pool, where TestPipeline.run() is still blocking.

Another occurrence 2023-12-04_17_25_11-8098838800222529628

INFO 2023-12-05T01:27:58.530670Z Successfully sampled resources
DEBUG 2023-12-05T01:28:21.709062648Z Cleaning up.
DEBUG 2023-12-05T01:28:21.755866390Z Starting worker pool teardown.
INFO 2023-12-05T01:28:21.778534382Z Stopping worker pool...
DEBUG 2023-12-05T01:28:21.806963604Z Starting worker pool teardown.
INFO 2023-12-05T01:28:21.827367725Z Stopping worker pool...
ERROR 2023-12-05T01:28:28.218612Z *** SIGTERM received by PID 14 (TID 14) on cpu 0 from PID 0; stack trace: ***
ERROR 2023-12-05T01:28:28.219090Z PC: @ 0x5621e4e71f96 (unknown) (unknown) @ 0x5621e6f3de89 1344 (unknown) @ 0x7f9168c099a0 631864672 (unknown) @ 0x5621e4e71f96 96 (unknown) @ 0x5621e4cafcca 208 (unknown) @ 0x5621e5df4d30 64 (unknown) @ 0x5621e52547b7 1664 (unknown) @ 0x7f9168a77bbd 208 __libc_start_main @ 0x5621e4c13599 (unknown) (unknown)
WARNING 2023-12-05T01:28:28.223847Z --- CPU registers: --- r8=0 r9=ffffffffffffffff r10=0 r11=246 r12=7f91689c9100 r13=7f91689c9180 r14=881 r15=ffffffffffffffff rdi=7f91689c9180 rsi=189 rbp=7ffd8e451870 rbx=7f91689c9100 rdx=0 rax=fffffffffffffffc rcx=5621e4e71f96 rsp=7ffd8e451820 rip=5621e4e71f96 efl=246 cgf=2b000000000033 err=0 trp=0 msk=0 cr2=0
...
INFO 2023-12-05T01:28:28.674663925Z *********************
DEFAULT 2023-12-05T01:28:28.674675645Z {"logging.googleapis.com/diagnostic":{…}}
DEBUG 2023-12-05T01:33:47.725042327Z Autoscaling: Reduced the number of workers to 0 based on low average worker CPU utilization, and the pipeline having sufficiently low backlog and keeping up with input rate.
DEBUG 2023-12-05T01:33:47.946791112Z Cleaning up.
DEBUG 2023-12-05T01:33:48.026216727Z Starting worker pool teardown.
INFO 2023-12-05T01:33:48.042428706Z Stopping worker pool...
DEBUG 2023-12-05T01:33:48.064501675Z Starting worker pool teardown.
INFO 2023-12-05T01:33:48.086649213Z Stopping worker pool...
INFO 2023-12-05T01:33:48.118769551Z Worker pool stopped.
DEBUG 2023-12-05T01:33:48.138574980Z Tearing down pending resources...
INFO 2023-12-05T01:33:49.164659964Z Your project already contains 100 Dataflow-created metric descriptors...

also took 6 minutes.

It is strange to see the last log is "Your project already contains 100 Dataflow-created metric descriptors." This should be seen in setup, usually.

For comparison, a normally shutdown worker has this kind of log:

DEBUG 2023-12-06T21:04:26.941513220Z Cleaning up.
DEBUG 2023-12-06T21:04:26.980247165Z Starting worker pool teardown.
INFO 2023-12-06T21:04:27.007520790Z Stopping worker pool...
DEBUG 2023-12-06T21:04:27.036155756Z Starting worker pool teardown.
INFO 2023-12-06T21:04:27.057789052Z Stopping worker pool...
INFO 2023-12-06T21:04:34.289906Z *** SIGTERM received by PID 13 (TID 13) on cpu 1 from PID 0; stack trace: ***
ERROR 2023-12-06T21:04:34.293400Z *** SIGTERM received by PID 13 (TID 13) on cpu 1 from PID 0; stack trace: ***
DEFAULT 2023-12-06T21:04:34.718670731Z {"logging.googleapis.com/diagnostic":{…}}
DEBUG 2023-12-06T21:06:48.144509841Z Autoscaling: Reduced the number of workers to 0 based on low average worker CPU utilization, and the pipeline having sufficiently low backlog and keeping up with input rate.
INFO 2023-12-06T21:06:48.181400397Z Worker pool stopped.
DEBUG 2023-12-06T21:06:48.208832034Z Tearing down pending resources...

Likely some racing condition in stop worker pool causing inconsistent teardown time

Issue Failure

Failure: Test is flaky

Issue Priority

Priority: 2 (backlog / disabled test but we think the product is healthy)

Issue Components

The text was updated successfully, but these errors were encountered:

Abacn added bug failing test awaiting triage labels Dec 6, 2023

github-actions bot added java tests P2 flake labels Dec 6, 2023

Abacn mentioned this issue Dec 6, 2023

Bump Java VR interval and shutdown remaining Dataflow VR Jenkins PostCommit #29647

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Failing Test]: Dataflow V2 VR Streaming Flakiness due to timeout close worker pool #29644

[Failing Test]: Dataflow V2 VR Streaming Flakiness due to timeout close worker pool #29644

Abacn commented Dec 6, 2023

[Failing Test]: Dataflow V2 VR Streaming Flakiness due to timeout close worker pool #29644

[Failing Test]: Dataflow V2 VR Streaming Flakiness due to timeout close worker pool #29644

Comments

Abacn commented Dec 6, 2023

What happened?

Issue Failure

Issue Priority

Issue Components