Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beam_PostCommit_XVR_GoUsingJava_Dataflow fails on some test transforms #21645

Closed
damccorm opened this issue Jun 5, 2022 · 14 comments
Closed

Comments

@damccorm
Copy link
Contributor

damccorm commented Jun 5, 2022

Example failure: https://ci-beam.apache.org/job/beam_PostCommit_XVR_GoUsingJava_Dataflow/7/

I couldn't find accurate details about why the tests are failing, but TestXLang_Prefix, TestXLang_Multi, and TestXLang_Partition are failing while running for some reason. Investigating the Dataflow logs, we can see SDK harnesses are failing to connect for some reason. For example:


"getPodContainerStatuses for pod "df-go-testxlang-multi-03300551-62xv-harness-3msv_default(a7f1d8dfb2c3d2b4e80f5d92c1728787)"
failed: rpc error: code = Unknown desc = Error: No such container: bea0d9bde42bf890f6fe1d4f589932471037a5948fb9588d01a06425cd14c177"

However I haven't been able to find any further details showing why the harness fails, and the tests keep running beyond that for a while with other errors that are also pretty inscrutable.

Imported from Jira BEAM-14214. Original Jira may contain additional context.
Reported by: danoliveira.

@jrmccluskey
Copy link
Contributor

The previous error message appears to be a red herring, as it occurs in test logs that have successful runs. These seem to continue to execute until the container is started up. This message, on the other hand, appears to be the common denominator in failing runs:

"Container not found in pod's containers" containerID="5fa844e1dea9ec63b43e5e42422c6370dd0f746a3b00f496f6397a7db955fd98"

With the containerID being indicated as started in the log message before.

@kennknowles
Copy link
Member

@jrmccluskey so is this still an issue? are you working on it or should we unassign?

@jrmccluskey
Copy link
Contributor

This went on the back burner for other work, hoping to dedicate some time to it next week

@kennknowles
Copy link
Member

Is this fixed? Or perhaps so back-burner that it should be unassigned for someone else to grab? Is there a mitigation so that it would not impact test signal and not be a stale P1?

@kennknowles
Copy link
Member

This suite is perma-red. Should we disable a test? I don't think we should just burn jenkins CPU and person time triaging this repeatedly, considering how long it has been going on.

@jrmccluskey jrmccluskey removed their assignment Mar 23, 2023
@jrmccluskey
Copy link
Contributor

Yeah we're probably at that point

@tvalentyn
Copy link
Contributor

@chamikaramj - could you please triage this? Thanks!

@Abacn
Copy link
Contributor

Abacn commented Jun 27, 2023

There are currently 3 failing tests:

--- FAIL: TestXLang_Prefix
--- FAIL: TestXLang_Multi
--- FAIL: TestXLang_Partition

In particular, the worker log for TestXLang_Partition has

Harness ID sdk-1-0
...
2023-06-27 11:23:08.154 EDT Connecting via grpc @ localhost:12371 for data ...
2023-06-27 11:23:11.692 EDT Worker failed: control.Recv failed
caused by:
rpc error: code = Unavailable desc = error reading from server: EOF
2023/06/27 15:23:11 boot.go: error logging message over FnAPI. endpoint localhost:12370 error: EOF message follows
2023/06/27 15:23:11 CRITICAL User program exited: exit status 1

java.util.logging.LogManager$RootLogger log
SEVERE: *~*~*~ Previous channel ManagedChannelImpl{logId=17, target=localhost:12371} was not shutdown properly!!! ~*~*~*
Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.
java.lang.Throwable: java.lang.RuntimeException: ManagedChannel allocation site
at org.apache.beam.vendor.grpc.v1p54p0.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.<init>(ManagedChannelOrphanWrapper.java:102)
at org.apache.beam.vendor.grpc.v1p54p0.io.grpc.internal.ManagedChannelOrphanWrapper.<init>(ManagedChannelOrphanWrapper.java:60)
at org.apache.beam.vendor.grpc.v1p54p0.io.grpc.internal.ManagedChannelOrphanWrapper.<init>(ManagedChannelOrphanWrapper.java:51)
at org.apache.beam.vendor.grpc.v1p54p0.io.grpc.internal.ManagedChannelImplBuilder.build(ManagedChannelImplBuilder.java:631)
at org.apache.beam.vendor.grpc.v1p54p0.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:297)
at org.apache.beam.sdk.fn.channel.ManagedChannelFactory.forDescriptor(ManagedChannelFactory.java:101)
at org.apache.beam.fn.harness.data.BeamFnDataGrpcClient.lambda$getClientFor$0(BeamFnDataGrpcClient.java:105)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at org.apache.beam.fn.harness.data.BeamFnDataGrpcClient.getClientFor(BeamFnDataGrpcClient.java:99)
at org.apache.beam.fn.harness.data.BeamFnDataGrpcClient.createOutboundAggregator(BeamFnDataGrpcClient.java:93)
at org.apache.beam.fn.harness.control.ProcessBundleHandler$1.lambda$addOutgoingDataEndpoint$0(ProcessBundleHandler.java:394)
at java.util.HashMap.computeIfAbsent(HashMap.java:1128)
at org.apache.beam.fn.harness.control.ProcessBundleHandler$1.addOutgoingDataEndpoint(ProcessBundleHandler.java:391)
at org.apache.beam.fn.harness.BeamFnDataWriteRunner$Factory.createRunnerForPTransform(BeamFnDataWriteRunner.java:94)
at org.apache.beam.fn.harness.BeamFnDataWriteRunner$Factory.createRunnerForPTransform(BeamFnDataWriteRunner.java:62)
at org.apache.beam.fn.harness.control.ProcessBundleHandler.createRunnerAndConsumersForPTransformRecursively(ProcessBundleHandler.java:307)
at org.apache.beam.fn.harness.control.ProcessBundleHandler.createRunnerAndConsumersForPTransformRecursively(ProcessBundleHandler.java:261)
at org.apache.beam.fn.harness.control.ProcessBundleHandler.createRunnerAndConsumersForPTransformRecursively(ProcessBundleHandler.java:261)
at org.apache.beam.fn.harness.control.ProcessBundleHandler.createBundleProcessor(ProcessBundleHandler.java:858)
at org.apache.beam.fn.harness.control.ProcessBundleHandler.lambda$processBundle$0(ProcessBundleHandler.java:511)
at org.apache.beam.fn.harness.control.ProcessBundleHandler$BundleProcessorCache.get(ProcessBundleHandler.java:969)
at org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:507)
at org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150)
at org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
INFO 2023-06-27T15:23:12.749328456Z
at org.apache.beam.fn.harness.logging.BeamFnLoggingClient.flushFinalLogs(BeamFnLoggingClient.java:385)
at org.apache.beam.fn.harness.logging.BeamFnLoggingClient.lambda$new$0(BeamFnLoggingClient.java:168)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

The xlang harness fails to "logging message over FnAPI"

CC: @lostluck @riteshghorse

@kennknowles
Copy link
Member

Cham - should this be disabled? I am happy to do the PR if you want to comment and assign to me.

@chamikaramj
Copy link
Contributor

An example failed job:

https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-07-13_07_35_25-3670419731370821100;bottomTab=WORKER_LOGS;logsSeverity=INFO;graphView=0?project=apache-beam-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))&e=13802955&jsmode=O&mods=logs_tg_staging

Check failed: absl::OkStatus() == ::dist_proc::dax::PrintableStatus(status) (OK vs. generic::failed_precondition: PaneInfo truncated
passed through:
==>
dist_proc/dax/workflow/worker/io/fnapi_coders.cc:122
==>
dist_proc/dax/workflow/worker/io/fnapi_coders.cc:1590)
*** Check failure stack trace: ***
@ 0x557732592b44 absl::log_internal::LogMessage::SendToLog()
@ 0x557732592762 absl::log_internal::LogMessage::Flush()
@ 0x557732592e89 absl::log_internal::LogMessageFatal::~LogMessageFatal()
@ 0x5577313803fd dist_proc::dax::workflow::FnApiProcessBundleOperator::ProcessOutputElements()
@ 0x5577317739fb dist_proc::dax::workflow::FnDataService::DataHandle::HandleInboundData()
@ 0x557731772f8b dist_proc::dax::workflow::FnDataService::HandleInboundData()
@ 0x5577313f86c3 dist_proc::dax::workflow::FnApiServiceImpl::BundleHandleImpl::WaitToFinish()
@ 0x557731384cb6 dist_proc::dax::workflow::FnApiProcessBundleOperator::Finish()
@ 0x5577313ea4fd dist_proc::dax::workflow::GraphWorkExecutor::Execute()
@ 0x5577313e28db dist_proc::dax::workflow::InstructionGraphExecutor::Run()
@ 0x5577315e037a dist_proc::dax::workflow::ParallelWorkflowWorkerTask::ProcessWork()
@ 0x5577315e4da4 std::__u::__function::__policy_invoker<>::__call_impl<>()
@ 0x5577315e43b4 absl::internal_any_invocable::LocalInvoker<>()
@ 0x55773246e59f Thread::ThreadBody()
@ 0x7f9af6ecc4e8 start_thread
@ 0x7f9af6d4122d clone

Seems like UW is failing.

I think there was a recent fix to UW related to this that is not in prod yet.

cc: @robertwb

@kennknowles
Copy link
Member

Any update on this P1?

@chamikaramj
Copy link
Contributor

Seems like Just Kafka test is failing now but other tests are passing. Have to check closely to see why the Kafka Go test is failing.

14:25:33 --- PASS: TestBigtableIO_BasicWriteRead (632.92s)
14:25:33 PASS
14:25:33 ok github.com/apache/beam/sdks/v2/go/test/integration/io/xlang/bigtable 635.947s
14:25:33 === RUN TestDebeziumIO_BasicRead
14:25:33 integration.go:364: Test TestDebeziumIO_BasicRead is currently filtered for runner dataflow
14:25:33 --- SKIP: TestDebeziumIO_BasicRead (0.00s)
14:25:33 PASS
14:25:33 ok github.com/apache/beam/sdks/v2/go/test/integration/io/xlang/debezium 3.026s
14:25:33 === RUN TestJDBCIO_BasicReadWrite
14:25:33 integration.go:364: Test TestJDBCIO_BasicReadWrite is currently filtered for runner dataflow
14:25:33 --- SKIP: TestJDBCIO_BasicReadWrite (0.00s)
14:25:33 === RUN TestJDBCIO_PostgresReadWrite
14:25:33 integration.go:364: Test TestJDBCIO_PostgresReadWrite is currently filtered for runner dataflow
14:25:33 --- SKIP: TestJDBCIO_PostgresReadWrite (0.00s)
14:25:33 PASS
14:25:33 ok github.com/apache/beam/sdks/v2/go/test/integration/io/xlang/jdbc 3.026s
14:25:33 === RUN TestKafkaIO_BasicReadWrite
14:25:33 integration.go:364: Test TestKafkaIO_BasicReadWrite is currently filtered for runner dataflow
14:25:33 --- SKIP: TestKafkaIO_BasicReadWrite (0.00s)
14:25:33 PASS
14:25:33 ok github.com/apache/beam/sdks/v2/go/test/integration/io/xlang/kafka 6.027s
14:25:33 FAIL

@kennknowles
Copy link
Member

It has been perma-red for a long time though (https://ci-beam.apache.org/job/beam_PostCommit_XVR_GoUsingJava_Dataflow/)

I think we should downgrade this to P2 since it seems to be more of a "new feature" and is not monitored by a human.

@kennknowles kennknowles added P2 and removed P1 labels Oct 4, 2023
@lostluck
Copy link
Contributor

lostluck commented Feb 6, 2024

Closing this issue since we're off jenkins and different issues now occur. #28339 tracks more recently.

@lostluck lostluck closed this as completed Feb 6, 2024
@github-actions github-actions bot added this to the 2.55.0 Release milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants