Timeout during Worker channel restart leaves host in bad state #10683

mathewc · 2024-12-10T00:53:19Z

In a recent CRI 568174889 the Functions Host got into a state where no workers were running but the host did not try to start/restart a worker, causing all function invocations to fail with error "Did not find any initialized language workers".

Below is a Kusto query showing the sequence of events that lead to the app getting into this broken state. Here's the relevant timeline:

At 2024-11-19 23:36:29.396 a concurrency bug happened during function loading
- "System.InvalidOperationException : Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct."
- I've logged a separate bug for that here: GrpcWorkerChannel concurrency bugs #10682
HandleWorkerFunctionLoadError gets triggered to handle this exception
This causes RpcFunctionInvocationDispatcher.WorkerError to invoke DisposeAndRestartWorkerChannel
As part of this method, ShouldRestartWorkerChannel determines whether to restart the worker
we see a log "Restarting worker channel for runtime: 'python'" after this
However, the worker startup timed out at 2024-11-19 23:36:59.395 with message "Initializing worker process failed" error "System.TimeoutException : The operation has timed out."
After this point, no further attempts are made to restart the worker, and the host stays in a broken state. All Function invocations fail until the host is Function App is restarted by the customer. Perhaps when timeouts happen, our logic to restart the worker doesn't kick in?

FunctionsLogs
| where PreciseTimeStamp between (datetime(2024-11-19 20:00) .. datetime(2024-11-20))
| where Host == "pl1mdlwk000BLC"
| where RoleInstance == "pl1MediumDedicatedLinuxWebWorkerRole_IN_15024"
| project PreciseTimeStamp, Level, AppName, FunctionName, Source, EventName, HostInstanceId, Summary, Details, HostVersion
| order by PreciseTimeStamp asc

The text was updated successfully, but these errors were encountered:

microsoft-github-policy-service bot added the Needs: Triage (Functions) label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout during Worker channel restart leaves host in bad state #10683

Timeout during Worker channel restart leaves host in bad state #10683

mathewc commented Dec 10, 2024 •

edited

Loading

Timeout during Worker channel restart leaves host in bad state #10683

Timeout during Worker channel restart leaves host in bad state #10683

Comments

mathewc commented Dec 10, 2024 • edited Loading

mathewc commented Dec 10, 2024 •

edited

Loading