Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout during Worker channel restart leaves host in bad state #10683

Open
mathewc opened this issue Dec 10, 2024 · 0 comments
Open

Timeout during Worker channel restart leaves host in bad state #10683

mathewc opened this issue Dec 10, 2024 · 0 comments

Comments

@mathewc
Copy link
Member

mathewc commented Dec 10, 2024

In a recent CRI 568174889 the Functions Host got into a state where no workers were running but the host did not try to start/restart a worker, causing all function invocations to fail with error "Did not find any initialized language workers".

Below is a Kusto query showing the sequence of events that lead to the app getting into this broken state. Here's the relevant timeline:

  • At 2024-11-19 23:36:29.396 a concurrency bug happened during function loading
    • "System.InvalidOperationException : Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct."
    • I've logged a separate bug for that here: GrpcWorkerChannel concurrency bugs #10682
  • HandleWorkerFunctionLoadError gets triggered to handle this exception
  • This causes RpcFunctionInvocationDispatcher.WorkerError to invoke DisposeAndRestartWorkerChannel
  • As part of this method, ShouldRestartWorkerChannel determines whether to restart the worker
  • we see a log "Restarting worker channel for runtime: 'python'" after this
  • However, the worker startup timed out at 2024-11-19 23:36:59.395 with message "Initializing worker process failed" error "System.TimeoutException : The operation has timed out."
  • After this point, no further attempts are made to restart the worker, and the host stays in a broken state. All Function invocations fail until the host is Function App is restarted by the customer. Perhaps when timeouts happen, our logic to restart the worker doesn't kick in?
FunctionsLogs
| where PreciseTimeStamp between (datetime(2024-11-19 20:00) .. datetime(2024-11-20))
| where Host == "pl1mdlwk000BLC"
| where RoleInstance == "pl1MediumDedicatedLinuxWebWorkerRole_IN_15024"
| project PreciseTimeStamp, Level, AppName, FunctionName, Source, EventName, HostInstanceId, Summary, Details, HostVersion
| order by PreciseTimeStamp asc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant