-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation #27381
Comments
@geoffkizer is it a known problem? cc @davidsh Thanks @baal2000 for your great and detailed analysis. This may be worth fixing in 2.1 servicing ... |
Not a known issue. Interesting catch. |
@karelz cc: @davidsh, @geoffkizer RE: How did you discover the problem? We hit it in production after:
|
Could this be the same problem as #27256? |
What's the behavior you see when this happens? Client requests start timing out, because the server is no longer responding? How were you dealing with this with WinHttp? Seems like you'd see the same behavior there, no?
How do you decide to cancel the "zombie" requests? Is it just some sort of timeout mechanism? We have request timeouts on HttpClient, is that what you use? That said, the request timeout ultimately just wires up a CancellationToken, so it's still going to hit the same issue.
Not sure I understand this. You stopped cancelling zombie requests to work around a different issue you saw with Winhttp, is that correct? |
Yes.
Yes, a timeout mechanism outside of HttpClient's own timed CancellationTokenSource. For a particular timeout mechanism to play the role, it should be able to line up multiple cancellations (as the test application does in a somewhat exaggerated manner) for the deadlock to happen.
Correct, a different #24641 mentioning for historical perspective only. We don't know if not canceling the requests helped or created more pain as we moved on to SocketsHttpHandler and are not planning to go back to winhhtp. |
I think we should just leave the waiter in the waiter queue and not try to remove it on cancellation. When a connection becomes available, we can dequeue the next waiter and see if its task has been cancelled. If so, we can just discard it and get the next waiter. We may want to try to fix this at the same time as #27153 since they affect the same code. |
FYI, there's a typo in your repro app here: https://github.com/baal2000/DeadlockInSocketsHandler/blob/master/Program.cs#L147 |
thank you @geoffkizer |
Another is to keep the current logic but make the waiter queue lockless, concurrency-tolerant unless there is a good reason to always serve the requests in a perfect order. IMO there is nothing wrong with a request getting to the connection pool ahead of another "unfairly" due to a small timing difference. |
I don't think this addresses the underlying issue. The reason we need to take the connection lock currently is that we are trying to remove the cancelled waiter from the queue. I don't know a good way to do this in a lock-free manner, and it seems much easier to just not do it and instead discard cancelled waiters when we dequeue them. Changing to a lock-free queue (e.g. ConcurrentQueue) might be a good thing to do in the future, but it doesn't directly solve the problem at hand. |
If there is no synchronization between the waiter cancellations ("thread B") and the waiter queue access then we have the race condition that, while not leading to the deadlock, could make for an unreliable cancellation check on the "thread A". When a waiter is tested negative for the cancellation at one instant, the same waiter could become canceled at another. That might cascade into an unexpected chain of events. Just pointing out to this possibility, it may very well be handled already. |
First of all, thanks for resolving the issue. Secondly, could you please share the reason for
Forced async means more overhead/worse performance whether justified or not. For this class usage, the most common outcome would be the continuation instantly reaching a point of another, native async call like SendAsync. If there are certain scenarios that would not happen that may lead to blocked upstream execution or "stack dives", then these should be mentioned and then dealt with individually, if possible. |
If that flag isn't used, the code calling SetResult will likely end up invoking the continuation as part of the SetResult call. If that's done while holding a lock, we could end up invoking a lot of code unexpectedly while the lock is held. If that's done in response to code the user is executing, their code could be stalled for longer than expected running that continuation. Etc. When this code was initially written, both of those situations were possible. To remove this flag, we would need to audit all places where the task could be completed to validate that such issues no longer existed, at which point the flag could be removed. In short, it's a small potential expense to pay for safety/reliability. |
@karelz @stephentoub
After the deadlock hits the process has to be restarted. If continued to be used the visible symptoms are the inability to communicate with a certain endpoint, the process may eventually run out of available threads.
Repro project: DeadlockInSocketsHandler
Tested in Windows on SDK 2.1.301
Compile the console app and run. It would produce output similar to:
The deadlock is caused by a race condition meaning it would strike after a random count of the test repetitions on each new application run. The constant values
MaximumConnectionsPerServer
andMaxRequestCount
can be modified to increase/decrease probability of the deadlock, butMaxRequestCount
must be higher thanMaximumConnectionsPerServer
to force some requests intoConnectionWaiter
queue. The current values1
and2
are the lowest possible. They still reliably reproduce the issue and produce clean threads picture.One may then attach to the running process or dump it to investigate the threads.
There would be 2 deadlocked threads, for example, named "A" and "B".
Thread A
Thread B
Explanation
Thread A
lock(SyncObj)
Thread B
Conclusion
Both threads cannot move, that confirms the deadlock.
Workarounds
The text was updated successfully, but these errors were encountered: