Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation #27381

Closed
baal2000 opened this issue Sep 12, 2018 · 14 comments · Fixed by dotnet/corefx#32297
Labels
bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@baal2000
Copy link

@karelz @stephentoub

After the deadlock hits the process has to be restarted. If continued to be used the visible symptoms are the inability to communicate with a certain endpoint, the process may eventually run out of available threads.

Repro project: DeadlockInSocketsHandler
Tested in Windows on SDK 2.1.301

Compile the console app and run. It would produce output similar to:

Running the test...
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
Deadlock detected: 2 requests are not completed
Finished the test. Press any key to exit.

The deadlock is caused by a race condition meaning it would strike after a random count of the test repetitions on each new application run. The constant values MaximumConnectionsPerServer and MaxRequestCount can be modified to increase/decrease probability of the deadlock, but MaxRequestCount must be higher than MaximumConnectionsPerServer to force some requests into ConnectionWaiter queue. The current values 1 and 2 are the lowest possible. They still reliably reproduce the issue and produce clean threads picture.

One may then attach to the running process or dump it to investigate the threads.

There would be 2 deadlocked threads, for example, named "A" and "B".

Thread A

System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnce(int sleep1Threshold)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.WaitForCallbackToComplete(long id)
System.Net.Http.dll!System.Net.Http.HttpConnectionPool.DecrementConnectionCount()
System.Net.Http.dll!System.Net.Http.HttpConnection.Dispose(bool disposing)
System.Net.Http.dll!System.Net.Http.HttpConnection.RegisterCancellation.AnonymousMethod__65_0(object s)
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(bool throwOnFirstException)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(bool throwOnFirstException)
DeadlockInSocketsHandler.dll!DeadlockInSocketsHandler.Program.DeadlockTestCore.AnonymousMethod__0() Line 83
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot)
System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()

Thread B

System.Net.Http.dll!System.Net.Http.HttpConnectionPool.GetConnectionAsync.AnonymousMethod__38_0(object s)
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(bool throwOnFirstException)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(bool throwOnFirstException)
DeadlockInSocketsHandler.dll!DeadlockInSocketsHandler.Program.DeadlockTestCore.AnonymousMethod__0() Line 83
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot)
System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()

Explanation

Thread A

  1. HttpConnectionPool.DecrementConnectionCount() entered lock(SyncObj)
  2. Spin-waits in CancellationTokenSource.WaitForCallbackToComplete for Thread B to complete HttpConnectionPool.GetConnectionAsync.AnonymousMethod__38_0 callback

Thread B

  1. HttpConnectionPool.GetConnectionAsync.AnonymousMethod__38_0 callback waits to enter lock(SyncObj) that is held by Thread A
  2. SyncObj can never be released Thread A because it is going to spin-wait infinitely unless Thread B makes progress.

Conclusion
Both threads cannot move, that confirms the deadlock.

Workarounds

  1. Cancel the requests to the same endpoint serially by the application. The request cancellation could be queued and then processed sequentially on a single a worker thread. Or the cancellation threads could be synchronized by a lock.
  2. If possible, oo not set MaxConnectionsPerServer property.
@baal2000 baal2000 changed the title SocketHttpHandler set up with connection pool limit deadlocks on concurrent request cancellation SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation Sep 12, 2018
@karelz
Copy link
Member

karelz commented Sep 12, 2018

@geoffkizer is it a known problem? cc @davidsh

Thanks @baal2000 for your great and detailed analysis.
How did you discover the problem? Did you hit it in production? Did some of your tests hit the problem?

This may be worth fixing in 2.1 servicing ...

@geoffkizer
Copy link
Contributor

Not a known issue. Interesting catch.

@baal2000
Copy link
Author

@karelz cc: @davidsh, @geoffkizer

RE: How did you discover the problem?

We hit it in production after:

  1. Setting MaximumConnectionsPerServer to 64 lead to connection pool exhaustion with zombie TCP connections when the server on another end would crap out unexpectedly.
  2. Learning that the pool exhaustion on the client side is due to not canceling zombie requests explicitly. Note that we specifically stopped doing that during winhttp - based client AV crash troubleshooting to reduce possibility of the race condition, but then switched to fw 2.1 without changing that back.
  3. Restoring request cancellation logic on sockets handler re-introduced the race condition back into the picture this time as the deadlock.

@filipnavara
Copy link
Member

Could this be the same problem as #27256?

@geoffkizer
Copy link
Contributor

Setting MaximumConnectionsPerServer to 64 lead to connection pool exhaustion with zombie TCP connections when the server on another end would crap out unexpectedly.

What's the behavior you see when this happens? Client requests start timing out, because the server is no longer responding? How were you dealing with this with WinHttp? Seems like you'd see the same behavior there, no?

Learning that the pool exhaustion on the client side is due to not canceling zombie requests explicitly.

How do you decide to cancel the "zombie" requests? Is it just some sort of timeout mechanism? We have request timeouts on HttpClient, is that what you use? That said, the request timeout ultimately just wires up a CancellationToken, so it's still going to hit the same issue.

Note that we specifically stopped doing that during winhttp - based client AV crash troubleshooting to reduce possibility of the race condition, but then switched to fw 2.1 without changing that back.

Not sure I understand this. You stopped cancelling zombie requests to work around a different issue you saw with Winhttp, is that correct?

@baal2000
Copy link
Author

@geoffkizer

Client requests start timing out

Yes.

Is it just some sort of timeout mechanism?

Yes, a timeout mechanism outside of HttpClient's own timed CancellationTokenSource. For a particular timeout mechanism to play the role, it should be able to line up multiple cancellations (as the test application does in a somewhat exaggerated manner) for the deadlock to happen.

Not sure I understand this. You stopped canceling zombie requests to work around a different issue you saw with Winhttp, is that correct?

Correct, a different #24641 mentioning for historical perspective only. We don't know if not canceling the requests helped or created more pain as we moved on to SocketsHttpHandler and are not planning to go back to winhhtp.

@geoffkizer
Copy link
Contributor

I think we should just leave the waiter in the waiter queue and not try to remove it on cancellation. When a connection becomes available, we can dequeue the next waiter and see if its task has been cancelled. If so, we can just discard it and get the next waiter.

We may want to try to fix this at the same time as #27153 since they affect the same code.

@geoffkizer
Copy link
Contributor

FYI, there's a typo in your repro app here: https://github.com/baal2000/DeadlockInSocketsHandler/blob/master/Program.cs#L147

@baal2000
Copy link
Author

thank you @geoffkizer

@baal2000
Copy link
Author

@geoffkizer

we should just leave the waiter in the waiter queue
that is one option.

Another is to keep the current logic but make the waiter queue lockless, concurrency-tolerant unless there is a good reason to always serve the requests in a perfect order. IMO there is nothing wrong with a request getting to the connection pool ahead of another "unfairly" due to a small timing difference.

@geoffkizer
Copy link
Contributor

Another is to keep the current logic but make the waiter queue lockless, concurrency-tolerant unless there is a good reason to always serve the requests in a perfect order

I don't think this addresses the underlying issue. The reason we need to take the connection lock currently is that we are trying to remove the cancelled waiter from the queue. I don't know a good way to do this in a lock-free manner, and it seems much easier to just not do it and instead discard cancelled waiters when we dequeue them.

Changing to a lock-free queue (e.g. ConcurrentQueue) might be a good thing to do in the future, but it doesn't directly solve the problem at hand.

@baal2000
Copy link
Author

baal2000 commented Sep 20, 2018

@geoffkizer

much easier to just not do it and instead discard cancelled waiters when we dequeue them.

If there is no synchronization between the waiter cancellations ("thread B") and the waiter queue access then we have the race condition that, while not leading to the deadlock, could make for an unreliable cancellation check on the "thread A". When a waiter is tested negative for the cancellation at one instant, the same waiter could become canceled at another. That might cascade into an unexpected chain of events. Just pointing out to this possibility, it may very well be handled already.

@baal2000
Copy link
Author

baal2000 commented Sep 28, 2018

@geoffkizer

First of all, thanks for resolving the issue.

Secondly, could you please share the reason for RunContinuationsAsynchronously to be placed here at

public TaskCompletionSourceWithCancellation() : base(TaskCreationOptions.RunContinuationsAsynchronously)

Forced async means more overhead/worse performance whether justified or not. For this class usage, the most common outcome would be the continuation instantly reaching a point of another, native async call like SendAsync. If there are certain scenarios that would not happen that may lead to blocked upstream execution or "stack dives", then these should be mentioned and then dealt with individually, if possible.

@stephentoub
Copy link
Member

could you please share the reason for RunContinuationsAsynchronously to be placed here at

If that flag isn't used, the code calling SetResult will likely end up invoking the continuation as part of the SetResult call. If that's done while holding a lock, we could end up invoking a lot of code unexpectedly while the lock is held. If that's done in response to code the user is executing, their code could be stalled for longer than expected running that continuation. Etc. When this code was initially written, both of those situations were possible. To remove this flag, we would need to audit all places where the task could be completed to validate that such issues no longer existed, at which point the flag could be removed. In short, it's a small potential expense to pay for safety/reliability.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants