SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation #27381

baal2000 · 2018-09-12T18:30:08Z

After the deadlock hits the process has to be restarted. If continued to be used the visible symptoms are the inability to communicate with a certain endpoint, the process may eventually run out of available threads.

Repro project: DeadlockInSocketsHandler
Tested in Windows on SDK 2.1.301

Compile the console app and run. It would produce output similar to:

Running the test...
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
No deadlocks detected: all requests completed.
Deadlock detected: 2 requests are not completed
Finished the test. Press any key to exit.

The deadlock is caused by a race condition meaning it would strike after a random count of the test repetitions on each new application run. The constant values MaximumConnectionsPerServer and MaxRequestCount can be modified to increase/decrease probability of the deadlock, but MaxRequestCount must be higher than MaximumConnectionsPerServer to force some requests into ConnectionWaiter queue. The current values 1 and 2 are the lowest possible. They still reliably reproduce the issue and produce clean threads picture.

One may then attach to the running process or dump it to investigate the threads.

There would be 2 deadlocked threads, for example, named "A" and "B".

Thread A

System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnce(int sleep1Threshold)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.WaitForCallbackToComplete(long id)
System.Net.Http.dll!System.Net.Http.HttpConnectionPool.DecrementConnectionCount()
System.Net.Http.dll!System.Net.Http.HttpConnection.Dispose(bool disposing)
System.Net.Http.dll!System.Net.Http.HttpConnection.RegisterCancellation.AnonymousMethod__65_0(object s)
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(bool throwOnFirstException)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(bool throwOnFirstException)
DeadlockInSocketsHandler.dll!DeadlockInSocketsHandler.Program.DeadlockTestCore.AnonymousMethod__0() Line 83
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot)
System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()

Thread B

System.Net.Http.dll!System.Net.Http.HttpConnectionPool.GetConnectionAsync.AnonymousMethod__38_0(object s)
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(bool throwOnFirstException)
System.Private.CoreLib.dll!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(bool throwOnFirstException)
DeadlockInSocketsHandler.dll!DeadlockInSocketsHandler.Program.DeadlockTestCore.AnonymousMethod__0() Line 83
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot)
System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()

Explanation

Thread A

HttpConnectionPool.DecrementConnectionCount() entered lock(SyncObj)
Spin-waits in CancellationTokenSource.WaitForCallbackToComplete for Thread B to complete HttpConnectionPool.GetConnectionAsync.AnonymousMethod__38_0 callback

Thread B

HttpConnectionPool.GetConnectionAsync.AnonymousMethod__38_0 callback waits to enter lock(SyncObj) that is held by Thread A
SyncObj can never be released Thread A because it is going to spin-wait infinitely unless Thread B makes progress.

Conclusion
Both threads cannot move, that confirms the deadlock.

Workarounds

Cancel the requests to the same endpoint serially by the application. The request cancellation could be queued and then processed sequentially on a single a worker thread. Or the cancellation threads could be synchronized by a lock.
If possible, oo not set MaxConnectionsPerServer property.

The text was updated successfully, but these errors were encountered:

karelz · 2018-09-12T20:07:53Z

@geoffkizer is it a known problem? cc @davidsh

Thanks @baal2000 for your great and detailed analysis.
How did you discover the problem? Did you hit it in production? Did some of your tests hit the problem?

This may be worth fixing in 2.1 servicing ...

geoffkizer · 2018-09-12T20:34:21Z

Not a known issue. Interesting catch.

baal2000 · 2018-09-13T01:59:14Z

@karelz cc: @davidsh, @geoffkizer

RE: How did you discover the problem?

We hit it in production after:

Setting MaximumConnectionsPerServer to 64 lead to connection pool exhaustion with zombie TCP connections when the server on another end would crap out unexpectedly.
Learning that the pool exhaustion on the client side is due to not canceling zombie requests explicitly. Note that we specifically stopped doing that during winhttp - based client AV crash troubleshooting to reduce possibility of the race condition, but then switched to fw 2.1 without changing that back.
Restoring request cancellation logic on sockets handler re-introduced the race condition back into the picture this time as the deadlock.

filipnavara · 2018-09-13T10:55:45Z

Could this be the same problem as #27256?

geoffkizer · 2018-09-13T11:05:04Z

Setting MaximumConnectionsPerServer to 64 lead to connection pool exhaustion with zombie TCP connections when the server on another end would crap out unexpectedly.

What's the behavior you see when this happens? Client requests start timing out, because the server is no longer responding? How were you dealing with this with WinHttp? Seems like you'd see the same behavior there, no?

Learning that the pool exhaustion on the client side is due to not canceling zombie requests explicitly.

How do you decide to cancel the "zombie" requests? Is it just some sort of timeout mechanism? We have request timeouts on HttpClient, is that what you use? That said, the request timeout ultimately just wires up a CancellationToken, so it's still going to hit the same issue.

Note that we specifically stopped doing that during winhttp - based client AV crash troubleshooting to reduce possibility of the race condition, but then switched to fw 2.1 without changing that back.

Not sure I understand this. You stopped cancelling zombie requests to work around a different issue you saw with Winhttp, is that correct?

baal2000 · 2018-09-13T12:13:54Z

@geoffkizer

Client requests start timing out

Yes.

Is it just some sort of timeout mechanism?

Yes, a timeout mechanism outside of HttpClient's own timed CancellationTokenSource. For a particular timeout mechanism to play the role, it should be able to line up multiple cancellations (as the test application does in a somewhat exaggerated manner) for the deadlock to happen.

Not sure I understand this. You stopped canceling zombie requests to work around a different issue you saw with Winhttp, is that correct?

Correct, a different #24641 mentioning for historical perspective only. We don't know if not canceling the requests helped or created more pain as we moved on to SocketsHttpHandler and are not planning to go back to winhhtp.

geoffkizer · 2018-09-13T23:32:03Z

I think we should just leave the waiter in the waiter queue and not try to remove it on cancellation. When a connection becomes available, we can dequeue the next waiter and see if its task has been cancelled. If so, we can just discard it and get the next waiter.

We may want to try to fix this at the same time as #27153 since they affect the same code.

geoffkizer · 2018-09-14T11:16:53Z

FYI, there's a typo in your repro app here: https://github.com/baal2000/DeadlockInSocketsHandler/blob/master/Program.cs#L147

baal2000 · 2018-09-14T11:28:15Z

thank you @geoffkizer

baal2000 · 2018-09-14T15:51:22Z

@geoffkizer

we should just leave the waiter in the waiter queue
that is one option.

Another is to keep the current logic but make the waiter queue lockless, concurrency-tolerant unless there is a good reason to always serve the requests in a perfect order. IMO there is nothing wrong with a request getting to the connection pool ahead of another "unfairly" due to a small timing difference.

geoffkizer · 2018-09-15T05:28:14Z

Another is to keep the current logic but make the waiter queue lockless, concurrency-tolerant unless there is a good reason to always serve the requests in a perfect order

I don't think this addresses the underlying issue. The reason we need to take the connection lock currently is that we are trying to remove the cancelled waiter from the queue. I don't know a good way to do this in a lock-free manner, and it seems much easier to just not do it and instead discard cancelled waiters when we dequeue them.

Changing to a lock-free queue (e.g. ConcurrentQueue) might be a good thing to do in the future, but it doesn't directly solve the problem at hand.

baal2000 · 2018-09-20T17:50:22Z

@geoffkizer

much easier to just not do it and instead discard cancelled waiters when we dequeue them.

If there is no synchronization between the waiter cancellations ("thread B") and the waiter queue access then we have the race condition that, while not leading to the deadlock, could make for an unreliable cancellation check on the "thread A". When a waiter is tested negative for the cancellation at one instant, the same waiter could become canceled at another. That might cascade into an unexpected chain of events. Just pointing out to this possibility, it may very well be handled already.

baal2000 · 2018-09-28T13:55:31Z

@geoffkizer

First of all, thanks for resolving the issue.

Secondly, could you please share the reason for RunContinuationsAsynchronously to be placed here at

public TaskCompletionSourceWithCancellation() : base(TaskCreationOptions.RunContinuationsAsynchronously)

Forced async means more overhead/worse performance whether justified or not. For this class usage, the most common outcome would be the continuation instantly reaching a point of another, native async call like SendAsync. If there are certain scenarios that would not happen that may lead to blocked upstream execution or "stack dives", then these should be mentioned and then dealt with individually, if possible.

stephentoub · 2018-09-28T14:49:22Z

could you please share the reason for RunContinuationsAsynchronously to be placed here at

If that flag isn't used, the code calling SetResult will likely end up invoking the continuation as part of the SetResult call. If that's done while holding a lock, we could end up invoking a lot of code unexpectedly while the lock is held. If that's done in response to code the user is executing, their code could be stalled for longer than expected running that continuation. Etc. When this code was initially written, both of those situations were possible. To remove this flag, we would need to audit all places where the task could be completed to validate that such issues no longer existed, at which point the flag could be removed. In short, it's a small potential expense to pay for safety/reliability.

baal2000 changed the title ~~SocketHttpHandler set up with connection pool limit deadlocks on concurrent request cancellation~~ SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation Sep 12, 2018

geoffkizer closed this as completed in dotnet/corefx#32297 Sep 28, 2018

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation #27381

SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation #27381

baal2000 commented Sep 12, 2018

karelz commented Sep 12, 2018

geoffkizer commented Sep 12, 2018

baal2000 commented Sep 13, 2018

filipnavara commented Sep 13, 2018

geoffkizer commented Sep 13, 2018

baal2000 commented Sep 13, 2018

geoffkizer commented Sep 13, 2018

geoffkizer commented Sep 14, 2018

baal2000 commented Sep 14, 2018

baal2000 commented Sep 14, 2018

geoffkizer commented Sep 15, 2018

baal2000 commented Sep 20, 2018 •

edited

Loading

baal2000 commented Sep 28, 2018 •

edited

Loading

stephentoub commented Sep 28, 2018

SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation #27381

SocketHttpHandler set up with MaximumConnectionsPerServer could deadlock on concurrent request cancellation #27381

Comments

baal2000 commented Sep 12, 2018

karelz commented Sep 12, 2018

geoffkizer commented Sep 12, 2018

baal2000 commented Sep 13, 2018

filipnavara commented Sep 13, 2018

geoffkizer commented Sep 13, 2018

baal2000 commented Sep 13, 2018

geoffkizer commented Sep 13, 2018

geoffkizer commented Sep 14, 2018

baal2000 commented Sep 14, 2018

baal2000 commented Sep 14, 2018

geoffkizer commented Sep 15, 2018

baal2000 commented Sep 20, 2018 • edited Loading

baal2000 commented Sep 28, 2018 • edited Loading

stephentoub commented Sep 28, 2018

baal2000 commented Sep 20, 2018 •

edited

Loading

baal2000 commented Sep 28, 2018 •

edited

Loading