-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock with TCPConnector limit after timeout #9670
Comments
It would be great if you could come up with a reproducer without external dependencies as we will need to be able to create a test for this at some point |
It looks like there are two race points the available and The |
Looks like its been a problem for a long time. reproducible on 3.9.5 as well |
I'm not sure its fixable with the current design.
|
synchronization between all the limits and waiters is turning out to be really hard to get right. I made an attempt in #9671, and it does fix the issue, but now the limit can be exceeded |
#9671 passes all the tests, works in production testing, and fixes this issue... Sadly its quite a bit larger than I had hoped because there are so many places that were not cancellation safe..... |
Pretty happy with #9671 now. Need to write some tests for it |
Hi, from my ignorance, couldn't we just use To be honest, I didn't follow this part:
Another thing, regarding:
Yeah, I saw that too. Although I started looking into this because for some unknown reason, the update to Thanks for looking into it! |
A Semaphore would have some the same challenges: We have two limits we need to contend with: Overall limit We also need to drop key from the dict the value when it reaches zero so we don't leak memory. It might work with two Semaphores, one for the overall limit, and one with a limit per hosts. However we probably would need an asyncio.Condition instead. I expect it could probably get it to work with a lot of refactoring but the performance would likely be much worse given how Condition is implemented or having to check two Semaphores. It would also have some of the same problems with how the book keeping is decreased when a connection is released which would add more complexity with the additional level of abstraction
|
I will ship another 3.11 beta soon, hopefully today. If all goes well with it, I'll get a 3.10.11 out a few days after |
@davidmanzanares 3.11.0b3 has been published with the fix. Are you able to give it a test? |
I've just tested it, it seems to work good, thanks. Btw, I took a quick look at |
Thanks for testing
If you can come up with another solution that is less complicated, performs similarly, passes all the tests, and doesn't cause a regression we would be happy, more than happy to accept a PR streamline the current implementation. Thanks |
3.10.11rc0 has been shipped with the fix. Hopefully a stable release tomorrow |
3.10.11 has been published with the fix |
Describe the bug
When using a limit in TCPConnector, timeouts can lead to a condition where new requests are not actually sent, resulting in "sticky" timeouts.
After debugging, I believe the problem occurs in this line:https://github.com/aio-libs/aiohttp/blob/v3.10.10/aiohttp/connector.py#L541
When a
ValueError
is raised there, that's the result of:CancelledError
: a timeout_release_waiter
method was called, and it tried to wake up a coroutine waiting for this connection. But since this coroutine also suffered the timeout, it won't proceed. Since it doesn't proceed, it won't wake up other potentially waiting coroutines.If the number of waiters never get to zero, the cycle never ends, and no coroutine wakes up to proceed. All coroutines wake up with a
CancelledError
.Proposed fix:
Replace the
pass
in the mentioned line withself._release_waiter()
. This seems to resolve the issue by awakening another coroutine, which was the original intent of the previousself._release_waiter()
call.Another way to fix it might be to use
asyncio.Semaphore
with a context manager.To Reproduce
fastapi dev aio_bug_server.py
:Expected behavior
All requests should be sent, previous timeouts shouldn't change that.
Logs/tracebacks
Python Version
aiohttp Version
multidict Version
propcache Version
yarl Version
OS
Linux
Related component
Client
Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: