deadlock with TCPConnector limit after timeout #9670

davidmanzanares · 2024-11-04T10:03:10Z

Describe the bug

When using a limit in TCPConnector, timeouts can lead to a condition where new requests are not actually sent, resulting in "sticky" timeouts.

After debugging, I believe the problem occurs in this line:https://github.com/aio-libs/aiohttp/blob/v3.10.10/aiohttp/connector.py#L541

When a ValueError is raised there, that's the result of:

Awaiting the future failed, potentially due to a CancelledError: a timeout
_release_waiter method was called, and it tried to wake up a coroutine waiting for this connection. But since this coroutine also suffered the timeout, it won't proceed. Since it doesn't proceed, it won't wake up other potentially waiting coroutines.

If the number of waiters never get to zero, the cycle never ends, and no coroutine wakes up to proceed. All coroutines wake up with a CancelledError.

Proposed fix:
Replace the pass in the mentioned line with self._release_waiter(). This seems to resolve the issue by awakening another coroutine, which was the original intent of the previous self._release_waiter() call.

Another way to fix it might be to use asyncio.Semaphore with a context manager.

To Reproduce

Start a small server script that returns slowly only sometimes. For example, run this with fastapi dev aio_bug_server.py:

from fastapi import FastAPI
import asyncio
from time import time

app = FastAPI()

sleep_time = 0.3
timestamp_last_change = time()
last_recv = time()

@app.get("/getme")
async def getme():
    global sleep_time, last_recv
    if time()-last_recv>8:
        # Go "up", forcing 0 sleep time if we don't know anymore about clients
        sleep_time = 0
    elif time()-timestamp_last_change>2:
        if sleep_time:
            sleep_time = 0
        else:
            sleep_time = 0.3
    print(f"recv {sleep_time=:.3f}")
    last_recv = time()

    await asyncio.sleep(sleep_time)
    print(f"fin {sleep_time=:.3f}")
    return {
        "asdf": "asdf",
    }

Start a script that uses aiohttp to make concurrent requests to the server, and check the number of timeouts/ok results. For example:

import uvloop
uvloop.install()

import asyncio
import aiohttp
from time import time

timeouts = 0
errors = 0
oks = 0

async def do_request(session: aiohttp.ClientSession):
    global errors, timeouts, oks
    try:
        async with session.get(f"http://127.0.0.1:8000/getme") as resp:
            await resp.json()
        oks += 1
    except TimeoutError:
        timeouts += 1
    except BaseException as exc:
        print(f"UNEXPECTED EXCEPTION {exc=}")
        errors += 1



async def main():
    global timeouts, oks
    async with aiohttp.ClientSession(
                    connector=aiohttp.TCPConnector(limit=10),
                    timeout=aiohttp.ClientTimeout(total=0.2), 
                ) as session:
        tasks = set()
        done_reqs = 0
        i = 0
        t = time()
        last_ok = t
        while True:
            i+=1
            if i%100==0:
                dt = (time() - t)
                print(f"{done_reqs/dt:.1f}req/s {oks=} {timeouts=}")
                if oks:
                    last_ok = time()
                if time() - last_ok > 10:
                    # We hit the issue likely
                    # at this point the server should not have received anything in a while
                    print(f"ISSUE DETECTED")
                    t = time()
                    try:
                        async with session.get(f"http://127.0.0.1:8000/getme?q=a") as resp:
                            await resp.json()
                    except Exception as exc:
                        print(f"LAST REQUEST ERROR: EXC{exc=} {time()-t}")
                    for task in tasks:
                        await task
                    print(f"NEW REQUEST AFTER AWAITING ALL TASKS")
                    t = time()
                    try:
                        async with session.get(f"http://127.0.0.1:8000/getme?q=b") as resp:
                            await resp.json()
                    except Exception:
                        raise
                    finally:
                        print("FINAL DT:", time()-t)
                    return

                timeouts = 0
                oks=0

            if len(tasks) >= 200:
                done, tasks = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
                done_reqs += len(done)
            tasks.add(asyncio.create_task(do_request(session)))

asyncio.run(main())

Expected behavior

All requests should be sent, previous timeouts shouldn't change that.

Logs/tracebacks

no tracebacks

Python Version

$ python --version
Python 3.12.7

aiohttp Version

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.10.10
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: 
Author-email: 
License: Apache 2
Location: /home/dv/code/search/services/local/venv/lib/python3.12/site-packages
Requires: aiohappyeyeballs, aiosignal, attrs, frozenlist, multidict, yarl
Required-by:

multidict Version

$ python -m pip show multidict
Name: multidict
Version: 6.1.0
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache 2
Location: /home/dv/code/search/services/local/venv/lib/python3.12/site-packages
Requires: 
Required-by: aiohttp, yarl

propcache Version

$ python -m pip show propcache
Name: propcache
Version: 0.2.0
Summary: Accelerated property cache
Home-page: https://github.com/aio-libs/propcache
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache-2.0
Location: /home/dv/code/search/services/local/venv/lib/python3.12/site-packages
Requires: 
Required-by: yarl

yarl Version

$ python -m pip show yarl
Name: yarl
Version: 1.17.1
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache-2.0
Location: /home/dv/code/search/services/local/venv/lib/python3.12/site-packages
Requires: idna, multidict, propcache
Required-by: aiohttp

OS

Linux

Related component

Client

Additional context

No response

Code of Conduct

I agree to follow the aio-libs Code of Conduct

The text was updated successfully, but these errors were encountered:

bdraco · 2024-11-04T14:38:20Z

It would be great if you could come up with a reproducer without external dependencies as we will need to be able to create a test for this at some point

fixes #9670

bdraco · 2024-11-04T16:00:55Z

It looks like there are two race points

the available and key in self._waiters --- could have cancelled futures there

The ValueError suppression is a race point as well

bdraco · 2024-11-04T16:20:00Z

Looks like its been a problem for a long time. reproducible on 3.9.5 as well

bdraco · 2024-11-04T17:27:14Z

I'm not sure its fixable with the current design.

key in self._waiters its not atomic in respect to if the connection is available so its always going to be out of sync

bdraco · 2024-11-04T19:45:58Z

synchronization between all the limits and waiters is turning out to be really hard to get right.

I made an attempt in #9671, and it does fix the issue, but now the limit can be exceeded

bdraco · 2024-11-05T01:28:49Z

#9671 passes all the tests, works in production testing, and fixes this issue...

Sadly its quite a bit larger than I had hoped because there are so many places that were not cancellation safe.....

bdraco · 2024-11-05T04:52:13Z

Pretty happy with #9671 now. Need to write some tests for it

davidmanzanares · 2024-11-05T09:10:05Z

Hi, from my ignorance, couldn't we just use asyncio.Semaphore (one per connection key, in a dictionary)? It looks to me that this is quite hard to get right.

To be honest, I didn't follow this part:

the available and key in self._waiters --- could have cancelled futures there

Another thing, regarding:

reproducible on 3.9.5 as well

Yeah, I saw that too. Although I started looking into this because for some unknown reason, the update to 3.10.10 surfaced this in a production envrionment.

Thanks for looking into it!

bdraco · 2024-11-05T13:18:33Z

Hi, from my ignorance, couldn't we just use asyncio.Semaphore (one per connection key, in a dictionary)? It looks to me that this is quite hard to get right.

A Semaphore would have some the same challenges:

We have two limits we need to contend with:

Overall limit
Limit per host

We also need to drop key from the dict the value when it reaches zero so we don't leak memory.

It might work with two Semaphores, one for the overall limit, and one with a limit per hosts. However we probably would need an asyncio.Condition instead. I expect it could probably get it to work with a lot of refactoring but the performance would likely be much worse given how Condition is implemented or having to check two Semaphores.

It would also have some of the same problems with how the book keeping is decreased when a connection is released which would add more complexity with the additional level of abstraction

To be honest, I didn't follow this part:

the available and key in self._waiters --- could have cancelled futures there

Another thing, regarding:

reproducible on 3.9.5 as well

Yeah, I saw that too. Although I started looking into this because for some unknown reason, the update to 3.10.10 surfaced this in a production envrionment.

Thanks for looking into it!

bdraco · 2024-11-05T16:48:15Z

I will ship another 3.11 beta soon, hopefully today.

If all goes well with it, I'll get a 3.10.11 out a few days after

bdraco · 2024-11-06T02:09:02Z

@davidmanzanares 3.11.0b3 has been published with the fix.

Are you able to give it a test?

davidmanzanares · 2024-11-06T12:25:38Z

I've just tested it, it seems to work good, thanks.

Btw, I took a quick look at asyncio.Semaphore performance. Under no contention it seems to perform extremely fast, but it's kinda slow under high contention. I didn't dig deeper and I didn't compare it to the existing mechanism in aiohttp though.

bdraco · 2024-11-06T15:32:13Z

I've just tested it, it seems to work good, thanks.

Thanks for testing

Btw, I took a quick look at asyncio.Semaphore performance. Under no contention it seems to perform extremely fast, but it's kinda slow under high contention. I didn't dig deeper and I didn't compare it to the existing mechanism in aiohttp though.

If you can come up with another solution that is less complicated, performs similarly, passes all the tests, and doesn't cause a regression we would be happy, more than happy to accept a PR streamline the current implementation.

Thanks

bdraco · 2024-11-13T03:01:33Z

3.10.11rc0 has been shipped with the fix. Hopefully a stable release tomorrow

bdraco · 2024-11-13T16:44:19Z

3.10.11 has been published with the fix
3.11.0 is in the process of publishing

davidmanzanares added the bug label Nov 4, 2024

bdraco added the reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR label Nov 4, 2024

bdraco changed the title ~~"sticky" timeouts with TCPConnector limit~~ deadlock with TCPConnector limit after timeout Nov 4, 2024

bdraco added a commit that referenced this issue Nov 4, 2024

Refactor connection waiters to be race safe

d48e287

fixes #9670

bdraco mentioned this issue Nov 4, 2024

Refactor connection waiters to be cancellation safe #9671

Merged

This was referenced Nov 5, 2024

Avoid starting connection timeout when a connection is already available #9600

Merged

[PR #9600/e6187f6 backport][3.10] Avoid starting connection timeout when a connection is already available #9673

Merged

bdraco closed this as completed in #9671 Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock with TCPConnector limit after timeout #9670

deadlock with TCPConnector limit after timeout #9670

davidmanzanares commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 5, 2024

bdraco commented Nov 5, 2024

davidmanzanares commented Nov 5, 2024

bdraco commented Nov 5, 2024 •

edited

Loading

bdraco commented Nov 5, 2024

bdraco commented Nov 6, 2024

davidmanzanares commented Nov 6, 2024

bdraco commented Nov 6, 2024 •

edited

Loading

bdraco commented Nov 13, 2024

bdraco commented Nov 13, 2024

deadlock with TCPConnector limit after timeout #9670

deadlock with TCPConnector limit after timeout #9670

Comments

davidmanzanares commented Nov 4, 2024

Describe the bug

To Reproduce

Expected behavior

Logs/tracebacks

Python Version

aiohttp Version

multidict Version

propcache Version

yarl Version

OS

Related component

Additional context

Code of Conduct

bdraco commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 4, 2024

bdraco commented Nov 5, 2024

bdraco commented Nov 5, 2024

davidmanzanares commented Nov 5, 2024

bdraco commented Nov 5, 2024 • edited Loading

bdraco commented Nov 5, 2024

bdraco commented Nov 6, 2024

davidmanzanares commented Nov 6, 2024

bdraco commented Nov 6, 2024 • edited Loading

bdraco commented Nov 13, 2024

bdraco commented Nov 13, 2024

bdraco commented Nov 5, 2024 •

edited

Loading

bdraco commented Nov 6, 2024 •

edited

Loading