Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock with TCPConnector limit after timeout #9670

Closed
1 task done
davidmanzanares opened this issue Nov 4, 2024 · 15 comments · Fixed by #9671 or #9600
Closed
1 task done

deadlock with TCPConnector limit after timeout #9670

davidmanzanares opened this issue Nov 4, 2024 · 15 comments · Fixed by #9671 or #9600
Labels
bug reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR

Comments

@davidmanzanares
Copy link

Describe the bug

When using a limit in TCPConnector, timeouts can lead to a condition where new requests are not actually sent, resulting in "sticky" timeouts.

After debugging, I believe the problem occurs in this line:https://github.com/aio-libs/aiohttp/blob/v3.10.10/aiohttp/connector.py#L541

When a ValueError is raised there, that's the result of:

  • Awaiting the future failed, potentially due to a CancelledError: a timeout
  • _release_waiter method was called, and it tried to wake up a coroutine waiting for this connection. But since this coroutine also suffered the timeout, it won't proceed. Since it doesn't proceed, it won't wake up other potentially waiting coroutines.

If the number of waiters never get to zero, the cycle never ends, and no coroutine wakes up to proceed. All coroutines wake up with a CancelledError.

Proposed fix:
Replace the pass in the mentioned line with self._release_waiter(). This seems to resolve the issue by awakening another coroutine, which was the original intent of the previous self._release_waiter() call.

Another way to fix it might be to use asyncio.Semaphore with a context manager.

To Reproduce

  1. Start a small server script that returns slowly only sometimes. For example, run this with fastapi dev aio_bug_server.py:
from fastapi import FastAPI
import asyncio
from time import time

app = FastAPI()

sleep_time = 0.3
timestamp_last_change = time()
last_recv = time()

@app.get("/getme")
async def getme():
    global sleep_time, last_recv
    if time()-last_recv>8:
        # Go "up", forcing 0 sleep time if we don't know anymore about clients
        sleep_time = 0
    elif time()-timestamp_last_change>2:
        if sleep_time:
            sleep_time = 0
        else:
            sleep_time = 0.3
    print(f"recv {sleep_time=:.3f}")
    last_recv = time()

    await asyncio.sleep(sleep_time)
    print(f"fin {sleep_time=:.3f}")
    return {
        "asdf": "asdf",
    }
  1. Start a script that uses aiohttp to make concurrent requests to the server, and check the number of timeouts/ok results. For example:
import uvloop
uvloop.install()

import asyncio
import aiohttp
from time import time

timeouts = 0
errors = 0
oks = 0

async def do_request(session: aiohttp.ClientSession):
    global errors, timeouts, oks
    try:
        async with session.get(f"http://127.0.0.1:8000/getme") as resp:
            await resp.json()
        oks += 1
    except TimeoutError:
        timeouts += 1
    except BaseException as exc:
        print(f"UNEXPECTED EXCEPTION {exc=}")
        errors += 1



async def main():
    global timeouts, oks
    async with aiohttp.ClientSession(
                    connector=aiohttp.TCPConnector(limit=10),
                    timeout=aiohttp.ClientTimeout(total=0.2), 
                ) as session:
        tasks = set()
        done_reqs = 0
        i = 0
        t = time()
        last_ok = t
        while True:
            i+=1
            if i%100==0:
                dt = (time() - t)
                print(f"{done_reqs/dt:.1f}req/s {oks=} {timeouts=}")
                if oks:
                    last_ok = time()
                if time() - last_ok > 10:
                    # We hit the issue likely
                    # at this point the server should not have received anything in a while
                    print(f"ISSUE DETECTED")
                    t = time()
                    try:
                        async with session.get(f"http://127.0.0.1:8000/getme?q=a") as resp:
                            await resp.json()
                    except Exception as exc:
                        print(f"LAST REQUEST ERROR: EXC{exc=} {time()-t}")
                    for task in tasks:
                        await task
                    print(f"NEW REQUEST AFTER AWAITING ALL TASKS")
                    t = time()
                    try:
                        async with session.get(f"http://127.0.0.1:8000/getme?q=b") as resp:
                            await resp.json()
                    except Exception:
                        raise
                    finally:
                        print("FINAL DT:", time()-t)
                    return

                timeouts = 0
                oks=0

            if len(tasks) >= 200:
                done, tasks = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
                done_reqs += len(done)
            tasks.add(asyncio.create_task(do_request(session)))

asyncio.run(main())

Expected behavior

All requests should be sent, previous timeouts shouldn't change that.

Logs/tracebacks

no tracebacks

Python Version

$ python --version
Python 3.12.7

aiohttp Version

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.10.10
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: 
Author-email: 
License: Apache 2
Location: /home/dv/code/search/services/local/venv/lib/python3.12/site-packages
Requires: aiohappyeyeballs, aiosignal, attrs, frozenlist, multidict, yarl
Required-by:

multidict Version

$ python -m pip show multidict
Name: multidict
Version: 6.1.0
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache 2
Location: /home/dv/code/search/services/local/venv/lib/python3.12/site-packages
Requires: 
Required-by: aiohttp, yarl

propcache Version

$ python -m pip show propcache
Name: propcache
Version: 0.2.0
Summary: Accelerated property cache
Home-page: https://github.com/aio-libs/propcache
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache-2.0
Location: /home/dv/code/search/services/local/venv/lib/python3.12/site-packages
Requires: 
Required-by: yarl

yarl Version

$ python -m pip show yarl
Name: yarl
Version: 1.17.1
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache-2.0
Location: /home/dv/code/search/services/local/venv/lib/python3.12/site-packages
Requires: idna, multidict, propcache
Required-by: aiohttp

OS

Linux

Related component

Client

Additional context

No response

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct
@bdraco bdraco added the reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR label Nov 4, 2024
@bdraco bdraco changed the title "sticky" timeouts with TCPConnector limit deadlock with TCPConnector limit after timeout Nov 4, 2024
@bdraco
Copy link
Member

bdraco commented Nov 4, 2024

It would be great if you could come up with a reproducer without external dependencies as we will need to be able to create a test for this at some point

@bdraco
Copy link
Member

bdraco commented Nov 4, 2024

It looks like there are two race points

the available and key in self._waiters --- could have cancelled futures there

The ValueError suppression is a race point as well

@bdraco
Copy link
Member

bdraco commented Nov 4, 2024

Looks like its been a problem for a long time. reproducible on 3.9.5 as well

@bdraco
Copy link
Member

bdraco commented Nov 4, 2024

I'm not sure its fixable with the current design.

key in self._waiters its not atomic in respect to if the connection is available so its always going to be out of sync

@bdraco
Copy link
Member

bdraco commented Nov 4, 2024

synchronization between all the limits and waiters is turning out to be really hard to get right.

I made an attempt in #9671, and it does fix the issue, but now the limit can be exceeded

@bdraco
Copy link
Member

bdraco commented Nov 5, 2024

#9671 passes all the tests, works in production testing, and fixes this issue...

Sadly its quite a bit larger than I had hoped because there are so many places that were not cancellation safe.....

@bdraco
Copy link
Member

bdraco commented Nov 5, 2024

Pretty happy with #9671 now. Need to write some tests for it

@davidmanzanares
Copy link
Author

Hi, from my ignorance, couldn't we just use asyncio.Semaphore (one per connection key, in a dictionary)? It looks to me that this is quite hard to get right.

To be honest, I didn't follow this part:

the available and key in self._waiters --- could have cancelled futures there

Another thing, regarding:

reproducible on 3.9.5 as well

Yeah, I saw that too. Although I started looking into this because for some unknown reason, the update to 3.10.10 surfaced this in a production envrionment.

Thanks for looking into it!

@bdraco
Copy link
Member

bdraco commented Nov 5, 2024

Hi, from my ignorance, couldn't we just use asyncio.Semaphore (one per connection key, in a dictionary)? It looks to me that this is quite hard to get right.

A Semaphore would have some the same challenges:

We have two limits we need to contend with:

Overall limit
Limit per host

We also need to drop key from the dict the value when it reaches zero so we don't leak memory.

It might work with two Semaphores, one for the overall limit, and one with a limit per hosts. However we probably would need an asyncio.Condition instead. I expect it could probably get it to work with a lot of refactoring but the performance would likely be much worse given how Condition is implemented or having to check two Semaphores.

It would also have some of the same problems with how the book keeping is decreased when a connection is released which would add more complexity with the additional level of abstraction

To be honest, I didn't follow this part:

the available and key in self._waiters --- could have cancelled futures there

Another thing, regarding:

reproducible on 3.9.5 as well

Yeah, I saw that too. Although I started looking into this because for some unknown reason, the update to 3.10.10 surfaced this in a production envrionment.

Thanks for looking into it!

@bdraco
Copy link
Member

bdraco commented Nov 5, 2024

I will ship another 3.11 beta soon, hopefully today.

If all goes well with it, I'll get a 3.10.11 out a few days after

@bdraco
Copy link
Member

bdraco commented Nov 6, 2024

@davidmanzanares 3.11.0b3 has been published with the fix.

Are you able to give it a test?

@davidmanzanares
Copy link
Author

I've just tested it, it seems to work good, thanks.

Btw, I took a quick look at asyncio.Semaphore performance. Under no contention it seems to perform extremely fast, but it's kinda slow under high contention. I didn't dig deeper and I didn't compare it to the existing mechanism in aiohttp though.

@bdraco
Copy link
Member

bdraco commented Nov 6, 2024

I've just tested it, it seems to work good, thanks.

Thanks for testing

Btw, I took a quick look at asyncio.Semaphore performance. Under no contention it seems to perform extremely fast, but it's kinda slow under high contention. I didn't dig deeper and I didn't compare it to the existing mechanism in aiohttp though.

If you can come up with another solution that is less complicated, performs similarly, passes all the tests, and doesn't cause a regression we would be happy, more than happy to accept a PR streamline the current implementation.

Thanks

@bdraco
Copy link
Member

bdraco commented Nov 13, 2024

3.10.11rc0 has been shipped with the fix. Hopefully a stable release tomorrow

@bdraco
Copy link
Member

bdraco commented Nov 13, 2024

3.10.11 has been published with the fix
3.11.0 is in the process of publishing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR
Projects
None yet
2 participants