-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: set timeout on socket connect in order to prevent infinite block #35
base: main
Are you sure you want to change the base?
Conversation
Hi @jafar-atili, If this were TCP, then I would understand that a timeout is needed. But only UDP is used here, so a Or maybe in other words - how can a test be constructed where this timeout is hit? |
one fact for sure is this thing is stuck in our platform, Here is the backtrace of the hung thread:
See this official documentation of the
I don't think that the only reason for And if the user already passed a |
That does look like it's stuck on connect...
... I would like to know though, what more specifically? And how can a test be constructed for this? |
(I asked on SO about this https://stackoverflow.com/questions/76980757/can-a-udp-socket-hang-on-connect-in-python)
Do you have more details of the setup here? Where is this running, what's the distro in the container for example? |
Actually if not a test, then a way to reproduce this manually somehow? |
Actually also... here the socket is already non-blocking... so no matter what, it shouldn't block? Does this speak to a bug somewhere at a lower level, as the answer at https://stackoverflow.com/a/76981006/1319998 suggests? When also suggests setting a timeout might not do anything? |
Please note that in our code we have requested to resolve a hostname and not an IP address.
The code is running on a large-scale production system inside a container, It does not reproduce easily, but it happens to reproduce once in a while, it already struck more than once. I think the real issue itself resides in the container (cgroup) networking stack, we have applied the patch locally on and are now monitoring. At this stage, I don't have any further information, but we'll keep monitoring. |
I think I have a lead: We have import asyncio
from aiodnsresolver import Resolver, TYPES
resolve, clear_cache = Resolver()
ip_addresses = await resolve('apac.ko.com', TYPES.A) and as a result, this created a When this issue reproduces, we have had many We will have to check this tomorrow. |
Oh! The upstream dns resolver is specified by hostname and not IP address? So aiodnsresolver isn’t really written for this I think. The connect call is likely to use the OS to resolve this hostname to an IP. I have a vague memory that this maybe can block even if the socket is non blocking? And so then maybe hang forever too… Not sure… It’s a slightly unfamiliar setup to me I have to admit. But do you need two instances of Resolver? One to find the IP address of the upstream resolver using the default servers in resolv.conf. And then another for your client code that uses this discovered IP address to resolve other host names? |
Background
We had a situation where this
sock.connect
blocked forever inside a container, in order to avoid such a situation in the future we ensured the connection was secured by the same timeout we granted the request.