Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: set timeout on socket connect in order to prevent infinite block #35

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jafar-atili
Copy link

@jafar-atili jafar-atili commented Aug 25, 2023

Background

We had a situation where this sock.connect blocked forever inside a container, in order to avoid such a situation in the future we ensured the connection was secured by the same timeout we granted the request.

@michalc
Copy link
Owner

michalc commented Aug 25, 2023

Hi @jafar-atili,

If this were TCP, then I would understand that a timeout is needed. But only UDP is used here, so a connect is more of an artefact of the socket API. Nothing is sent out as part of connect, and so it's not waiting for a response?

Or maybe in other words - how can a test be constructed where this timeout is hit?

@jafar-atili
Copy link
Author

jafar-atili commented Aug 25, 2023

one fact for sure is this thing is stuck in our platform, Here is the backtrace of the hung thread:

Thread 0x7F30AB7FE700 (active): "ThreadPoolExecutor-6_0"
    request (aiodnsresolver.py:513)
    request_with_timeout (aiodnsresolver.py:481)
    request_until_response (aiodnsresolver.py:454)
    request_and_cache (aiodnsresolver.py:421)
    runner (aiodnsresolver.py:648)
    request_memoized (aiodnsresolver.py:418)
    resolve (aiodnsresolver.py:389)
   ****

See this official documentation of the socket module of Python

The connect() operation is also subject to the timeout setting, and in general it is recommended to call settimeout() before calling connect() or pass a timeout parameter to create_connection(). However, the system network stack may also return a connection timeout error of its own regardless of any Python socket timeout setting.

I don't think that the only reason for connect to stuck is the SYN_SENT issue but no SYN-ACK (as you described in TCP), This System Call is subject to other things related to the networking stack.

And if the user already passed a timeout parameter, it is good practice to respect this timeout in the whole request operation (i.e. setting up the socket)

@michalc
Copy link
Owner

michalc commented Aug 25, 2023

Here is the backtrace of the hung thread:

That does look like it's stuck on connect...

This System Call is subject to other things related to the networking stack.

... I would like to know though, what more specifically? And how can a test be constructed for this?

@michalc
Copy link
Owner

michalc commented Aug 26, 2023

(I asked on SO about this https://stackoverflow.com/questions/76980757/can-a-udp-socket-hang-on-connect-in-python)

blocked forever inside a container

Do you have more details of the setup here? Where is this running, what's the distro in the container for example?

@michalc
Copy link
Owner

michalc commented Aug 26, 2023

Actually if not a test, then a way to reproduce this manually somehow?

@michalc
Copy link
Owner

michalc commented Aug 26, 2023

Actually also... here the socket is already non-blocking... so no matter what, it shouldn't block? Does this speak to a bug somewhere at a lower level, as the answer at https://stackoverflow.com/a/76981006/1319998 suggests? When also suggests setting a timeout might not do anything?

@jafar-atili
Copy link
Author

Please note that in our code we have requested to resolve a hostname and not an IP address.

Thread 0x7F30AB7FE700 (active): "ThreadPoolExecutor-6_0"
    request (aiodnsresolver.py:513)
        Arguments:
            logger: <ResolverLoggerAdapter at 0x7f325c3ea830>
            addrs: (("apac.ko.com", 53))
            set_timeout_cause: <function at 0x7f3063b42a70>
        Locals:
            req: <function at 0x7f3063b40940>
            socks: (<socket at 0x7f3088288160>)
            connections: {}
            last_exception: <OSError at 0x7f306c35e7c0>
            addr_port: ("apac.ko.com", 53)
            sock: <socket at 0x7f3088288160>
    request_with_timeout (aiodnsresolver.py:481)
        Arguments:
            logger: <ResolverLoggerAdapter at 0x7f325c3ea830>
            timeout: 2
            addrs: (("apac.ko.com", 53))
            fqdn: <BytesExpiresAt at 0x7f305fd57040>
            qtype: 1
        Locals:
            cancel: <function at 0x7f3063b43760>
            handle: <TimerHandle at 0x7f305cf73840>
            set_timeout_cause: <function at 0x7f3063b42a70>
    request_until_response (aiodnsresolver.py:454)

The code is running on a large-scale production system inside a container, It does not reproduce easily, but it happens to reproduce once in a while, it already struck more than once.

I think the real issue itself resides in the container (cgroup) networking stack, we have applied the patch locally on and are now monitoring.

At this stage, I don't have any further information, but we'll keep monitoring.

@jafar-atili
Copy link
Author

jafar-atili commented Aug 26, 2023

I think I have a lead:

We have dnsmasq service running in the container, I started a new ipython on the container and ran the following code:

import asyncio
from aiodnsresolver import Resolver, TYPES
 
resolve, clear_cache = Resolver()
ip_addresses = await resolve('apac.ko.com', TYPES.A)

and as a result, this created a dnsmasq zombie process, I ran it again and another dnsmasq zombie process.

When this issue reproduces, we have had many dnsmasq defunct processes, and I think this large number of zombie processes has something to do with the stuck connect().

We will have to check this tomorrow.

@michalc
Copy link
Owner

michalc commented Aug 26, 2023

Please note that in our code we have requested to resolve a hostname and not an IP address.

Oh! The upstream dns resolver is specified by hostname and not IP address?

So aiodnsresolver isn’t really written for this I think. The connect call is likely to use the OS to resolve this hostname to an IP. I have a vague memory that this maybe can block even if the socket is non blocking? And so then maybe hang forever too… Not sure…

It’s a slightly unfamiliar setup to me I have to admit. But do you need two instances of Resolver? One to find the IP address of the upstream resolver using the default servers in resolv.conf. And then another for your client code that uses this discovered IP address to resolve other host names?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants