fix: set timeout on socket connect in order to prevent infinite block #35

jafar-atili · 2023-08-25T16:43:27Z

Background

We had a situation where this sock.connect blocked forever inside a container, in order to avoid such a situation in the future we ensured the connection was secured by the same timeout we granted the request.

michalc · 2023-08-25T17:30:16Z

Hi @jafar-atili,

If this were TCP, then I would understand that a timeout is needed. But only UDP is used here, so a connect is more of an artefact of the socket API. Nothing is sent out as part of connect, and so it's not waiting for a response?

Or maybe in other words - how can a test be constructed where this timeout is hit?

jafar-atili · 2023-08-25T21:24:50Z

one fact for sure is this thing is stuck in our platform, Here is the backtrace of the hung thread:

Thread 0x7F30AB7FE700 (active): "ThreadPoolExecutor-6_0"
    request (aiodnsresolver.py:513)
    request_with_timeout (aiodnsresolver.py:481)
    request_until_response (aiodnsresolver.py:454)
    request_and_cache (aiodnsresolver.py:421)
    runner (aiodnsresolver.py:648)
    request_memoized (aiodnsresolver.py:418)
    resolve (aiodnsresolver.py:389)
   ****

See this official documentation of the socket module of Python

The connect() operation is also subject to the timeout setting, and in general it is recommended to call settimeout() before calling connect() or pass a timeout parameter to create_connection(). However, the system network stack may also return a connection timeout error of its own regardless of any Python socket timeout setting.

I don't think that the only reason for connect to stuck is the SYN_SENT issue but no SYN-ACK (as you described in TCP), This System Call is subject to other things related to the networking stack.

And if the user already passed a timeout parameter, it is good practice to respect this timeout in the whole request operation (i.e. setting up the socket)

michalc · 2023-08-25T21:32:43Z

Here is the backtrace of the hung thread:

That does look like it's stuck on connect...

This System Call is subject to other things related to the networking stack.

... I would like to know though, what more specifically? And how can a test be constructed for this?

michalc · 2023-08-26T06:27:47Z

(I asked on SO about this https://stackoverflow.com/questions/76980757/can-a-udp-socket-hang-on-connect-in-python)

blocked forever inside a container

Do you have more details of the setup here? Where is this running, what's the distro in the container for example?

michalc · 2023-08-26T10:33:54Z

Actually if not a test, then a way to reproduce this manually somehow?

michalc · 2023-08-26T10:42:00Z

Actually also... here the socket is already non-blocking... so no matter what, it shouldn't block? Does this speak to a bug somewhere at a lower level, as the answer at https://stackoverflow.com/a/76981006/1319998 suggests? When also suggests setting a timeout might not do anything?

jafar-atili · 2023-08-26T16:08:03Z

Please note that in our code we have requested to resolve a hostname and not an IP address.

Thread 0x7F30AB7FE700 (active): "ThreadPoolExecutor-6_0"
    request (aiodnsresolver.py:513)
        Arguments:
            logger: <ResolverLoggerAdapter at 0x7f325c3ea830>
            addrs: (("apac.ko.com", 53))
            set_timeout_cause: <function at 0x7f3063b42a70>
        Locals:
            req: <function at 0x7f3063b40940>
            socks: (<socket at 0x7f3088288160>)
            connections: {}
            last_exception: <OSError at 0x7f306c35e7c0>
            addr_port: ("apac.ko.com", 53)
            sock: <socket at 0x7f3088288160>
    request_with_timeout (aiodnsresolver.py:481)
        Arguments:
            logger: <ResolverLoggerAdapter at 0x7f325c3ea830>
            timeout: 2
            addrs: (("apac.ko.com", 53))
            fqdn: <BytesExpiresAt at 0x7f305fd57040>
            qtype: 1
        Locals:
            cancel: <function at 0x7f3063b43760>
            handle: <TimerHandle at 0x7f305cf73840>
            set_timeout_cause: <function at 0x7f3063b42a70>
    request_until_response (aiodnsresolver.py:454)

The code is running on a large-scale production system inside a container, It does not reproduce easily, but it happens to reproduce once in a while, it already struck more than once.

I think the real issue itself resides in the container (cgroup) networking stack, we have applied the patch locally on and are now monitoring.

At this stage, I don't have any further information, but we'll keep monitoring.

jafar-atili · 2023-08-26T16:26:39Z

I think I have a lead:

We have dnsmasq service running in the container, I started a new ipython on the container and ran the following code:

import asyncio
from aiodnsresolver import Resolver, TYPES
 
resolve, clear_cache = Resolver()
ip_addresses = await resolve('apac.ko.com', TYPES.A)

and as a result, this created a dnsmasq zombie process, I ran it again and another dnsmasq zombie process.

When this issue reproduces, we have had many dnsmasq defunct processes, and I think this large number of zombie processes has something to do with the stuck connect().

We will have to check this tomorrow.

michalc · 2023-08-26T16:34:35Z

Please note that in our code we have requested to resolve a hostname and not an IP address.

Oh! The upstream dns resolver is specified by hostname and not IP address?

So aiodnsresolver isn’t really written for this I think. The connect call is likely to use the OS to resolve this hostname to an IP. I have a vague memory that this maybe can block even if the socket is non blocking? And so then maybe hang forever too… Not sure…

It’s a slightly unfamiliar setup to me I have to admit. But do you need two instances of Resolver? One to find the IP address of the upstream resolver using the default servers in resolv.conf. And then another for your client code that uses this discovered IP address to resolve other host names?

jafar-atili added 2 commits August 25, 2023 19:43

fix: set timeout on socket connect in order to prevent infinite block

4a6e4a7

one more thing

0700695

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: set timeout on socket connect in order to prevent infinite block #35

fix: set timeout on socket connect in order to prevent infinite block #35

jafar-atili commented Aug 25, 2023 •

edited

Loading

michalc commented Aug 25, 2023

jafar-atili commented Aug 25, 2023 •

edited

Loading

michalc commented Aug 25, 2023

michalc commented Aug 26, 2023

michalc commented Aug 26, 2023

michalc commented Aug 26, 2023

jafar-atili commented Aug 26, 2023

jafar-atili commented Aug 26, 2023 •

edited

Loading

michalc commented Aug 26, 2023 •

edited

Loading

fix: set timeout on socket connect in order to prevent infinite block #35

Are you sure you want to change the base?

fix: set timeout on socket connect in order to prevent infinite block #35

Conversation

jafar-atili commented Aug 25, 2023 • edited Loading

Background

michalc commented Aug 25, 2023

jafar-atili commented Aug 25, 2023 • edited Loading

michalc commented Aug 25, 2023

michalc commented Aug 26, 2023

michalc commented Aug 26, 2023

michalc commented Aug 26, 2023

jafar-atili commented Aug 26, 2023

jafar-atili commented Aug 26, 2023 • edited Loading

michalc commented Aug 26, 2023 • edited Loading

jafar-atili commented Aug 25, 2023 •

edited

Loading

jafar-atili commented Aug 25, 2023 •

edited

Loading

jafar-atili commented Aug 26, 2023 •

edited

Loading

michalc commented Aug 26, 2023 •

edited

Loading