Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pasta, parallel: test command timeout #23471

Closed
edsantiago opened this issue Aug 1, 2024 · 29 comments · Fixed by #23561
Closed

pasta, parallel: test command timeout #23471

edsantiago opened this issue Aug 1, 2024 · 29 comments · Fixed by #23561
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. pasta pasta(1) bugs or features

Comments

@edsantiago
Copy link
Member

Seeing this occasionally in #23275. The actual test name varies. The common aspect is:

<+319ms> # $ podman run --rm --name=c-socat-t152-imuqzpfw --net=pasta -p [10.128.0.101]:5746:5746/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5746 5746); do                              socat -u UDP4-LISTEN:${port},bind=[10.128.0.101],null-eof STDOUT &                          done; wait
<+0130s> # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
         # timeout: sending signal KILL to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
<+006ms> # [ rc=137 (** EXPECTED 0 **) ]
x x x x x x
sys(3) remote(2) fedora-40(2) rootless(3) host(3) sqlite(2)
podman(1) fedora-39(1) boltdb(1)
@edsantiago edsantiago added flakes Flakes from Continuous Integration pasta pasta(1) bugs or features labels Aug 1, 2024
@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

One of the socat processes is not exiting causing the wait in bash to hang until we SIGKILL the entire container
I attached strace to socat and saw this

select(6, [5], [], [], NULL <unfinished ...>) = ?
+++ killed by SIGKILL +++

So it seems to wait for more io I think?

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

I added pcap captures to pasta to see what is send

✗ |505| UDP port range forwarding, IPv4, tap [130790]
   tags: ci:parallel
   (from function `bail-now' in file test/system/helpers.bash, line 192,
    from function `die' in file test/system/helpers.bash, line 940,
    from function `run_podman' in file test/system/helpers.bash, line 542,
    from function `pasta_test_do' in file test/system/505-networking-pasta.bats, line 235,
    in test file test/system/505-networking-pasta.bats, line 581)
     `pasta_test_do' failed
   
   [11:59:25.297039417] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman info --format {{.Host.Pasta.Executable}}
   [11:59:25.493535966] /usr/bin/pasta
   
   [11:59:25.872642490] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman run --rm --name=c-socat-t45-eiqa3glo --net=pasta:--pcap,/tmp/pcap-t45-eiqa3glo -p [192.168.188.25]:5040-5042:5040-5042/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5040 5042); do                              socat -u UDP4-LISTEN:${port},bind=[192.168.188.25],null-eof STDOUT &                          done; wait
   [12:01:35.883359113] xxtimeout: sending signal TERM to command ‘/home/pholzing/go/src/github.com/containers/podman/bin/podman’
   timeout: sending signal KILL to command ‘/home/pholzing/go/src/github.com/containers/podman/bin/podman’
   [12:01:35.886754948] [ rc=137 (** EXPECTED 0 **) ]
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: exit code is 137; expected 0
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]

this shows a test with three senders/receivers but in the pcap file we can see that only for two of the three our packages were seen by pasta

...
    6   1.874969 192.168.188.1 → 192.168.188.25 UDP 43 59819 → 5041 Len=1
    7   1.874986 192.168.188.1 → 192.168.188.25 UDP 42 59819 → 5041 Len=0
    8   1.876526 192.168.188.1 → 192.168.188.25 UDP 43 39208 → 5042 Len=1
    9   1.876533 192.168.188.1 → 192.168.188.25 UDP 42 39208 → 5042 Len=0
...

For port 5040 there were no packages which could mean the packages were never send or dropped on the host side somehow
cc @sbrivio-rh

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

Even more weird

 ✗ |505| UDP%2FIPv4 small transfer, tap [130494]
   tags: ci:parallel
   (from function `bail-now' in file test/system/helpers.bash, line 192,
    from function `die' in file test/system/helpers.bash, line 940,
    from function `run_podman' in file test/system/helpers.bash, line 542,
    from function `pasta_test_do' in file test/system/505-networking-pasta.bats, line 235,
    in test file test/system/505-networking-pasta.bats, line 709)
     `pasta_test_do' failed
   
   [12:09:45.887431350] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman info --format {{.Host.Pasta.Executable}}
   [12:09:46.020700033] /usr/bin/pasta
   1+0 records in
   1+0 records out
   2048 bytes (2.0 kB, 2.0 KiB) copied, 8.308e-05 s, 24.7 MB/s
   
   [12:09:46.211772573] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman run --rm --name=c-socat-t75-lartez00 --net=pasta:--pcap,/tmp/pcap-t75-lartez00 -p [192.168.188.25]:5520:5520/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5520 5520); do                              socat -u UDP4-LISTEN:${port},bind=[192.168.188.25],null-eof EXEC:md5sum &                          done; wait
   [12:11:56.220491495] timeout: sending signal TERM to command ‘/home/pholzing/go/src/github.com/containers/podman/bin/podman’
   timeout: sending signal KILL to command ‘/home/pholzing/go/src/github.com/containers/podman/bin/podman’
   [12:11:56.223402969] [ rc=137 (** EXPECTED 0 **) ]
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: exit code is 137; expected 0
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]
 ✓ |505| UDP%2FIPv4 small transfer, loopback [2537]
 ✗ |505| UDP%2FIPv4 large transfer, tap [2653]
   tags: ci:parallel
   (from function `bail-now' in file test/system/helpers.bash, line 192,
    from function `assert' in file test/system/helpers.bash, line 1056,
    from function `pasta_test_do' in file test/system/505-networking-pasta.bats, line 241,
    in test file test/system/505-networking-pasta.bats, line 717)
     `pasta_test_do' failed
   
   [12:09:45.914795887] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman info --format {{.Host.Pasta.Executable}}
   [12:09:46.040656141] /usr/bin/pasta
   1+0 records in
   1+0 records out
   53248 bytes (53 kB, 52 KiB) copied, 0.000209959 s, 254 MB/s
   
   [12:09:46.237577448] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman run --rm --name=c-socat-t77-nnzhgdut --net=pasta:--pcap,/tmp/pcap-t77-nnzhgdut -p [192.168.188.25]:5520:5520/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5520 5520); do                              socat -u UDP4-LISTEN:${port},bind=[192.168.188.25],null-eof EXEC:md5sum &                          done; wait
   [12:09:48.394902905] 9edd44bb9254c3cdc9f67f27effb5d1c  -
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Mismatch between data sent and received
   #| expected: = c651a3aed94c6a4e95d2ba3f23f6e9c7  -
   #|   actual:   9edd44bb9254c3cdc9f67f27effb5d1c  -
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]

Two tests using the same port (5520) at the same time and pasta is not failing to bind it with EADDRINUSE?

@sbrivio-rh
Copy link
Collaborator

Oh, oops, setting the REUSEADDR socket option on UDP (as we do):

$ strace -e bind -f ./pasta -u 9876 -t none -U none -T none -- sleep 60
bind(5, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
bind(6, {sa_family=AF_INET, sin_port=htons(9876), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(9876), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0

and exactly the same in another terminal. If I drop REUSEADDR:

$ strace -e bind -f ./pasta -u 9876 -t none -U none -T none -- sleep 60
bind(5, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
bind(6, {sa_family=AF_INET, sin_port=htons(9876), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
bind(6, {sa_family=AF_INET6, sin6_port=htons(9876), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EADDRINUSE (Address already in use)
Failed to bind port 9876 (Address already in use) for option '-u 9876', exiting
+++ exited with 1 +++

I didn't expect that. On TCP it fails as expected.

Actually, we realised that for unrelated reasons, and started exploiting this behaviour (but not for UDP port forwarding), which is now reverse engineereddocumented at: https://passt.top/passt/tree/doc/platform-requirements/reuseaddr-priority.c

For port forwarding sockets, we should either disable REUSEADDR (@dgibson, any concern with that?) or explicitly check in another way if ports are conflicting.

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

I think that explains the hangs as well, all the ADDRINUSE errors were seen with tcp so I assume if that happens in the udp case it can hang there because another socket with the same port got the traffic causing the hang as the socat in the container never got the "null-eof" udp package to terminate.

I am not familiar with REUSEADDR but reading the docs

Indicates that the rules used in validating addresses
supplied in a bind(2) call should allow reuse of local
addresses. For AF_INET sockets this means that a socket
may bind, except when there is an active listening socket
bound to the address. When the listening socket is bound
to INADDR_ANY with a specific port then it is not possible
to bind to this port for any local address. Argument is
an integer boolean flag.

And udp sockets do not count as listening sockets I guess and thus not cause a bind failure.

@sbrivio-rh
Copy link
Collaborator

I think that explains the hangs as well, all the ADDRINUSE errors were seen with tcp

Oh, I didn't see those.

so I assume if that happens in the udp case it can hang there because another socket with the same port got the traffic causing the hang as the socat in the container never got the "null-eof" udp package to terminate.

Right. This is actually an issue in the tests I would say, even though we should fix pasta as well.

I am not familiar with REUSEADDR but reading the docs

Indicates that the rules used in validating addresses
supplied in a bind(2) call should allow reuse of local
addresses. For AF_INET sockets this means that a socket
may bind, except when there is an active listening socket
bound to the address. When the listening socket is bound
to INADDR_ANY with a specific port then it is not possible
to bind to this port for any local address. Argument is
an integer boolean flag.

And udp sockets do not count as listening sockets I guess

Not so much, in the Linux kernel the listening state for UDP is represented exactly in the same way as for TCP. In general, UDP can have "listening" sockets, in kernel terms.

and thus not cause a bind failure.

The INADDR_ANY clause is misleading, because, for TCP (and that's the reason why we use SO_REUSEADDR, so that users can forward ports differently based on different addresses), you get, without SO_REUSEADDR:

$ strace -e bind -f ./pasta -t 1.2.3.1/9876 -u none -U none -T none -- sleep 60
bind(5, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
bind(6, {sa_family=AF_INET, sin_port=htons(9876), sin_addr=inet_addr("1.2.3.1")}, 16) = 0

and:

$ strace -e bind -f ./pasta -t 9876 -u none -U none -T none -- sleep 60
bind(5, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
bind(6, {sa_family=AF_INET6, sin6_port=htons(9876), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EADDRINUSE (Address already in use)
bind(6, {sa_family=AF_INET, sin_port=htons(9876), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
bind(6, {sa_family=AF_INET6, sin6_port=htons(9876), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0

but with SO_REUSEADDR, that succeeds. My interpretation of the second sentence there (with the addition of "same" before "address") is that it describes the behaviour with SO_REUSEADDR, but the third one describes the behaviour without it.

We should eventually send a patch for that, now that we have a test program it should be doable.

Back to pasta: if we disable SO_REUSEADDR for UDP, we lose the ability to do exactly what I described above: bind UDP ports using different (local) addresses so that we can handle traffic differently. I guess we need to find another way to check if there's another process binding to the same exact (local) address and port.

The /proc/net/udp and /proc/net/udp6 entries would be perfect, except that pasta can't see its own PID at runtime. And yes, we could probably check them at configuration time, when pasta is not sandboxed so strictly, but then we would run into the same issue the day we finally implement dynamic port forwarding at runtime.

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

The /proc/net/udp and /proc/net/udp6 entries would be perfect, except that pasta can't see its own PID at runtime. And yes, we could probably check them at configuration time, when pasta is not sandboxed so strictly, but then we would run into the same issue the day we finally implement dynamic port forwarding at runtime.

Looking this up somewhere just sounds racy to me and more complicated. IMO either bind fails or not.

bind UDP ports using different (local) addresses so that we can handle traffic differently

But isn't this only true if you want to bind 0.0.0.0 and a specific ip at the same there. You do not need SO_REUSEADDR to bind 127.0.0.1 and 127.0.0.2 for example. We do not use SO_REUSEADDR when binding ports as root. I don't find a mention of REUSEADDR in slirp4netns or our rootlessport forwarder either and AFAIK we never had any user ask for such behavior so I don't think this important to use REUSEADDR at all. I could see that there are cases for it but shouldn't these be opt-in then?

The fact that SO_REUSEADDR with udp behaves like that seems really odd to me. Isn't that what SO_REUSEPORT is supposed to allow?

@sbrivio-rh
Copy link
Collaborator

The /proc/net/udp and /proc/net/udp6 entries would be perfect, except that pasta can't see its own PID at runtime. And yes, we could probably check them at configuration time, when pasta is not sandboxed so strictly, but then we would run into the same issue the day we finally implement dynamic port forwarding at runtime.

Looking this up somewhere just sounds racy to me and more complicated. IMO either bind fails or not.

For sure it's more complicated, but it doesn't need to be racy, for example if we bind() and then check that there's a single entry in /proc/net/udp for that port and address matching our PID, that's not racy. But that's not something we can do, either. Still, I think there might be other ways.

bind UDP ports using different (local) addresses so that we can handle traffic differently

But isn't this only true if you want to bind 0.0.0.0 and a specific ip at the same there.

Correct, but see #14425. Sure, we also use SO_BINDTODEVICE there, but if I recall correctly we can't anyway bind twice without SO_REUSEADDR. And regardless of SO_BINDTODEVICE, one might anyway want to use that without an interface.

Perhaps some Podman users who needed #14425 are actually using that feature by binding without an interface given that we haven't added it to Podman's options, yet.

You do not need SO_REUSEADDR to bind 127.0.0.1 and 127.0.0.2 for example. We do not use SO_REUSEADDR when binding ports as root. I don't find a mention of REUSEADDR in slirp4netns or our rootlessport forwarder either and AFAIK we never had any user ask for such behavior so I don't think this important to use REUSEADDR at all. I could see that there are cases for it but shouldn't these be opt-in then?

Sure, it can be a configuration option, but for TCP, it's harmless, so that would be a rather useless configuration option. We could also decide to make it optional just for UDP. On the other hand, what happens if another process uses SO_REUSEADDR? From a quick glance at this beauty in net/ipv4/udp.c:

        sk_for_each(sk2, &hslot->head) {
                if (net_eq(sock_net(sk2), net) &&
                    sk2 != sk &&
                    (bitmap || udp_sk(sk2)->udp_port_hash == num) &&
                    (!sk2->sk_reuse || !sk->sk_reuse) &&
                    (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if ||
                     sk2->sk_bound_dev_if == sk->sk_bound_dev_if) &&
                    inet_rcv_saddr_equal(sk, sk2, true)) {
                        if (sk2->sk_reuseport && sk->sk_reuseport &&
                            !rcu_access_pointer(sk->sk_reuseport_cb) &&
                            uid_eq(uid, sock_i_uid(sk2))) {
                                if (!bitmap)
                                        return 0;
                        } else {
                                if (!bitmap)
                                        return 1;
                                __set_bit(udp_sk(sk2)->udp_port_hash >> log,
                                          bitmap);
                        }
                }
        }
        return 0;

I would say that it doesn't matter who uses SO_REUSEADDR. We might succeed to bind() to a UDP port, and yet get absolutely no traffic there, because another process bound to it with SO_REUSEADDR. We should test this I guess. But in that case, does it make a difference, if we fail? We're not really guaranteeing anything in either case.

The fact that SO_REUSEADDR with udp behaves like that seems really odd to me.

Yes, same to me.

Isn't that what SO_REUSEPORT is supposed to allow?

Yes, definitely. We can also try to fix that in the kernel if we can show there's no reason to use SO_REUSEADDR for that and that the risk for https://xkcd.com/1172/ is reasonably low.

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

Sure, it can be a configuration option, but for TCP, it's harmless

Fair enough

what happens if another process uses SO_REUSEADDR?
I would say that it doesn't matter who uses SO_REUSEADDR. We might succeed to bind() to a UDP port, and yet get absolutely no traffic there, because another process bound to it with SO_REUSEADDR. We should test this I guess. But in that case, does it make a difference, if we fail? We're not really guaranteeing anything in either case.

Well easy enough to test, if I bind first without SO_REUSEADDR then the other process cannot bind even when reuseaddr is set so that works like it should.

$ socat -4 UDP-RECV:5000,reuseaddr STDOUT
$ strace -e bind,setsockopt socat -4 UDP-RECV:5000,reuseaddr STDOUT
2024/08/02 16:30:18 socat[352710] W address is opened in read-write mode but only supports read-only
setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
2024/08/02 16:30:18 socat[352710] E bind(5, {AF=2 0.0.0.0:5000}, 16): Address already in use
+++ exited with 1 +++

So overall I would say you should drop SO_REUSEADDR from udp sockets IMO or well fix the kernel if that is even possible without breaking things?

@sbrivio-rh
Copy link
Collaborator

Well easy enough to test, if I bind first without SO_REUSEADDR then the other process cannot bind even when reuseaddr is set so that works like it should.

Hah, thanks, I forgot that socat can set that too.

So overall I would say you should drop SO_REUSEADDR from udp sockets IMO or well fix the kernel if that is even possible without breaking things?

Most likely one of the two, or both, yes.

Should I also take care of fixing the tests, or is this covered somewhere else/by somebody else?

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

Should I also take care of fixing the tests, or is this covered somewhere else/by somebody else?

Which part? Making the test port assignment parallel safe is certainly for us to do and not you, once that is parallel safe and we know we don't get port conflicts then we see how many pasta issues are actually left...

@sbrivio-rh
Copy link
Collaborator

Should I also take care of fixing the tests, or is this covered somewhere else/by somebody else?

Which part?

The part of not picking a port that's bound as free, I guess I added/changed that function, but I think I've seen another issue with a comment of yours about it.

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

20b3f5e which is all WIP in the parallel system test PR, I have not checked the math but clearly it isn't working so it leave that for @edsantiago to fix for now.

Before that we never had any parallel runs so the function didn't had to be parallel safe so that was totally fine.

@Luap99
Copy link
Member

Luap99 commented Aug 5, 2024

Ok with the port issue being fixed I no longer see the quick/easy udp hangs (#23488 (comment))

However there are still hangs observed but via TCP

 ✗ |505| Interface-bound TCP port forwarding, IPv4, tap [130632]
   tags: ci:parallel
   (from function `bail-now' in file test/system/helpers.bash, line 189,
    from function `die' in file test/system/helpers.bash, line 937,
    from function `run_podman' in file test/system/helpers.bash, line 539,
    from function `pasta_test_do' in file test/system/505-networking-pasta.bats, line 235,
    in test file test/system/505-networking-pasta.bats, line 513)
     `pasta_test_do' failed
   
   [14:09:12.294751292] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman info --format {{.Host.Pasta.Executable}}
   [14:09:12.444220819] /usr/bin/pasta
   
   [14:09:12.673952415] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman run --rm --name=c-socat-t29-h9qbb1wq --net=pasta:--tcp-ports,192.168.188.25%enp9s0u2u1u2/5526:5526 -p  quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5526 5526); do                              socat -u TCP4-LISTEN:${port},bind=[192.168.188.25] STDOUT &                          done; wait
   [14:11:22.683250385] timeout: sending signal TERM to command ‘/home/pholzing/go/src/github.com/containers/podman/bin/podman’
   timeout: sending signal KILL to command ‘/home/pholzing/go/src/github.com/containers/podman/bin/podman’
   [14:11:22.686097848] [ rc=137 (** EXPECTED 0 **) ]
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: exit code is 137; expected 0
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]

I will try to instrument the test with package captures to see what is going on there

@sbrivio-rh
Copy link
Collaborator

So overall I would say you should drop SO_REUSEADDR from udp sockets IMO

We talked about this with @dgibson and there's a potential problem with disabling SO_REUSEADDR for UDP sockets, albeit unlikely to happen.

If a UDP port is forwarded to the container, we'll bind() it with SO_REUSEADDR at start-up, with a wildcard address or with an address given by the user. If, later, a process in the container uses the same UDP port as source, for outbound traffic, as we try to preserve source ports, we'll bind that same port with SO_REUSEADDR with the appropriate, specific address, which succeeds. Not only that: we actually connect() that socket, which is a stronger indication, used by the kernel, to deliver matching traffic to that specific socket, instead of the one configured for port forwarding.

At this point, the traffic for the flow initiated by the container is properly routed over that flow, and any other traffic going to that same port is forwarded as configured instead.

If we drop SO_REUSEADDR, we'll fail to bind() the socket for the outbound flow, and we can then choose to either use another port as source, or drop packets hoping that the process will pick another port as source. In both cases, we'll not be able to preserve the source port. This is not ideal because the port forwarding configuration will determine whether we are able to preserve source ports.

This case is unlikely to happen because the Linux kernel should anyway pick ephemeral ports as source ports, and those shouldn't be forwarded, typically. However, there's no common definition of ephemeral ports: while IANA recommends the 49152–65535 range, Linux will, by default, use the 32768–60999 range instead.

If we decide to stop trying to preserve source ports, that's not an issue anymore, but some users and protocols might rely on that behaviour.

All in all, my favourite option at this point would be to keep SO_REUSEADDR with a note in the man page explaining that we won't detect "bind" conflicts with UDP ports. Let me know if it's fine by you.

Note by the way that, unlike what happens with TCP, the conflict might even be temporary, because we can succeed binding to a port with the same address as bound by an existing process, and we'll actually start getting packets once this other process terminates. If that's the intended operation in a given use case, reporting a conflict might actually be a nuisance.

@Luap99
Copy link
Member

Luap99 commented Aug 5, 2024

All in all, my favourite option at this point would be to keep SO_REUSEADDR with a note in the man page explaining that we won't detect "bind" conflicts with UDP ports. Let me know if it's fine by you.

I don't like this. A podman users expects us to report a conflict. A user should not be allowed to bind the same port twice by default.


Note by the way that, unlike what happens with TCP, the conflict might even be temporary, because we can succeed binding to a port with the same address as bound by an existing process, and we'll actually start getting packets once this other process terminates. If that's the intended operation in a given use case, reporting a conflict might actually be a nuisance.

The port logic was fixed. There should be no conflicts in the tests anymore but we are still seeing hangs even when using unique ports. Actually I just got a hang with udp too and I really don't think the current port logic is buggy and gives use the same port twice 7c3d294

Something else is wrong either with pasta or the tests itself. I am still trying to properly setup package captures on the host.

@sbrivio-rh
Copy link
Collaborator

All in all, my favourite option at this point would be to keep SO_REUSEADDR with a note in the man page explaining that we won't detect "bind" conflicts with UDP ports. Let me know if it's fine by you.

I don't like this. A podman users expects us to report a conflict. A user should not be allowed to bind the same port twice by default.

I get this, I don't like it either, but what do you propose to do if the source port of an outbound flow clashes with a forwarded port? We can:

  1. drop that outbound flow, and hope that the originating process tries again
  2. pick a different source port, just for that flow
  3. avoid trying to preserve source ports in general

All of these look much worse than the alternative to me: 1. might break UDP connectivity altogether, 2. might break some protocols and use cases, and it's rather unpredictable and subtle, 3. might also even more protocols and use cases, while being less subtle.

Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR.

It's slightly racy as well (even though we don't know if there's no race without SO_REUSEADDR anyway, between two processes calling bind() at just around the same time, because the socket is not really bound anyway), and it might also break use cases similar to #14425, but at least we won't have an issue with preserving source ports if they conflict with forwarded ports.

Yet another alternative could be to try and "fix" this in the kernel, first. If that fails, we'll evaluate again what to do in pasta. If we can fix this in the kernel, that would be the best outcome as we'll then be able to report true conflicts only, but it's probably going to take a significant amount of time.


Actually I just got a hang with udp too and I really don't think the current port logic is buggy and gives use the same port twice 7c3d294

Something else is wrong either with pasta or the tests itself. I am still trying to properly setup package captures on the host.

Ouch. What might be more likely to happen in a parallelised setup is that the server isn't ready as we start the client. There's no way to detect that socat is ready, except for checking procfs entries, so right now we have those delays (which I still have to replace with something else in order to speed up tests).

To see if that's the case, you can try to increase those delays even further, I guess. Eventually, I plan to replace them with a loop on /proc/net/udp{,v6} or something of that sort.

@Luap99
Copy link
Member

Luap99 commented Aug 5, 2024

What happens today if the source port is already bound on the host and you fail to bind? 1 or 2?

  1. drop that outbound flow, and hope that the originating process tries again

  2. pick a different source port, just for that flow

  3. avoid trying to preserve source ports in general

I am not sure if feasible but is there a 4) use the socket bound for the port forwarding so send outgoing package? Of course that would only make sense if the port mapping is 1 to 1 for the port numbers, mapping something else like -p 10:5/udp would also cause clashes if the container started sending from port 10 as the host side could not know if the inbound should go to port 5 or 10.

I think 3 is definitely the worst option. There are certainly protocols that need the port information preserved to work. But also I agree with your other points.

Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR.

This sounds like a good compromise to me.

Yet another alternative could be to try and "fix" this in the kernel, first. If that fails, we'll evaluate again what to do in pasta. If we can fix this in the kernel, that would be the best outcome as we'll then be able to report true conflicts only, but it's probably going to take a significant amount of time.

Ok but wouldn't fixing this in the kernel mean you get broken in the ways described above if the bind call now all of the sudden fails?

@sbrivio-rh
Copy link
Collaborator

What happens today if the source port is already bound on the host and you fail to bind? 1 or 2?

  1. drop that outbound flow, and hope that the originating process tries again
  2. pick a different source port, just for that flow
  3. avoid trying to preserve source ports in general

I am not sure if feasible but is there a 4) use the socket bound for the port forwarding so send outgoing package?

Heh, no, we can't, because outbound sockets need to be connect()ed so that we know which flow they belong to (without an extra lookup table, at least), see the theory of operation for reply sockets.

I think 3 is definitely the worst option. There are certainly protocols that need the port information preserved to work. But also I agree with your other points.

Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR.

This sounds like a good compromise to me.

Okay, good... we can also keep this one as fallback plan if it turns out we can't fix the kernel.

Yet another alternative could be to try and "fix" this in the kernel, first. If that fails, we'll evaluate again what to do in pasta. If we can fix this in the kernel, that would be the best outcome as we'll then be able to report true conflicts only, but it's probably going to take a significant amount of time.

Ok but wouldn't fixing this in the kernel mean you get broken in the ways described above if the bind call now all of the sudden fails?

No, because all the kernel should reject, with SO_REUSEPORT, is two bind() calls on the same port, same exact address, of non-connected sockets. We never do that in pasta (within a single process), and I suppose nobody intentionally does it either (they would use SO_REUSEPORT if they wanted to distribute packets between threads).

@sbrivio-rh
Copy link
Collaborator

What happens today if the source port is already bound on the host and you fail to bind? 1 or 2?

Oops, I missed this question. We do 1. for "spliced" (local) sockets. For the general (non-local) case, I'm actually lost regarding the new behaviour... @dgibson?

@dgibson
Copy link
Collaborator

dgibson commented Aug 6, 2024

Sorry, finally catching up on all the updates here. Replying to assorted points.

What happens today if the source port is already bound on the host and you fail to bind? 1 or 2?

Oops, I missed this question. We do 1. for "spliced" (local) sockets. For the general (non-local) case, I'm actually lost regarding the new behaviour... @dgibson?

Still (1); in udp_flow_new() we'll cancel the flow if we're unable to create or bind the outgoing socket. This is, I think, an argument of doing (1) for the case where we conflict with one of our own listens as well: it will behave just the same as if something else on the host bound the source port we want, and that's not something we can prevent.

I am not sure if feasible but is there a 4) use the socket bound for the port forwarding so send outgoing package?

Heh, no, we can't, because outbound sockets need to be connect()ed so that we know which flow they belong to (without an extra lookup table, at least), see the theory of operation for reply sockets.

Well.. we could maybe do this (my earlier designs for UDP flow handling did something like it). It's a pain to implement though, because we need a whole extra hash table to look up if we have an existing socket bound to a suitable address for outgoing packets. Plus, it introduces a different nasty edge case: the existing socket could be bound to a wildcard address, but we have a specific source address we're supposed to be using for our outgoing packets (e.g. because of -o). In that case we need extra trickery to get the correct source address. I think it's possible using the IP_PKTINFO control message, but again it's a real pain.

Ok but wouldn't fixing this in the kernel mean you get broken in the ways described above if the bind call now all of the sudden fails?

No, because all the kernel should reject, with SO_REUSEPORT, is two bind() calls on the same port, same exact address, of non-connected sockets. We never do that in pasta (within a single process), and I suppose nobody intentionally does it either (they would use SO_REUSEPORT if they wanted to distribute packets between threads).

Actually we could do that. If you have a port forward on a specific address, you could still get an unrelated flow trying to use the same source port and source address for an outgiong send. So with the kernel change this would again fail.

@dgibson
Copy link
Collaborator

dgibson commented Aug 6, 2024

For the short to medium term, I'm really inclined towards option (1). Yes it has an ugly edge case, but that doesn't do anything that we couldn't already hit if the guest's source port conflicts with some unrelated process on the host.

We could implement (2) or (3) on top of that in future, if it seems necessary.

Regarding (3), I'm not that sure that preserving the source port is as big a win as it seems. It's true that there are UDP protocols that require specific source ports - however, a number of those also have non-standard requirements about how the addresses are used that already won't fit into the existing flow model. Doing (3) for "normal" UDP flows, doesn't preclude us from implementing special case handling for specific protocols.

@sbrivio-rh
Copy link
Collaborator

Ok but wouldn't fixing this in the kernel mean you get broken in the ways described above if the bind call now all of the sudden fails?

No, because all the kernel should reject, with SO_REUSEPORT, is two bind() calls on the same port, same exact address, of non-connected sockets. We never do that in pasta (within a single process), and I suppose nobody intentionally does it either (they would use SO_REUSEPORT if they wanted to distribute packets between threads).

Actually we could do that. If you have a port forward on a specific address, you could still get an unrelated flow trying to use the same source port and source address for an outgiong send. So with the kernel change this would again fail.

Well, but the reply socket is connect()ed in that case, the "listening" socket isn't, and we could keep that distinction in the kernel fix.

For the short to medium term, I'm really inclined towards option (1). Yes it has an ugly edge case, but that doesn't do anything that we couldn't already hit if the guest's source port conflicts with some unrelated process on the host.

That's not such a big difference for the typical usage of pasta with Podman, probably, but it makes a big difference for common usages of passt and KubeVirt where all non-ephemeral ports in a pod are forwarded by passt (and no other ports are bound by other processes in that network namespace).

There's another thing we break in any case if we drop SO_REUSEADDR: the possibility of forwarding the same port while binding to two different addresses.

We could implement (2) or (3) on top of that in future, if it seems necessary.

Regarding (3), I'm not that sure that preserving the source port is as big a win as it seems. It's true that there are UDP protocols that require specific source ports - however, a number of those also have non-standard requirements about how the addresses are used that already won't fit into the existing flow model. Doing (3) for "normal" UDP flows, doesn't preclude us from implementing special case handling for specific protocols.

On the other hand, we probably won't want to add special handling for specific protocols (application-level gateways or suchlike), it's quite hard to get those right and they are usually one of the main sources of security issues (just look at Slirp). If we strive to keep network transparency wherever possible, it's quite likely that we won't ever need ALGs (except for the minimal DNS forwarding implementation we already have).

Do you see any problem with this alternative by the way:

Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR.

?

@dgibson
Copy link
Collaborator

dgibson commented Aug 6, 2024

Ok but wouldn't fixing this in the kernel mean you get broken in the ways described above if the bind call now all of the sudden fails?

No, because all the kernel should reject, with SO_REUSEPORT, is two bind() calls on the same port, same exact address, of non-connected sockets. We never do that in pasta (within a single process), and I suppose nobody intentionally does it either (they would use SO_REUSEPORT if they wanted to distribute packets between threads).

Actually we could do that. If you have a port forward on a specific address, you could still get an unrelated flow trying to use the same source port and source address for an outgiong send. So with the kernel change this would again fail.

Well, but the reply socket is connect()ed in that case, the "listening" socket isn't, and we could keep that distinction in the kernel fix.

Nope. Because you have to bind() it before you can connect() it, and since we'll fail the bind, we won't reach the connect.

For the short to medium term, I'm really inclined towards option (1). Yes it has an ugly edge case, but that doesn't do anything that we couldn't already hit if the guest's source port conflicts with some unrelated process on the host.

That's not such a big difference for the typical usage of pasta with Podman, probably, but it makes a big difference for common usages of passt and KubeVirt where all non-ephemeral ports in a pod are forwarded by passt (and no other ports are bound by other processes in that network namespace).

Ah, hrm. Yeah, I forgot about -u all. That is a bit of a nasty case.

There's another thing we break in any case if we drop SO_REUSEADDR: the possibility of forwarding the same port while binding to two different addresses.

Well, only the case where one of the binds is a wildcard and the other is not. And arguably in that case reporting the conflict makes sense.

We could implement (2) or (3) on top of that in future, if it seems necessary.
Regarding (3), I'm not that sure that preserving the source port is as big a win as it seems. It's true that there are UDP protocols that require specific source ports - however, a number of those also have non-standard requirements about how the addresses are used that already won't fit into the existing flow model. Doing (3) for "normal" UDP flows, doesn't preclude us from implementing special case handling for specific protocols.

On the other hand, we probably won't want to add special handling for specific protocols (application-level gateways or suchlike), it's quite hard to get those right and they are usually one of the main sources of security issues (just look at Slirp). If we strive to keep network transparency wherever possible, it's quite likely that we won't ever need ALGs (except for the minimal DNS forwarding implementation we already have).

Yeah, I guess that's true. I wasn't thinking of ALGs per se, but of special flow types which have different rules about port allocation. "Spliced" connections are already kind of an example of this.

Do you see any problem with this alternative by the way:

Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR.

?

Theoretically racy, but I guess it's probably ok. When we implement dynamic port forwarding we could get the reversed problem: a temporary flow could block binding a new forwarding port.

@sbrivio-rh
Copy link
Collaborator

sbrivio-rh commented Aug 6, 2024

Ok but wouldn't fixing this in the kernel mean you get broken in the ways described above if the bind call now all of the sudden fails?

No, because all the kernel should reject, with SO_REUSEPORT, is two bind() calls on the same port, same exact address, of non-connected sockets. We never do that in pasta (within a single process), and I suppose nobody intentionally does it either (they would use SO_REUSEPORT if they wanted to distribute packets between threads).

Actually we could do that. If you have a port forward on a specific address, you could still get an unrelated flow trying to use the same source port and source address for an outgiong send. So with the kernel change this would again fail.

Well, but the reply socket is connect()ed in that case, the "listening" socket isn't, and we could keep that distinction in the kernel fix.

Nope. Because you have to bind() it before you can connect() it, and since we'll fail the bind, we won't reach the connect.

Ouch, right.

Do you see any problem with this alternative by the way:

Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR.

?

Theoretically racy, but I guess it's probably ok.

I wouldn't be so sure that the non-SO_REUSEADDR case is not racy, by the way.

When we implement dynamic port forwarding we could get the reversed problem: a temporary flow could block binding a new forwarding port.

Well, in that case we'll have an actual conflict and I guess it's good that we block that.

In any case, I filed https://bugs.passt.top/show_bug.cgi?id=92 for this, so that we don't spam this ticket (which is about the test cases) too much.

@dgibson
Copy link
Collaborator

dgibson commented Aug 7, 2024

Do you see any problem with this alternative by the way:

Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR.

?

Theoretically racy, but I guess it's probably ok.

I wouldn't be so sure that the non-SO_REUSEADDR case is not racy, by the way.

A race between what and what are you thinking of?

When we implement dynamic port forwarding we could get the reversed problem: a temporary flow could block binding a new forwarding port.

Well, in that case we'll have an actual conflict and I guess it's good that we block that.

I don't see that that conflict is any more "actual" than the ones we'd block with option (1). In this case the reply socket is already connect()ed, and so would not meaningfully conflict with the new forwarding socket.

In any case, I filed https://bugs.passt.top/show_bug.cgi?id=92 for this, so that we don't spam this ticket (which is about the test cases) too much.

Ok.

@sbrivio-rh
Copy link
Collaborator

Do you see any problem with this alternative by the way:

Another possibility I was thinking about: try to bind() without SO_REUSEADDR. If that fails, we report that and exit. If that succeeds, we close the socket, open another one, and bind with SO_REUSEADDR.

?

Theoretically racy, but I guess it's probably ok.

I wouldn't be so sure that the non-SO_REUSEADDR case is not racy, by the way.

A race between what and what are you thinking of?

Between bind() and bind() to the same port, same address, without SO_REUSEADDR, or with SO_REUSEADDR on one socket only.

If the reason why we can bind to the same port and address with SO_REUSEADDR is that the port isn't really "bound" in every sense, then we can't exclude there are other issues in that sense, I guess.

When we implement dynamic port forwarding we could get the reversed problem: a temporary flow could block binding a new forwarding port.

Well, in that case we'll have an actual conflict and I guess it's good that we block that.

I don't see that that conflict is any more "actual" than the ones we'd block with option (1). In this case the reply socket is already connect()ed, and so would not meaningfully conflict with the new forwarding socket.

Hmm, right.

By the way, I missed one obvious, important fact: we pre-open the procfs entries with information about bound ports, so we could always access them. It's not practical to read out the whole file every time we create a "reply" socket, but we don't need any additional check on that path. We could simply read the file whenever we bind a UDP port for inbound forwarding. How does that sound?

@dgibson
Copy link
Collaborator

dgibson commented Aug 8, 2024

I wouldn't be so sure that the non-SO_REUSEADDR case is not racy, by the way.

A race between what and what are you thinking of?

Between bind() and bind() to the same port, same address, without SO_REUSEADDR, or with SO_REUSEADDR on one socket only.

Still not really seeing the scenario you're getting at. Without dynamic forwarding the "listening" binds always go first, so there's no self-race. There's a race between us opening a reply socket and some other process opening a conflicting one without REUSEADDR, but that's outside our control anyway.

With dynamic forwarding there is certainly a race, but no more so AFAICT than with the proposal below.

If the reason why we can bind to the same port and address with SO_REUSEADDR is that the port isn't really "bound" in every sense, then we can't exclude there are other issues in that sense, I guess.

Not really sure what you mean by that.

When we implement dynamic port forwarding we could get the reversed problem: a temporary flow could block binding a new forwarding port.

Well, in that case we'll have an actual conflict and I guess it's good that we block that.

I don't see that that conflict is any more "actual" than the ones we'd block with option (1). In this case the reply socket is already connect()ed, and so would not meaningfully conflict with the new forwarding socket.

Hmm, right.

By the way, I missed one obvious, important fact: we pre-open the procfs entries with information about bound ports, so we could always access them. It's not practical to read out the whole file every time we create a "reply" socket, but we don't need any additional check on that path. We could simply read the file whenever we bind a UDP port for inbound forwarding. How does that sound?

I guess the key difference from the open without REUSEADDR, then open with, is that we don't have a window where our initial !REUSEADDR binding could mess with other things. It does also mean that we check some extra information in the /proc file for more nuance. I think it works.

@Luap99 Luap99 self-assigned this Aug 9, 2024
@Luap99
Copy link
Member

Luap99 commented Aug 9, 2024

I have patch locally based on #23275 to convert the test to explicit port bound checks in the container instead of the sleep delay approach currently. In the non parallel test case it is already much much faster in total test time for the file. I have the parallel case running in a loop now to see if this fixes the cause of the hangs here as well.
Before it failed after like 15 mins locally so if it is still running after 30 mins I call it a success and submit a PR.

Luap99 added a commit to Luap99/libpod that referenced this issue Aug 9, 2024
Do not rely on an arbitrary delat in order to ensure the port was bound
in the container. Insted this approach checks if the port is bound in
the netns and only then starts the client. This speeds up the entire
test file by 50% but more importanly in parallel testing it solves hangs
as the timeout there was unreliable.

Fixes containers#23471

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Aug 12, 2024
Do not rely on an arbitrary delay in order to ensure the port was bound
in the container. Instead this approach checks if the port is bound in
the netns and only then starts the client. This speeds up the entire
test file by 50% but more importantly in parallel testing it solves
hangs as the timeout there was unreliable.

Fixes containers#23471

Signed-off-by: Paul Holzinger <[email protected]>
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 12, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. pasta pasta(1) bugs or features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants