-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/system: pasta_test_do add explicit port check #23561
Conversation
@edsantiago @sbrivio-rh PTAL |
Hah, it's similar to what I was drafting (but I was nowhere near the point you are, that is, ...completion). Using Review coming in a bit. |
sh -c 'for port in $(seq '"${xseq}"'); do '\ | ||
' socat -u '"${bind}"' '"${recv}"' & '\ | ||
' done; wait' | ||
# Wait for the client childs to finish. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"children" (those weird plurals that look like German :D)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... maybe I say "kids" then :)
Will fix that up on the next push
I used podman unshare nsenter to join only the users+netns so I can use the hosts executables. But well that doesn't seem to work because the podman unshare is not supported via remote client. But sounds like you are right I Can us ss in the image which allows me to convert that to podman exec which works via remote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github is telling me: "This diff is outdated, please refresh and try again." and it doesn't let me add any further comment, but I was in the middle of the review. So here comes a first part of it.
# With Podman up, and socat still starting, UDP clients send to nowhere | ||
delay=2 | ||
fi | ||
# start server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Start server
run_podman container inspect --format "{{.NetworkSettings.SandboxKey}}" $cname | ||
netns="$output" | ||
|
||
# Make sure all ports in the contianer are bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
container
for cport in $(seq ${xseq}); do | ||
retries=50 | ||
while [[ $retries -gt 0 ]]; do | ||
run_podman unshare nsenter -n"$netns" ss -Htuln -$ip_ver sport = $cport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is robust enough in practice: you pick both UDP sockets and TCP sockets. What happens if a UDP port, with the same number as the TCP port we're using for a TCP test, happens to be bound, while our TCP port isn't?
I guess you could do, before this, something like [ "${proto}" = "tcp" ] && ss_opts="Htln" || ss_opts="Huln"
.
Or even, above, where proto_upper
is assigned, proto_short="$(expr substr "${proto}" 1 1)"
, and then ss -H"${proto_short}"ln
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I ignored that case because there doesn't seem to be any test using both tcp and udp so this isn't a concern in practise. I find that easier than adding extra conditions to filter for tcp/udp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but this is on the "fast path", and ss
isn't always cheap, because it has to fetch the whole procfs entry:
$ unshare -rUn
# time for i in $(seq 1 1000); do ss -Htuln -4 sport = 80 >/dev/null; done
real 0m10.573s
user 0m0.248s
sys 0m10.356s
# time for i in $(seq 1 1000); do ss -Huln -4 sport = 80 >/dev/null; done
real 0m1.178s
user 0m0.869s
sys 0m0.343s
# time for i in $(seq 1 1000); do ss -Htln -4 sport = 80 >/dev/null; done
real 0m10.576s
user 0m0.318s
sys 0m10.289s
And it's not negligible either for 50 retries:
# time for i in $(seq 1 50); do ss -Htln -4 sport = 80 >/dev/null; done
real 0m0.522s
user 0m0.022s
sys 0m0.502s
# time for i in $(seq 1 50); do ss -Huln -4 sport = 80 >/dev/null; done
real 0m0.073s
user 0m0.048s
sys 0m0.027s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ss doesn't fetch anything from procfs? It uses netlink (NETLINK_SOCK_DIAG) to get the sockets.
Ok I agree if there are timing benefits, actually no need for special parsing in the test I can add --$proto instead of using the short option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, right, it uses NETLINK_SOCK_DIAG
for everything as long as tcp_show_netlink()
succeeds. Still, the cost doesn't come from the interface, but rather from the kernel preparing the TCP socket table dump (UDP is somewhat more straightforward).
retries=50 | ||
while [[ $retries -gt 0 ]]; do | ||
run_podman unshare nsenter -n"$netns" ss -Htuln -$ip_ver sport = $cport | ||
if [[ $output =~ $cport ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, this isn't working on IPv6 "tap" tests. It works for IPv4, and for IPv6 loopback tests. I'm not really familiar with non-POSIX shell syntax, so here comes the relevant output of one failing test as example:
✗ [505] Single UDP port forwarding, IPv6, tap
(from function `run_podman' in file helpers.bash, line 527,
from function `pasta_test_do' in file 505-networking-pasta.bats, line 247,
in test file 505-networking-pasta.bats, line 634)
`pasta_test_do' failed
[16:59:47.268721829] $ podman info --format {{.Host.Pasta.Executable}}
[16:59:48.118448097] /usr/bin/pasta
[16:59:48.179590796] $ podman run -d --name=c-socat-t13-1bowvp8k --net=pasta -p [2a01:4f8:222:904::2]:5095:5095/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5095 5095); do socat -u UDP6-LISTEN:${port},null-eof STDOUT & done; wait
[16:59:48.291610727] d07505ed44b255ca84dff14d3fd0f7c527fb2c93bbcafc1f8aebb320686cb9e8
[16:59:48.296102453] $ podman container inspect --format {{.NetworkSettings.SandboxKey}} c-socat-t13-1bowvp8k
[16:59:48.325410606] /run/user/1000/netns/netns-760759d3-9a23-882f-bb67-63c86c556a40
[16:59:48.332287966] $ podman unshare nsenter -n/run/user/1000/netns/netns-760759d3-9a23-882f-bb67-63c86c556a40 ss -Htuln -6 sport = 5095
[16:59:48.398745612] udp UNCONN 0 0 *:5095 *:*
[16:59:48.411588810] $ podman logs --follow c-socat-t13-1bowvp8k
[17:01:48.439216198] timeout: sending signal TERM to command ‘podman’
[17:01:48.442469452] [ rc=124 (** EXPECTED 0 **) ]
*** TIMED OUT ***
# [teardown]
perhaps you could simply check if the output is non-empty [ -n "${output}" ] && break
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is that I didn't quote the vars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if I quote them ("$output"
and "$cport"
):
✗ [505] Single TCP port forwarding, IPv6, tap
(from function `run_podman' in file helpers.bash, line 527,
from function `pasta_test_do' in file 505-networking-pasta.bats, line 247,
in test file 505-networking-pasta.bats, line 534)
`pasta_test_do' failed
[17:26:22.335618426] $ podman info --format {{.Host.Pasta.Executable}}
[17:26:23.199697073] /usr/bin/pasta
[17:26:23.290020913] $ podman run -d --name=c-socat-t1-wk8cq5z3 --net=pasta -p [2a01:4f8:222:904::2]:5782:5782/tcp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5782 5782); do socat -u TCP6-LISTEN:${port} STDOUT & done; wait
[17:26:23.398206770] 557a806bc97e4d61782e6fd20f58aef75085a7723b938ed94144d53595879cb5
[17:26:23.402789587] $ podman container inspect --format {{.NetworkSettings.SandboxKey}} c-socat-t1-wk8cq5z3
[17:26:23.473008406] /run/user/1000/netns/netns-8f9ed356-7b29-dd1e-af2a-e905164d429e
[17:26:23.479608036] $ podman unshare nsenter -n/run/user/1000/netns/netns-8f9ed356-7b29-dd1e-af2a-e905164d429e ss -Htuln -6 sport = 5782
[17:26:23.514649930] tcp LISTEN 0 5 *:5782 *:*
[17:26:23.527285450] $ podman logs --follow c-socat-t1-wk8cq5z3
[17:28:23.551330353] timeout: sending signal TERM to command ‘podman’
[17:28:23.554501527] [ rc=124 (** EXPECTED 0 **) ]
*** TIMED OUT ***
# [teardown]
by the way, I'm not sure if the assert
for the 5-second timeout is actually working: the test times out after approximately two minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ -n "${output}" ] && break
doesn't work either, actually :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot test this on my v6 setup it seems the test still require a public routeable address even though pasta doesn't anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so that means we still need to work around here until it gets fixed on your side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's not a workaround and it wouldn't be a fix :) ...but yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Luap99 with Fedora's passt-0^20240821.g1d6142f-1.fc40
, passt-0^20240821.g1d6142f-1.fc41
, passt-0^20240821.g1d6142f-1.fc42
, and Debian's passt-0.0~git20240814.61c0b0d-1
, we can drop the additional delay here. I can also submit a pull request once the CI images are updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of yesterday pasta was 2024-08-14 on debian, f40, and rawhide; but still 06-24 on f39. I've just +1'ed its bodhi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but still 06-24 on f39
Oops, I kept pushing updates and obsoleting the previous ones. At least I avoided that this time.
Oh, I missed this "subtlety" altogether.
I'm happy my uninformed comment led to something useful. :) By the way, it was also like that in the approach I attempted (but simply because I didn't know I could easily run host stuff in the namespace -- I was going through procfs entries with |
306df3c
to
58141d0
Compare
Do not rely on an arbitrary delay in order to ensure the port was bound in the container. Instead this approach checks if the port is bound in the netns and only then starts the client. This speeds up the entire test file by 50% but more importantly in parallel testing it solves hangs as the timeout there was unreliable. Fixes containers#23471 Signed-off-by: Paul Holzinger <[email protected]>
58141d0
to
20f3e89
Compare
This is one of the flakes I've been seeing but I don't think I've filed yet. f39 EDIT: no, I did file it, #23482
|
I have not seen this locally running running over one hour but I don't think it is related to this here. |
Finally got around to conflict-resolving this onto my bats-parallel. I've seen no failures either in a long while loop. LGTM. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Luap99, sbrivio-rh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
…erface It makes no sense for a container or a guest to try and perform duplicate address detection for their link-local address, as we'll anyway not relay neighbour solicitations with an unspecified source address. While they perform duplicate address detection, the link-local address is not usable, which prevents us from bringing up especially containers and communicate with them right away via IPv6. This is not enough to prevent DAD and reach the container right away: we'll need a couple more patches. As we send NLM_F_REPLACE requests right away, while we still have to read out other addresses on the same socket, we can't use nl_do(): keep track of the last sequence we sent (last address we changed), and deal with the answers to those NLM_F_REPLACE requests in a separate loop, later. Link: containers/podman#23561 (comment) Signed-off-by: Stefano Brivio <[email protected]> Reviewed-by: David Gibson <[email protected]>
Do not rely on an arbitrary delay in order to ensure the port was bound in the container. Instead this approach checks if the port is bound in the netns and only then starts the client. This speeds up the entire test file by 50% but more importantly in parallel testing it solves hangs as the timeout there was unreliable.
Fixes #23471
Does this PR introduce a user-facing change?