Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/system: pasta_test_do add explicit port check #23561

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Aug 9, 2024

Do not rely on an arbitrary delay in order to ensure the port was bound in the container. Instead this approach checks if the port is bound in the netns and only then starts the client. This speeds up the entire test file by 50% but more importantly in parallel testing it solves hangs as the timeout there was unreliable.

Fixes #23471

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 9, 2024
@Luap99
Copy link
Member Author

Luap99 commented Aug 9, 2024

@edsantiago @sbrivio-rh PTAL

@sbrivio-rh
Copy link
Collaborator

Hah, it's similar to what I was drafting (but I was nowhere near the point you are, that is, ...completion). Using ss is great, I didn't know it was available in the images and I was trying to come up with decent helpers for /proc/net/tcp and friends.

Review coming in a bit.

sh -c 'for port in $(seq '"${xseq}"'); do '\
' socat -u '"${bind}"' '"${recv}"' & '\
' done; wait'
# Wait for the client childs to finish.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"children" (those weird plurals that look like German :D)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... maybe I say "kids" then :)
Will fix that up on the next push

@Luap99
Copy link
Member Author

Luap99 commented Aug 9, 2024

Hah, it's similar to what I was drafting (but I was nowhere near the point you are, that is, ...completion). Using ss is great, I didn't know it was available in the images and I was trying to come up with decent helpers for /proc/net/tcp and friends.

I used podman unshare nsenter to join only the users+netns so I can use the hosts executables. But well that doesn't seem to work because the podman unshare is not supported via remote client. But sounds like you are right I Can us ss in the image which allows me to convert that to podman exec which works via remote

Copy link
Collaborator

@sbrivio-rh sbrivio-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github is telling me: "This diff is outdated, please refresh and try again." and it doesn't let me add any further comment, but I was in the middle of the review. So here comes a first part of it.

# With Podman up, and socat still starting, UDP clients send to nowhere
delay=2
fi
# start server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Start server

run_podman container inspect --format "{{.NetworkSettings.SandboxKey}}" $cname
netns="$output"

# Make sure all ports in the contianer are bound
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container

for cport in $(seq ${xseq}); do
retries=50
while [[ $retries -gt 0 ]]; do
run_podman unshare nsenter -n"$netns" ss -Htuln -$ip_ver sport = $cport
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is robust enough in practice: you pick both UDP sockets and TCP sockets. What happens if a UDP port, with the same number as the TCP port we're using for a TCP test, happens to be bound, while our TCP port isn't?

I guess you could do, before this, something like [ "${proto}" = "tcp" ] && ss_opts="Htln" || ss_opts="Huln".

Or even, above, where proto_upper is assigned, proto_short="$(expr substr "${proto}" 1 1)", and then ss -H"${proto_short}"ln ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I ignored that case because there doesn't seem to be any test using both tcp and udp so this isn't a concern in practise. I find that easier than adding extra conditions to filter for tcp/udp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but this is on the "fast path", and ss isn't always cheap, because it has to fetch the whole procfs entry:

$ unshare -rUn
# time for i in $(seq 1 1000); do ss -Htuln -4 sport = 80 >/dev/null; done

real	0m10.573s
user	0m0.248s
sys	0m10.356s
# time for i in $(seq 1 1000); do ss -Huln -4 sport = 80 >/dev/null; done

real	0m1.178s
user	0m0.869s
sys	0m0.343s
# time for i in $(seq 1 1000); do ss -Htln -4 sport = 80 >/dev/null; done

real	0m10.576s
user	0m0.318s
sys	0m10.289s

And it's not negligible either for 50 retries:

# time for i in $(seq 1 50); do ss -Htln -4 sport = 80 >/dev/null; done

real	0m0.522s
user	0m0.022s
sys	0m0.502s
# time for i in $(seq 1 50); do ss -Huln -4 sport = 80 >/dev/null; done

real	0m0.073s
user	0m0.048s
sys	0m0.027s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ss doesn't fetch anything from procfs? It uses netlink (NETLINK_SOCK_DIAG) to get the sockets.

Ok I agree if there are timing benefits, actually no need for special parsing in the test I can add --$proto instead of using the short option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, right, it uses NETLINK_SOCK_DIAG for everything as long as tcp_show_netlink() succeeds. Still, the cost doesn't come from the interface, but rather from the kernel preparing the TCP socket table dump (UDP is somewhat more straightforward).

test/system/505-networking-pasta.bats Outdated Show resolved Hide resolved
retries=50
while [[ $retries -gt 0 ]]; do
run_podman unshare nsenter -n"$netns" ss -Htuln -$ip_ver sport = $cport
if [[ $output =~ $cport ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, this isn't working on IPv6 "tap" tests. It works for IPv4, and for IPv6 loopback tests. I'm not really familiar with non-POSIX shell syntax, so here comes the relevant output of one failing test as example:

 ✗ [505] Single UDP port forwarding, IPv6, tap
   (from function `run_podman' in file helpers.bash, line 527,
    from function `pasta_test_do' in file 505-networking-pasta.bats, line 247,
    in test file 505-networking-pasta.bats, line 634)
     `pasta_test_do' failed
   
   [16:59:47.268721829] $ podman info --format {{.Host.Pasta.Executable}}
   [16:59:48.118448097] /usr/bin/pasta
   
   [16:59:48.179590796] $ podman run -d --name=c-socat-t13-1bowvp8k --net=pasta -p [2a01:4f8:222:904::2]:5095:5095/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5095 5095); do                              socat -u UDP6-LISTEN:${port},null-eof STDOUT &                          done; wait
   [16:59:48.291610727] d07505ed44b255ca84dff14d3fd0f7c527fb2c93bbcafc1f8aebb320686cb9e8
   
   [16:59:48.296102453] $ podman container inspect --format {{.NetworkSettings.SandboxKey}} c-socat-t13-1bowvp8k
   [16:59:48.325410606] /run/user/1000/netns/netns-760759d3-9a23-882f-bb67-63c86c556a40
   
   [16:59:48.332287966] $ podman unshare nsenter -n/run/user/1000/netns/netns-760759d3-9a23-882f-bb67-63c86c556a40 ss -Htuln -6 sport = 5095
   [16:59:48.398745612] udp UNCONN 0      0      *:5095 *:*
   
   [16:59:48.411588810] $ podman logs --follow c-socat-t13-1bowvp8k
   [17:01:48.439216198] timeout: sending signal TERM to command ‘podman’
   [17:01:48.442469452] [ rc=124 (** EXPECTED 0 **) ]
   *** TIMED OUT ***
   # [teardown]

perhaps you could simply check if the output is non-empty [ -n "${output}" ] && break instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that I didn't quote the vars

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if I quote them ("$output" and "$cport"):

 ✗ [505] Single TCP port forwarding, IPv6, tap
   (from function `run_podman' in file helpers.bash, line 527,
    from function `pasta_test_do' in file 505-networking-pasta.bats, line 247,
    in test file 505-networking-pasta.bats, line 534)
     `pasta_test_do' failed
   
   [17:26:22.335618426] $ podman info --format {{.Host.Pasta.Executable}}
   [17:26:23.199697073] /usr/bin/pasta
   
   [17:26:23.290020913] $ podman run -d --name=c-socat-t1-wk8cq5z3 --net=pasta -p [2a01:4f8:222:904::2]:5782:5782/tcp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5782 5782); do                              socat -u TCP6-LISTEN:${port} STDOUT &                          done; wait
   [17:26:23.398206770] 557a806bc97e4d61782e6fd20f58aef75085a7723b938ed94144d53595879cb5
   
   [17:26:23.402789587] $ podman container inspect --format {{.NetworkSettings.SandboxKey}} c-socat-t1-wk8cq5z3
   [17:26:23.473008406] /run/user/1000/netns/netns-8f9ed356-7b29-dd1e-af2a-e905164d429e
   
   [17:26:23.479608036] $ podman unshare nsenter -n/run/user/1000/netns/netns-8f9ed356-7b29-dd1e-af2a-e905164d429e ss -Htuln -6 sport = 5782
   [17:26:23.514649930] tcp LISTEN 0      5      *:5782 *:*
   
   [17:26:23.527285450] $ podman logs --follow c-socat-t1-wk8cq5z3
   [17:28:23.551330353] timeout: sending signal TERM to command ‘podman’
   [17:28:23.554501527] [ rc=124 (** EXPECTED 0 **) ]
   *** TIMED OUT ***
   # [teardown]

by the way, I'm not sure if the assert for the 5-second timeout is actually working: the test times out after approximately two minutes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ -n "${output}" ] && break doesn't work either, actually :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot test this on my v6 setup it seems the test still require a public routeable address even though pasta doesn't anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so that means we still need to work around here until it gets fixed on your side

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not a workaround and it wouldn't be a fix :) ...but yes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luap99 with Fedora's passt-0^20240821.g1d6142f-1.fc40, passt-0^20240821.g1d6142f-1.fc41, passt-0^20240821.g1d6142f-1.fc42, and Debian's passt-0.0~git20240814.61c0b0d-1, we can drop the additional delay here. I can also submit a pull request once the CI images are updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of yesterday pasta was 2024-08-14 on debian, f40, and rawhide; but still 06-24 on f39. I've just +1'ed its bodhi.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but still 06-24 on f39

Oops, I kept pushing updates and obsoleting the previous ones. At least I avoided that this time.

@sbrivio-rh
Copy link
Collaborator

I used podman unshare nsenter to join only the users+netns so I can use the hosts executables.

Oh, I missed this "subtlety" altogether.

But well that doesn't seem to work because the podman unshare is not supported via remote client. But sounds like you are right I Can us ss in the image which allows me to convert that to podman exec which works via remote

I'm happy my uninformed comment led to something useful. :)

By the way, it was also like that in the approach I attempted (but simply because I didn't know I could easily run host stuff in the namespace -- I was going through procfs entries with grep).

@Luap99 Luap99 force-pushed the test-pasta-port branch 2 times, most recently from 306df3c to 58141d0 Compare August 9, 2024 15:47
Do not rely on an arbitrary delay in order to ensure the port was bound
in the container. Instead this approach checks if the port is bound in
the netns and only then starts the client. This speeds up the entire
test file by 50% but more importantly in parallel testing it solves
hangs as the timeout there was unreliable.

Fixes containers#23471

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member

edsantiago commented Aug 12, 2024

This is one of the flakes I've been seeing but I don't think I've filed yet. f39 EDIT: no, I did file it, #23482

<+060ms> # $ podman logs --follow c-socat-t612-o9ooo3we
<+155ms> # af4739753578c80b5f9f151c3c2db3fe  -
         # 2024/08/12 12:24:36 socat[3] W waitpid(0, 0x7ffd6999885c, 0): No child process
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: Mismatch between data sent and received
         # #| expected: = af4739753578c80b5f9f151c3c2db3fe  -
         # #|   actual:   af4739753578c80b5f9f151c3c2db3fe  -
         # #|         >   2024/08/12 12:24:36 socat\[3\] W waitpid\(0\, 0x7ffd6999885c\, 0\): No child process
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@Luap99
Copy link
Member Author

Luap99 commented Aug 12, 2024

#23482

I have not seen this locally running running over one hour but I don't think it is related to this here.

@edsantiago
Copy link
Member

Finally got around to conflict-resolving this onto my bats-parallel. I've seen no failures either in a long while loop.

LGTM.

Copy link
Contributor

openshift-ci bot commented Aug 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99, sbrivio-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mheon
Copy link
Member

mheon commented Aug 13, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 17baab0 into containers:main Aug 13, 2024
58 checks passed
@Luap99 Luap99 deleted the test-pasta-port branch August 13, 2024 18:18
hswong3i pushed a commit to alvistack/passt-top-passt that referenced this pull request Aug 22, 2024
…erface

It makes no sense for a container or a guest to try and perform
duplicate address detection for their link-local address, as we'll
anyway not relay neighbour solicitations with an unspecified source
address.

While they perform duplicate address detection, the link-local address
is not usable, which prevents us from bringing up especially
containers and communicate with them right away via IPv6.

This is not enough to prevent DAD and reach the container right away:
we'll need a couple more patches.

As we send NLM_F_REPLACE requests right away, while we still have to
read out other addresses on the same socket, we can't use nl_do():
keep track of the last sequence we sent (last address we changed), and
deal with the answers to those NLM_F_REPLACE requests in a separate
loop, later.

Link: containers/podman#23561 (comment)
Signed-off-by: Stefano Brivio <[email protected]>
Reviewed-by: David Gibson <[email protected]>
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 20, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pasta, parallel: test command timeout
4 participants