Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Wait until host EP is ready (=regenerated) #18859

Merged
merged 1 commit into from
Feb 23, 2022

Conversation

brb
Copy link
Member

@brb brb commented Feb 18, 2022

Previously (hopefully), we saw many CI flakes which were due to the
first request from outside to k8s Service failing. E.g.,

Can not connect to service "http://192.168.37.11:30146" from outside
cluster

After some investigation it became obvious why it happened. Cilium-agent
becomes ready before the host endpoints get regenerated (e.g.,
bpf_netdev_eth0.o). This leads to old programs to handling requests
which might fail in different ways. For example the following request
failed in the K8sServicesTest Checks_N_S_loadbalancing
Tests_with_direct_routing_and_DSR suite:

{..., "IP":{"source":"192.168.56.13","destination":"10.0.1.105", ...,
"trace_observation_point":"TO_OVERLAY","interface":{"index":40}, ...}

The previous suite was running in the tunnel mode, so the old program
was still trying to send the packet over the tunnel which no longer
existed. This resulted in the silent drop.

Fix this by making the CI to wait after deploying Cilium until the host
EP is in the "ready" state. This should ensure that the host EP programs
have been regenerated.

Fix #12511.

@brb brb added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Feb 18, 2022
@brb
Copy link
Member Author

brb commented Feb 18, 2022

/test

@brb brb force-pushed the pr/brb/ci-wait-until-host-ep-regenerated branch from ce1643c to 2c6f95d Compare February 19, 2022 09:42
@brb brb marked this pull request as ready for review February 19, 2022 09:44
@brb brb requested a review from a team as a code owner February 19, 2022 09:44
@brb brb requested a review from nebril February 19, 2022 09:44
@brb
Copy link
Member Author

brb commented Feb 19, 2022

/test

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Checks service on same node

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

@brb brb requested a review from joestringer February 19, 2022 09:44
@brb
Copy link
Member Author

brb commented Feb 19, 2022

@joestringer Requested you to review mainly to validate the state "ready" == the endpoint has been regenerated after the startup assumption.

@brb
Copy link
Member Author

brb commented Feb 19, 2022

/test-gke

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Checks service on same node

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

@brb
Copy link
Member Author

brb commented Feb 19, 2022

/test-gke

@brb
Copy link
Member Author

brb commented Feb 19, 2022

/test-1.23-net-next

Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed:

Click to show.

Test Name

K8sVerifier Runs the kernel verifier against Cilium's BPF datapath

Failure Output

FAIL: terminating containers are not deleted after timeout

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-net-next so I can create one.

@brb brb force-pushed the pr/brb/ci-wait-until-host-ep-regenerated branch from 2c6f95d to dd32ad5 Compare February 21, 2022 13:26
@brb
Copy link
Member Author

brb commented Feb 21, 2022

/test-1.23-net-next

Previously (hopefully), we saw many CI flakes which were due to the
first request from outside to k8s Service failing. E.g.,

Can not connect to service "http://192.168.37.11:30146" from outside
cluster

After some investigation it became obvious why it happened. Cilium-agent
becomes ready before the host endpoints get regenerated (e.g.,
bpf_netdev_eth0.o). This leads to old programs to handling requests
which might fail in different ways. For example the following request
failed in the K8sServicesTest Checks_N_S_loadbalancing
Tests_with_direct_routing_and_DSR suite:

{..., "IP":{"source":"192.168.56.13","destination":"10.0.1.105", ...,
"trace_observation_point":"TO_OVERLAY","interface":{"index":40}, ...}

The previous suite was running in the tunnel mode, so the old program
was still trying to send the packet over the tunnel which no longer
existed. This resulted in the silent drop.

Fix this by making the CI to wait after deploying Cilium until the host
EP is in the "ready" state. This should ensure that the host EP programs
have been regenerated.

Signed-off-by: Martynas Pumputis <[email protected]>
@brb
Copy link
Member Author

brb commented Feb 21, 2022

/test

1 similar comment
@brb
Copy link
Member Author

brb commented Feb 21, 2022

/test

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, endpoints first start in restoring state when they're restored from the filesystem, then they should transition through regenerating and become ready:

ep.setState(StateRestoring, "Endpoint restoring")

func (e *Endpoint) setState(toState State, reason string) bool {

@brb brb added needs-backport/1.11 ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Feb 23, 2022
@nebril nebril merged commit 3b9b098 into master Feb 23, 2022
@nebril nebril deleted the pr/brb/ci-wait-until-host-ep-regenerated branch February 23, 2022 07:58
@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Mar 15, 2022
@pchaigno
Copy link
Member

pchaigno commented Apr 4, 2022

@brb Would it be feasible to backport this to v1.10?

@brb
Copy link
Member Author

brb commented Apr 4, 2022

@pchaigno Sure. Do you want me to do that or a tophat?

@pchaigno
Copy link
Member

pchaigno commented Apr 4, 2022

I'll take care of it and ping you if I need help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: K8sServicesTest Checks fails with "Can not connect to service X from outside cluster"
5 participants