-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Wait until host EP is ready (=regenerated) #18859
Conversation
/test |
ce1643c
to
2c6f95d
Compare
/test Job 'Cilium-PR-K8s-GKE' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
@joestringer Requested you to review mainly to validate the state "ready" == the endpoint has been regenerated after the startup assumption. |
/test-gke Job 'Cilium-PR-K8s-GKE' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
/test-gke |
/test-1.23-net-next Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
2c6f95d
to
dd32ad5
Compare
/test-1.23-net-next |
Previously (hopefully), we saw many CI flakes which were due to the first request from outside to k8s Service failing. E.g., Can not connect to service "http://192.168.37.11:30146" from outside cluster After some investigation it became obvious why it happened. Cilium-agent becomes ready before the host endpoints get regenerated (e.g., bpf_netdev_eth0.o). This leads to old programs to handling requests which might fail in different ways. For example the following request failed in the K8sServicesTest Checks_N_S_loadbalancing Tests_with_direct_routing_and_DSR suite: {..., "IP":{"source":"192.168.56.13","destination":"10.0.1.105", ..., "trace_observation_point":"TO_OVERLAY","interface":{"index":40}, ...} The previous suite was running in the tunnel mode, so the old program was still trying to send the packet over the tunnel which no longer existed. This resulted in the silent drop. Fix this by making the CI to wait after deploying Cilium until the host EP is in the "ready" state. This should ensure that the host EP programs have been regenerated. Signed-off-by: Martynas Pumputis <[email protected]>
/test |
1 similar comment
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, endpoints first start in restoring state when they're restored from the filesystem, then they should transition through regenerating and become ready:
cilium/pkg/endpoint/endpoint.go
Line 852 in ab7ff52
ep.setState(StateRestoring, "Endpoint restoring") |
cilium/pkg/endpoint/endpoint.go
Line 1273 in ab7ff52
func (e *Endpoint) setState(toState State, reason string) bool { |
@brb Would it be feasible to backport this to v1.10? |
@pchaigno Sure. Do you want me to do that or a tophat? |
I'll take care of it and ping you if I need help. |
Previously (hopefully), we saw many CI flakes which were due to the
first request from outside to k8s Service failing. E.g.,
After some investigation it became obvious why it happened. Cilium-agent
becomes ready before the host endpoints get regenerated (e.g.,
bpf_netdev_eth0.o). This leads to old programs to handling requests
which might fail in different ways. For example the following request
failed in the K8sServicesTest Checks_N_S_loadbalancing
Tests_with_direct_routing_and_DSR suite:
The previous suite was running in the tunnel mode, so the old program
was still trying to send the packet over the tunnel which no longer
existed. This resulted in the silent drop.
Fix this by making the CI to wait after deploying Cilium until the host
EP is in the "ready" state. This should ensure that the host EP programs
have been regenerated.
Fix #12511.