-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the clustermesh status command #1834
Conversation
4833f0c
to
76c774c
Compare
76c774c
to
c9a5394
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you forgot to remove the commit aptly named "TO DROP: use forked version of cilium"
Yeah, the commit is still there because otherwise the CLI doesn't compile. This is the reason for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're now checking the health of the Deployment, but it may be the case that the Service is separately not able to receive traffic (e.g. an LB Service has not been assigned an IP). Is the idea here that if nodes have synced then the Service must be reachable?
There's actually a service readiness check, but it is hidden inside the |
c9a5394
to
183375f
Compare
The temporary commit has been dropped, and there are no more vendoring changes.
Currently, the wait observer prints the first log message only after that the warning interval has expired. Yet, this can be misleading, because the command seems to hang, and it is not clear which operation is being performed. Signed-off-by: Marco Iorio <[email protected]>
This parameter was a no-op, hence let's just remove it. Signed-off-by: Marco Iorio <[email protected]>
183375f
to
d9904f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one question
clustermesh/clustermesh.go
Outdated
ServiceIPs: []string{}, | ||
Tunnel: cm.Data[configNameTunnel], | ||
} | ||
|
||
switch { | ||
case len(endpoints) > 0: | ||
svc.Spec.Type = "Manual" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this "Manual"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR; I've dropped it with the last push. The idea was to propagate the type of the service to be printed by the cilium clustermesh status
command, and to set it to Manual
in case the endpoints were specified manually (there was a typo though, and it should have read ai.ServiceType = "Manual"
). Yet, the cilium clustermesh status
command does not support specifying the endpoints manually, and I agree that having a custom value might be misleading (I've also now changed that field to be of type corev1.ServiceType
to make its content more explicit). Hence, the special case is actually not needed.
The retrieval of the access information already retrieves the clustermesh-apiserver service and verifies that an IP can be inferred. Hence, let's drop the subsequent service check, which is redundant. Signed-off-by: Marco Iorio <[email protected]>
Currently, the status of the clustermesh-apiserver deployment is checked only when the --wait flag is specified. Let's always enable it, and uniform the wait style to that adopted for the other checks. Signed-off-by: Marco Iorio <[email protected]>
The wait logic implemented by the ClusterMeshConnectivity function was never enabled, as it is already handled by an outer wrapper. Hence, let's just remove it for the sake of simplicity. Signed-off-by: Marco Iorio <[email protected]>
The message reported when waiting for clusters to be connected is currently not correct, as the given number refers to nodes and not clusters. Additionally, the two conditions are equivalent, as an error is always present when a node is not ready. Let's also get rid of a redundant error wrapping. Signed-off-by: Marco Iorio <[email protected]>
Let's sort the cluster names and the possible errors to enforce a consistent output. Additionally, let's make it more explicit when the local cluster is not connected to any remote clusters. Signed-off-by: Marco Iorio <[email protected]>
Currently, the number of ready clusters is determined from the status reported by the agents only. Yet, given that the configuration is mounted from a secret, it might take some time before it propagates, in case agents are not restarted at the same time (that is when host aliases don't need to be configured). Hence, let's infer the number of expected clusters from the secret, so that we can actually report the correct status and wait until they are all effectively ready. Signed-off-by: Marco Iorio <[email protected]>
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <[email protected]>
d9904f1
to
e695828
Compare
This PR includes a set of improvements for the
cilium clustermesh status
command, to simplify troubleshooting and reduce CI flakiness:Please refer to the individual commit messages for additional information.
Fixes: #1773
I'm currently marking this PR as ready for review as it makes sense that the review occurs in parallel with that of cilium/cilium#26788, so that possible modifications can be synchronized. The commit with the go.mod changes will be dropped before merging.