Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health checks to all controllers #560

Closed
andrewrynhard opened this issue Aug 30, 2021 · 3 comments · Fixed by #568
Closed

Add health checks to all controllers #560

andrewrynhard opened this issue Aug 30, 2021 · 3 comments · Fixed by #568

Comments

@andrewrynhard
Copy link
Member

We should add readiness and liveness problem to all controllers.

@abckey
Copy link

abckey commented Sep 17, 2021

any update on this, we met this problem again:

E0915 13:05:47.387749       1 leaderelection.go:357] Failed to update lock: Put "https://10.255.0.1:443/api/v1/namespaces/sidero-system/configmaps/controller-leader-election-sidero-controller-manager": context
deadline exceeded
I0915 13:05:47.387945       1 leaderelection.go:278] failed to renew lease sidero-system/controller-leader-election-sidero-controller-manager: timed out waiting for the condition
2021-09-15T13:05:47.388Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"ConfigMap","apiVersion":"v1"}, "reason": "LeaderElection", "message": "sidero-controller-manager-76fc9
d4ddf-fx9vk_ec017c02-6eee-443f-b100-81d5560091e6 stopped leading"}
2021-09-15T13:05:47.388Z        ERROR   setup   problem running manager {"error": "leader election lost"}
golang.org/x/sync/errgroup.(*Group).Go.func1
        /.cache/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
2021-09-15T13:05:47.389Z        INFO    controller      Stopping workers        {"reconcilerGroup": "metal.sidero.dev", "reconcilerKind": "Server", "controller": "server"}
2021-09-15T13:05:47.389Z        INFO    controller      Stopping workers        {"reconcilerGroup": "metal.sidero.dev", "reconcilerKind": "Environment", "controller": "environment"}
2021-09-15T13:05:47.389Z        INFO    controller      Stopping workers        {"reconcilerGroup": "metal.sidero.dev", "reconcilerKind": "ServerClass", "controller": "serverclass"}

@smira
Copy link
Member

smira commented Sep 17, 2021

We are going to take a look today. It's a bug that controller doesn't crash

@smira
Copy link
Member

smira commented Sep 17, 2021

I don't think liveness check is the right answer here though, as controller being working is hard to check from outside.

smira added a commit to smira/sidero that referenced this issue Sep 17, 2021
Fixes siderolabs#560

The way it was implemented before this change, `errgoup` waits for all
goroutines to finish before it returns, so if the controller crashes due
to election issues, container still keeps running as HTTP API is up.

After this change, container crashes on first error.

Also added liveness/readiness check, they won't help much this issue,
but provide additional layer of protection/visibility.

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit e52071d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants