Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown not handled correctly for long-running reconciliations #569

Closed
relu opened this issue Nov 25, 2022 · 0 comments
Closed
Assignees

Comments

@relu
Copy link
Member

relu commented Nov 25, 2022

There seems to be a problem with graceful termination handling in the situation when the helm-controller workers are busy reconciling a release that takes longer than 30s (readinessProbe.failureThreshold: 3 and readinessProbe.periodSeconds: 10 as configured by default) the readiness probe fails immediately after SIGTERM (I see that in the pod events) and then the container receives another SIGTERM which triggers the signal handler to exit immediately with code 1).

However, this doesn't look like something we can fix in helm-controller itself but rather an issue in the controller-runtime logic:

  1. The internalProceduresStop channel is closed here (before the runnables are stopped): https://github.com/kubernetes-sigs/controller-runtime/blob/v0.13.1/pkg/manager/internal.go#L539
  2. The probes server is designed to shutdown when the internalProceduresStop channel is closed: https://github.com/kubernetes-sigs/controller-runtime/blob/v0.13.1/pkg/manager/internal.go#L384

Removing the readiness probe will solve only part of the problem because we also need to override the default controller manager gracefulShutdownTimeout (30s).

relu added a commit that referenced this issue Nov 25, 2022
Overriding the default GracefulShutdownTimeout option given to the
controller manager with a default of 0 (no timeout) since the helm
operations are sensitive to interruption and can lead to leaving the
HelmRelease in a bad state.

This will also allow users to override the option via a cli flag
`-graceful-shutdown-timeout` how much time to wait before forcibly
exiting.

Related to #569

Signed-off-by: Aurel Canciu <[email protected]>
@relu relu self-assigned this Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants