Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid upgrade being killed by failed liveness probes #344

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

remram44
Copy link

@remram44 remram44 commented Jan 27, 2023

Pull Request

Description of the change

This runs the upgrade in an init container, before the main container starts. It sets the command arguments to "true", so the container exits immediately after the upgrade, and sets the variable NEXTCLOUD_UPDATE to 1, without which the upgrade step is skipped because there are command arguments (see entrypoint).

Benefits

Avoid the container being stopped during the upgrade because of failed liveness probes, since init containers don't get probed.

Possible drawbacks

Runs an additional container, so it's a bit slower I guess.

Applicable issues

Additional information

This also removes the nextcloud.update value. I don't see a way people could possibly have used it though, since it only does something when you pass different arguments to the image, and there is no value in this chart that will allow you to do that.

Checklist

@jessebot
Copy link
Collaborator

Thanks for submitting a PR! :)

Will this always run an upgrade anytime the pod restarts or additional are spun up? Is there a way to disable upgrades until a user is ready?

@remram44
Copy link
Author

It will update the volume to the version of the image, so it's usually triggered by the user doing helm upgrade (or kubectl set image I guess). If a Pod is restarted without changing version, nothing will happen.

@remram44
Copy link
Author

remram44 commented Feb 9, 2023

This doesn't solve the situation where you need to run the web installer. Probes should probably be changed not to fail if the web installer needs to be run (otherwise the pod never becomes ready because not installed, and you can't reach the web installer to install)

@Jeroen0494
Copy link

Jeroen0494 commented Mar 4, 2023

This doesn't solve the situation where you need to run the web installer. Probes should probably be changed not to fail if the web installer needs to be run (otherwise the pod never becomes ready because not installed, and you can't reach the web installer to install)

Should you even be using the web based upgrader at all when using container images? The image has the updated files. If you use the web-based upgrader but don't update your container image version, you'll break stuff for sure.

@Jeroen0494
Copy link

You can disable the web-based updater to prevent people from borking their installation:

'upgrade.disable-web' => true,

https://docs.nextcloud.com/server/stable/admin_manual/configuration_server/config_sample_php_parameters.html
Search for the upgrade.disable-web option.

@remram44
Copy link
Author

remram44 commented Mar 4, 2023

Yeah I also think it should auto-install but that is not the default behavior of running helm install

@4censord
Copy link

Hey, what would be needed to get this merged?
Also, is it possible to backport this change to the last 3 versions, so 25, 26 and 27?

@provokateurin
Copy link
Member

@4censord I haven't looked at this PR closely, but the conflicts need to be solved and the changes need to be reviewed. With this chart there is no backporting, you just overrride nextcloud.tag with your desired version.

@4censord
Copy link

I'll take a look at resolving the conflicts.

With this chart there is no backporting, you just overrride nextcloud.tag with your desired version.

Ok

@4censord
Copy link

The conflicts are solvable by simply rebasing onto the main branch.
@remram44 can you rebase, please?
Or if you'd be fine with that, i could open a new PR with your changes, and we close this one.

@remram44
Copy link
Author

If I rebase, will somebody review? I don't like to put in work for it to be ignored

@remram44 remram44 force-pushed the upgrade-avoid-probes branch from 11f5cf8 to c65efeb Compare October 26, 2023 15:33
@budimanjojo
Copy link

@jessebot can you please take a look at this? this is a longstanding issue with this chart and currently the only way to prevent upgrade failing is to disable all probes which is not really a solution.

I'm using fluxcd gitops tool to manage my cluster, and when the helm upgrade fails because it's taking longer than the probes, fluxcd will rollback the pod to the last working state. But nextcloud doesn't want to start the last working state too because of this error:

Can't start Nextcloud because the version of the data (27.1.3.2) is higher than the docker image version (27.1.2.1) and downgrading is not supported.
Are you sure you have pulled the newest image version?

And I will be left with the pod crashlooping forever until I manually fix it using occ commands.

@4censord
Copy link

And I will be left with the pod crashlooping forever until I manually fix it using occ commands.

How exactly are you fixing this?
I am left on 25.0.1 currently, and have yet to actually complete an upgrade with the helm deployment.

@budimanjojo
Copy link

@4censord I use helm rollback nextcloud <revision> to get it back to the helm revision where the upgrade failed and get rid of the nextcloud error. Then exec into the pod and do php occ upgrade then php occ maintenance:mode --off

@4censord
Copy link

@budimanjojo Then i seem to have a different issue, that does not work for me. I have to roll back a backup after attempting an upgrade.

@4censord
Copy link

4censord commented Nov 5, 2023

@provokateurin Would you mind taking a look at this once it's convenient?
IMO its ready.

@jessebot
Copy link
Collaborator

jessebot commented Jun 9, 2024

Sorry for the delay. Slowly making my rounds 🙏

@remram44 I did a quick look over and it seems ok, but don't we still want to be able to set upgrade in the values.yaml?

Also, there still needs to be a rebase, as there's conflicts.

@jessebot jessebot requested a review from provokateurin June 9, 2024 07:46
@jessebot
Copy link
Collaborator

jessebot commented Jun 9, 2024

Also, tagged @provokateurin to get an additional review to ask about keeping the upgrade parameter in the values.yaml.

@provokateurin
Copy link
Member

I'm not 100% sure if I read https://github.com/nextcloud/docker/blob/master/README.md correctly, but it sounds like the "normal" container will still try to do the upgrade as the default command is used there. Then we have a race condition between the two containers and depending on which one gets to work first the upgrade is killed by the probes or not.

@4censord
Copy link

4censord commented Jun 9, 2024

IIRK the upgrade env var is used for triggering the Nextcloud images upgrade mechanism. But because this PR completely supersedes the internal upgrade on startup stuff by running the upgrades explicitly before, the upgrade env var does not do anything any more.

Also, there still needs to be a rebase, as there's conflicts.

Yeah, back in 2023 when we said its ready there weren't.

@4censord
Copy link

4censord commented Jun 9, 2024

Then we have a race condition between the two containers and depending on which one gets to work first the upgrade is killed by the probes or not.

No, the init container runs first, and the normal container only gets started once the upgrade is completes.
If it had the upgrade var set, it just would not do anything because its already on the lasted version

@provokateurin
Copy link
Member

You're right, I overlooked that it is an initContainer and not another sidecar. Then this makes sense to me, I'll have to give it some testing though to confirm it works as intended.

Comment on lines 320 to 341
{{- if .Values.nextcloud.securityContext}}
securityContext:
{{- with .Values.nextcloud.securityContext }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just be a single with instead of if+with

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initContainer is copied from the container with only required changes, I am not cleaning up the existing container at this time. It would only make this patch harder to review.

charts/nextcloud/templates/deployment.yaml Outdated Show resolved Hide resolved
@provokateurin
Copy link
Member

If you could do a rebase onto the latest main state then I'll give it a test.

@wrenix
Copy link
Collaborator

wrenix commented Jun 9, 2024

In my opinion with the hooks. It is a better solution to move it to the initContainer (that should work).

@4censord
Copy link

4censord commented Jun 14, 2024

I'd say we should have multiple init-containers run in order.
That would make them both easier, and would help with debugging.
I'd have them in this order:

  1. wait for DB to be up
  2. Upgrade Nextcloud version
  3. install and update apps
  4. disable maintenance mode, possible more checks.

While that adds complexity in more containers, and will increase startup time per pod, IMHO it's then easier that writing complex scripts for doing everything within one container.

@remram44
Copy link
Author

I'm not entirely sure what the advantage is. Multiple init containers that run the exact same image... that also means they will have to release/acquire a lock multiple time.

I don't see what you mean be "decrease startup time per pod". At least as much work has to happen, so that doesn't seem true?

@4censord
Copy link

4censord commented Jun 15, 2024

decrease startup time per pod

This was meant to say "increase", fixed now.

Every container that starts takes a few seconds until it's ready to run its scripts.
So if you run more containers sequentially, you have that delay multiple times until the container can check that it does not need to do anything.

Personally, I just would have done multiple containers, because that feels easier to do, rather than having to chain things in the same container.
But its not a problem for me, and i wont argue against doing it another way.

@remram44
Copy link
Author

Bump. Will this get merged any time soon? Will you ask me to rebase and ignore it again?

@jessebot @provokateurin please let me know if there is anything else you expect before merging, thanks

@budimanjojo
Copy link

It's unreal how long this PR is getting ignored and this is a real issue with the chart.

@provokateurin
Copy link
Member

Please keep in mind this chart is only maintained by volunteers and nobody is paid for this.

@budimanjojo
Copy link

@provokateurin I understand. But so is the PR author which happens to be very cooperative in the rebase requests and then keep getting ignored afterwards (multiple times). This change is very small and you should just take a quick look, and decide whether this is a good addition or not and proceed to test if you accept this change or reject and close the PR otherwise. This should take like one or two hours and not years.

@jessebot
Copy link
Collaborator

jessebot commented Jul 22, 2024

This change is very small and you should just take a quick look, and decide whether this is a good addition or not and proceed to test if you accept this change or reject and close the PR otherwise. This should take like one or two hours and not years.

This changes the way updates work. It's unfortunately not as small as you may think, and because this is maintained by volunteers, it's when we have time to take a look at it.

Bump. Will this get merged any time soon? Will you ask me to rebase and ignore it again?
... please let me know if there is anything else you expect before merging, thanks

@remram44 Please also bump the helm chart version a major version, as this removes an option and defaults to allowing updates. In the future, please check out the checks at the bottom of the PR and you can see if there's any default checks that we'll come back and ask you to change, as if the checks of a PR are not passing, we cannot merge it, according to the greater nextcloud org rules. You can also find the contributing guidelines here.

Screenshot of the checks at the bottom of the PR, showing that the chart linting github workflow has failed and stating that the Merging is blocked

EDIT: I just tried to check this PR in an incognito window while logged out of GitHub and it didn't show the bottom checks section, but it does still show each commit, and if you see a little ❌ beside a commit, you should be able to click it and it will show you which check has failed. Sorry about that.

@provokateurin should we also add a note in the README that with this change, this chart will not auto-update anytime the tag is updates, so if you don't want that, you should manually specify the image.tag? 🤔 If not, I'm happy to merge this when the Chart.yaml has it's version bumped. Also, I can submit that PR for the doc change. It doesn't have to be in this PR.

Copy link
Collaborator

@jessebot jessebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes comment: Needs Version bumped in Chart.yaml.

@provokateurin
Copy link
Member

this chart will not auto-update anytime the tag is updates, so if you don't want that, you should manually specify the image.tag?

I didn't dive deep into this PR, but to my understanding it doesn't change this behavior? It only changes how the update is performed, or am I missing something?

@budimanjojo
Copy link

This PR only move the update process to an initContainer instead of in the main container that has probes. Those probes kill the container everytime when the upgrade process takes too long that will cause gitops tools like FluxCD and ArgoCD to rollback the chart because of the failed probes. And Nextcloud will refuse to start even after rolling back because of version mismatch when the container being killed mid upgrade.

@budimanjojo
Copy link

This changes the way updates work. It's unfortunately not as small as you may think, and because this is maintained by volunteers, it's when we have time to take a look at it.

No the changes doesn't change the way updates work, it just move that job to another container instead. And I apologize for being such a jerk in the complaint but I have been dealing with this problem for too long.

@remram44 remram44 force-pushed the upgrade-avoid-probes branch from 3acbf1e to c1b9fe0 Compare July 22, 2024 14:55
@remram44
Copy link
Author

Bumped to 6.0.0

@remram44
Copy link
Author

@jessebot I can see the checks at the bottom, however they do not run until the workflow is approved:

1 workflow awaiting approval
This workflow requires approval from a maintainer.
3 expected and 1 successful checks

I am not going to check this page everyday to see if maybe they got approved and there are results to see...

@jessebot
Copy link
Collaborator

jessebot commented Jul 23, 2024

I didn't dive deep into this PR, but to my understanding it doesn't change this behavior? It only changes how the update is performed, or am I missing something?

It removes the option to set the update flag and always sets it going forward, but perhaps I've misunderstood. @provokateurin could you please keep helping here?

@jessebot jessebot dismissed their stale review July 23, 2024 07:09

no longer handling this

remram44 added 3 commits July 26, 2024 12:04
This should not have been exposed. There is no way to use it, since you
can't pass a custom command to the container.

Signed-off-by: Remi Rampin <[email protected]>
@remram44 remram44 force-pushed the upgrade-avoid-probes branch from 23b861c to 3948370 Compare July 26, 2024 16:12
@remram44
Copy link
Author

The NEXTCLOUD_UPDATE is actually a little tricky. It is does not enable/disable updating. For that reason, in my view it should have never been expose by the chart.

The update always happens automatically, unless you run the container with a custom command (not the case of this chart), in which case it won't update unless NEXTCLOUD_UPDATE=1.

@SchoolGuy
Copy link
Contributor

Is there anything that one can do to progress this PR? I have to resort atm to setting an absurd timeframe for a startup probe to not kill an upgrade. This however delays starting Nextcloud all the time by multiple minutes. I would really love for the mechanism of init-containers for NC upgrades to go forward.

@davidfrickert
Copy link

davidfrickert commented Dec 6, 2024

also interested in this moving forward, in the meantime i've merged this branch with the current 6.3.0 and published it for my own use, if anyone else wants you can use it:
helm pull oci://git.djf.lol/davidfrickert/nextcloud/nextcloud --version 6.3.0

The source of the packaged chart is here https://git.djf.lol/davidfrickert/helm-charts/src/branch/main/charts/nextcloud

(still need to encounter an upgrade scenario as i've just started using it but i hope this will improve the currently terrible upgrade process)

@remram44
Copy link
Author

remram44 commented Dec 6, 2024

Maybe forking is the way forward. This is a critical issue where installation and upgrades get killed halfway through and obviously no maintainer has any interest in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Liveness probe kills pod while upgrading
9 participants