Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider will remove the leases during the k8s maintenance #14

Closed
andy108369 opened this issue Jan 9, 2023 · 6 comments
Closed

provider will remove the leases during the k8s maintenance #14

andy108369 opened this issue Jan 9, 2023 · 6 comments
Assignees
Labels
repo/provider Akash provider-services repo issues sev2

Comments

@andy108369
Copy link
Contributor

andy108369 commented Jan 9, 2023

Provider has a setting monitorMaxRetries which is set to 40, https://github.com/akash-network/provider/blob/v0.1.0/cluster/monitor.go#L26 , so when one of the worker nodes gets removed* during the maintenance (say for downsizing the k8s cluster), the pods which do not have enough room to start will be getting closed.

* - I haven't tested with stopping the k8s worker node, but I presume the result is going to be same.

The best thing to do during the maintenance is to stop akash-provider service in order to prevent the attempts counter from incrementing, reaching the value of 40 which makes provider believe the deployment failed, closing the lease (if that's the case then you can spot this message in the provider's logs: deployment failed. closing lease with accumulating attempts value.)
https://github.com/akash-network/provider/blob/v0.1.0/cluster/monitor.go#L162

To stop the akash-provider service

kubectl -n akash-services get statefulsets
kubectl -n akash-services scale statefulsets akash-provider --replicas=0

Verify

Make sure you see 0/0 or you see no akash-provider pods running:

kubectl -n akash-services get statefulsets
kubectl -n akash-services get pods -l app=akash-provide

means akash-provider is stopped now.

To start it back again

Only when the maintenance is fully complete and all k8s nodes are back up with enough resources to host the pods.

kubectl -n akash-services scale statefulsets akash-provider --replicas=1

Doc https://docs.akash.network/providers/akash-provider-troubleshooting/provider-maintenance

@andy108369
Copy link
Contributor Author

andy108369 commented Jan 9, 2023

Or monitorMaxRetries logic should not be applied to the deployments which have already been successfully deployed at least once.
OTOH, we don't want forever Pending pods...

Have informed providers here https://discord.com/channels/747885925232672829/771909807359262751/1061981326246424596

@andy108369
Copy link
Contributor Author

andy108369 commented Apr 8, 2023

I have just discussed this internally. I'm going to make a proposal where the provider can set the grace period from a few hours to days so that in the event of the worker node crash, the apps may have a higher chance to start back up again.
This is especially important when the apps have had persistent storage mounted, so the clients can regain their data once the worker node is restored.

@andy108369
Copy link
Contributor Author

@troian
Copy link
Member

troian commented Sep 27, 2023

superseded by #121

@troian troian closed this as completed Sep 27, 2023
@arno01
Copy link

arno01 commented Oct 10, 2023

superseded by #121

let's keep this open since this issue is different,

#121 is kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))
and this one is related to monitorMaxRetries when K8s runs out of resources to redeploy the workloads (i.e. even in the event when one of the worker nodes falls off for a short period of time)

@andy108369
Copy link
Contributor Author

NIC's died, leases died.

image

That's OK when provider isn't charging for the lease that doesn't work for quite some time.

So I think the Alternative proposal (client-defined) would be ideal if the timeout (the amount of time when the lease is down because it cannot redeploy as the worker node is down) could be configured by the clients themselves in their SDL (say deployment_grace_period in deployment manifests).

I'll close this issue in the favor of the Alternative proposal.

@github-project-automation github-project-automation bot moved this from Up Next (prioritized) to Released (in Prod) in Core Product and Engineering Roadmap Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues sev2
Projects
Status: Released (in Prod)
Development

No branches or pull requests

3 participants