-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provider will remove the leases during the k8s maintenance #14
Comments
Or Have informed providers here https://discord.com/channels/747885925232672829/771909807359262751/1061981326246424596 |
I have just discussed this internally. I'm going to make a proposal where the provider can set the grace period from a few hours to days so that in the event of the worker node crash, the apps may have a higher chance to start back up again. |
The proposal => https://github.com/orgs/akash-network/discussions/160 |
superseded by #121 |
let's keep this open since this issue is different, #121 is |
NIC's died, leases died. That's OK when provider isn't charging for the lease that doesn't work for quite some time. So I think the Alternative proposal (client-defined) would be ideal if the timeout (the amount of time when the lease is down because it cannot redeploy as the worker node is down) could be configured by the clients themselves in their SDL (say I'll close this issue in the favor of the Alternative proposal. |
Provider has a setting
monitorMaxRetries
which is set to40
, https://github.com/akash-network/provider/blob/v0.1.0/cluster/monitor.go#L26 , so when one of the worker nodes gets removed* during the maintenance (say for downsizing the k8s cluster), the pods which do not have enough room to start will be getting closed.The best thing to do during the maintenance is to stop
akash-provider
service in order to prevent theattempts
counter from incrementing, reaching the value of40
which makes provider believe the deployment failed, closing the lease (if that's the case then you can spot this message in the provider's logs:deployment failed. closing lease
with accumulatingattempts
value.)https://github.com/akash-network/provider/blob/v0.1.0/cluster/monitor.go#L162
To stop the
akash-provider
serviceVerify
Make sure you see
0/0
or you see noakash-provider
pods running:means
akash-provider
is stopped now.To start it back again
Doc https://docs.akash.network/providers/akash-provider-troubleshooting/provider-maintenance
The text was updated successfully, but these errors were encountered: