provider will remove the leases during the k8s maintenance #14

andy108369 · 2023-01-09T12:11:23Z

Provider has a setting monitorMaxRetries which is set to 40, https://github.com/akash-network/provider/blob/v0.1.0/cluster/monitor.go#L26 , so when one of the worker nodes gets removed* during the maintenance (say for downsizing the k8s cluster), the pods which do not have enough room to start will be getting closed.

* - I haven't tested with stopping the k8s worker node, but I presume the result is going to be same.

The best thing to do during the maintenance is to stop akash-provider service in order to prevent the attempts counter from incrementing, reaching the value of 40 which makes provider believe the deployment failed, closing the lease (if that's the case then you can spot this message in the provider's logs: deployment failed. closing lease with accumulating attempts value.)
https://github.com/akash-network/provider/blob/v0.1.0/cluster/monitor.go#L162

To stop the `akash-provider` service

kubectl -n akash-services get statefulsets
kubectl -n akash-services scale statefulsets akash-provider --replicas=0

Verify

Make sure you see 0/0 or you see no akash-provider pods running:

kubectl -n akash-services get statefulsets
kubectl -n akash-services get pods -l app=akash-provide

means akash-provider is stopped now.

To start it back again

Only when the maintenance is fully complete and all k8s nodes are back up with enough resources to host the pods.

kubectl -n akash-services scale statefulsets akash-provider --replicas=1

Doc https://docs.akash.network/providers/akash-provider-troubleshooting/provider-maintenance

The text was updated successfully, but these errors were encountered:

andy108369 · 2023-01-09T13:08:32Z

Or monitorMaxRetries logic should not be applied to the deployments which have already been successfully deployed at least once.
OTOH, we don't want forever Pending pods...

Have informed providers here https://discord.com/channels/747885925232672829/771909807359262751/1061981326246424596

andy108369 · 2023-04-08T19:25:10Z

I have just discussed this internally. I'm going to make a proposal where the provider can set the grace period from a few hours to days so that in the event of the worker node crash, the apps may have a higher chance to start back up again.
This is especially important when the apps have had persistent storage mounted, so the clients can regain their data once the worker node is restored.

andy108369 · 2023-04-08T19:44:55Z

The proposal => https://github.com/orgs/akash-network/discussions/160

troian · 2023-09-27T14:13:10Z

superseded by #121

arno01 · 2023-10-10T16:29:58Z

superseded by #121

let's keep this open since this issue is different,

#121 is kube-builder: ClusterParams() returned result of unexpected type (%!s(<nil>))
and this one is related to monitorMaxRetries when K8s runs out of resources to redeploy the workloads (i.e. even in the event when one of the worker nodes falls off for a short period of time)

andy108369 · 2023-11-13T17:00:13Z

NIC's died, leases died.

That's OK when provider isn't charging for the lease that doesn't work for quite some time.

So I think the Alternative proposal (client-defined) would be ideal if the timeout (the amount of time when the lease is down because it cannot redeploy as the worker node is down) could be configured by the clients themselves in their SDL (say deployment_grace_period in deployment manifests).

I'll close this issue in the favor of the Alternative proposal.

andy108369 mentioned this issue Feb 4, 2023

the deployment got closed due to an unknown reason along with the strange address #27

Open

andy108369 assigned troian Jan 18, 2023

andy108369 transferred this issue from akash-network/provider Feb 4, 2023

andy108369 added the repo/provider Akash provider-services repo issues label Feb 4, 2023

andy108369 mentioned this issue Feb 4, 2023

provider closed 32 leases after one of K8s nodes experience a network error #15

Closed

andy108369 added P1 sev1 sev2 and removed P1 sev1 labels Mar 15, 2023

troian closed this as completed Sep 27, 2023

andy108369 reopened this Oct 10, 2023

anilmurty added this to Core Product and Engineering Roadmap Oct 17, 2023

anilmurty moved this to Up Next (prioritized) in Core Product and Engineering Roadmap Oct 17, 2023

andy108369 closed this as completed Nov 13, 2023

github-project-automation bot moved this from Up Next (prioritized) to Released (in Prod) in Core Product and Engineering Roadmap Nov 13, 2023

andy108369 mentioned this issue Dec 18, 2023

service available replicas below target #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provider will remove the leases during the k8s maintenance #14

provider will remove the leases during the k8s maintenance #14

andy108369 commented Jan 9, 2023 •

edited

Loading

andy108369 commented Jan 9, 2023 •

edited

Loading

andy108369 commented Apr 8, 2023 •

edited

Loading

andy108369 commented Apr 8, 2023

troian commented Sep 27, 2023

arno01 commented Oct 10, 2023

andy108369 commented Nov 13, 2023

provider will remove the leases during the k8s maintenance #14

provider will remove the leases during the k8s maintenance #14

Comments

andy108369 commented Jan 9, 2023 • edited Loading

To stop the akash-provider service

Verify

To start it back again

andy108369 commented Jan 9, 2023 • edited Loading

andy108369 commented Apr 8, 2023 • edited Loading

andy108369 commented Apr 8, 2023

troian commented Sep 27, 2023

arno01 commented Oct 10, 2023

andy108369 commented Nov 13, 2023

andy108369 commented Jan 9, 2023 •

edited

Loading

To stop the `akash-provider` service

andy108369 commented Jan 9, 2023 •

edited

Loading

andy108369 commented Apr 8, 2023 •

edited

Loading