Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

akash tx update response & behavior when provider is out of resources #82

Open
arno01 opened this issue Jan 4, 2022 · 11 comments
Open
Labels

Comments

@arno01
Copy link

arno01 commented Jan 4, 2022

When a user updates his deployment, he may get the following, confusing him, message:

in the following example he was using akashlytics to update his deployment

web: undefined [Warning] [FailedScheduling] [Pod] 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient cpu.
web: undefined [Normal] [SuccessfulCreate] [ReplicaSet] Created pod: web-6db9665ccb-92p4v
web: undefined [Normal] [ScalingReplicaSet] [Deployment] Scaled up replica set web-6db9665ccb to 1

This is happening because K8s won't destroy an old pod instance until it ensures the new one has been created.
Since there is no available node for deploying the new pod, it gets stuck in "Running" & "Pending" state.
Things will move on as soon as one of the nodes gets enough CPU, RAM & disk requested by the deployment.
This is how K8s is working in order to prevent the service outage, however the user might want to get a better message. OR, alternatively, a user could be granted an option such as --force which would destroy the previously running Pod, i.e. that would probably be similar to destroy & recreate method.

root@foxtrot:~# kubectl -n $NS get pods
NAME                   READY   STATUS    RESTARTS   AGE
web-69989588c7-2w5c4   1/1     Running   0          17h
web-6db9665ccb-92p4v   0/1     Pending   0          18m
root@foxtrot:~# kubectl -n $NS describe pods | grep -Ew "^Name:|cpu:"
Name:         web-69989588c7-2w5c4
      cpu:                10
      cpu:                10
Name:           web-6db9665ccb-92p4v
      cpu:                10
      cpu:                10
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.444] inventory fetched                            module=provider-cluster cmp=service cmp=inventory-service nodes=7
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=foxtrot.provider available-cpu="units:<val:\"6875\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"16263143424\" > " available-storage="quantity:<val:\"225335708095\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=golf.provider available-cpu="units:<val:\"125\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"30669094912\" > " available-storage="quantity:<val:\"880184186644\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id= available-cpu="units:<val:\"0\" > attributes:<key:\"arch\" > " available-memory="quantity:<val:\"0\" > " available-storage="quantity:<val:\"0\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=alpha.ingress available-cpu="units:<val:\"3625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"13253083136\" > " available-storage="quantity:<val:\"849673462802\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=bravo.ingress available-cpu="units:<val:\"8025\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"22012821504\" > " available-storage="quantity:<val:\"313339421714\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=charley.ingress available-cpu="units:<val:\"5625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"27698855936\" > " available-storage="quantity:<val:\"96585534905\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=delta.ingress available-cpu="units:<val:\"3625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"31188185088\" > " available-storage="quantity:<val:\"880234408520\" > "
@arno01 arno01 changed the title akash tx update response when provider is out of resources akash tx update response & behavior when provider is out of resources Jan 4, 2022
@arno01
Copy link
Author

arno01 commented Jan 4, 2022

cc @boz @dmikey

@88plug
Copy link

88plug commented Jan 4, 2022

I am that user -

Filled up a providers machines with some xmrig deployments to over 90% fill rate - had 2 of them crash, went to re-deploy a new image using the update button in Akashlytics and was unable to

web: undefined [Warning] [FailedScheduling] [Pod] 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient cpu.
web: undefined [Normal] [SuccessfulCreate] [ReplicaSet] Created pod: web-6db9665ccb-92p4v
web: undefined [Normal] [ScalingReplicaSet] [Deployment] Scaled up replica set web-6db9665ccb to 1

@Strasser-Pablo
Copy link

As a provider, I have the same situation with the same user deploying 2 deployement and beeing unable to update the deployement. A new replica set is created but because of lack of ressource the new replica set never come online. Note that there is no disruption of service as the old replica set stay up.

@boz
Copy link

boz commented Feb 21, 2022

Thanks all for the report. Interesting case. Few thoughts:

  • Some kind of optional --force with a clear message around it is a good suggestion.
  • Inventory "overcommit" can be reduced.
  • Inventory can always reserve double the largest deployed resources.

All of them have drawbacks, of course. In the meantime, for mining and other stateless workloads, I suggest closing the deployment and creating a new one if you hit the described scenario.

@Strasser-Pablo
Copy link

I think having a default option which use --force is a good solution. I think the majority of deployment can be forced. For the rare case where high availability is needed a special option for that could be used but with the understanding that update may be more difficult. Currently as a provider when I see a deployment stuck because of that, generally miner I just force delete the old replica set and make sure the deployment work again.

@dmikey
Copy link

dmikey commented Aug 30, 2022

I've hit this today, bug still in place.

@andy108369
Copy link
Contributor

andy108369 commented Jan 6, 2023

This is still happening with provider-services 0.1.0, akash 0.20.0.
Especially when provider is packed.

[Warning] [FailedScheduling] [Pod] 0/5 nodes are available: 5 Insufficient cpu. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.

@andy108369 andy108369 transferred this issue from akash-network/node Jan 6, 2023
@andy108369 andy108369 transferred this issue from akash-network/provider Mar 9, 2023
@troian troian added the sev2 label Mar 15, 2023
@andy108369 andy108369 added P2 and removed P2 labels Mar 15, 2023
@andy108369
Copy link
Contributor

andy108369 commented Mar 22, 2023

akash force new replicasets workaround

  1. Create /usr/local/bin/akash-force-new-replicasets.sh file
cat > /usr/local/bin/akash-force-new-replicasets.sh <<'EOF'
#!/bin/bash
#
# Version: 0.2 - 25 March 2023
# Files:
# - /usr/local/bin/akash-force-new-replicasets.sh
# - /etc/cron.d/akash-force-new-replicasets
#
# Description:
# This workaround goes through the newest deployments/replicasets, pods of which can't get deployed due to "insufficient resources" errors and it then removes the older replicasets leaving the newest (latest) one.
# This is only a workaround until a better solution to https://github.com/akash-network/support/issues/82 is found.
#

kubectl get deployment -l akash.network/manifest-service -A -o=jsonpath='{range .items[*]}{.metadata.namespace} {.metadata.name}{"\n"}{end}' |
  while read ns app; do
    kubectl -n $ns rollout status --timeout=10s deployment/${app} >/dev/null 2>&1
    rc=$?
    if [[ $rc -ne 0 ]]; then
      if kubectl -n $ns describe pods | grep -q "Insufficient"; then
        OLD="$(kubectl -n $ns get replicaset -o json -l akash.network/manifest-service --sort-by='{.metadata.creationTimestamp}' | jq -r '(.items | reverse)[1:][] | .metadata.name')"
        for i in $OLD; do kubectl -n $ns delete replicaset $i; done
      fi
    fi
  done
EOF
  1. Mark it as executable file
chmod +x /usr/local/bin/akash-force-new-replicasets.sh
  1. Create the crontab job /etc/cron.d/akash-force-new-replicasets to run the workaround every 5 minutes
cat > /etc/cron.d/akash-force-new-replicasets << 'EOF'
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
SHELL=/bin/bash

*/5 * * * * root /usr/local/bin/akash-force-new-replicasets.sh
EOF

@andy108369
Copy link
Contributor

andy108369 commented Mar 22, 2023

@sterburg
Copy link

Let the user define what kind of SLA he wants on his deployments.
If he is ok with a short downtime during redeployment then just set the strategy to Recreate instead of Rolling.
That was rhe old pod is taken down first before spinning up a new pod.
If he want high-availibility SLA then set strategy to Rolling (and replicas to 2+).

Other thing is EvictionPolicy:
You can set to evict pods with a lower priorityClass whenever a higher priority pod gets scheduled or whenever the node has an eviction pressure (is full).
After pods are evicted (force delete) and node is freed up, new pod schedules can take place.
This way you have a disruption of the lower priority pods, not the highest prio ones. And as soon as that temporary extra pod during rrdeployment is gone the evicted low prio pod would fit back onto the node too.

And the last thing:
If the user buys a certain amount of resources the resources should get reserved using a resourcequota.
This way we make sure there is always room left.

I am mentioning all this because a shell script as a cronjob doesn't really feel like the right "cloud-native" way of working.
Kubernetes has so many powerful features, especially for these type of scheduling use-cases that it is a waste not to do it the correct way.

The other thing is:
A professional provider would not let its cluster fill up completely and always keep a margin. I would suggest to make some margin mandatory.
and to make capacity monitoring and management mandatory (guidelines?).

This issue should be an edge use-case as node fill-ups should not happen.
the focus of this issue should be to prevent, not to remediate in heinsight.

@andy108369
Copy link
Contributor

Let the user define what kind of SLA he wants on his deployments. If he is ok with a short downtime during redeployment then just set the strategy to Recreate instead of Rolling. That was rhe old pod is taken down first before spinning up a new pod. If he want high-availibility SLA then set strategy to Rolling (and replicas to 2+).

Other thing is EvictionPolicy: You can set to evict pods with a lower priorityClass whenever a higher priority pod gets scheduled or whenever the node has an eviction pressure (is full). After pods are evicted (force delete) and node is freed up, new pod schedules can take place. This way you have a disruption of the lower priority pods, not the highest prio ones. And as soon as that temporary extra pod during rrdeployment is gone the evicted low prio pod would fit back onto the node too.

And the last thing: If the user buys a certain amount of resources the resources should get reserved using a resourcequota. This way we make sure there is always room left.

I am mentioning all this because a shell script as a cronjob doesn't really feel like the right "cloud-native" way of working. Kubernetes has so many powerful features, especially for these type of scheduling use-cases that it is a waste not to do it the correct way.

The other thing is: A professional provider would not let its cluster fill up completely and always keep a margin. I would suggest to make some margin mandatory. and to make capacity monitoring and management mandatory (guidelines?).

This issue should be an edge use-case as node fill-ups should not happen. the focus of this issue should be to prevent, not to remediate in heinsight.

Thank you @sterburg for your observations, all are making total sence to me.
I've turned this into the discussion post here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants