Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s_drain "Failed to delete pod" "Too many requests" #474

Closed
impsik opened this issue Jun 8, 2022 · 5 comments · Fixed by #606
Closed

k8s_drain "Failed to delete pod" "Too many requests" #474

impsik opened this issue Jun 8, 2022 · 5 comments · Fixed by #606
Assignees
Labels
jira type/bug Something isn't working

Comments

@impsik
Copy link

impsik commented Jun 8, 2022

SUMMARY

I try to drain kubernetes nodes, do some patching and uncordon those nodes, but sometimes it fails with
"msg": "Failed to delete pod POD NAME HERE due to: Too Many Requests"
It's usually Longhorn POD.
And it's empty cluster (1etcd/CP, 3 worker nodes), for testing only.
When i drain manually it will take up to 2min to drain that node.

node/192.168.122.11 evicted

real	1m50.449s
user	0m1.128s
sys	0m0.645s
ISSUE TYPE
  • Bug Report
COMPONENT NAME

k8s_drain

ANSIBLE VERSION
ansible [core 2.12.6]

COLLECTION VERSION

# /usr/lib/python3/dist-packages/ansible_collections
Collection      Version
--------------- -------
kubernetes.core 2.3.1

CONFIGURATION
DEFAULT_HOST_LIST(/etc/ansible/ansible.cfg) = ['/home/username/hosts']
DEPRECATION_WARNINGS(/etc/ansible/ansible.cfg) = False

OS / ENVIRONMENT

NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"

STEPS TO REPRODUCE
- name: "Drain node {{ inventory_hostname|lower }}, even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it."
      kubernetes.core.k8s_drain:
        state: drain
        name: "{{ inventory_hostname|lower }}"
        kubeconfig: ~/.kube/config
        delete_options:
          ignore_daemonsets: yes
          delete_emptydir_data: yes
          force: yes
          terminate_grace_period: 5
          wait_sleep: 20
      delegate_to: localhost

EXPECTED RESULTS

TASK [Drain node 192.168.122.11, even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it.] *****************************************************************************************
changed: [192.168.122.11 -> localhost]

ACTUAL RESULTS

The full traceback is:
File "/tmp/ansible_kubernetes.core.k8s_drain_payload_favq191w/ansible_kubernetes.core.k8s_drain_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_drain.py", line 324, in evict_pods
File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 7652, in create_namespaced_pod_eviction
return self.create_namespaced_pod_eviction_with_http_info(name, namespace, body, **kwargs) # noqa: E501
File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 7759, in create_namespaced_pod_eviction_with_http_info
return self.api_client.call_api(
File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 391, in request
return self.rest_client.POST(url,
File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 275, in POST
return self.request("POST", url,
File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 234, in request
raise ApiException(http_resp=r)
fatal: [192.168.122.11 -> localhost]: FAILED! => {
"changed": false,
"invocation": {
"module_args": {
"api_key": null,
"ca_cert": null,
"client_cert": null,
"client_key": null,
"context": null,
"delete_options": {
"delete_emptydir_data": true,
"disable_eviction": false,
"force": true,
"ignore_daemonsets": true,
"terminate_grace_period": 5,
"wait_sleep": 20,
"wait_timeout": null
},
"host": null,
"impersonate_groups": null,
"impersonate_user": null,
"kubeconfig": "/home/imre/.kube/config",
"name": "192.168.122.11",
"no_proxy": null,
"password": null,
"persist_config": null,
"proxy": null,
"proxy_headers": null,
"state": "drain",
"username": null,
"validate_certs": null
}
},
"msg": "Failed to delete pod longhorn-system/instance-manager-e-f5feaabb due to: Too Many Requests"
}


@gravesm
Copy link
Member

gravesm commented Jun 27, 2022

@impsik Thanks for filing the issue. The eviction API can sometimes return a 429 Too Many Requests status, especially if the eviction would violate the pod disruption budget. It's unclear if that's what is going on in your case or not. We should be retrying 429 responses here, and we aren't. As a workaround until this is fixed, you could try using Ansible's built-in retry logic.

@gravesm gravesm added type/bug Something isn't working jira labels Jun 27, 2022
@impsik
Copy link
Author

impsik commented Jun 28, 2022

@gravesm My setup was (i blew it up):
3 etcd/cp nodes, 3 worker nodes. SSD disks.

Node config:
OS type and version: Ubuntu 20.04.4 LTS
CPU per node: 4
Memory per node: 8GB

I installed Longhorn, 3 replicas, trough Rancher apps.
I installed Wordpress for testing LINK and this created 2 PVC, one for SQL, one for wordpress.
From the Longhorn docs i found that i also need to use --pod-selector='app!=csi-attacher,app!=csi-provisioner'
However pod-selector option is not supported by kubernetes.core.k8s_drain

$ kubectl get poddisruptionbudgets -n longhorn-system
NAME                          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
instance-manager-e-5369c190   1               N/A               0                     19m
instance-manager-e-662fd89d   1               N/A               0                     13m
instance-manager-e-b944b669   1               N/A               0                     16m
instance-manager-r-26f65c39   1               N/A               0                     16m
instance-manager-r-6030e0e8   1               N/A               0                     13m
instance-manager-r-6314b679   1               N/A               0                     19m

Another workaround for me was to use shell module, works fine:
shell: kubectl drain {{ inventory_hostname|lower }} --ignore-daemonsets --delete-emptydir-data --force --pod-selector='app!=csi-attacher,app!=csi-provisioner' --kubeconfig ~/.kube/config

@0Styless
Copy link

I also can confirm this behavior. It would be nice if the kubernetes.core module can support the pod-selector configuration. The usage of the shell module is just a "dirty workaround".

@stephan2012
Copy link

Same here, few Pods on a Kubernetes control-plane node, no PDB on the Pods. Sometimes it does not evict any Pods.

The only reliable workaround for the moment is to revert to kubectl.

@pierreozoux
Copy link
Contributor

Hi!

I use kubernetes.core 3.2.0 and I still have this issue:

- name: Mark node as unschedulable.
  delegate_to: localhost
  become: no
  kubernetes.core.k8s_drain:
    state: cordon
    name: "{{ inventory_hostname }}"

- name: Remove pg label see https://github.com/zalando/postgres-operator/issues/547#issuecomment-486308679
  delegate_to: localhost
  become: no
  kubernetes.core.k8s:
    kind: Node
    name: "{{ inventory_hostname }}"
    state: patched
    definition:
      metadata:
        labels:
          node.libre.sh/postgres: "false"

- name: Drain node even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it.
  delegate_to: localhost
  become: no
  kubernetes.core.k8s_drain:
    state: drain
    delete_options:
      force: yes
      delete_emptydir_data: yes
      ignore_daemonsets: yes
    name: "{{ inventory_hostname }}"
TASK [drain : Drain node even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it.] *************************************************************************************************************************************************************************************
fatal: [scw-prod-elastic-metal-paris2-xxx -> localhost]: FAILED! => {"changed": false, "msg": "Failed to delete pod xxx/prod--de-re-0 due to: Too Many Requests"}

I'd prefer to use this module, but maybe I have to fallback to shell?

thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment