Gracefully drain EKS Worker Nodes whenever a node is terminated by an Auto Scaling Group or a Spot termination.
Deployable Lambda function with CloudWatch Event Rules and an IAM Role, enabled by adding a Lifecycle hook to any Auto Scaling Group in the same AWS Region.
Deployment of the Lambda, IAM Role and Cloudwatch Event Rules can be simplified with the SAM CLI and Docker.
SAM CLI can be installed differently depending on your OS, in general for Linux and MacOS however..
brew upgrade
brew update
brew tap aws/tap
brew install aws-sam-cli
sam --version
- Clone this repository
git clone https://github.com/dkeightley/eks-auto-drain.git
cd eks-auto-drain
- Optional: set your AWS region and create an S3 bucket
export AWS_DEFAULT_REGION=<region name>
aws s3 mb s3://<bucket name>
- Build, package and deploy the project with SAM
sam build --use-container
sam package --output-template-file packaged.yaml --s3-bucket <bucket name>
sam deploy --template-file packaged.yaml --stack-name eks-auto-drain --capabilities CAPABILITY_IAM
To provide RBAC permissions for the drain, an RBAC group that provides the specific permissions is needed. Once added the Lambda execution role can be mapped to the group in the aws-auth
ConfigMap for each EKS Cluster
kubectl apply -f rbac/
aws cloudformation describe-stacks --stack-name eks-auto-drain --query 'Stacks[0].Outputs[0].OutputValue'
Use an imperative action, like edit, to add to the ConfigMap to avoid merge conflicts
kubectl edit -n kube-system configmap aws-auth
Example:
mapRoles: |
- groups:
- eks-auto-drain-lambda
rolearn: <Lambda execution Role>
username: eks-auto-drain-lambda
Note: A heart beat timeout of 300s is provided here, adjust as needed, it will serve as an overall grace period before continuing with terminating the Node
Define a variable to loop through these, otherwise the below command can be used for each ASG
aws autoscaling put-lifecycle-hook --lifecycle-hook-name eks-auto-drain --lifecycle-transition "autoscaling:EC2_INSTANCE_TERMINATING" --heartbeat-timeout 300 --default-result CONTINUE --auto-scaling-group-name <auto scaling group name>
OR
./put-lifecycle-hook.sh asg1 asg2 asg3
Obtain a list of instances in a Cluster:
kubectl get nodes -o=custom-columns=NAME:.metadata.name,INSTANCE:.spec.providerID
Test by terminating an instance in an ASG
aws autoscaling terminate-instance-in-auto-scaling-group --no-should-decrement-desired-capacity --instance-id <instance id>
The Node should be cordoned, and drained of all Pods before termination. The Lambda function logs can provide output.
sam logs --name LambdaFunction --stack-name eks-auto-drain --tail
The provided event.json contains an invalid instance id so will fail, however, replace with a valid instance from your cluster to ensure the drain occurs
sam local invoke -e misc/event.json
kubectl delete -f rbac/
aws cloudformation delete-stack --stack-name eks-auto-drain
./delete-lifecycle-hook.sh asg1 asg2 asg3
- VPC support for private access