Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taint nodes on spot and scheduled events #162

Merged
merged 12 commits into from
May 18, 2020

Conversation

diversario
Copy link
Contributor

Issue #, if available: #160

Description of changes:

Upon receiving an event, taint the affect node with the appropriate taint. Remove the taint if event is cancelled or completed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Fixes aws#160.

Signed-off-by: Ilya Shaisultanov <[email protected]>
@diversario
Copy link
Contributor Author

@bwagner5 I see the build failed but not all of the tests – unsure if related to my changes here or not. Do you know how I could debug those?

@bwagner5
Copy link
Contributor

@diversario I believe it's because the cluster role does not have taint permissions. https://github.com/aws/aws-node-termination-handler/blob/master/config/helm/aws-node-termination-handler/templates/clusterrole.yaml

You can run the e2e tests locally:

make e2e-test

or all tests:

make test


Also, looks like some small formatting issues:

❌ goimports found a problem in go source files. See above for the files with problems.
Grade: A+ (92.6%)
Files: 24
Issues: 5
gofmt: 83%
	pkg/interruptionevent/spot-itn-event_internal_test.go
		Line 1: warning: file is not gofmted with -s (gofmt)
	pkg/node/node.go
		Line 1: warning: file is not gofmted with -s (gofmt)
	pkg/interruptionevent/scheduled-event_internal_test.go
		Line 1: warning: file is not gofmted with -s (gofmt)
	pkg/interruptionevent/spot-itn-event.go
		Line 1: warning: file is not gofmted with -s (gofmt)
go_vet: 100%
gocyclo: 95%
	test/ec2-metadata-test-proxy/cmd/ec2-metadata-test-proxy.go
		Line 100: warning: cyclomatic complexity 25 of function handleRequest() is high (> 15) (gocyclo)
golint: 95%
	pkg/node/node.go
		Line 272: warning: exported method Node.TaintSpotItn should have comment or be unexported (golint)
		Line 281: warning: exported method Node.TaintScheduledMaintenance should have comment or be unexported (golint)
		Line 290: warning: exported method Node.CleanAllTaints should have comment or be unexported (golint)
ineffassign: 91%
	pkg/interruptionevent/scheduled-event_internal_test.go
		Line 82: warning: ineffectual assignment to err (ineffassign)
	pkg/interruptionevent/spot-itn-event_internal_test.go
		Line 59: warning: ineffectual assignment to err (ineffassign)
license: 100%
misspell: 100%

@bwagner5 bwagner5 self-requested a review May 15, 2020 14:39
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add permissions to cluster role and run goimports -w ./ && gofmt -s -w ./

If all the tests except helm-sync passes, I can take the action to fix that. It's just a tedious task of syncing the helm chart to the aws/eks-charts repo.

@diversario
Copy link
Contributor Author

Is it possible for me to see the build logs? I don't think it's a permissions issue because NTH can already add labels to nodes. I checked cluster-autoscaler clusterrole and it looks like this:

- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - watch
  - list
  - get
  - update

which looks pretty much like the NTH's one:

- apiGroups:
    - ""
  resources:
    - nodes
  verbs:
    - get
    - patch
    - update

I'm currently trying to get some view into what's failing from the e2e run.

@codecov-io
Copy link

Codecov Report

Merging #162 into master will decrease coverage by 10.42%.
The diff coverage is 31.14%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #162       +/-   ##
===========================================
- Coverage   94.64%   84.22%   -10.43%     
===========================================
  Files           8        8               
  Lines         635      748      +113     
===========================================
+ Hits          601      630       +29     
- Misses         22       99       +77     
- Partials       12       19        +7     
Impacted Files Coverage Δ
pkg/node/node.go 66.77% <23.07%> (-22.45%) ⬇️
pkg/interruptionevent/scheduled-event.go 90.47% <71.42%> (-2.86%) ⬇️
pkg/interruptionevent/spot-itn-event.go 93.75% <81.81%> (-6.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b647783...1011bd2. Read the comment docs.

@diversario diversario marked this pull request as ready for review May 17, 2020 10:08
@diversario diversario requested a review from bwagner5 May 17, 2020 10:08
@@ -363,3 +434,112 @@ func jsonPatchEscape(value string) string {
value = strings.Replace(value, "~", "~0", -1)
return strings.Replace(value, "/", "~1", -1)
}

func addTaint(node *corev1.Node, nth Node, taintKey string, taintValue string, effect corev1.TaintEffect) error {
Copy link
Contributor Author

@diversario diversario May 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, addTaintToSpec and cleanTaint are lifted from cluster-autoscaler with minor modifications.

@@ -80,7 +94,9 @@ helm upgrade --install $CLUSTER_NAME-emtp $SCRIPTPATH/../../config/helm/ec2-meta
--set ec2MetadataTestProxy.enableScheduledMaintenanceEvents="true" \
--set ec2MetadataTestProxy.enableSpotITN="false" \
--set ec2MetadataTestProxy.scheduledEventStatus="canceled" \
--set ec2MetadataTestProxy.port="$IMDS_PORT"
--set ec2MetadataTestProxy.port="$IMDS_PORT" \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the worker node is now tainted, the proxy won't schedule there unless it tolerates the taint.

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work on this! I just had a few comments in-line.

pkg/node/node.go Outdated Show resolved Hide resolved
pkg/node/node.go Show resolved Hide resolved
@diversario diversario requested a review from bwagner5 May 18, 2020 12:03
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @diversario ! 🚀

@bwagner5 bwagner5 merged commit 0d528d6 into aws:master May 18, 2020
@diversario diversario deleted the diversario-160-taint-nodes-on-events branch May 19, 2020 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants