Add wait for num pods #2074

godber · 2020-07-23T20:27:16Z

This changes the teraslice job shutdown process to be more structured. Previously we just "simultaneously" deleted the Worker deloyment and Execution Controller Job, which might leave us with the scenario described here:

#942 (comment)

To mitigate this we now do:

Delete the worker k8s deployment first
Wait for all worker pods to be gone
Delete execution controller k8s job

Note that this is in parallel with the other messaging based shutdown activities, those processes would have already been messaged to shut down, this change is a matter of managing the k8s resources (and by virtue of that, the Teraslice processes).

codecov · 2020-07-23T20:36:04Z

Codecov Report

Merging #2074 into master will decrease coverage by 0.19%.
The diff coverage is 36.17%.

@@            Coverage Diff             @@
##           master    #2074      +/-   ##
==========================================
- Coverage   76.00%   75.80%   -0.20%     
==========================================
  Files         397      397              
  Lines       15311    15345      +34     
  Branches     2553     2553              
==========================================
- Hits        11637    11633       -4     
- Misses       2970     3008      +38     
  Partials      704      704

Impacted Files	Coverage Δ
...ervices/cluster/backends/kubernetes/k8sResource.js	`99.02% <ø> (ø)`
...luster/services/cluster/backends/kubernetes/k8s.js	`38.82% <28.57%> (-14.04%)`	⬇️
...ster/services/cluster/backends/kubernetes/utils.js	`100.00% <100.00%> (ø)`

I also remove the test for the k8s implementation of deleteExecution. Mocking that would be pretty messy.

packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/k8s.js

godber · 2020-07-29T00:46:01Z

I have some new changes I intend to make after talking things through with Zach

add try catch around the wait for pod number
make the catches around the worker delete and wait for pod number both NOT error (log only) so the execution controller job still gets deleted
add pRetry on the api call to delete
make waitforpodnum timeout be 15s + pod stop timeout instead of hard coded

I've added pRetry everywhere but POST. Which should be safe. I now drive the worker pod polling with the default shutdown timeout.

…nto add-waitForNumPods

I also link the worker deployment to the execution controller job. This allows k8s to garbage collect the worker deployment (and then pods) when the execution controller is deleted. This is good in general but also critical to the error handling in deleteExecution closes #1612

godber · 2020-07-31T01:30:36Z

This is ready for review again.

packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/utils.js

peterdemartini

LGTM, error handling could use some improvements but I don't think it will cause an significant issue so we can defer it.

godber added 2 commits July 23, 2020 12:39

handle changes to assets

36e4f8d

add waitForNumPods

ef3e918

godber and others added 4 commits July 27, 2020 16:48

Update job with delay and update logging

b7a5ee2

I also remove the test for the k8s implementation of deleteExecution. Mocking that would be pretty messy.

Merge branch 'master' into add-waitForNumPods

72641e1

Minor logging change.

1a23d2f

release: (minor) [email protected]

43a7b18

godber marked this pull request as ready for review July 28, 2020 17:33

godber requested a review from macgyver603 July 28, 2020 17:33

godber commented Jul 28, 2020

View reviewed changes

packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/k8s.js Outdated Show resolved Hide resolved

godber and others added 4 commits July 30, 2020 15:51

add pRetry and worker shutdown polling change

869ff31

I've added pRetry everywhere but POST. Which should be safe. I now drive the worker pod polling with the default shutdown timeout.

Merge branch 'add-waitForNumPods' of github.com:terascope/teraslice i…

921b444

…nto add-waitForNumPods

Merge branch 'master' into add-waitForNumPods

0c99b98

godber mentioned this pull request Jul 31, 2020

Add k8s api retry #2019

Closed

peterdemartini suggested changes Jul 31, 2020

View reviewed changes

packages/teraslice/lib/cluster/services/cluster/backends/kubernetes/utils.js Outdated Show resolved Hide resolved

godber and others added 5 commits July 31, 2020 10:04

Merge branch 'master' into add-waitForNumPods

4863fdb

fix isTest

486b326

fix util spec

d87eb6a

release: (patch) [email protected]

3f15741

reverting accidental job commit

fa4fa66

godber requested a review from peterdemartini July 31, 2020 19:44

peterdemartini approved these changes Jul 31, 2020

View reviewed changes

godber merged commit 52f42df into master Jul 31, 2020

godber deleted the add-waitForNumPods branch July 31, 2020 20:17

This was referenced Aug 4, 2020

Allow job to override shutdown_timeout - k8s #942

Open

Fix long shutdown times for Teraslice jobs #2106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wait for num pods #2074

Add wait for num pods #2074

godber commented Jul 23, 2020

codecov bot commented Jul 23, 2020 •

edited

Loading

godber commented Jul 29, 2020 •

edited

Loading

godber commented Jul 31, 2020

peterdemartini left a comment

Add wait for num pods #2074

Add wait for num pods #2074

Conversation

godber commented Jul 23, 2020

codecov bot commented Jul 23, 2020 • edited Loading

Codecov Report

godber commented Jul 29, 2020 • edited Loading

godber commented Jul 31, 2020

peterdemartini left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 23, 2020 •

edited

Loading

godber commented Jul 29, 2020 •

edited

Loading