Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wait for num pods #2074

Merged
merged 15 commits into from
Jul 31, 2020
Merged

Add wait for num pods #2074

merged 15 commits into from
Jul 31, 2020

Conversation

godber
Copy link
Member

@godber godber commented Jul 23, 2020

This changes the teraslice job shutdown process to be more structured. Previously we just "simultaneously" deleted the Worker deloyment and Execution Controller Job, which might leave us with the scenario described here:

#942 (comment)

To mitigate this we now do:

  • Delete the worker k8s deployment first
  • Wait for all worker pods to be gone
  • Delete execution controller k8s job

Note that this is in parallel with the other messaging based shutdown activities, those processes would have already been messaged to shut down, this change is a matter of managing the k8s resources (and by virtue of that, the Teraslice processes).

@codecov
Copy link

codecov bot commented Jul 23, 2020

Codecov Report

Merging #2074 into master will decrease coverage by 0.19%.
The diff coverage is 36.17%.

@@            Coverage Diff             @@
##           master    #2074      +/-   ##
==========================================
- Coverage   76.00%   75.80%   -0.20%     
==========================================
  Files         397      397              
  Lines       15311    15345      +34     
  Branches     2553     2553              
==========================================
- Hits        11637    11633       -4     
- Misses       2970     3008      +38     
  Partials      704      704              
Impacted Files Coverage Δ
...ervices/cluster/backends/kubernetes/k8sResource.js 99.02% <ø> (ø)
...luster/services/cluster/backends/kubernetes/k8s.js 38.82% <28.57%> (-14.04%) ⬇️
...ster/services/cluster/backends/kubernetes/utils.js 100.00% <100.00%> (ø)

godber and others added 4 commits July 27, 2020 16:48
@godber godber marked this pull request as ready for review July 28, 2020 17:33
@godber godber requested a review from macgyver603 July 28, 2020 17:33
@godber
Copy link
Member Author

godber commented Jul 29, 2020

I have some new changes I intend to make after talking things through with Zach

  • add try catch around the wait for pod number
  • make the catches around the worker delete and wait for pod number both NOT error (log only) so the execution controller job still gets deleted
  • add pRetry on the api call to delete
  • make waitforpodnum timeout be 15s + pod stop timeout instead of hard coded

godber and others added 4 commits July 30, 2020 15:51
I've added pRetry everywhere but POST.
Which should be safe.

I now drive the worker pod polling with the default shutdown timeout.
I also link the worker deployment to the execution controller job.  This
allows k8s to garbage collect the worker
deployment (and then pods) when the
execution controller is deleted.  This is good in general but also critical
to the error handling in deleteExecution

closes #1612
@godber
Copy link
Member Author

godber commented Jul 31, 2020

This is ready for review again.

@godber godber mentioned this pull request Jul 31, 2020
@godber godber requested a review from peterdemartini July 31, 2020 19:44
Copy link
Contributor

@peterdemartini peterdemartini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, error handling could use some improvements but I don't think it will cause an significant issue so we can defer it.

@godber godber merged commit 52f42df into master Jul 31, 2020
@godber godber deleted the add-waitForNumPods branch July 31, 2020 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants