-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: convert k8s submissions from pods to jobs (#9296) #9438
Conversation
Update our Kubernetes resource manager to submit one job per Determined task instead of many pods. This is a complicated change but we think it is worth it because: - Jobs play nice with resource quotas and other Kubernetes features out of the box. - Eventually we can delegate restarts, TTL, pause/resume (using suspend), and more to jobs. - They allow us to integrate with kueue immediately. - If we want to support VolcanoJobs we are much closer (and it is easier to maintain Job+VolcanoJob than Pods+VolcanoJob). This change also contains some unrelated CI QoL changes I found useful while working.
✅ Deploy Preview for determined-ui canceled.
|
As of 66f9d1e CI is very happy so this should be a good starting point for any last minute refactors. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9438 +/- ##
==========================================
+ Coverage 48.60% 49.03% +0.43%
==========================================
Files 1233 1233
Lines 158981 159202 +221
Branches 2778 2778
==========================================
+ Hits 77271 78063 +792
+ Misses 81536 80965 -571
Partials 174 174
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; I've already approved all the commits individually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python file lgtm
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This change updates our Kubernetes resource manager to submit one job per Determined task instead of many pods. This is a complicated change but we think it is worth it because:
This commit is the result of several PRs, enumerated here for easier discovery.
Ticket
Description
This is a feature branch. It already contains the core of the change, but we'll be merging release notes, docs, extra tests, and maybe some more adjustments (nothing major) before it lands later this week.
Test Plan
Checklist
docs/release-notes/
.See Release Note for details.