-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Framework Splitting and Bumpenvs #9457
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9457 +/- ##
=======================================
Coverage 48.96% 48.97%
=======================================
Files 1234 1234
Lines 159823 159827 +4
Branches 2780 2781 +1
=======================================
+ Hits 78257 78271 +14
+ Misses 81391 81381 -10
Partials 175 175
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This reverts commit 4af9bfc.
9f44b60
to
12bc478
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, assuming all the tests pass 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes under master/
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\o/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using a daemonset to pull a Docker image onto each node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NGC+ images that we are moving to take a significant time to pull, which causes some unit tests to hang because they end up pulling these images at runtime. Using a daemonset to pull them first seems easier and more reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there Docker credentials in the build process (in CircleCI) somewhere that could be used to pull the image more explicitly instead? The daemonset technically works, but it's being used more for its side effect of pulling images than for running specific workloads on all Kubernetes nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, these images are public to begin with, so credentials wouldn't be necessary. We just need some mechanism to make sure that all pods pull this image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you specify it on whatever pod needs it, Kubernetes will fetch the image a first time. Subsequent workloads that use imagePullPolicy: IfNotPresent
will then use the node-local version. Does that not work? I.e. without using a DaemonSet
, and just using the image as usual. Also, are the subsequent unit tests using Kubernetes workloads specifically? Or just interacting directly with Docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subsequent unit tests are just interacting with docker. I'm not sure if it's easier to enforce using specific pods rather than just making sure all pods have this image installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit confused. How are the unit tests running docker? Are the tests themselves running from inside pods and also interacting with docker there? (I.e. a "docker in docker" situation.) Or is something running docker <blah>
from outside of Kubernetes, but still on the node VMs themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if the tests are running inside K8s pods, then you shouldn't need the daemonset at all. The first pod to run the test that requires this image will pull it, then subsequent pods, if using imagePullPolicy: IfNotPresent
, should use the existing image that K8s already has. That specific image pull policy is designed to do what the daemonset is doing, but only if everything is happening within K8s itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use a run-e2e-tests
command in circleci. that doesn't use k8s at all, but rather pytest
with a remote master target. All it does is schedule jobs on a remote cluster, but it doesn't know whether it's a k8s cluster or a normal determined cluster.
When the tests are requested, the remote cluster schedules it's own pods.
The particular tests that timeout due to docker pulls are a combination of two factors:
- The default image in the master config is not updated because this master is not being created/restarted.
- The test itself schedules another experiment, waits for that experiment to finish, then exits. Since this launches a new task, a new pod gets allocated and has to do a pull. The test itself has to wait for that pull to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context! I'm still a bit nervous, but I don't want to hold this up more without being able to devote more time to thinking about it, so I'll approve this.
Ticket
MD-410
Description
Revert #9405
Bumpenvs for NGC+ images
Various fixes for slurm builds and runs
Docs table changes
Drop
--cuda
and--cpu
from hpc launcher in CIAdds afw efs test update by @azhou-determined
Test Plan
CI passes
Checklist
docs/release-notes/
.See Release Note for details.