Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Framework Splitting and Bumpenvs #9457

Merged
merged 10 commits into from
Jun 12, 2024
Merged

feat: Framework Splitting and Bumpenvs #9457

merged 10 commits into from
Jun 12, 2024

Conversation

MikhailKardash
Copy link
Contributor

@MikhailKardash MikhailKardash commented May 30, 2024

Ticket

MD-410

Description

Revert #9405
Bumpenvs for NGC+ images
Various fixes for slurm builds and runs
Docs table changes
Drop --cuda and --cpu from hpc launcher in CI
Adds afw efs test update by @azhou-determined

Test Plan

CI passes

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label May 30, 2024
@determined-ci determined-ci requested a review from a team May 30, 2024 21:21
@determined-ci determined-ci added the documentation Improvements or additions to documentation label May 30, 2024
Copy link

netlify bot commented May 30, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 93b454e
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66687aa252eaf70008efd5d0

Copy link

codecov bot commented May 30, 2024

Codecov Report

Attention: Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.

Project coverage is 48.97%. Comparing base (84299a6) to head (93b454e).
Report is 12 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9457   +/-   ##
=======================================
  Coverage   48.96%   48.97%           
=======================================
  Files        1234     1234           
  Lines      159823   159827    +4     
  Branches     2780     2781    +1     
=======================================
+ Hits        78257    78271   +14     
+ Misses      81391    81381   -10     
  Partials      175      175           
Flag Coverage Δ
backend 43.70% <100.00%> (+0.01%) ⬆️
harness 64.00% <93.93%> (+0.01%) ⬆️
web 44.12% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
harness/determined/deploy/gcp/constants.py 100.00% <100.00%> (ø)
harness/determined/exec/launch.py 78.00% <ø> (ø)
...ness/tests/experiment/keras/test_tf_keras_trial.py 98.83% <100.00%> (+0.08%) ⬆️
harness/tests/launch/test_launch.py 100.00% <100.00%> (ø)
master/internal/config/provconfig/aws_config.go 11.20% <ø> (ø)
master/internal/config/provconfig/gcp_config.go 26.76% <100.00%> (ø)
harness/determined/core/_profiler.py 57.56% <0.00%> (ø)
harness/tests/experiment/pytorch/test_local.py 91.66% <91.66%> (ø)

... and 9 files with indirect coverage changes

@MikhailKardash MikhailKardash marked this pull request as ready for review May 30, 2024 21:56
@MikhailKardash MikhailKardash requested review from a team as code owners May 30, 2024 21:57
Copy link
Contributor

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@azhou-determined azhou-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, assuming all the tests pass 🙃

Copy link
Contributor

@hamidzr hamidzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes under master/ lgtm

Copy link
Contributor

@keita-determined keita-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\o/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using a daemonset to pull a Docker image onto each node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NGC+ images that we are moving to take a significant time to pull, which causes some unit tests to hang because they end up pulling these images at runtime. Using a daemonset to pull them first seems easier and more reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there Docker credentials in the build process (in CircleCI) somewhere that could be used to pull the image more explicitly instead? The daemonset technically works, but it's being used more for its side effect of pulling images than for running specific workloads on all Kubernetes nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, these images are public to begin with, so credentials wouldn't be necessary. We just need some mechanism to make sure that all pods pull this image.

Copy link
Contributor

@davidfluck-hpe davidfluck-hpe Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you specify it on whatever pod needs it, Kubernetes will fetch the image a first time. Subsequent workloads that use imagePullPolicy: IfNotPresent will then use the node-local version. Does that not work? I.e. without using a DaemonSet, and just using the image as usual. Also, are the subsequent unit tests using Kubernetes workloads specifically? Or just interacting directly with Docker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subsequent unit tests are just interacting with docker. I'm not sure if it's easier to enforce using specific pods rather than just making sure all pods have this image installed.

Copy link
Contributor

@davidfluck-hpe davidfluck-hpe Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused. How are the unit tests running docker? Are the tests themselves running from inside pods and also interacting with docker there? (I.e. a "docker in docker" situation.) Or is something running docker <blah> from outside of Kubernetes, but still on the node VMs themselves?

Copy link
Contributor

@davidfluck-hpe davidfluck-hpe Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if the tests are running inside K8s pods, then you shouldn't need the daemonset at all. The first pod to run the test that requires this image will pull it, then subsequent pods, if using imagePullPolicy: IfNotPresent, should use the existing image that K8s already has. That specific image pull policy is designed to do what the daemonset is doing, but only if everything is happening within K8s itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a run-e2e-tests command in circleci. that doesn't use k8s at all, but rather pytest with a remote master target. All it does is schedule jobs on a remote cluster, but it doesn't know whether it's a k8s cluster or a normal determined cluster.
When the tests are requested, the remote cluster schedules it's own pods.

The particular tests that timeout due to docker pulls are a combination of two factors:

  1. The default image in the master config is not updated because this master is not being created/restarted.
  2. The test itself schedules another experiment, waits for that experiment to finish, then exits. Since this launches a new task, a new pod gets allocated and has to do a pull. The test itself has to wait for that pull to finish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context! I'm still a bit nervous, but I don't want to hold this up more without being able to devote more time to thinking about it, so I'll approve this.

@MikhailKardash MikhailKardash merged commit 8e9067b into main Jun 12, 2024
114 of 119 checks passed
@MikhailKardash MikhailKardash deleted the revert_and_fix branch June 12, 2024 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants