feat: Framework Splitting and Bumpenvs #9457

MikhailKardash · 2024-05-30T21:21:24Z

Ticket

Description

Revert #9405
Bumpenvs for NGC+ images
Various fixes for slurm builds and runs
Docs table changes
Drop --cuda and --cpu from hpc launcher in CI
Adds afw efs test update by @azhou-determined

Test Plan

CI passes

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

netlify · 2024-05-30T21:21:42Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`93b454e`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/66687aa252eaf70008efd5d0

codecov · 2024-05-30T21:22:04Z

Codecov Report

Attention: Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.

Project coverage is 48.97%. Comparing base (84299a6) to head (93b454e).
Report is 12 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #9457   +/-   ##
=======================================
  Coverage   48.96%   48.97%           
=======================================
  Files        1234     1234           
  Lines      159823   159827    +4     
  Branches     2780     2781    +1     
=======================================
+ Hits        78257    78271   +14     
+ Misses      81391    81381   -10     
  Partials      175      175

Flag	Coverage Δ
backend	`43.70% <100.00%> (+0.01%)`	⬆️
harness	`64.00% <93.93%> (+0.01%)`	⬆️
web	`44.12% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
harness/determined/deploy/gcp/constants.py	`100.00% <100.00%> (ø)`
harness/determined/exec/launch.py	`78.00% <ø> (ø)`
...ness/tests/experiment/keras/test_tf_keras_trial.py	`98.83% <100.00%> (+0.08%)`	⬆️
harness/tests/launch/test_launch.py	`100.00% <100.00%> (ø)`
master/internal/config/provconfig/aws_config.go	`11.20% <ø> (ø)`
master/internal/config/provconfig/gcp_config.go	`26.76% <100.00%> (ø)`
harness/determined/core/_profiler.py	`57.56% <0.00%> (ø)`
harness/tests/experiment/pytorch/test_local.py	`91.66% <91.66%> (ø)`

... and 9 files with indirect coverage changes

tara-hpe

LGTM

This reverts commit 4af9bfc.

e2e_tests/tests/config.py

azhou-determined

lgtm, assuming all the tests pass 🙃

hamidzr

changes under master/ lgtm

keita-determined

\o/

davidfluck-hpe · 2024-06-11T21:04:08Z

.circleci/scripts/pull_image_daemonset.yaml

Why are we using a daemonset to pull a Docker image onto each node?

The NGC+ images that we are moving to take a significant time to pull, which causes some unit tests to hang because they end up pulling these images at runtime. Using a daemonset to pull them first seems easier and more reliable.

Are there Docker credentials in the build process (in CircleCI) somewhere that could be used to pull the image more explicitly instead? The daemonset technically works, but it's being used more for its side effect of pulling images than for running specific workloads on all Kubernetes nodes.

Well, these images are public to begin with, so credentials wouldn't be necessary. We just need some mechanism to make sure that all pods pull this image.

If you specify it on whatever pod needs it, Kubernetes will fetch the image a first time. Subsequent workloads that use imagePullPolicy: IfNotPresent will then use the node-local version. Does that not work? I.e. without using a DaemonSet, and just using the image as usual. Also, are the subsequent unit tests using Kubernetes workloads specifically? Or just interacting directly with Docker?

The subsequent unit tests are just interacting with docker. I'm not sure if it's easier to enforce using specific pods rather than just making sure all pods have this image installed.

I'm still a bit confused. How are the unit tests running docker? Are the tests themselves running from inside pods and also interacting with docker there? (I.e. a "docker in docker" situation.) Or is something running docker <blah> from outside of Kubernetes, but still on the node VMs themselves?

Because if the tests are running inside K8s pods, then you shouldn't need the daemonset at all. The first pod to run the test that requires this image will pull it, then subsequent pods, if using imagePullPolicy: IfNotPresent, should use the existing image that K8s already has. That specific image pull policy is designed to do what the daemonset is doing, but only if everything is happening within K8s itself.

We use a run-e2e-tests command in circleci. that doesn't use k8s at all, but rather pytest with a remote master target. All it does is schedule jobs on a remote cluster, but it doesn't know whether it's a k8s cluster or a normal determined cluster.
When the tests are requested, the remote cluster schedules it's own pods.

The particular tests that timeout due to docker pulls are a combination of two factors:

The default image in the master config is not updated because this master is not being created/restarted.

The test itself schedules another experiment, waits for that experiment to finish, then exits. Since this launches a new task, a new pod gets allocated and has to do a pull. The test itself has to wait for that pull to finish.

Thanks for the context! I'm still a bit nervous, but I don't want to hold this up more without being able to devote more time to thinking about it, so I'll approve this.

cla-bot bot added the cla-signed label May 30, 2024

determined-ci requested a review from a team May 30, 2024 21:21

determined-ci added the documentation Improvements or additions to documentation label May 30, 2024

MikhailKardash marked this pull request as ready for review May 30, 2024 21:56

MikhailKardash requested review from a team as code owners May 30, 2024 21:57

MikhailKardash requested review from ashtonG, AmanuelAaron, azhou-determined, hamidzr and loksonarius May 30, 2024 21:57

keita-determined approved these changes May 30, 2024

View reviewed changes

tara-hpe approved these changes May 31, 2024

View reviewed changes

loksonarius approved these changes Jun 4, 2024

View reviewed changes

MikhailKardash added 3 commits June 6, 2024 10:57

Revert "revert: Framework splitting (#9405)"

55fd076

This reverts commit 4af9bfc.

merge fixes

b85616f

fix util

12bc478

MikhailKardash force-pushed the revert_and_fix branch from 9f44b60 to 12bc478 Compare June 6, 2024 17:57

determined-ci requested a review from a team June 6, 2024 17:57

MikhailKardash added 6 commits June 7, 2024 15:09

add daemonset for image pulls in shared gke cluster

baafe57

-f

2d5cac7

don't validate

d8ad7a4

fix

c5e63c8

setup command ordering

eed8ef2

helm timeout

28f5e63

MikhailKardash requested a review from loksonarius June 10, 2024 21:03

azhou-determined reviewed Jun 11, 2024

View reviewed changes

e2e_tests/tests/config.py Outdated Show resolved Hide resolved

replace ignore_env with ci config removal

93b454e

AmanuelAaron approved these changes Jun 11, 2024

View reviewed changes

azhou-determined approved these changes Jun 11, 2024

View reviewed changes

hamidzr approved these changes Jun 11, 2024

View reviewed changes

keita-determined approved these changes Jun 11, 2024

View reviewed changes

davidfluck-hpe reviewed Jun 11, 2024

View reviewed changes

davidfluck-hpe approved these changes Jun 12, 2024

View reviewed changes

MikhailKardash merged commit 8e9067b into main Jun 12, 2024
114 of 119 checks passed

MikhailKardash deleted the revert_and_fix branch June 12, 2024 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Framework Splitting and Bumpenvs #9457

feat: Framework Splitting and Bumpenvs #9457

MikhailKardash commented May 30, 2024 •

edited by jira bot

Loading

netlify bot commented May 30, 2024 •

edited

Loading

codecov bot commented May 30, 2024 •

edited

Loading

tara-hpe left a comment

azhou-determined left a comment

hamidzr left a comment

keita-determined left a comment

davidfluck-hpe Jun 11, 2024

MikhailKardash Jun 11, 2024

davidfluck-hpe Jun 11, 2024

MikhailKardash Jun 11, 2024

davidfluck-hpe Jun 12, 2024 •

edited

Loading

MikhailKardash Jun 12, 2024

davidfluck-hpe Jun 12, 2024 •

edited

Loading

davidfluck-hpe Jun 12, 2024 •

edited

Loading

MikhailKardash Jun 12, 2024

davidfluck-hpe Jun 12, 2024

feat: Framework Splitting and Bumpenvs #9457

feat: Framework Splitting and Bumpenvs #9457

Conversation

MikhailKardash commented May 30, 2024 • edited by jira bot Loading

Ticket

Description

Test Plan

Checklist

netlify bot commented May 30, 2024 • edited Loading

✅ Deploy Preview for determined-ui canceled.

codecov bot commented May 30, 2024 • edited Loading

Codecov Report

tara-hpe left a comment

Choose a reason for hiding this comment

azhou-determined left a comment

Choose a reason for hiding this comment

hamidzr left a comment

Choose a reason for hiding this comment

keita-determined left a comment

Choose a reason for hiding this comment

davidfluck-hpe Jun 11, 2024

Choose a reason for hiding this comment

MikhailKardash Jun 11, 2024

Choose a reason for hiding this comment

davidfluck-hpe Jun 11, 2024

Choose a reason for hiding this comment

MikhailKardash Jun 11, 2024

Choose a reason for hiding this comment

davidfluck-hpe Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

MikhailKardash Jun 12, 2024

Choose a reason for hiding this comment

davidfluck-hpe Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

davidfluck-hpe Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

MikhailKardash Jun 12, 2024

Choose a reason for hiding this comment

davidfluck-hpe Jun 12, 2024

Choose a reason for hiding this comment

MikhailKardash commented May 30, 2024 •

edited by jira bot

Loading

netlify bot commented May 30, 2024 •

edited

Loading

codecov bot commented May 30, 2024 •

edited

Loading

davidfluck-hpe Jun 12, 2024 •

edited

Loading

davidfluck-hpe Jun 12, 2024 •

edited

Loading

davidfluck-hpe Jun 12, 2024 •

edited

Loading