Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resolve indefinitely queued (STOPPING_COMPLETED) trials #9605

Merged
merged 6 commits into from
Jul 18, 2024

Conversation

carolinaecalderon
Copy link
Contributor

@carolinaecalderon carolinaecalderon commented Jul 3, 2024

Ticket

RM-368

Description

When a cluster restarts, it restarts running trials. For large experiments with several trials (like a hyperparameter experiment), some of these restored trails end up in STOPPING_COMPLETED indefinitely versus COMPLETED state. Fix this bug.

I was able to reproduce this on AWS by killing the agent service, stopping the master service, and then restarting them both. WIth my fix, the experiments naturally resolved themselves.

Test Plan

See new e2e test. No additional testing needed.

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Jul 3, 2024
Copy link

netlify bot commented Jul 3, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 6d109be
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66996394bd2d4f0008490e6f

Copy link

codecov bot commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 53.42%. Comparing base (e4a9ae3) to head (6d109be).
Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9605      +/-   ##
==========================================
- Coverage   53.44%   53.42%   -0.02%     
==========================================
  Files        1254     1254              
  Lines      152636   152633       -3     
  Branches     3268     3267       -1     
==========================================
- Hits        81572    81548      -24     
- Misses      70913    70934      +21     
  Partials      151      151              
Flag Coverage Δ
backend 44.69% <83.33%> (-0.05%) ⬇️
harness 72.84% <ø> (ø)
web 51.81% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/trial.go 42.10% <83.33%> (+0.24%) ⬆️

... and 5 files with indirect coverage changes

@carolinaecalderon carolinaecalderon changed the title Carolinac/rm 257 fix: resolve indefinitely queued (STOPPING_COMPLETED) trials Jul 16, 2024
@carolinaecalderon carolinaecalderon marked this pull request as ready for review July 16, 2024 22:23
@carolinaecalderon carolinaecalderon requested a review from a team as a code owner July 16, 2024 22:23
@carolinaecalderon carolinaecalderon requested review from ShreyaLnuHpe, ioga, a team and hamidzr and removed request for ShreyaLnuHpe and a team July 16, 2024 22:23
Copy link
Contributor

@hamidzr hamidzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a cluster restarts, it restarts running trials

always? why? is this because we're not handling missed metrics, and progress reports?

master/internal/trial.go Outdated Show resolved Hide resolved
e2e_tests/tests/cluster/test_master_restart.py Outdated Show resolved Hide resolved
@carolinaecalderon
Copy link
Contributor Author

When a cluster restarts, it restarts running trials

always? why? is this because we're not handling missed metrics, and progress reports?

Not sure what you mean -- In the case of a long-running hp experiment, when a cluster goes down/restarts in the middle of that progress, uncompleted trials are re-allocated/re-started. I'm not sure what you mean by missed metrics/progress reports. I see that the trials restart through the master service logs & also on the webUI.

)

# Kill the agent & master
restartable_managed_cluster.kill_agent()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the order matter here that which one dies first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you try to kill the agent after the master, it gives you an error (can't remember what it is off the top of my head though)

@carolinaecalderon carolinaecalderon merged commit c70dd8c into main Jul 18, 2024
114 of 120 checks passed
@carolinaecalderon carolinaecalderon deleted the carolinac/rm-257 branch July 18, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants