-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: resolve indefinitely queued (STOPPING_COMPLETED) trials #9605
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9605 +/- ##
==========================================
- Coverage 53.44% 53.42% -0.02%
==========================================
Files 1254 1254
Lines 152636 152633 -3
Branches 3268 3267 -1
==========================================
- Hits 81572 81548 -24
- Misses 70913 70934 +21
Partials 151 151
Flags with carried forward coverage won't be shown. Click here to find out more.
|
e881c70
to
b575482
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a cluster restarts, it restarts running trials
always? why? is this because we're not handling missed metrics, and progress reports?
Not sure what you mean -- In the case of a long-running hp experiment, when a cluster goes down/restarts in the middle of that progress, uncompleted trials are re-allocated/re-started. I'm not sure what you mean by missed metrics/progress reports. I see that the trials restart through the master service logs & also on the webUI. |
) | ||
|
||
# Kill the agent & master | ||
restartable_managed_cluster.kill_agent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the order matter here that which one dies first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you try to kill the agent after the master, it gives you an error (can't remember what it is off the top of my head though)
Ticket
RM-368
Description
When a cluster restarts, it restarts running trials. For large experiments with several trials (like a hyperparameter experiment), some of these restored trails end up in STOPPING_COMPLETED indefinitely versus COMPLETED state. Fix this bug.
I was able to reproduce this on AWS by killing the agent service, stopping the master service, and then restarting them both. WIth my fix, the experiments naturally resolved themselves.
Test Plan
See new e2e test. No additional testing needed.
Checklist
docs/release-notes/
See Release Note for details.