-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add missing k8s job submission times to allocations #9028
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
c301769
to
ea836d5
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9028 +/- ##
==========================================
+ Coverage 47.07% 47.10% +0.02%
==========================================
Files 1155 1156 +1
Lines 142400 142378 -22
Branches 2421 2423 +2
==========================================
+ Hits 67034 67064 +30
+ Misses 75176 75124 -52
Partials 190 190
Flags with carried forward coverage won't be shown. Click here to find out more.
|
) | ||
|
||
// FetchAvgQueuedTime fetches the average queued time for a resource pool. | ||
func FetchAvgQueuedTime(pool string) ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any Go practices for keeping the exposure limited this and subpackages aka rm/*
?
any suggestions on the best way to add automated tests here? |
master/internal/task/allocation.go
Outdated
// allocation.startTime? | ||
StartTime: &msg.JobSubmissionTime, | ||
EndTime: &now, | ||
StartTime: &msg.JobSubmissionTime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. i think JobSubmissionTime is wrong. if anything the language here is wrong. "task stats" implies this is talking about the task level, but job submission time is when the experiment was submitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think there's more to look at here but it isn't immediately obvious to me and that'd be a different level of wrong than the current state of "we put time.zero as the start of queued period for k8s". Does it make sense to work on that on a separate ticket w/ a different priority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, they're equally totally wrong, so i'd lean to just fixing it. i wouldnt call this commit "fix: k8s rp queued stats" if you do it in two PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fair. #9028 (comment)
I updated the title and the associated ticket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving because there is nothing in the PR I disagree with
I don't think it makes average queued time much more correct. We still are not having the right end time. We end the tasks stats when we get resources allocated, which in kubernetes is going to be roughly right after we submit the job. We really want to change this till after the pod gets gpus assigned. Maybe a better change is just hiding the chart on Kubernetes / Slurm clusters until grafana dashboards can include queued time information?
I'm cool if you want to land this since I think it is a slight improvement.
@@ -572,39 +572,7 @@ func (k *ResourceManager) createResourcePoolSummary( | |||
func (k *ResourceManager) fetchAvgQueuedTime(pool string) ( | |||
[]*jobv1.AggregateQueueStats, error, | |||
) { | |||
aggregates := []model.ResourceAggregates{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the database query doesn't change at all in this PR right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no the pr should not have any effectvie change other than
add missing job submission time on k8s assigned resources
reuse average stats calculation between rms
e60e87e
to
72a1375
Compare
Description
TODO:
Test Plan
have a k8s cluster,
submit jobs and check the queued stats for the corresponding resource pool the queued values should not be in the range of years (more likely seconds or minutes)
Commentary (optional)
We might also want to run a migration or add a measure to ignore erroneous previous records
or rectify them through tasks.start_time or as an approximation or allocation start time
perhaps via a migration
https://hpe.sharepoint.com/:w:/r/teams/detai/Shared%20Documents/Engineering/Resource%20Management/Random%20Things/Web%20Cluster%20Requirements.docx?d=waae7403bf7b04e0fb9832ac010108574&csf=1&web=1&e=vqsH06
Checklist
docs/release-notes/
.See Release Note for details.
Ticket
https://hpe-aiatscale.atlassian.net/browse/RM-135