Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add missing k8s job submission times to allocations #9028

Merged
merged 10 commits into from
Apr 4, 2024

Conversation

hamidzr
Copy link
Contributor

@hamidzr hamidzr commented Mar 20, 2024

Description

  • add missing job submission time on k8s assigned resources
  • reuse average stats calculation between rms

TODO:

Test Plan

have a k8s cluster,
submit jobs and check the queued stats for the corresponding resource pool the queued values should not be in the range of years (more likely seconds or minutes)

Commentary (optional)

We might also want to run a migration or add a measure to ignore erroneous previous records
or rectify them through tasks.start_time or as an approximation or allocation start time
perhaps via a migration

https://hpe.sharepoint.com/:w:/r/teams/detai/Shared%20Documents/Engineering/Resource%20Management/Random%20Things/Web%20Cluster%20Requirements.docx?d=waae7403bf7b04e0fb9832ac010108574&csf=1&web=1&e=vqsH06

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

https://hpe-aiatscale.atlassian.net/browse/RM-135

@cla-bot cla-bot bot added the cla-signed label Mar 20, 2024
@hamidzr hamidzr changed the title Hz rm34 queued stats fix rm queued stats Mar 20, 2024
Copy link

netlify bot commented Mar 20, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 72a1375
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/660d7ddc9774ba0007d1639a
😎 Deploy Preview https://deploy-preview-9028--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@hamidzr hamidzr changed the title fix rm queued stats fix: k8s rp queued stats Mar 20, 2024
Copy link

codecov bot commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 72.72727% with 12 lines in your changes are missing coverage. Please review.

Project coverage is 47.10%. Comparing base (f78b9aa) to head (72a1375).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9028      +/-   ##
==========================================
+ Coverage   47.07%   47.10%   +0.02%     
==========================================
  Files        1155     1156       +1     
  Lines      142400   142378      -22     
  Branches     2421     2423       +2     
==========================================
+ Hits        67034    67064      +30     
+ Misses      75176    75124      -52     
  Partials      190      190              
Flag Coverage Δ
backend 43.03% <72.72%> (+0.09%) ⬆️
harness 64.07% <ø> (ø)
web 38.95% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ster/internal/rm/agentrm/agent_resource_manager.go 45.04% <100.00%> (-2.95%) ⬇️
master/internal/rm/kubernetesrm/resource_pool.go 17.80% <100.00%> (+12.39%) ⬆️
master/pkg/model/task.go 15.26% <ø> (ø)
master/internal/db/postgres_tasks.go 69.23% <66.66%> (+0.36%) ⬆️
...nal/rm/kubernetesrm/kubernetes_resource_manager.go 23.45% <0.00%> (+2.02%) ⬆️
master/internal/rm/agentrm/agent.go 25.42% <0.00%> (ø)
master/internal/rm/tasklist/task_list.go 9.63% <0.00%> (-0.24%) ⬇️
master/internal/rm/db.go 80.64% <80.64%> (ø)

... and 6 files with indirect coverage changes

)

// FetchAvgQueuedTime fetches the average queued time for a resource pool.
func FetchAvgQueuedTime(pool string) (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any Go practices for keeping the exposure limited this and subpackages aka rm/* ?

@hamidzr hamidzr marked this pull request as ready for review March 25, 2024 15:54
@hamidzr hamidzr requested review from a team as code owners March 25, 2024 15:54
@hamidzr hamidzr requested a review from stoksc March 25, 2024 15:54
@hamidzr
Copy link
Contributor Author

hamidzr commented Mar 25, 2024

any suggestions on the best way to add automated tests here?

@determined-ci determined-ci added the documentation Improvements or additions to documentation label Mar 25, 2024
@determined-ci determined-ci requested a review from a team March 25, 2024 16:02
// allocation.startTime?
StartTime: &msg.JobSubmissionTime,
EndTime: &now,
StartTime: &msg.JobSubmissionTime,
Copy link
Contributor

@stoksc stoksc Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. i think JobSubmissionTime is wrong. if anything the language here is wrong. "task stats" implies this is talking about the task level, but job submission time is when the experiment was submitted.

Copy link
Contributor Author

@hamidzr hamidzr Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think there's more to look at here but it isn't immediately obvious to me and that'd be a different level of wrong than the current state of "we put time.zero as the start of queued period for k8s". Does it make sense to work on that on a separate ticket w/ a different priority?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, they're equally totally wrong, so i'd lean to just fixing it. i wouldnt call this commit "fix: k8s rp queued stats" if you do it in two PRs.

Copy link
Contributor Author

@hamidzr hamidzr Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair. #9028 (comment)

@hamidzr hamidzr changed the title fix: k8s rp queued stats fix: add missing k8s job submission times to allocations Apr 2, 2024
@hamidzr
Copy link
Contributor Author

hamidzr commented Apr 2, 2024

I updated the title and the associated ticket.
I created a branch off of this to work on and explore bigger change to how we compute these stats for various RMs
#9088

Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving because there is nothing in the PR I disagree with

I don't think it makes average queued time much more correct. We still are not having the right end time. We end the tasks stats when we get resources allocated, which in kubernetes is going to be roughly right after we submit the job. We really want to change this till after the pod gets gpus assigned. Maybe a better change is just hiding the chart on Kubernetes / Slurm clusters until grafana dashboards can include queued time information?

I'm cool if you want to land this since I think it is a slight improvement.

master/internal/db/postgres_tasks.go Outdated Show resolved Hide resolved
master/internal/rm/db.go Outdated Show resolved Hide resolved
master/internal/rm/db.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/resource_pool.go Outdated Show resolved Hide resolved
@@ -572,39 +572,7 @@ func (k *ResourceManager) createResourcePoolSummary(
func (k *ResourceManager) fetchAvgQueuedTime(pool string) (
[]*jobv1.AggregateQueueStats, error,
) {
aggregates := []model.ResourceAggregates{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the database query doesn't change at all in this PR right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the pr should not have any effectvie change other than

add missing job submission time on k8s assigned resources
reuse average stats calculation between rms

master/internal/task/allocation.go Outdated Show resolved Hide resolved
docs/release-notes/job-queued-stats.rst Outdated Show resolved Hide resolved
@determined-ci determined-ci removed the documentation Improvements or additions to documentation label Apr 3, 2024
@hamidzr hamidzr merged commit f03a8a8 into main Apr 4, 2024
68 of 81 checks passed
@hamidzr hamidzr deleted the hz-rm34-queued-stats branch April 4, 2024 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants