-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-job watchable allocation endpoint #19339
Comments
I've been working on this in the bff-accurate-jobs-summaries branch, where the endpoint is currently called All of this is subject to change, but I wanted to document the current state before leaving for a week+ to try and buy a house across the country. 😋 It supports index blocking, pagination, filtering, etc, and the (unusual) ability to POST a specific list of job IDs (+namespace), to prevent "jostling" the UI table with jobs possibly coming in and out of existence. Either GET-ing a page, or POST-ing a "subset" (as we've been calling it) of jobs returns a response body like this: click to expandnomad operator api /v1/jobs/statuses | jq .
# or
echo '{"Jobs": [{"namespace": "default", "id": "cool-job"}]}' \
| nomad operator api -X POST /v1/jobs/statuses | jq . [
{
"Allocs": [
{
"ClientStatus": "running",
"DeploymentStatus": {
"Canary": false,
"Healthy": true
},
"Group": "cool-group",
"ID": "87986a8f-f222-0fbc-f566-eb274e179695",
"JobVersion": 11,
"NodeID": "e156b04d-2c81-a02f-f880-2b8cfccddd95"
}
],
"ChildStatuses": null,
"Datacenters": [
"*"
],
"DeploymentID": "",
"GroupCountSum": 1,
"ID": "cool-job",
"Name": "cool-job",
"Namespace": "default",
"NodePool": "default",
"Priority": 50,
"SmartAlloc": {
"total": 1,
"running": 1
},
"Type": "service",
"Version": 11
}
]
Notably, each allocation has a whole separate entry, rather than relying on count integers like JobSummary (though I did put status counts under "SmartAlloc" as an experiment). This allows for some nice flexibility in the frontend, but comes with a performance concern: "What about jobs with <very many> allocs?" Especially since this blocking query unblocks with any update to jobs or allocations on the page/subset of jobs (but importantly, not off-page), it could be requested repeatedly in quick succession, so performance is pretty important. I should compile my approach/findings properly, but suffice to say that even with a job having 100,000 allocs, the endpoint http response takes about 0.075 seconds on average (down to 0.008 seconds for 10,000 allocs), and appears to consume additional memory on the scale of MB (i.e. not very much) to serialize the response. I don't have proper usage data of real-world scenarios of very large and busy clusters, and it would be great to ask some users to test this for us, but it seems to me that the addition of pagination will decrease the api cost compared to the current There are various other ins and outs to be properly documented in due time, but I wanted to link the WIP branch, and at least jot down the performance concern while it's fresh on my mind. |
Latest update to the shape of the API, at this point likely what we will ship! This cluster has 3 jobs:
[
{
"Allocs": [
{
"ClientStatus": "pending",
"DeploymentStatus": {
"Canary": false,
"Healthy": null
},
"FollowupEvalID": "",
"Group": "group",
"ID": "d49914fb-885d-4040-3afb-0fca7621c2bf",
"JobVersion": 0,
"NodeID": "85f1209c-fee4-65f2-6b85-b3d8dce922c8"
}
],
"ChildStatuses": null,
"Datacenters": [
"*"
],
"GroupCountSum": 1,
"ID": "param/dispatch-1714750358-1f77d9b0",
"LatestDeployment": null,
"ModifyIndex": 140,
"Name": "param/dispatch-1714750358-1f77d9b0",
"Namespace": "default",
"NodePool": "default",
"ParentID": "param",
"Priority": 50,
"SubmitTime": 1714750358850407000,
"Type": "batch",
"Version": 0
},
{
"Allocs": [
{
"ClientStatus": "running",
"DeploymentStatus": {
"Canary": false,
"Healthy": true
},
"FollowupEvalID": "",
"Group": "g",
"ID": "14f78215-e8b8-f653-4a52-bc638c19e8d5",
"JobVersion": 0,
"NodeID": "85f1209c-fee4-65f2-6b85-b3d8dce922c8"
}
],
"ChildStatuses": null,
"Datacenters": [
"*"
],
"GroupCountSum": 1,
"ID": "job",
"LatestDeployment": {
"AllAutoPromote": false,
"ID": "447053d3-c5cf-1127-396a-7bff8e85ce60",
"IsActive": false,
"JobVersion": 0,
"RequiresPromotion": false,
"Status": "successful",
"StatusDescription": "Deployment completed successfully"
},
"ModifyIndex": 19,
"Name": "job",
"Namespace": "default",
"NodePool": "default",
"ParentID": "",
"Priority": 50,
"SubmitTime": 1714744032603533800,
"Type": "service",
"Version": 0
},
{
"Allocs": null,
"ChildStatuses": [
"running"
],
"Datacenters": [
"*"
],
"GroupCountSum": 1,
"ID": "param",
"LatestDeployment": null,
"ModifyIndex": 11,
"Name": "param",
"Namespace": "default",
"NodePool": "default",
"ParentID": "",
"Priority": 50,
"SubmitTime": 1714743946723324000,
"Type": "batch",
"Version": 0
}
] |
@gulducat It would be helpful if you put the screenshot here. Or did you done changes on the API side only? |
Hi @EugenKon! I enjoy visuals too, but I only made the backend API. ❤️ |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
The Jobs index page in the UI long-polls a blocking query for
/jobs
in order to show the data it currently shows:We presently use the
jobs[].JobSummary.Summary.$groupName
object to populate the historical state of allocations in the jobs, and display it in the small chart on the right of the job rows.However, in #16128, a major driver of the work was to stop using the historical state-stored alloc data and to instead to a live look-up. On a job-by-job basis, looking up /allocations and watching for changes makes sense and the browser can support holding open a blocking query. But on the jobs index page, this is asking the browser to keep an eye on a few too many things and in testing, this has proven to cause network queue backups.
I'd like to have a way to watch the allocations of a subset of jobs, and for that endpoint to provide the following properties + update whenever they change:
Related, this table should also be able to keep track whenever a Job has an active deployment taking place. The Nomad front-end currently handles this by watching
Job.Deployments[].{latest}.status === running
which, like allocation watching, works fine when looking at a single job but is problematic for the browser when looking at an index.This could be part of the same query described above for allocations, and simply expanded to something like /realtimejobs; or, it could be a second blocking query.
Why though?
This would solve 3 notable problems on the job index page:
The text was updated successfully, but these errors were encountered: