Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-job watchable allocation endpoint #19339

Closed
philrenaud opened this issue Dec 6, 2023 · 6 comments · Fixed by #20130
Closed

Multi-job watchable allocation endpoint #19339

philrenaud opened this issue Dec 6, 2023 · 6 comments · Fixed by #20130
Assignees
Labels
hcc/bff Internal: server-side stuff in a client-side world

Comments

@philrenaud
Copy link
Contributor

The Jobs index page in the UI long-polls a blocking query for /jobs in order to show the data it currently shows:
image

We presently use the jobs[].JobSummary.Summary.$groupName object to populate the historical state of allocations in the jobs, and display it in the small chart on the right of the job rows.
image

However, in #16128, a major driver of the work was to stop using the historical state-stored alloc data and to instead to a live look-up. On a job-by-job basis, looking up /allocations and watching for changes makes sense and the browser can support holding open a blocking query. But on the jobs index page, this is asking the browser to keep an eye on a few too many things and in testing, this has proven to cause network queue backups.

I'd like to have a way to watch the allocations of a subset of jobs, and for that endpoint to provide the following properties + update whenever they change:

  • ClientStatus
  • DeploymentStatus.Canary
  • DeploymentStatus.Healthy

Related, this table should also be able to keep track whenever a Job has an active deployment taking place. The Nomad front-end currently handles this by watching Job.Deployments[].{latest}.status === running which, like allocation watching, works fine when looking at a single job but is problematic for the browser when looking at an index.

This could be part of the same query described above for allocations, and simply expanded to something like /realtimejobs; or, it could be a second blocking query.


Why though?

This would solve 3 notable problems on the job index page:

  • The index page doesn't tell us the present status of a job's allocations, only their historical status
  • A truer picture of things: accumulation of alloc statuses in the state store is sometimes incorrect (statuses can be incremented but not decremented under certain scenarios, like quickly failed deployments or client shut-downs)
  • alerting an observing user when a deployment or job update is taking place
@gulducat
Copy link
Member

gulducat commented Feb 2, 2024

I've been working on this in the bff-accurate-jobs-summaries branch, where the endpoint is currently called /v1/jobs/statuses.

All of this is subject to change, but I wanted to document the current state before leaving for a week+ to try and buy a house across the country. 😋

It supports index blocking, pagination, filtering, etc, and the (unusual) ability to POST a specific list of job IDs (+namespace), to prevent "jostling" the UI table with jobs possibly coming in and out of existence.

Either GET-ing a page, or POST-ing a "subset" (as we've been calling it) of jobs returns a response body like this:

click to expand
nomad operator api /v1/jobs/statuses | jq .
# or
echo '{"Jobs": [{"namespace": "default", "id": "cool-job"}]}' \
  | nomad operator api -X POST /v1/jobs/statuses | jq .
[
  {
    "Allocs": [
      {
        "ClientStatus": "running",
        "DeploymentStatus": {
          "Canary": false,
          "Healthy": true
        },
        "Group": "cool-group",
        "ID": "87986a8f-f222-0fbc-f566-eb274e179695",
        "JobVersion": 11,
        "NodeID": "e156b04d-2c81-a02f-f880-2b8cfccddd95"
      }
    ],
    "ChildStatuses": null,
    "Datacenters": [
      "*"
    ],
    "DeploymentID": "",
    "GroupCountSum": 1,
    "ID": "cool-job",
    "Name": "cool-job",
    "Namespace": "default",
    "NodePool": "default",
    "Priority": 50,
    "SmartAlloc": {
      "total": 1,
      "running": 1
    },
    "Type": "service",
    "Version": 11
  }
]

Notably, each allocation has a whole separate entry, rather than relying on count integers like JobSummary (though I did put status counts under "SmartAlloc" as an experiment). This allows for some nice flexibility in the frontend, but comes with a performance concern: "What about jobs with <very many> allocs?" Especially since this blocking query unblocks with any update to jobs or allocations on the page/subset of jobs (but importantly, not off-page), it could be requested repeatedly in quick succession, so performance is pretty important.

I should compile my approach/findings properly, but suffice to say that even with a job having 100,000 allocs, the endpoint http response takes about 0.075 seconds on average (down to 0.008 seconds for 10,000 allocs), and appears to consume additional memory on the scale of MB (i.e. not very much) to serialize the response.

I don't have proper usage data of real-world scenarios of very large and busy clusters, and it would be great to ask some users to test this for us, but it seems to me that the addition of pagination will decrease the api cost compared to the current /jobs call (the frontend does not ask for pages right now, it gets allll the jobs), and increase api cost in the form of possibly-lots-of allocs in the payload, and repeated requests when updates occur.

There are various other ins and outs to be properly documented in due time, but I wanted to link the WIP branch, and at least jot down the performance concern while it's fresh on my mind.

@gulducat
Copy link
Member

gulducat commented May 3, 2024

Latest update to the shape of the API, at this point likely what we will ship!

This cluster has 3 jobs:

  • ID "job" is a service job
  • ID "param" is a parameterized batch job
  • ID "param/dispatch-..." is a child job
nomad operator api /v1/jobs/statuses?include_children=true | jq .
[
  {
    "Allocs": [
      {
        "ClientStatus": "pending",
        "DeploymentStatus": {
          "Canary": false,
          "Healthy": null
        },
        "FollowupEvalID": "",
        "Group": "group",
        "ID": "d49914fb-885d-4040-3afb-0fca7621c2bf",
        "JobVersion": 0,
        "NodeID": "85f1209c-fee4-65f2-6b85-b3d8dce922c8"
      }
    ],
    "ChildStatuses": null,
    "Datacenters": [
      "*"
    ],
    "GroupCountSum": 1,
    "ID": "param/dispatch-1714750358-1f77d9b0",
    "LatestDeployment": null,
    "ModifyIndex": 140,
    "Name": "param/dispatch-1714750358-1f77d9b0",
    "Namespace": "default",
    "NodePool": "default",
    "ParentID": "param",
    "Priority": 50,
    "SubmitTime": 1714750358850407000,
    "Type": "batch",
    "Version": 0
  },
  {
    "Allocs": [
      {
        "ClientStatus": "running",
        "DeploymentStatus": {
          "Canary": false,
          "Healthy": true
        },
        "FollowupEvalID": "",
        "Group": "g",
        "ID": "14f78215-e8b8-f653-4a52-bc638c19e8d5",
        "JobVersion": 0,
        "NodeID": "85f1209c-fee4-65f2-6b85-b3d8dce922c8"
      }
    ],
    "ChildStatuses": null,
    "Datacenters": [
      "*"
    ],
    "GroupCountSum": 1,
    "ID": "job",
    "LatestDeployment": {
      "AllAutoPromote": false,
      "ID": "447053d3-c5cf-1127-396a-7bff8e85ce60",
      "IsActive": false,
      "JobVersion": 0,
      "RequiresPromotion": false,
      "Status": "successful",
      "StatusDescription": "Deployment completed successfully"
    },
    "ModifyIndex": 19,
    "Name": "job",
    "Namespace": "default",
    "NodePool": "default",
    "ParentID": "",
    "Priority": 50,
    "SubmitTime": 1714744032603533800,
    "Type": "service",
    "Version": 0
  },
  {
    "Allocs": null,
    "ChildStatuses": [
      "running"
    ],
    "Datacenters": [
      "*"
    ],
    "GroupCountSum": 1,
    "ID": "param",
    "LatestDeployment": null,
    "ModifyIndex": 11,
    "Name": "param",
    "Namespace": "default",
    "NodePool": "default",
    "ParentID": "",
    "Priority": 50,
    "SubmitTime": 1714743946723324000,
    "Type": "batch",
    "Version": 0
  }
]

@EugenKon
Copy link

@gulducat It would be helpful if you put the screenshot here. Or did you done changes on the API side only?

@gulducat
Copy link
Member

Hi @EugenKon! I enjoy visuals too, but I only made the backend API. ❤️

@philrenaud
Copy link
Contributor Author

@EugenKon Hi! Screenshots of what is being powered by this endpoint can be found over at #20452

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hcc/bff Internal: server-side stuff in a client-side world
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants