Multi-job watchable allocation endpoint #19339

philrenaud · 2023-12-06T22:00:36Z

The Jobs index page in the UI long-polls a blocking query for /jobs in order to show the data it currently shows:

We presently use the jobs[].JobSummary.Summary.$groupName object to populate the historical state of allocations in the jobs, and display it in the small chart on the right of the job rows.

However, in #16128, a major driver of the work was to stop using the historical state-stored alloc data and to instead to a live look-up. On a job-by-job basis, looking up /allocations and watching for changes makes sense and the browser can support holding open a blocking query. But on the jobs index page, this is asking the browser to keep an eye on a few too many things and in testing, this has proven to cause network queue backups.

I'd like to have a way to watch the allocations of a subset of jobs, and for that endpoint to provide the following properties + update whenever they change:

ClientStatus
DeploymentStatus.Canary
DeploymentStatus.Healthy

Related, this table should also be able to keep track whenever a Job has an active deployment taking place. The Nomad front-end currently handles this by watching Job.Deployments[].{latest}.status === running which, like allocation watching, works fine when looking at a single job but is problematic for the browser when looking at an index.

This could be part of the same query described above for allocations, and simply expanded to something like /realtimejobs; or, it could be a second blocking query.

Why though?

This would solve 3 notable problems on the job index page:

The index page doesn't tell us the present status of a job's allocations, only their historical status
A truer picture of things: accumulation of alloc statuses in the state store is sometimes incorrect (statuses can be incremented but not decremented under certain scenarios, like quickly failed deployments or client shut-downs)
alerting an observing user when a deployment or job update is taking place

The text was updated successfully, but these errors were encountered:

gulducat · 2024-02-02T18:59:00Z

I've been working on this in the bff-accurate-jobs-summaries branch, where the endpoint is currently called /v1/jobs/statuses.

All of this is subject to change, but I wanted to document the current state before leaving for a week+ to try and buy a house across the country. 😋

It supports index blocking, pagination, filtering, etc, and the (unusual) ability to POST a specific list of job IDs (+namespace), to prevent "jostling" the UI table with jobs possibly coming in and out of existence.

Either GET-ing a page, or POST-ing a "subset" (as we've been calling it) of jobs returns a response body like this:

click to expand

nomad operator api /v1/jobs/statuses | jq .
# or
echo '{"Jobs": [{"namespace": "default", "id": "cool-job"}]}' \
  | nomad operator api -X POST /v1/jobs/statuses | jq .

[
  {
    "Allocs": [
      {
        "ClientStatus": "running",
        "DeploymentStatus": {
          "Canary": false,
          "Healthy": true
        },
        "Group": "cool-group",
        "ID": "87986a8f-f222-0fbc-f566-eb274e179695",
        "JobVersion": 11,
        "NodeID": "e156b04d-2c81-a02f-f880-2b8cfccddd95"
      }
    ],
    "ChildStatuses": null,
    "Datacenters": [
      "*"
    ],
    "DeploymentID": "",
    "GroupCountSum": 1,
    "ID": "cool-job",
    "Name": "cool-job",
    "Namespace": "default",
    "NodePool": "default",
    "Priority": 50,
    "SmartAlloc": {
      "total": 1,
      "running": 1
    },
    "Type": "service",
    "Version": 11
  }
]

Notably, each allocation has a whole separate entry, rather than relying on count integers like JobSummary (though I did put status counts under "SmartAlloc" as an experiment). This allows for some nice flexibility in the frontend, but comes with a performance concern: "What about jobs with <very many> allocs?" Especially since this blocking query unblocks with any update to jobs or allocations on the page/subset of jobs (but importantly, not off-page), it could be requested repeatedly in quick succession, so performance is pretty important.

I should compile my approach/findings properly, but suffice to say that even with a job having 100,000 allocs, the endpoint http response takes about 0.075 seconds on average (down to 0.008 seconds for 10,000 allocs), and appears to consume additional memory on the scale of MB (i.e. not very much) to serialize the response.

I don't have proper usage data of real-world scenarios of very large and busy clusters, and it would be great to ask some users to test this for us, but it seems to me that the addition of pagination will decrease the api cost compared to the current /jobs call (the frontend does not ask for pages right now, it gets allll the jobs), and increase api cost in the form of possibly-lots-of allocs in the payload, and repeated requests when updates occur.

There are various other ins and outs to be properly documented in due time, but I wanted to link the WIP branch, and at least jot down the performance concern while it's fresh on my mind.

gulducat · 2024-05-03T15:36:11Z

Latest update to the shape of the API, at this point likely what we will ship!

This cluster has 3 jobs:

ID "job" is a service job
ID "param" is a parameterized batch job
ID "param/dispatch-..." is a child job

nomad operator api /v1/jobs/statuses?include_children=true | jq .

[
  {
    "Allocs": [
      {
        "ClientStatus": "pending",
        "DeploymentStatus": {
          "Canary": false,
          "Healthy": null
        },
        "FollowupEvalID": "",
        "Group": "group",
        "ID": "d49914fb-885d-4040-3afb-0fca7621c2bf",
        "JobVersion": 0,
        "NodeID": "85f1209c-fee4-65f2-6b85-b3d8dce922c8"
      }
    ],
    "ChildStatuses": null,
    "Datacenters": [
      "*"
    ],
    "GroupCountSum": 1,
    "ID": "param/dispatch-1714750358-1f77d9b0",
    "LatestDeployment": null,
    "ModifyIndex": 140,
    "Name": "param/dispatch-1714750358-1f77d9b0",
    "Namespace": "default",
    "NodePool": "default",
    "ParentID": "param",
    "Priority": 50,
    "SubmitTime": 1714750358850407000,
    "Type": "batch",
    "Version": 0
  },
  {
    "Allocs": [
      {
        "ClientStatus": "running",
        "DeploymentStatus": {
          "Canary": false,
          "Healthy": true
        },
        "FollowupEvalID": "",
        "Group": "g",
        "ID": "14f78215-e8b8-f653-4a52-bc638c19e8d5",
        "JobVersion": 0,
        "NodeID": "85f1209c-fee4-65f2-6b85-b3d8dce922c8"
      }
    ],
    "ChildStatuses": null,
    "Datacenters": [
      "*"
    ],
    "GroupCountSum": 1,
    "ID": "job",
    "LatestDeployment": {
      "AllAutoPromote": false,
      "ID": "447053d3-c5cf-1127-396a-7bff8e85ce60",
      "IsActive": false,
      "JobVersion": 0,
      "RequiresPromotion": false,
      "Status": "successful",
      "StatusDescription": "Deployment completed successfully"
    },
    "ModifyIndex": 19,
    "Name": "job",
    "Namespace": "default",
    "NodePool": "default",
    "ParentID": "",
    "Priority": 50,
    "SubmitTime": 1714744032603533800,
    "Type": "service",
    "Version": 0
  },
  {
    "Allocs": null,
    "ChildStatuses": [
      "running"
    ],
    "Datacenters": [
      "*"
    ],
    "GroupCountSum": 1,
    "ID": "param",
    "LatestDeployment": null,
    "ModifyIndex": 11,
    "Name": "param",
    "Namespace": "default",
    "NodePool": "default",
    "ParentID": "",
    "Priority": 50,
    "SubmitTime": 1714743946723324000,
    "Type": "batch",
    "Version": 0
  }
]

EugenKon · 2024-05-28T14:38:24Z

@gulducat It would be helpful if you put the screenshot here. Or did you done changes on the API side only?

gulducat · 2024-05-28T14:50:27Z

Hi @EugenKon! I enjoy visuals too, but I only made the backend API. ❤️

philrenaud · 2024-05-28T14:50:44Z

@EugenKon Hi! Screenshots of what is being powered by this endpoint can be found over at #20452

github-actions · 2024-12-28T02:14:49Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

philrenaud added the hcc/bff Internal: server-side stuff in a client-side world label Dec 6, 2023

gulducat self-assigned this Dec 8, 2023

philrenaud mentioned this issue Jan 5, 2024

Jobs index stub: provide an integer that sums up the counts of all a job's groups #19641

Closed

This was referenced Jan 23, 2024

[ui] Handle jobs pagination #19806

Closed

Nomad UI painfully slow when job counts goes from hundreds to thousands #14787

Closed

mikenomitch mentioned this issue Feb 23, 2024

Wrong report about the status of cluster. Show actual status instead of allocation history #20032

Closed

gulducat mentioned this issue Mar 13, 2024

New API endpoint to back UI jobs index page uplift #20130

Merged

gulducat mentioned this issue Apr 4, 2024

Enable numeric pagination tokens #20299

Merged

gulducat closed this as completed in #20130 May 3, 2024

github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-job watchable allocation endpoint #19339

Multi-job watchable allocation endpoint #19339

philrenaud commented Dec 6, 2023

gulducat commented Feb 2, 2024

gulducat commented May 3, 2024

EugenKon commented May 28, 2024

gulducat commented May 28, 2024

philrenaud commented May 28, 2024

github-actions bot commented Dec 28, 2024

Multi-job watchable allocation endpoint #19339

Multi-job watchable allocation endpoint #19339

Comments

philrenaud commented Dec 6, 2023

Why though?

gulducat commented Feb 2, 2024

gulducat commented May 3, 2024

EugenKon commented May 28, 2024

gulducat commented May 28, 2024

philrenaud commented May 28, 2024

github-actions bot commented Dec 28, 2024