-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UI: When allocations for a job are split across versions, all allocations use the latest job version resource metrics #7842
Comments
This one is a doozy, so I want to start by explaining the data model a little bit and then sharing a combination stream of consciousness and remediation options. How does Nomad save resource requirements?Resource requirements are members of the Job document, and a Job can be identified by (id, version, namespace). How do you get the resource requirements for an allocation?As an API consumer you have a few options, but this is what the UI does:
Why is this a problem?So that's a lot to follow, but the key gotcha is that when linking allocations and jobs, job version is not considered. This is almost always acceptable because allocations of older versions are rarely running, but in this case, it's important. Not only is it important, but a deployment going sideways is exactly when you would want as much accurate information as possible. How do we fix it?There are a few options to explore, but the number one thing that definitely has to happen is the Option 1: Do nothing, call it an edge case This would certainly be the easiest path forward, but as mentioned, even if this happens rarely, it happens exactly when you want accurate information. Doing nothing is not good enough. Option 2: Don't show metrics for allocations for older job versions No information is better than wrong information, but this is still not very good. Ideally we continue to show metrics for all running allocations and we just address this edge case. Option 3: Load every allocation individually to get the resources information from the job definition that the allocation was constructed from This would fix the issue, but it would be at the cost of a lot more HTTP requests. This is what the CLI does, but the CLI will only present stats when a single allocation is referenced (e.g., Option 4: Add the correct task group resources block to the stub allocations returned in the allocation list requests This would also fix the issue and it would be an improvement over Option 3, but it means severely bloating the JSON for the allocations endpoint. This is especially prohibitive if using the top-level Option 5: Prefer the existing solution of linking an allocation to the job but if the allocation's job version doesn't match the most recent job version, reload to get the correct resources This is more complex to build and maintain, but it is a "best of both worlds" approach to Option 2 and Option 1. Since this isn't likely to happen very often, this shouldn't increase the number of requests by a large margin, but it means in the UI we have to verify JobVersion, conditionally make the additional request, and handle the asynchronicity of the additional request. Maybe still the best option? Another downside here is redundant requests for allocations that are all of the same job version but not the most recent job version. If the newest job version is 2 and there are 4 running allocations of job version 1, why make 4 additional requests? Option 6.a: Prefer the existing solution of linking an allocation to the job but if the allocation's job version doesn't match the most recent job version, use the job versions API to get the old definition This is a way to avoid the problem with Option 5 where redundant requests for resource information of the same job version are requested for each allocation. The downside is we don't have an API for fetching a single job version. It's all or nothing. This trades redundant requests for one wasteful request. Plus there might be the potential for job versions to be GC'd orphaning the older allocation. Option 6.b: Do Option 6.a but also introduce a new API for fetching a single historical job definition This is the most work. A change to allocation list, a new api, remodeling the allocation<>job relationship in the UI, handling the nuance of making conditional requests, and also making sure that historical jobs are never shown elsewhere in the UI (e.g., jobs list view or job detail view). But this is also maybe the best solution. Next stepsI'm going to proceed with Option 5 while we discuss the pros and cons of Option 5 and Option 6. A lot of the UI work between the two overlaps. |
Updates! First, turns out Second, the UI was already reloading the allocation for the allocation row component to get complete preemption and rescheduling information, so my fretting over excessive requests was unjustified. Third, and maybe the most insightful, the UI codebase is officially too old and too large for me to hold the whole thing in my head at once. Hopefully this has all been entertaining at the very least. PR open #7855 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If a job has running allocations for multiple job versions, all allocations, regardless of job version, will reference the resource requirements from the latest job version.
Reproduction
Expected Result
Each allocation should show current utilization out of the memory limit for the job version the allocation corresponds to.
Affected Nomad Versions
All of them (since 0.7 when the UI was introduced)
The text was updated successfully, but these errors were encountered: