Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad UI painfully slow when job counts goes from hundreds to thousands #14787

Closed
djenriquez opened this issue Oct 3, 2022 · 13 comments
Closed

Comments

@djenriquez
Copy link

djenriquez commented Oct 3, 2022

Nomad version

Output from nomad version
Nomad v1.3.3 (428b2cd8014c48ee9eae23f02712b7219da16d30)

Operating system and Environment details

Amazon Linux release 2 (Karoo)

Issue

We have a particular use case where Nomad is used to orchestrate full sandboxes for our developers in our development environment. These sandboxes represent our complete stack of services, which means ~100 jobs, including periodic batch jobs.

The higher the number of total jobs, the slower the Nomad UI becomes. Initially, we thought this might be an issue with the actual Nomad servers handling the sheer amount of work, but thats is not the case. Nomad's core is able to handle, at one point, over 10,000 jobs /w ~maybe 50,000 allocations just fine. RPC calls through its API were responsive and the metrics we track showed no struggle whatsoever.

However, the UI was a different story, as it would sit on the Nomad loader graphic for a period of time that seemed to grow linearly with the amount of jobs being run. Interestingly the API requests the UI made to the Nomad servers were responsive, according to chrome dev tools, providing supporting evidence that the backend is not the issue.

Also, when looking at the waterfall chart from chrome dev tools, we see a call to /v1/namespaces?index=1 that eventually is canceled by the browser. Not sure if this request is misleading, but the page renders once that request pops up in the network analyzer, so it seems there is some blocking call at that part of the flow.
Screen Shot 2022-10-03 at 1 34 34 PM

Reproduction steps

Spin up atleast 1000 jobs /w ~3000 allocations then navigate to the UI.

Expected Result

UI load time grows proportionately with the API response time for requests made to the Nomad server.

Actual Result

UI load time degrades as more jobs and allocations are running on the Nomad cluster while the API responds performantly.

We're open to scheduling a remote session if that makes it easier to see the issue.

@tgross tgross added the theme/ui label Oct 3, 2022
@tgross tgross added this to Nomad UI Oct 3, 2022
@tgross tgross moved this to Backlog in Nomad UI Oct 3, 2022
@philrenaud
Copy link
Contributor

Hi @djenriquez, thanks for raising this — we'll take a look and update this once we have more info.

@philrenaud philrenaud moved this from Backlog to Todo in Nomad UI Oct 3, 2022
@philrenaud philrenaud self-assigned this Oct 4, 2022
@ChaiWithJai ChaiWithJai moved this from Todo to In Progress in Nomad UI Oct 17, 2022
@ChaiWithJai ChaiWithJai self-assigned this Oct 17, 2022
@ChaiWithJai ChaiWithJai moved this from In Progress to In Review in Nomad UI Oct 20, 2022
@ChaiWithJai
Copy link
Contributor

Hey @djenriquez! Nice to meet you. We're super grateful that you raised this issue and it looks like the Nomad Community at large is also noticing this problem.

We're noticing that the issue may be the result of JavaScript Promises on the /jobs and /jobs/:jobId views are starving the event loop. We investigated the issue along with possible solutions and we have 2 commits that you can pull down:

For the /jobs/:jobId (The Job Detail Overview page) we're very confident that this commit will resolve that problem.

But for the /jobs (The Main Jobs List page) we tried to implement our pagination logic. There will be some regressions because we're mixing server and client-side filtering and sorting now. You can try out this commit.

We're very excited to work with you to find the right solution and we welcome any and all feedback about how you're searching and filtering for jobs (along with any feedback about the Nomad UI). We're in the process of planning a lot great new features into the UI and we're eager to solve any big challenges or even small "papercuts" that you're experiencing.

I'll be heading out on vacation soon, but I'll try my best to be responsive today and tomorrow on this issue and revisit this when I return. Looking forward to hearing from you!

Life is so rich,
Jai

@djenriquez
Copy link
Author

Hi @ChaiWithJai, thanks so much for providing these commits. I'll go ahead see how I might be able to plug this into our current system and verify its results. It will likely be next week when I can provide results, however.

@ChaiWithJai
Copy link
Contributor

Hey @djenriquez! I'm back in the office and wanted to circle back up with you. Were you able to try these commits out?

@djenriquez
Copy link
Author

Hi @ChaiWithJai I realize I dropped the ball on checking back on this issue. Are we able to reconvene?

@jhyx2022
Copy link

Greetings! Is there any update to the fix? The UI is slowing down to a halt whenever there are more than thousand jobs(including dead jobs) in the cluster.

@djenriquez
Copy link
Author

Looks like theres a PR: #14989, looking to test this out against 1.5.3, just need quick confirmation on compatibility /w #14989 (comment).

@philrenaud
Copy link
Contributor

Dropping a note to say that this is something we intend to prioritize soon; see #14989 (comment) for a little more context.

@philrenaud philrenaud moved this from In Review to Todo in Nomad UI Jul 13, 2023
@jhyx2022
Copy link

Dropping a note to say that this is something we intend to prioritize soon; see #14989 (comment) for a little more context.

Hi there, is there an update on the fix yet or expected version for the fix? Thanks!

@philrenaud
Copy link
Contributor

@jhyx2022 Serendipitous timing! We've been developing a new endpoint to complement /jobs that will should make things a lot snappier. You can follow along with a few of the issues:

These should have the effect of a more limited initial pull of jobs on the main index in the UI. There'll still be the ability to paginate, search, and filter your list down, but those functions will no longer be front-end dependent.

@philrenaud philrenaud moved this from Todo to In Progress in Nomad UI Jan 24, 2024
@jhyx2022
Copy link

Great news, appreciate the update!

@philrenaud
Copy link
Contributor

Thanks to everyone for your patience on this issue. Pleased to say that #20452 is now merged and will be releasing in the upcoming Nomad 1.8. Among other things, it handles pagination for the jobs index and doesn't overload itself with child jobs that eat up memory at index level. I hope that this makes the overall experience of using the web UI much smoother!

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Status: Done
Development

No branches or pull requests

5 participants