Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.8.4] Nomad UI Hanging on Job detail Viewing #5946

Closed
holtwilkins opened this issue Jul 10, 2019 · 16 comments
Closed

[0.8.4] Nomad UI Hanging on Job detail Viewing #5946

holtwilkins opened this issue Jul 10, 2019 · 16 comments

Comments

@holtwilkins
Copy link
Contributor

Re-opening previously closed issue. This issue is not resolved and definitely seems like a Nomad bug.

Nomad version

0.8.4

Operating system and Environment details

Ubuntu 16.04.4 LTS

Issue

In the Nomad UI, when you click on a job or client to view allocations, the UI hangs attempting to load allocations instead of loading the actual allocations.

I was able to still use the CLI to run nomad status to view the jobs and view their underlying allocations through nomad status

Reproduction steps

If there is a periodic job launch that is still in the allocations list, but the parent job had aged out (in our case the periodic job is just stuck in pending).

Other logs

404 in web browser console details in https://user-images.githubusercontent.com/6162849/46262074-e7078f80-c52e-11e8-9102-44d112cf3e9e.png as per #4464 . Note that this screenshot is from another user, but it looks to be a similar problem - the console records a single job showing a 404 when loading /jobs splash page.

Note that browsing directly to a valid job seems to work (as in changing the URL to /jobs/<JOBNAME> manually), but it seems that this job that Nomad is confused about giving the 404 back on the main /jobs page is preventing clicking-in to any job from loading.

@backspace
Copy link
Contributor

Hello, thanks for the report. Are you able to try this out with Nomad 0.9.2 or later? It looks to me that this problem was fixed with UI updates in that version.

@holtwilkins
Copy link
Contributor Author

Will try it out - I'm waiting on 0.9.4 to start upgrading our fleet.

@hands-on-masters
Copy link

i had reproduced this issue using nomad 0.11.0

NOMAD-UI-STUCK-WHEN-PERIODIC-JOB-EXPIRED
NOMAD-UI-STUCK-WHEN-PERIODIC-JOB-EXPIRED-JOBS

2020-04-20T12:26:35.596Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\ProgramData\Kryon\nomad\server\plugins
2020-04-20T12:26:36.050Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2020-04-20T12:26:36.050Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2020-04-20T12:26:36.050Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2020-04-20T12:26:36.050Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2020-04-20T12:26:36.051Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2020-04-20T12:26:39.164Z [INFO] nomad.raft: initial configuration: index=9 servers="[{Suffrage:Voter ID:192.168.15.123:4647 Address:192.168.15.123:4647} {Suffrage:Voter ID:192.168.12.250:4647 Address:192.168.12.250:4647}]"
2020-04-20T12:26:39.165Z [INFO] nomad.raft: entering follower state: follower="Node at 192.168.15.123:4647 [Follower]" leader=
2020-04-20T12:26:39.309Z [INFO] nomad: serf: EventMemberJoin: Kryon15-123.global 192.168.15.123
2020-04-20T12:26:39.309Z [INFO] nomad: serf: Attempting re-join to previously known node: Kryon12-250.global: 192.168.12.250:4648
2020-04-20T12:26:39.314Z [INFO] nomad: starting scheduling worker(s): num_workers=4 schedulers=[batch, system, service, _core]
2020-04-20T12:26:39.462Z [INFO] nomad: adding server: server="Kryon15-123.global (Addr: 192.168.15.123:4647) (DC: kryon-dev)"
2020-04-20T12:26:39.678Z [INFO] nomad: serf: EventMemberJoin: Kryon12-250.global 192.168.12.250
2020-04-20T12:26:39.678Z [WARN] nomad: memberlist: Refuting a suspect message (from: Kryon15-123.global)
2020-04-20T12:26:39.679Z [INFO] nomad: adding server: server="Kryon12-250.global (Addr: 192.168.12.250:4647) (DC: kryon-dev)"
2020-04-20T12:26:39.679Z [INFO] nomad: serf: Re-joined to previously known node: Kryon12-250.global: 192.168.12.250:4648
2020-04-20T12:26:39.819Z [WARN] nomad.raft: failed to get previous log: previous-index=3016 last-index=3015 error="log not found"
2020-04-20T12:50:25.630Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL support disabled" code=400
2020-04-20T12:50:25.658Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpoint" code=501
2020-04-20T12:50:26.177Z [ERROR] http: request failed: method=GET path=/v1/job/redis-config-updater error="job not found" code=404
2020-04-20T12:51:24.844Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL support disabled" code=400
2020-04-20T12:51:24.874Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpoint" code=501
2020-04-20T12:51:25.293Z [ERROR] http: request failed: method=GET path=/v1/job/redis-config-updater error="job not found" code=404
2020-04-20T12:54:48.182Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL support disabled" code=400
2020-04-20T12:54:48.219Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpoint" code=501
2020-04-20T12:54:48.656Z [ERROR] http: request failed: method=GET path=/v1/job/redis-config-updater error="job not found" code=404
NOMAD-PERIODIC-JOBS-ISSUE-VERSION

@hands-on-masters
Copy link

hands-on-masters commented Apr 20, 2020

i had managed to overcome the problem. i had stoped and purged a periodic job by name (redis-config-updater) by stop --purge.
it seems that the jobs were still there - dead but with periodic postfix - all periodic batch jobs that where seen on nomad job status.

below is a script i had run to remove all dead periodic jobs - now the ui is responsive.

i am not sure if the ui is stuck due to amount of jobs - it sure looks like it - anyway after the purge the ui become responsive again.

@echo off
setlocal EnableExtensions EnableDelayedExpansion
cls

for /f "tokens=1,2,4" %%a in ('..\nomad.exe status --address http://192.168.15.123:4646') do (
echo %%a|find "redis-config-updater" >nul
if errorlevel 1 (echo notfound) else (if "%%b"=="batch" if "%%c"=="dead" ..\nomad.exe stop -purge --address http://192.168.15.123:4646 %%a)

)

@jdebbink
Copy link

I can confirm we are seeing this issue on 0.10.4. Also seems related to stopped periodic jobs once they are garbage collected.

@scalp42
Copy link
Contributor

scalp42 commented Jun 10, 2020

I can confirm we just hit this issue when we upgraded Nomad to 0.11.3.

UI was hanging, /gc endpoint didn't fix it.

Purging pending batch jobs fixed it for us by running the following (⚠️ Be careful and make sure you understand the impact before hand to be sure ⚠️ ):

nomad job status | grep -i pending | grep -i batch | awk '{print $1}' | xargs -I% -P2 sh -c '{ nomad stop -purge %; }'

UI works again.

@DingoEatingFuzz
Copy link
Contributor

I looked into this awhile back and ran into deep issues within Ember itself that have since been fixed in newer versions.

I'll be revisiting this after we finish our UI tech debt work that includes an Ember upgrade: #7834

@rkettelerij
Copy link
Contributor

rkettelerij commented Jun 11, 2020

We've recently run into the same issue. It happend on one of our clusters that's still running an ancient version of Nomad: v0.8.3. That cluster has been running for some years now and it's the first time we've encountered this issue (it happend after a significant network outage). We're currently migrating this cluster to the latest Nomad release (actually we're rebuilding it). But apparently that doesn't matter since it's also present in current releases.

Lucky the workaround posted by @scalp42 works perfectly. So for anyone else bumping into this issue: execute the bash oneliner posted by @scalp42 and the Nomad UI should be working again.

@evandam
Copy link

evandam commented Jun 29, 2020

Hey folks, I'm seeing this issue several times a day. The one-liner from @scalp42 works sometimes, but the issue also happens with dead batch jobs. Forcing a nomad system gc solved the issue, though.

@joshuawscott
Copy link

I also ran into this today; seems to be related to some pending parameterized jobs. @scalp42 's fix worked for me.

@berkant
Copy link

berkant commented Jul 11, 2020

Yup, I have lately been having the same issue with parameterized jobs. Opening multiple tabs to jobs reproduces this. UI hangs and then the web server becomes completely clogged (with error on Chrome ERR_EMPTY_RESPONSE) until some time when it comes back on its own.

@DingoEatingFuzz
Copy link
Contributor

Hi everyone!

Thank you for your patience with this bug, and especially thank you @scalp42 for the one-liner workaround. I believe this is now fixed in v0.12.1. See the explanation of the solution here.

Given the number of reports this bug has gotten, I don't want to close this issue until there has been some community confirmation. Please try out 0.12.1 and see if this fixes the issue for you!

@scalp42
Copy link
Contributor

scalp42 commented Jul 23, 2020

I'm going to deploy Nomad 0.12.1 across the infrastructure soon and I'll report back.

Thanks @DingoEatingFuzz 😅

@rkettelerij
Copy link
Contributor

I guess this issue can be closed now (I haven't seen it anymore).

@scalp42
Copy link
Contributor

scalp42 commented Oct 14, 2020

Confirming, can be closed.

@tgross tgross closed this as completed Oct 14, 2020
@github-actions
Copy link

github-actions bot commented Nov 1, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests