-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script health checks fail in Ambari cluster #1968
Comments
Can you paste the output of: |
Not exactly
|
@kaskavalci Is the issue that the check never switches to passing or that it starts critical? It is a bit hard to use the logs/process/check output you have to debug since I think they were captured at different times. |
@dadgar It starts critical and never changes. I also tested with command that creates file and file is created. Which means Nomad successfully executes the binary but does not send TTL to Consul. Consul has a log entry complaining that:
Job creates file
Test if OK
|
@kaskavalci We would like to see if Nomad is sending any requests to Consul Can you capture some packets for us? For example, on my host Consul is running on port Can you capture 3-4 minutes of data and share with us, please? |
I'm experiencing a similar issue. checks seem to deregister after a while, and sometime only partially register in consul - example: group "redis" {
count = 3
task "server" {
driver = "docker"
user = "root"
service {
name = "redis-discovery"
port = "redis"
tags = ["index-${NOMAD_ALLOC_INDEX}"]
}
}
} In a lot of cases only register some of the tags, e.g. in my last run Sometimes the service registers, but without any check as well. My Nomad cluster is running in DEBUG log level, but can't find anything related to Consul in the log output. (0.5.0-rc2) Consul is 0.7.1, running with log level INFO, and nothing interesting there either worth noting. The POST payload seems correct, and can't see any errors in Consul from it. As suggested to @schmichael i can give hashicorp ssh access to the infrastructure if needed for deeper analysis if need be. for completeness sake, this is my full job spec: https://gist.github.com/jippi/7b182548eb4f8ef0b0799327dc02d5af |
This is roughly the flow I'm seeing from Nomad
Submitting the job again
(it's not registered any longer, but not reregistered wither) Submitting the job again
Yep, nothing in nomad regarding service checks at all Submitting the job again
on one node, but not on the other 2 Submitting the job again
and nothing else :( |
Even removing all the service + checks and resubmitting it, doesn't fully remove them from Consul, so seem to be some inconsistent state between nomad and consul going on :( |
Force removing the services + check and submitting the original job again successfully registed all 3 checks + service for maybe worth noting that the resubmission of the job is unmodified, so no new allocations are created. Tried to change the I've noticed a few times that everything is registered correctly, but without submitting a new job, 12-48h later, the service and checks become unregistered against, meaning nomad or consul must decide to remove them for whatever reason. Since i don't see this behavior under normal consul checks in our apps (also using http api), my bet is nomad doing something silly :( |
@kaskavalci Still a problem on 0.5? There were some consul bug fixes |
I will test ASAP and let you know. |
Confirmed issue no longer exists. Thanks 👍 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.4.1
Operating system and Environment details
CentOS Linux release 7.2.1511 (Core)
Issue
Health check of this simple job file fails in Ambari cluster that we use. I tested in CentOS, RHEL7 and Ubuntu. All clusters have Spark, YARN, ZooKeeper, HDFS and MapReduce2. I attached output of running programs as well.
Nomad Server logs (if appropriate)
Consul logs
Job file (if appropriate)
Running programs
The text was updated successfully, but these errors were encountered: