Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script health checks fail in Ambari cluster #1968

Closed
kaskavalci opened this issue Nov 9, 2016 · 14 comments
Closed

Script health checks fail in Ambari cluster #1968

kaskavalci opened this issue Nov 9, 2016 · 14 comments

Comments

@kaskavalci
Copy link
Contributor

Nomad version

Nomad v0.4.1

Operating system and Environment details

CentOS Linux release 7.2.1511 (Core)

Issue

Health check of this simple job file fails in Ambari cluster that we use. I tested in CentOS, RHEL7 and Ubuntu. All clusters have Spark, YARN, ZooKeeper, HDFS and MapReduce2. I attached output of running programs as well.

Nomad Server logs (if appropriate)

2016/11/09 08:17:49.301146 [DEBUG] client: state updated to ready
2016/11/09 08:17:49.765639 [DEBUG] http: Request /v1/status/peers (395.047µs)
2016/11/09 08:17:52.947273 [DEBUG] http: Request /v1/agent/servers (799.038µs)
2016/11/09 08:17:54.294241 [DEBUG] worker: dequeued evaluation c81fbdfa-49ac-617b-35f6-666ca5e2c87c
2016/11/09 08:17:54.294303 [DEBUG] sched: <Eval 'c81fbdfa-49ac-617b-35f6-666ca5e2c87c' JobID: 'test'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
2016/11/09 08:17:54.294477 [DEBUG] http: Request /v1/jobs?region=global (1.023716ms)
2016/11/09 08:17:54.294712 [DEBUG] worker: submitted plan for evaluation c81fbdfa-49ac-617b-35f6-666ca5e2c87c
2016/11/09 08:17:54.294727 [DEBUG] sched: <Eval 'c81fbdfa-49ac-617b-35f6-666ca5e2c87c' JobID: 'test'>: setting status to complete
2016/11/09 08:17:54.294742 [DEBUG] client: updated allocations at index 8 (pulled 1) (filtered 0)
2016/11/09 08:17:54.294801 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)
2016/11/09 08:17:54.294811 [DEBUG] worker: updated evaluation <Eval 'c81fbdfa-49ac-617b-35f6-666ca5e2c87c' JobID: 'test'>
2016/11/09 08:17:54.294831 [DEBUG] worker: ack for evaluation c81fbdfa-49ac-617b-35f6-666ca5e2c87c
2016/11/09 08:17:54.295011 [DEBUG] http: Request /v1/evaluation/c81fbdfa-49ac-617b-35f6-666ca5e2c87c?region=global (95.942µs)
2016/11/09 08:17:54.295498 [DEBUG] client: starting task runners for alloc '21816ad1-9393-1df4-87a3-5db5a84a8a83'
2016/11/09 08:17:54.295522 [DEBUG] http: Request /v1/evaluation/c81fbdfa-49ac-617b-35f6-666ca5e2c87c/allocations?region=global (113.989µs)
2016/11/09 08:17:54.295581 [DEBUG] client: starting task context for 'test' (alloc '21816ad1-9393-1df4-87a3-5db5a84a8a83')
2016/11/09 08:17:54 [DEBUG] plugin: starting plugin: /var/lib/ambari-agent/cache/stacks/HDP/2.4/services/manager/package/lib/scheduler/bin/nomad []string{"/var/lib/ambari-agent/cache/stacks/HDP/2.4/services/manager/package/lib/scheduler/bin/nomad", "executor", "/tmp/NomadClient280580579/21816ad1-9393-1df4-87a3-5db5a84a8a83/test/test-executor.out"}
2016/11/09 08:17:54 [DEBUG] plugin: waiting for RPC address for: /var/lib/ambari-agent/cache/stacks/HDP/2.4/services/manager/package/lib/scheduler/bin/nomad
2016/11/09 08:17:54 [DEBUG] plugin: nomad: 2016/11/09 08:17:54 [DEBUG] plugin: plugin address: unix /tmp/plugin015473986
2016/11/09 08:17:54.305346 [DEBUG] driver.raw_exec: started process with pid: 9846
2016/11/09 08:17:54.351122 [DEBUG] client: updated allocations at index 10 (pulled 0) (filtered 1)
2016/11/09 08:17:54.351179 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
2016/11/09 08:17:54.551163 [DEBUG] client: updated allocations at index 11 (pulled 0) (filtered 1)
2016/11/09 08:17:54.551211 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
2016/11/09 08:17:56 [DEBUG] memberlist: TCP connection from=127.0.0.1:46374
2016/11/09 08:17:59.766701 [DEBUG] http: Request /v1/status/peers (324.154µs)
2016/11/09 08:18:02.949395 [DEBUG] http: Request /v1/agent/servers (1.206018ms)
2016/11/09 08:18:06 [DEBUG] memberlist: TCP connection from=127.0.0.1:46494
2016/11/09 08:18:09.767801 [DEBUG] http: Request /v1/status/peers (328.442µs)
2016/11/09 08:18:12.950213 [DEBUG] http: Request /v1/agent/servers (88.531µs)
2016/11/09 08:18:16 [DEBUG] memberlist: TCP connection from=127.0.0.1:46714
2016/11/09 08:18:19.768752 [DEBUG] http: Request /v1/status/peers (225.241µs)
2016/11/09 08:18:22.952705 [DEBUG] http: Request /v1/agent/servers (872.34µs)
2016/11/09 08:18:26 [DEBUG] memberlist: TCP connection from=127.0.0.1:46840
2016/11/09 08:18:29.769693 [DEBUG] http: Request /v1/status/peers (220.079µs)
2016/11/09 08:18:32.954395 [DEBUG] http: Request /v1/agent/servers (697.815µs)
2016/11/09 08:18:36 [DEBUG] memberlist: TCP connection from=127.0.0.1:46950
2016/11/09 08:18:39.770633 [DEBUG] http: Request /v1/status/peers (262.107µs)
2016/11/09 08:18:42.955499 [DEBUG] http: Request /v1/agent/servers (77.743µs)
2016/11/09 08:18:46 [DEBUG] memberlist: TCP connection from=127.0.0.1:47070
2016/11/09 08:18:49.771597 [DEBUG] http: Request /v1/status/peers (226.424µs)
2016/11/09 08:18:52.956866 [DEBUG] http: Request /v1/agent/servers (242.954µs)
2016/11/09 08:18:56 [DEBUG] memberlist: TCP connection from=127.0.0.1:47190
2016/11/09 08:18:59.772564 [DEBUG] http: Request /v1/status/peers (226.267µs)
2016/11/09 08:19:02.957674 [DEBUG] http: Request /v1/agent/servers (64.906µs)
2016/11/09 08:19:06 [DEBUG] memberlist: TCP connection from=127.0.0.1:47308
2016/11/09 08:19:09.774386 [DEBUG] http: Request /v1/status/peers (597.092µs)
2016/11/09 08:19:12.958707 [DEBUG] http: Request /v1/agent/servers (260.932µs)
2016/11/09 08:19:16 [DEBUG] memberlist: TCP connection from=127.0.0.1:47548
2016/11/09 08:19:19.775470 [DEBUG] http: Request /v1/status/peers (286.254µs)
2016/11/09 08:19:22.961538 [DEBUG] http: Request /v1/agent/servers (978.09µs)
2016/11/09 08:19:26 [DEBUG] memberlist: TCP connection from=127.0.0.1:47678
2016/11/09 08:19:29.776389 [DEBUG] http: Request /v1/status/peers (237.115µs)
2016/11/09 08:19:32.963092 [DEBUG] http: Request /v1/agent/servers (634.139µs)
2016/11/09 08:19:36 [DEBUG] memberlist: TCP connection from=127.0.0.1:47808

Consul logs

2016/11/09 08:19:00 [DEBUG] agent: Service 'consul' in sync
2016/11/09 08:19:00 [DEBUG] agent: Service '_nomad-server-nomad-http' in sync
2016/11/09 08:19:00 [DEBUG] agent: Service '_nomad-server-nomad-rpc' in sync
2016/11/09 08:19:00 [DEBUG] agent: Service '_nomad-server-nomad-serf' in sync
2016/11/09 08:19:00 [DEBUG] agent: Service '_nomad-client-nomad-client-http' in sync
2016/11/09 08:19:00 [DEBUG] agent: Check '0dded7d45fe76226a414e259c6dad014cbd40979' in sync
2016/11/09 08:19:00 [DEBUG] agent: Check '69e906ae7aeae2e237a89aa4decde678f3fa104c' in sync
2016/11/09 08:19:00 [DEBUG] agent: Check '751f8f44d02f0c440d2251472f02303dfdc56f80' in sync
2016/11/09 08:19:00 [DEBUG] agent: Check 'bcbd506a35931cca712b453a9dd8e1a463f01d00' in sync
2016/11/09 08:19:00 [DEBUG] agent: Check '841860023284daa7c428406c87aa81ede8068c0e' in sync
2016/11/09 08:19:02 [DEBUG] agent: check '841860023284daa7c428406c87aa81ede8068c0e' is passing
2016/11/09 08:19:03 [DEBUG] http: Request GET /v1/agent/services (45.236µs) from=127.0.0.1:57204
2016/11/09 08:19:03 [DEBUG] http: Request GET /v1/agent/checks (37.735µs) from=127.0.0.1:57204
2016/11/09 08:19:03 [DEBUG] http: Request GET /v1/agent/services (21.996µs) from=127.0.0.1:57204
2016/11/09 08:19:04 [DEBUG] http: Request GET /v1/agent/services (45.498µs) from=127.0.0.1:57286
2016/11/09 08:19:04 [DEBUG] http: Request GET /v1/agent/checks (43.809µs) from=127.0.0.1:57286
2016/11/09 08:19:06 [DEBUG] agent: check '751f8f44d02f0c440d2251472f02303dfdc56f80' is passing
2016/11/09 08:19:07 [DEBUG] agent: check '69e906ae7aeae2e237a89aa4decde678f3fa104c' is passing
2016/11/09 08:19:07 [DEBUG] http: Request GET /v1/agent/services (42.398µs) from=127.0.0.1:57204
2016/11/09 08:19:07 [DEBUG] http: Request GET /v1/agent/checks (31.669µs) from=127.0.0.1:57204
2016/11/09 08:19:07 [DEBUG] http: Request GET /v1/agent/services (20.967µs) from=127.0.0.1:57204
2016/11/09 08:19:08 [DEBUG] http: Request GET /v1/agent/services (44.461µs) from=127.0.0.1:57286
2016/11/09 08:19:08 [DEBUG] http: Request GET /v1/agent/checks (40.048µs) from=127.0.0.1:57286
2016/11/09 08:19:09 [DEBUG] agent: check 'bcbd506a35931cca712b453a9dd8e1a463f01d00' is passing
2016/11/09 08:19:12 [DEBUG] http: Request GET /v1/agent/services (42.977µs) from=127.0.0.1:57204
2016/11/09 08:19:12 [DEBUG] http: Request GET /v1/agent/checks (35.211µs) from=127.0.0.1:57204
2016/11/09 08:19:12 [DEBUG] http: Request GET /v1/agent/services (21.907µs) from=127.0.0.1:57204
2016/11/09 08:19:12 [DEBUG] agent: check '841860023284daa7c428406c87aa81ede8068c0e' is passing
2016/11/09 08:19:13 [DEBUG] http: Request GET /v1/agent/services (40.495µs) from=127.0.0.1:57286
2016/11/09 08:19:13 [DEBUG] http: Request GET /v1/agent/checks (35.782µs) from=127.0.0.1:57286
2016/11/09 08:19:16 [DEBUG] http: Request GET /v1/agent/services (41.934µs) from=127.0.0.1:57204
2016/11/09 08:19:16 [DEBUG] http: Request GET /v1/agent/checks (28.417µs) from=127.0.0.1:57204
2016/11/09 08:19:16 [DEBUG] http: Request GET /v1/agent/services (19.531µs) from=127.0.0.1:57204
2016/11/09 08:19:16 [DEBUG] agent: check '751f8f44d02f0c440d2251472f02303dfdc56f80' is passing
2016/11/09 08:19:17 [DEBUG] agent: check '69e906ae7aeae2e237a89aa4decde678f3fa104c' is passing
2016/11/09 08:19:18 [DEBUG] http: Request GET /v1/agent/services (42.915µs) from=127.0.0.1:57286
2016/11/09 08:19:18 [DEBUG] http: Request GET /v1/agent/checks (29.614µs) from=127.0.0.1:57286
2016/11/09 08:19:19 [DEBUG] agent: check 'bcbd506a35931cca712b453a9dd8e1a463f01d00' is passing
2016/11/09 08:19:21 [DEBUG] http: Request GET /v1/agent/services (44.062µs) from=127.0.0.1:57204
2016/11/09 08:19:21 [DEBUG] http: Request GET /v1/agent/checks (30.429µs) from=127.0.0.1:57204
2016/11/09 08:19:21 [DEBUG] http: Request GET /v1/agent/services (19.714µs) from=127.0.0.1:57204
2016/11/09 08:19:22 [DEBUG] agent: check '841860023284daa7c428406c87aa81ede8068c0e' is passing
2016/11/09 08:19:22 [DEBUG] http: Request GET /v1/agent/services (51.519µs) from=127.0.0.1:57286
2016/11/09 08:19:22 [DEBUG] http: Request GET /v1/agent/checks (41.649µs) from=127.0.0.1:57286
2016/11/09 08:19:25 [WARN] agent: Check '0dded7d45fe76226a414e259c6dad014cbd40979' missed TTL, is now critical

Job file (if appropriate)

job "test" {
    datacenters = ["dc1"]

    type = "service"

    group "test" {
        count = 1
        restart {
            interval = "5m"
            attempts = 20
            delay = "10s"
            mode = "delay"
        }
        task "test" {
            driver = "raw_exec"
            config {
                command= "/bin/sleep"
                args = ["1000"]
            }

            #Discovery Service Registration
            service {
                name = "test-check"
                tags = ["test-tag"]
                check {
                  type     = "script"
                  name     = "check_table"
                  command  = "/bin/sh"
                  args     = ["-c", "whoami"]
                  interval = "60s"
                  timeout  = "5s"
                }
            }

            resources {
                cpu = 100
                memory = 100
                network {
                    mbits = 1
                }
            }
        }
    }
}

Running programs

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 06:19 ?        00:00:00 /usr/sbin/init
root        18     1  0 06:19 ?        00:00:02 /usr/lib/systemd/systemd-journald
root       699     0  0 07:55 ?        00:00:00 sh
root      2844   699  0 08:01 ?        00:00:00 /var/lib/ambari-agent/cache/stacks/HDP/2.4/services/manager/package/lib/discovery/bin/consul agent -dev
root      2904     0  0 08:01 ?        00:00:00 sh
root      2988  2904  0 08:02 ?        00:00:01 /var/lib/ambari-agent/cache/stacks/HDP/2.4/services/manager/package/lib/scheduler/bin/nomad agent -dev
root      3124     0  0 08:02 ?        00:00:00 sh
rpc       3296     1  0 07:01 ?        00:00:00 /sbin/rpcbind -w
root      3950  2988  0 08:04 ?        00:00:00 /var/lib/ambari-agent/cache/stacks/HDP/2.4/services/manager/package/lib/scheduler/bin/nomad executor /tmp/NomadClient142315875/d1dc5941-2c66-a757-4bd5-44eb690e92c4/test/test-executor.out
root      3961  3950  0 08:04 ?        00:00:00 /bin/sleep 1000
dbus      4460     1  0 07:14 ?        00:00:00 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root      4461     1  0 07:14 ?        00:00:00 /usr/lib/systemd/systemd-logind
ams       4508     1  0 07:14 ?        00:00:09 /usr/bin/python2.7 /usr/lib/python2.6/site-packages/resource_monitoring/main.py start
hdfs      4707     1  1 07:14 ?        00:00:35 /usr/java/default/bin/java -Dproc_namenode -Xmx1024m -Dhdp.version=2.4.2.0-258 -Djava.net.preferIPv4Stack=true -Dhdp.version= -Djava.net.preferIPv4Stack=true -Dhdp.version= -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop/hdfs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.4.2.0-258/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhdp.version=2.4.2.0-258 -Dhadoop.log.dir=/var/log/hadoop/hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-master.ansible.log -Dhadoop.home.dir=/usr/hdp/2.4.2.0-258/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/hdfs/gc.log-201611090714 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/hdfs/gc.log-201611090714 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/hdfs/gc.log-201611090714 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode
root      5212  3124  0 08:07 ?        00:00:00 ps -ef
zookeep+  5247     1  0 07:14 ?        00:00:02 /usr/java/default/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.log.file=zookeeper-zookeeper-server-master.ansible.log -Dzookeeper.root.logger=INFO,ROLLINGFILE -cp /usr/hdp/current/zookeeper-server/bin/../build/classes:/usr/hdp/current/zookeeper-server/bin/../build/lib/*.jar:/usr/hdp/current/zookeeper-server/bin/../lib/xercesMinimal-1.9.6.2.jar:/usr/hdp/current/zookeeper-server/bin/../lib/wagon-provider-api-2.4.jar:/usr/hdp/current/zookeeper-server/bin/../lib/wagon-http-shared4-2.4.jar:/usr/hdp/current/zookeeper-server/bin/../lib/wagon-http-shared-1.0-beta-6.jar:/usr/hdp/current/zookeeper-server/bin/../lib/wagon-http-lightweight-1.0-beta-6.jar:/usr/hdp/current/zookeeper-server/bin/../lib/wagon-http-2.4.jar:/usr/hdp/current/zookeeper-server/bin/../lib/wagon-file-1.0-beta-6.jar:/usr/hdp/current/zookeeper-server/bin/../lib/slf4j-log4j12-1.6.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/slf4j-api-1.6.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/plexus-utils-3.0.8.jar:/usr/hdp/current/zookeeper-server/bin/../lib/plexus-interpolation-1.11.jar:/usr/hdp/current/zookeeper-server/bin/../lib/plexus-container-default-1.0-alpha-9-stable-1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/netty-3.7.0.Final.jar:/usr/hdp/current/zookeeper-server/bin/../lib/nekohtml-1.9.6.2.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-settings-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-repository-metadata-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-project-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-profile-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-plugin-registry-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-model-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-error-diagnostics-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-artifact-manager-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-artifact-2.2.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/maven-ant-tasks-2.1.3.jar:/usr/hdp/current/zookeeper-server/bin/../lib/log4j-1.2.16.jar:/usr/hdp/current/zookeeper-server/bin/../lib/jsoup-1.7.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/jline-0.9.94.jar:/usr/hdp/current/zookeeper-server/bin/../lib/httpcore-4.2.3.jar:/usr/hdp/current/zookeeper-server/bin/../lib/httpclient-4.2.3.jar:/usr/hdp/current/zookeeper-server/bin/../lib/commons-logging-1.1.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/commons-io-2.2.jar:/usr/hdp/current/zookeeper-server/bin/../lib/commons-codec-1.6.jar:/usr/hdp/current/zookeeper-server/bin/../lib/classworlds-1.1-alpha-2.jar:/usr/hdp/current/zookeeper-server/bin/../lib/backport-util-concurrent-3.1.jar:/usr/hdp/current/zookeeper-server/bin/../lib/ant-launcher-1.8.0.jar:/usr/hdp/current/zookeeper-server/bin/../lib/ant-1.8.0.jar:/usr/hdp/current/zookeeper-server/bin/../zookeeper-3.4.6.2.4.2.0-258.jar:/usr/hdp/current/zookeeper-server/bin/../src/java/lib/*.jar:/usr/hdp/current/zookeeper-server/conf::/usr/share/zookeeper/*:/usr/share/zookeeper/* -Xmx1024m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/hdp/current/zookeeper-server/conf/zoo.cfg
yarn      5899     1  0 07:15 ?        00:00:15 /usr/java/default/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.4.2.0-258 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-master.ansible.log -Dyarn.log.file=yarn-yarn-timelineserver-master.ansible.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir -Dyarn.policy.file=hadoop-policy.xml -Djava.io.tmpdir=/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-master.ansible.log -Dyarn.log.file=yarn-yarn-timelineserver-master.ansible.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.4.2.0-258/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.4.2.0-258/hadoop/lib/*:/usr/hdp/2.4.2.0-258/hadoop/.//*:/usr/hdp/2.4.2.0-258/hadoop-hdfs/./:/usr/hdp/2.4.2.0-258/hadoop-hdfs/lib/*:/usr/hdp/2.4.2.0-258/hadoop-hdfs/.//*:/usr/hdp/2.4.2.0-258/hadoop-yarn/lib/*:/usr/hdp/2.4.2.0-258/hadoop-yarn/.//*:/usr/hdp/2.4.2.0-258/hadoop-mapreduce/lib/*:/usr/hdp/2.4.2.0-258/hadoop-mapreduce/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
mapred    6884     1  0 07:15 ?        00:00:18 /usr/java/default/bin/java -Dproc_historyserver -Xmx900m -Dhdp.version=2.4.2.0-258 -Djava.net.preferIPv4Stack=true -Djava.io.tmpdir=/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir -Dhdp.version= -Dhdp.version= -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop/mapred -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.4.2.0-258/hadoop -Dhadoop.id.str=mapred -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-mapreduce/mapred -Dhadoop.log.file=hadoop.log -Dhadoop.root.logger=INFO,console -Dhadoop.id.str=mapred -Dhdp.version=2.4.2.0-258 -Dhadoop.log.dir=/var/log/hadoop/mapred -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.4.2.0-258/hadoop -Dhadoop.id.str=mapred -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-mapreduce/mapred -Dhadoop.log.file=mapred-mapred-historyserver-master.ansible.log -Dhadoop.root.logger=INFO,RFA -Dmapred.jobsummary.logger=INFO,JSA -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
yarn      7124     1  1 07:15 ?        00:00:42 /usr/java/default/bin/java -Dproc_resourcemanager -Xmx1024m -Dhdp.version=2.4.2.0-258 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-resourcemanager-master.ansible.log -Dyarn.log.file=yarn-yarn-resourcemanager-master.ansible.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir -Dyarn.policy.file=hadoop-policy.xml -Djava.io.tmpdir=/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir -Dyarn.server.resourcemanager.appsummary.logger=INFO,RMSUMMARY -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-resourcemanager-master.ansible.log -Dyarn.log.file=yarn-yarn-resourcemanager-master.ansible.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-resourcemanager -Dhadoop.home.dir=/usr/hdp/2.4.2.0-258/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.4.2.0-258/hadoop/lib/*:/usr/hdp/2.4.2.0-258/hadoop/.//*:/usr/hdp/2.4.2.0-258/hadoop-hdfs/./:/usr/hdp/2.4.2.0-258/hadoop-hdfs/lib/*:/usr/hdp/2.4.2.0-258/hadoop-hdfs/.//*:/usr/hdp/2.4.2.0-258/hadoop-yarn/lib/*:/usr/hdp/2.4.2.0-258/hadoop-yarn/.//*:/usr/hdp/2.4.2.0-258/hadoop-mapreduce/lib/*:/usr/hdp/2.4.2.0-258/hadoop-mapreduce/.//*:/usr/hdp/current/hadoop-yarn-resourcemanager/.//*:/usr/hdp/current/hadoop-yarn-resourcemanager/lib/*:/usr/hdp/current/hadoop-client/conf/rm-config/log4j.properties org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
hdfs      7459     1  0 07:15 ?        00:00:08 /usr/java/default/bin/java -Dproc_secondarynamenode -Xmx1024m -Dhdp.version=2.4.2.0-258 -Djava.net.preferIPv4Stack=true -Dhdp.version= -Djava.net.preferIPv4Stack=true -Dhdp.version= -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop/hdfs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.4.2.0-258/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhdp.version=2.4.2.0-258 -Dhadoop.log.dir=/var/log/hadoop/hdfs -Dhadoop.log.file=hadoop-hdfs-secondarynamenode-master.ansible.log -Dhadoop.home.dir=/usr/hdp/2.4.2.0-258/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native:/usr/hdp/2.4.2.0-258/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.2.0-258/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/hdfs/gc.log-201611090715 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-secondarynamenode/bin/kill-secondary-name-node" -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/hdfs/gc.log-201611090715 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-secondarynamenode/bin/kill-secondary-name-node" -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/hdfs/gc.log-201611090715 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-secondarynamenode/bin/kill-secondary-name-node" -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
spark     8104     1  0 07:15 ?        00:00:08 /usr/java/default/bin/java -Dhdp.version=2.4.2.0-258 -cp /usr/hdp/2.4.2.0-258/spark/sbin/../conf/:/usr/hdp/2.4.2.0-258/spark/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar:/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-core-3.2.10.jar:/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/hadoop-client/conf/ -Xms1g -Xmx1g org.apache.spark.deploy.history.HistoryServer
ams       8335     1  0 07:15 ?        00:00:00 bash /usr/lib/ams-hbase/bin/hbase-daemon.sh --config /etc/ams-hbase/conf foreground_start master
ams       8349  8335  1 07:15 ?        00:00:57 /usr/java/default/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/ambari-metrics-collector/hs_err_pid%p.log -Djava.io.tmpdir=/var/lib/ambari-metrics-collector/hbase-tmp -Djava.library.path=/usr/lib/ams-hbase/lib/hadoop-native/ -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/gc.log-201611090715 -Xms1024m -Xmx1024m -Xmn256m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Dhbase.log.dir=/var/log/ambari-metrics-collector -Dhbase.log.file=hbase-ams-master-master.ansible.log -Dhbase.home.dir=/usr/lib/ams-hbase/bin/.. -Dhbase.id.str=ams -Dhbase.root.logger=INFO,RFA -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.master.HMaster start
ams       8412     1  1 07:15 ?        00:00:56 /usr/java/default/bin/java -Xms512m -Xmx512m -Djava.library.path=/usr/lib/ams-hbase/lib/hadoop-native -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/collector-gc.log-201611090715 -cp /usr/lib/ambari-metrics-collector/*:/etc/ambari-metrics-collector/conf -Djava.net.preferIPv4Stack=true -Dams.log.dir=/var/log/ambari-metrics-collector -Dproc_timelineserver org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
root     12254     1  0 07:24 ?        00:00:00 /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/AmbariAgent.py start
root     12262 12254  1 07:24 ?        00:00:34 /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start
root     20212     1  0 07:28 ?        00:00:01 /usr/sbin/rsyslogd -n
@dadgar
Copy link
Contributor

dadgar commented Nov 9, 2016

Can you paste the output of: curl http://127.0.0.1:8500/v1/agent/checks? I just ran your job and the check passes for me.

@kaskavalci
Copy link
Contributor Author

Not exactly agent/checks but here is the health/service/test response:

[
    {
        "Node": {
            "Node": "mynode",
            "Address": "10.97.19.161",
            "CreateIndex": 17358,
            "ModifyIndex": 23033
        },
        "Service": {
            "ID": "_nomad-executor-956049c9-51f6-e7f3-12c5-a11a9ee84879-test-test-test",
            "Service": "test",
            "Tags": [
                "test-tag"
            ],
            "Address": "",
            "Port": 0,
            "EnableTagOverride": false,
            "CreateIndex": 23032,
            "ModifyIndex": 23033
        },
        "Checks": [
            {
                "Node": "mynode",
                "CheckID": "110d4ea9f8863c4a10efdc609211f70f1ee11515",
                "Name": "sleep-check",
                "Status": "critical",
                "Notes": "",
                "Output": "",
                "ServiceID": "_nomad-executor-956049c9-51f6-e7f3-12c5-a11a9ee84879-test-test-test",
                "ServiceName": "test",
                "CreateIndex": 23033,
                "ModifyIndex": 23033
            },
            {
                "Node": "mynode",
                "CheckID": "serfHealth",
                "Name": "Serf Health Status",
                "Status": "passing",
                "Notes": "",
                "Output": "Agent alive and reachable",
                "ServiceID": "",
                "ServiceName": "",
                "CreateIndex": 17358,
                "ModifyIndex": 17358
            }
        ]
    }
]

@dadgar
Copy link
Contributor

dadgar commented Nov 9, 2016

@kaskavalci Is the issue that the check never switches to passing or that it starts critical? It is a bit hard to use the logs/process/check output you have to debug since I think they were captured at different times.

@kaskavalci
Copy link
Contributor Author

@dadgar It starts critical and never changes. I also tested with command that creates file and file is created. Which means Nomad successfully executes the binary but does not send TTL to Consul. Consul has a log entry complaining that:

2016/11/09 08:19:25 [WARN] agent: Check '0dded7d45fe76226a414e259c6dad014cbd40979' missed TTL, is now critical

Job creates file

config {
  command= "/bin/sh"
  args = ["-c", "echo test > /tmp/healthtest"]
}

Test if OK

$ cat /tmp/healthtest 
test

@diptanu
Copy link
Contributor

diptanu commented Nov 10, 2016

@kaskavalci We would like to see if Nomad is sending any requests to Consul

Can you capture some packets for us?

For example, on my host Consul is running on port 8500 and bound to the lo interface. The following commands captures the packets going in and out - tcpdump -i lo -l -w - port 8500

Can you capture 3-4 minutes of data and share with us, please?

@kaskavalci
Copy link
Contributor Author

@jippi
Copy link
Contributor

jippi commented Nov 14, 2016

I'm experiencing a similar issue. checks seem to deregister after a while, and sometime only partially register in consul - example:

group "redis" {
        count = 3

        task "server" {
            driver = "docker"
            user   = "root"

            service {
                name = "redis-discovery"
                port = "redis"
                tags = ["index-${NOMAD_ALLOC_INDEX}"]
            }
      }
}

In a lot of cases only register some of the tags, e.g. in my last run index-1.redis-discovery.service.consul and index-2.redis-discovery.service.consul was registered correctly, but index-0.redis-discovery.service.consul is not registered at all (nor as check or service)
Resubmitting the job made index-0.redis-discovery.service.consul show up again. Though, two other checks in the job file went "away" completely.

Sometimes the service registers, but without any check as well. My Nomad cluster is running in DEBUG log level, but can't find anything related to Consul in the log output. (0.5.0-rc2)

Consul is 0.7.1, running with log level INFO, and nothing interesting there either worth noting. The POST payload seems correct, and can't see any errors in Consul from it.

As suggested to @schmichael i can give hashicorp ssh access to the infrastructure if needed for deeper analysis if need be.

for completeness sake, this is my full job spec: https://gist.github.com/jippi/7b182548eb4f8ef0b0799327dc02d5af

@jippi
Copy link
Contributor

jippi commented Nov 14, 2016

This is roughly the flow I'm seeing from Nomad

2016/11/14 20:02:04 [INFO] agent: Synced check '32cdd97e43f4cab82899ec76c73a151cca5e0ef4'
2016/11/14 20:02:12 [INFO] agent: Synced check '32cdd97e43f4cab82899ec76c73a151cca5e0ef4'

Submitting the job again

2016/11/14 20:02:37 [INFO] agent: Deregistered check '32cdd97e43f4cab82899ec76c73a151cca5e0ef4'

(it's not registered any longer, but not reregistered wither)

Submitting the job again

(...)

Yep, nothing in nomad regarding service checks at all

Submitting the job again

2016/11/14 20:07:10 [INFO] agent: Synced check '32cdd97e43f4cab82899ec76c73a151cca5e0ef4'
2016/11/14 20:07:17 [INFO] agent: Synced check '32cdd97e43f4cab82899ec76c73a151cca5e0ef4'

on one node, but not on the other 2

Submitting the job again

2016/11/14 20:08:20 [INFO] agent: Deregistered check '32cdd97e43f4cab82899ec76c73a151cca5e0ef4'

and nothing else :(

@jippi
Copy link
Contributor

jippi commented Nov 14, 2016

Even removing all the service + checks and resubmitting it, doesn't fully remove them from Consul, so seem to be some inconsistent state between nomad and consul going on :(

@jippi
Copy link
Contributor

jippi commented Nov 14, 2016

Force removing the services + check and submitting the original job again successfully registed all 3 checks + service for redis-sentinel , only index-1 for the redis-discovery service and 1 of the 3 checks + service for redis (nothing in logs for the 2 other allocations or checks from Nomad at all)

maybe worth noting that the resubmission of the job is unmodified, so no new allocations are created.

Tried to change the check name and in first try everything registered correctly for that service (redis) but continued to be partially registered for the redis-discovery and redis-sentinel.

I've noticed a few times that everything is registered correctly, but without submitting a new job, 12-48h later, the service and checks become unregistered against, meaning nomad or consul must decide to remove them for whatever reason. Since i don't see this behavior under normal consul checks in our apps (also using http api), my bet is nomad doing something silly :(

@dadgar
Copy link
Contributor

dadgar commented Nov 28, 2016

@kaskavalci Still a problem on 0.5? There were some consul bug fixes

@kaskavalci
Copy link
Contributor Author

I will test ASAP and let you know.

@kaskavalci
Copy link
Contributor Author

Confirmed issue no longer exists. Thanks 👍

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants