Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad should provide a way to know if a task failed due to OOM killer #2203

Closed
multani opened this issue Jan 16, 2017 · 6 comments
Closed

Nomad should provide a way to know if a task failed due to OOM killer #2203

multani opened this issue Jan 16, 2017 · 6 comments

Comments

@multani
Copy link
Contributor

multani commented Jan 16, 2017

Reference: https://groups.google.com/forum/#!msg/nomad-tool/h7qNXEsavFw/s5HEnyPWEQAJ

Nomad version

0.5.2

Operating system and Environment details

Debian Stable, running Docker 1.12

Issue

When a task gets killed because it used more memory than what has been declared in the job, the task gets killed and Nomad marks it as failed, but we don't get any more information about what the underlying problem is from Nomad. Looking at the kernel logs clearly shows the OOM killer in action.

Then, it is difficult to track down the error while looking at Nomad alone (I thought there was problem in our application) and to provide feedbacks to the Nomad operators on how to properly fix their jobs.

Reproduction steps

$ nomad run oom-killed.nomad
==> Monitoring evaluation "f0daaffe"
    Evaluation triggered by job "oom-killed"
    Allocation "73b69aa3" created: node "36e9d87f", group "oom-killed"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "f0daaffe" finished with status "complete"
$ nomad alloc-status -verbose 363ba53f
ID                 = 363ba53f-1184-d95d-214d-12fe226a5bc3
Eval ID            = e4d0b842-29b1-c5f1-f0e0-b9b389377a59
Name               = oom-killed.oom-killed[0]
Node ID            = b76973d6-d18f-bfbf-aab9-c596320bf891
Job ID             = oom-killed
Client Status      = pending
Client Description = <none>
Created At         = 01/16/17 19:27:49 CET
Evaluated Nodes    = 1
Filtered Nodes     = 0
Exhausted Nodes    = 0
Allocation Time    = 27.649µs
Failures           = 0

Task "oom-killed" is "pending"
Task Resources
CPU      Memory  Disk  IOPS  Addresses
100 MHz  15 MiB  0 B   0

Recent Events:
Time                   Type        Description
01/16/17 19:28:02 CET  Restarting  Task restarting in 18.613972498s
01/16/17 19:28:02 CET  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
01/16/17 19:27:52 CET  Started     Task started by client
01/16/17 19:27:49 CET  Received    Task received by client

Placement Metrics
  * Score "b76973d6-d18f-bfbf-aab9-c596320bf891.binpack" = 0.225770

Related dmesg message:

[62781.485709] stress invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), order=0, oom_score_adj=0
[62781.485712] stress cpuset=fd635ede503fdf45897939516af7e60865570389521d5a2f9e3a9f8d4c88b67d mems_allowed=0
[62781.485720] CPU: 1 PID: 930 Comm: stress Tainted: G           OE   4.8.0-2-amd64 #1 Debian 4.8.11-1
[62781.485722] Hardware name: LENOVO 20FN003LMZ/20FN003LMZ, BIOS R06ET33W (1.07 ) 01/05/2016
[62781.485724]  0000000000000286 00000000dfeb1f8f ffffffff9d5269f5 ffff91f740993e10
[62781.485729]  ffff91f740519080 ffffffff9d3fe6d1 ffffffff9d2a4774 ffff91f881415000
[62781.485734]  0000000000000003 ffffffff9d3808c6 ffff91f740519080 ffff91f740519080
[62781.485738] Call Trace:
[62781.485747]  [<ffffffff9d5269f5>] ? dump_stack+0x5c/0x77
[62781.485752]  [<ffffffff9d3fe6d1>] ? dump_header+0x59/0x1dc
[62781.485757]  [<ffffffff9d2a4774>] ? ttwu_do_wakeup+0x14/0xe0
[62781.485761]  [<ffffffff9d3808c6>] ? find_lock_task_mm+0x36/0x80
[62781.485765]  [<ffffffff9d381452>] ? oom_kill_process+0x222/0x3e0
[62781.485768]  [<ffffffff9d3f58fc>] ? mem_cgroup_iter+0x1dc/0x300
[62781.485771]  [<ffffffff9d3f7b59>] ? mem_cgroup_out_of_memory+0x299/0x2d0
[62781.485775]  [<ffffffff9d3f865b>] ? mem_cgroup_oom_synchronize+0x31b/0x330
[62781.485779]  [<ffffffff9d3f3940>] ? memory_high_write+0xd0/0xd0
[62781.485783]  [<ffffffff9d381a9d>] ? pagefault_out_of_memory+0x4d/0xc0
[62781.485787]  [<ffffffff9d7f0e58>] ? page_fault+0x28/0x30
[62781.485789] Task in /docker/fd635ede503fdf45897939516af7e60865570389521d5a2f9e3a9f8d4c88b67d killed as a result of limit of /docker/fd635ede503fdf45897939516af7e60865570389521d5a2f9e3a9f8d4c88b67d
[62781.485797] memory: usage 15360kB, limit 15360kB, failcnt 232
[62781.485799] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[62781.485800] kmem: usage 456kB, limit 9007199254740988kB, failcnt 0
[62781.485801] Memory cgroup stats for /docker/fd635ede503fdf45897939516af7e60865570389521d5a2f9e3a9f8d4c88b67d: cache:52KB rss:14852KB rss_huge:0KB mapped_file:4KB dirty:48KB writeback:0KB inactive_anon:0KB active_anon:14852KB inactive_file:48KB active_file:4KB unevictable:0KB
[62781.485819] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[62781.486028] [  822]     0   822      379        1       7       3        0             0 sh
[62781.486032] [  929]     0   929      186        1       5       3        0             0 stress
[62781.486036] [  930]     0   930    12987     3632      13       3        0             0 stress
[62781.486039] Memory cgroup out of memory: Kill process 930 (stress) score 950 or sacrifice child
[62781.486047] Killed process 930 (stress) total-vm:51948kB, anon-rss:14524kB, file-rss:4kB, shmem-rss:0kB
[62781.578335] veth229fd35: renamed from eth0
[62781.618999] docker0: port 3(veth0f159ef) entered disabled state
[62781.635293] docker0: port 3(veth0f159ef) entered disabled state
[62781.637665] device veth0f159ef left promiscuous mode
[62781.637667] docker0: port 3(veth0f159ef) entered disabled state

Nomad Client logs (if appropriate)

$ sudo nomad agent -dev -log-level debug
    No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: debug
                Region: global (DC: dc1)
                Server: true
               Version: 0.5.2

==> Nomad agent started! Log data will stream in below:

    2017/01/16 19:27:30 [INFO] raft: Node at 127.0.0.1:4647 [Follower] entering Follower state (Leader: "")
    2017/01/16 19:27:30 [INFO] serf: EventMemberJoin: cory 127.0.0.1
    2017/01/16 19:27:30.223969 [INFO] nomad: starting 4 scheduling worker(s) for [service batch system _core]
    2017/01/16 19:27:30.224234 [INFO] client: using state directory /tmp/NomadClient056663668
    2017/01/16 19:27:30.224301 [INFO] client: using alloc directory /tmp/NomadClient998095683
    2017/01/16 19:27:30.224363 [INFO] nomad: adding server cory (Addr: 127.0.0.1:4647) (DC: dc1)    2017/01/16 19:27:33.656916 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["dc1"]
    2017/01/16 19:27:30.224372 [DEBUG] client: built-in fingerprints: [arch cgroup consul cpu host memory network nomad signal storage vault env_aws env_gce]
    2017/01/16 19:27:30.224583 [INFO] fingerprint.cgroups: cgroups are available
    2017/01/16 19:27:30.224872 [DEBUG] client: fingerprinting cgroup every 15s

    2017/01/16 19:27:30.226970 [INFO] fingerprint.consul: consul agent is available
    2017/01/16 19:27:30.227110 [DEBUG] client: fingerprinting consul every 15s
    2017/01/16 19:27:30.227534 [DEBUG] fingerprint.cpu: frequency: 2800 MHz
    2017/01/16 19:27:30.227540 [DEBUG] fingerprint.cpu: core count: 4
    2017/01/16 19:27:30.409559 [DEBUG] fingerprint.network: Detected interface lo with IP 127.0.0.1 during fingerprinting
    2017/01/16 19:27:30.409614 [DEBUG] fingerprint.network: Unable to read link speed from /sys/class/net/lo/speed
    2017/01/16 19:27:30.409617 [DEBUG] fingerprint.network: link speed could not be detected and no speed specified by user. Defaulting to 1000
    2017/01/16 19:27:30.411156 [DEBUG] client: fingerprinting vault every 15s
    2017/01/16 19:27:31 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2017/01/16 19:27:31 [INFO] raft: Node at 127.0.0.1:4647 [Candidate] entering Candidate state
    2017/01/16 19:27:31 [DEBUG] raft: Votes needed: 1
    2017/01/16 19:27:31 [DEBUG] raft: Vote granted from 127.0.0.1:4647. Tally: 1
    2017/01/16 19:27:31 [INFO] raft: Election won. Tally: 1
    2017/01/16 19:27:31 [INFO] raft: Node at 127.0.0.1:4647 [Leader] entering Leader state
    2017/01/16 19:27:31 [INFO] raft: Disabling EnableSingleNode (bootstrap)
    2017/01/16 19:27:31 [DEBUG] raft: Node 127.0.0.1:4647 updated peer set (2): [127.0.0.1:4647]
    2017/01/16 19:27:31.694676 [INFO] nomad: cluster leadership acquired
    2017/01/16 19:27:31.694810 [DEBUG] leader: reconciling job summaries at index: 0
    2017/01/16 19:27:32.411294 [DEBUG] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
    2017/01/16 19:27:33.483874 [DEBUG] fingerprint.env_gce: Could not read value for attribute "machine-type"
    2017/01/16 19:27:33.483895 [DEBUG] fingerprint.env_gce: Error querying GCE Metadata URL, skipping
    2017/01/16 19:27:33.483927 [DEBUG] client: applied fingerprints [arch cgroup consul cpu host memory network nomad signal storage]
    2017/01/16 19:27:33.542413 [DEBUG] driver.qemu: enabling driver
    2017/01/16 19:27:33.542562 [DEBUG] driver.docker: using client connection initialized from environment
    2017/01/16 19:27:33.542672 [DEBUG] client: fingerprinting rkt every 15s
    2017/01/16 19:27:33.543269 [DEBUG] driver.exec: exec driver is enabled
    2017/01/16 19:27:33.543357 [DEBUG] client: fingerprinting docker every 15s
    2017/01/16 19:27:33.543410 [DEBUG] client: fingerprinting exec every 15s
    2017/01/16 19:27:33.655421 [DEBUG] client: available drivers [qemu docker exec raw_exec java]
    2017/01/16 19:27:33.655540 [INFO] client: Node ID "b76973d6-d18f-bfbf-aab9-c596320bf891"
    2017/01/16 19:27:33.656229 [DEBUG] client: updated allocations at index 1 (pulled 0) (filtered 0)
    2017/01/16 19:27:33.656308 [INFO] client: node registration complete
    2017/01/16 19:27:33.656316 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2017/01/16 19:27:33.660410 [ERR] client.consul: error discovering nomad servers: no Nomad Servers advertising service "nomad" in Consul datacenters: ["dc1"]
    2017/01/16 19:27:33.660483 [DEBUG] client: periodically checking for node changes at duration 5s
    2017/01/16 19:27:33.660710 [DEBUG] client: state updated to ready

    2017/01/16 19:27:49.174054 [DEBUG] worker: dequeued evaluation e4d0b842-29b1-c5f1-f0e0-b9b389377a59
    2017/01/16 19:27:49.174177 [DEBUG] sched: <Eval 'e4d0b842-29b1-c5f1-f0e0-b9b389377a59' JobID: 'oom-killed'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
    2017/01/16 19:27:49.174747 [DEBUG] http: Request /v1/jobs?region=global (2.232029ms)
    2017/01/16 19:27:49.174867 [DEBUG] worker: submitted plan for evaluation e4d0b842-29b1-c5f1-f0e0-b9b389377a59
    2017/01/16 19:27:49.174894 [DEBUG] sched: <Eval 'e4d0b842-29b1-c5f1-f0e0-b9b389377a59' JobID: 'oom-killed'>: setting status to complete
    2017/01/16 19:27:49.174951 [DEBUG] client: updated allocations at index 8 (pulled 1) (filtered 0)
    2017/01/16 19:27:49.175039 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)
    2017/01/16 19:27:49.175084 [DEBUG] worker: updated evaluation <Eval 'e4d0b842-29b1-c5f1-f0e0-b9b389377a59' JobID: 'oom-killed'>
    2017/01/16 19:27:49.175133 [DEBUG] worker: ack for evaluation e4d0b842-29b1-c5f1-f0e0-b9b389377a59
    2017/01/16 19:27:49.177434 [DEBUG] http: Request /v1/evaluation/e4d0b842-29b1-c5f1-f0e0-b9b389377a59?region=global (1.194867ms)
    2017/01/16 19:27:49.177746 [DEBUG] client: starting task runners for alloc '363ba53f-1184-d95d-214d-12fe226a5bc3'
    2017/01/16 19:27:49.178561 [DEBUG] client: starting task context for 'oom-killed' (alloc '363ba53f-1184-d95d-214d-12fe226a5bc3')
    2017/01/16 19:27:49.180747 [DEBUG] http: Request /v1/evaluation/e4d0b842-29b1-c5f1-f0e0-b9b389377a59/allocations?region=global (380.745µs)
    2017/01/16 19:27:49.307801 [DEBUG] client: updated allocations at index 10 (pulled 0) (filtered 1)
    2017/01/16 19:27:49.308046 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
    2017/01/16 19:27:52.018643 [DEBUG] driver.docker: docker pull zyfdedh/stress:latest succeeded
    2017/01/16 19:27:52.022281 [DEBUG] driver.docker: identified image zyfdedh/stress:latest as sha256:adba7b24235b168ae9afb86fb19c88c75ff9c257c8a7d17f62a266993ed0ddf6
    2017/01/16 19:27:52 [DEBUG] plugin: starting plugin: /home/jballet/local/nomad/0.5.2/nomad []string{"/home/jballet/local/nomad/0.5.2/nomad", "executor", "/tmp/NomadClient998095683/363ba53f-1184-d95d-214d-12fe226a5bc3/oom-killed/oom-killed-executor.out"}
    2017/01/16 19:27:52 [DEBUG] plugin: waiting for RPC address for: /home/jballet/local/nomad/0.5.2/nomad
    2017/01/16 19:27:52 [DEBUG] plugin: nomad: 2017/01/16 19:27:52 [DEBUG] plugin: plugin address: unix /tmp/plugin761602023
    2017/01/16 19:27:52.041525 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin987718170
    2017/01/16 19:27:52.041553 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin987718170]}
    2017/01/16 19:27:52.041562 [DEBUG] driver.docker: using 15728640 bytes memory for oom-killed
    2017/01/16 19:27:52.041574 [DEBUG] driver.docker: using 100 cpu shares for oom-killed
    2017/01/16 19:27:52.041591 [DEBUG] driver.docker: binding directories []string{"/tmp/NomadClient998095683/363ba53f-1184-d95d-214d-12fe226a5bc3/alloc:/alloc", "/tmp/NomadClient998095683/363ba53f-1184-d95d-214d-12fe226a5bc3/oom-killed/local:/local", "/tmp/NomadClient998095683/363ba53f-1184-d95d-214d-12fe226a5bc3/oom-killed/secrets:/secrets"} for oom-killed
    2017/01/16 19:27:52.041599 [DEBUG] driver.docker: networking mode not specified; defaulting to bridge
    2017/01/16 19:27:52.041606 [DEBUG] driver.docker: No network interfaces are available
    2017/01/16 19:27:52.041659 [DEBUG] driver.docker: setting container startup command to: sh -c sleep 10; stress --vm 1 --vm-bytes 50M
    2017/01/16 19:27:52.041674 [DEBUG] driver.docker: setting container name to: oom-killed-363ba53f-1184-d95d-214d-12fe226a5bc3
    2017/01/16 19:27:52.074228 [INFO] driver.docker: created container fd635ede503fdf45897939516af7e60865570389521d5a2f9e3a9f8d4c88b67d
    2017/01/16 19:27:52.310544 [INFO] driver.docker: started container fd635ede503fdf45897939516af7e60865570389521d5a2f9e3a9f8d4c88b67d
    2017/01/16 19:27:52.311650 [WARN] client: error fetching stats of task oom-killed: stats collection hasn't started yet
    2017/01/16 19:27:52.508134 [DEBUG] client: updated allocations at index 11 (pulled 0) (filtered 1)
    2017/01/16 19:27:52.508400 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
    2017/01/16 19:28:00.416359 [DEBUG] http: Request /v1/allocations?prefix=363ba53f (419.41µs)
    2017/01/16 19:28:00.417188 [DEBUG] http: Request /v1/allocation/363ba53f-1184-d95d-214d-12fe226a5bc3 (183.484µs)
    2017/01/16 19:28:00.418521 [DEBUG] http: Request /v1/node/b76973d6-d18f-bfbf-aab9-c596320bf891 (174.042µs)
    2017/01/16 19:28:00.419687 [DEBUG] http: Request /v1/client/allocation/363ba53f-1184-d95d-214d-12fe226a5bc3/stats (428.307µs)
    2017/01/16 19:28:02.623407 [DEBUG] driver.docker: error collecting stats from container fd635ede503fdf45897939516af7e60865570389521d5a2f9e3a9f8d4c88b67d: io: read/write on closed pipe
    2017/01/16 19:28:02 [DEBUG] plugin: /home/jballet/local/nomad/0.5.2/nomad: plugin process exited
    2017/01/16 19:28:02.686215 [INFO] client: task "oom-killed" for alloc "363ba53f-1184-d95d-214d-12fe226a5bc3" failed: Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
    2017/01/16 19:28:02.686246 [INFO] client: Restarting task "oom-killed" for alloc "363ba53f-1184-d95d-214d-12fe226a5bc3" in 18.613972498s
    2017/01/16 19:28:02.907824 [DEBUG] client: updated allocations at index 12 (pulled 0) (filtered 1)
    2017/01/16 19:28:02.907964 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
    2017/01/16 19:28:03.861055 [DEBUG] http: Request /v1/allocations?prefix=363ba53f (403.165µs)
    2017/01/16 19:28:03.868238 [DEBUG] http: Request /v1/allocation/363ba53f-1184-d95d-214d-12fe226a5bc3 (467.155µs)
    2017/01/16 19:28:03.870565 [DEBUG] http: Request /v1/node/b76973d6-d18f-bfbf-aab9-c596320bf891 (405.781µs)
    2017/01/16 19:28:03.871775 [DEBUG] http: Request /v1/client/allocation/363ba53f-1184-d95d-214d-12fe226a5bc3/stats (135.562µs)

    2017/01/16 19:28:15.293251 [DEBUG] http: Request /v1/allocations?prefix=363ba53f (394.093µs)
    2017/01/16 19:28:15.295485 [DEBUG] http: Request /v1/allocation/363ba53f-1184-d95d-214d-12fe226a5bc3 (1.232518ms)
    2017/01/16 19:28:15.298162 [DEBUG] http: Request /v1/node/b76973d6-d18f-bfbf-aab9-c596320bf891 (523.706µs)
    2017/01/16 19:28:15.299386 [DEBUG] http: Request /v1/client/allocation/363ba53f-1184-d95d-214d-12fe226a5bc3/stats (408.943µs)
    2017/01/16 19:28:19.144766 [DEBUG] http: Request /v1/jobs?prefix=oom-killed (426.23µs)
    2017/01/16 19:28:19.146549 [DEBUG] http: Request /v1/job/oom-killed (377.692µs)
    2017/01/16 19:28:19.149023 [DEBUG] worker: dequeued evaluation 5e3398a3-5552-4ce9-e2ce-c251830362c1
    2017/01/16 19:28:19.149148 [DEBUG] http: Request /v1/job/oom-killed (874.864µs)
    2017/01/16 19:28:19.149191 [DEBUG] sched: <Eval '5e3398a3-5552-4ce9-e2ce-c251830362c1' JobID: 'oom-killed'>: allocs: (place 0) (update 0) (migrate 0) (stop 1) (ignore 0) (lost 0)
    2017/01/16 19:28:19.150016 [DEBUG] worker: submitted plan for evaluation 5e3398a3-5552-4ce9-e2ce-c251830362c1
    2017/01/16 19:28:19.150057 [DEBUG] client: updated allocations at index 15 (pulled 1) (filtered 0)
    2017/01/16 19:28:19.150082 [DEBUG] sched: <Eval '5e3398a3-5552-4ce9-e2ce-c251830362c1' JobID: 'oom-killed'>: setting status to complete
    2017/01/16 19:28:19.150231 [DEBUG] client: allocs: (added 0) (removed 0) (updated 1) (ignore 0)
    2017/01/16 19:28:19.150279 [DEBUG] client: Not restarting task: oom-killed because it has been destroyed
    2017/01/16 19:28:19.150435 [DEBUG] worker: updated evaluation <Eval '5e3398a3-5552-4ce9-e2ce-c251830362c1' JobID: 'oom-killed'>
    2017/01/16 19:28:19.150521 [DEBUG] worker: ack for evaluation 5e3398a3-5552-4ce9-e2ce-c251830362c1
    2017/01/16 19:28:19.155326 [DEBUG] http: Request /v1/evaluation/5e3398a3-5552-4ce9-e2ce-c251830362c1 (2.578416ms)
    2017/01/16 19:28:19.156955 [DEBUG] http: Request /v1/evaluation/5e3398a3-5552-4ce9-e2ce-c251830362c1/allocations (198.237µs)
    2017/01/16 19:28:19.308274 [DEBUG] client: updated allocations at index 17 (pulled 0) (filtered 1)
    2017/01/16 19:28:19.308543 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)

^C==> Caught signal: interrupt
    2017/01/16 19:28:21.582012 [DEBUG] http: Shutting down http server
    2017/01/16 19:28:21.582193 [INFO] agent: requesting shutdown
    2017/01/16 19:28:21.582245 [INFO] client: shutting down
    2017/01/16 19:28:21.610262 [DEBUG] client: terminating runner for alloc '363ba53f-1184-d95d-214d-12fe226a5bc3'
    2017/01/16 19:28:21.610339 [INFO] nomad: shutting down server
    2017/01/16 19:28:21 [WARN] serf: Shutdown without a Leave
    2017/01/16 19:28:21.621483 [INFO] agent: shutdown complete

Job file (if appropriate)

job "oom-killed" {
    datacenters = ["dc1"]
    type = "service"

    group "oom-killed" {
        task "oom-killed" {
            driver = "docker"

            config {
                image = "zyfdedh/stress:latest"
                command = "sh"
                args = [ "-c", "sleep 10; stress --vm 1 --vm-bytes 50M" ]
            }

            resources {
                memory = 15 # MB
            }
        }
    }
}
@multani
Copy link
Contributor Author

multani commented Jan 18, 2017

As mentioned by @diptanu , the cgroups notification API can do the job in this case, although that would be Linux-specific.

@diptanu: I'll be glad to propose a patch for this feature, could you give me some pointers where I should start to hack Nomad on?

@mlushpenko
Copy link

@dadgar yes please, we spent several days trying to launch owasp-zap container via nomad while it was working totally fine via docker before fixed it by assigning 1gb of memory to nomad job.

@dadgar dadgar added this to the unscheduled milestone Jan 31, 2017
@multani
Copy link
Contributor Author

multani commented Feb 5, 2017

I started to put a fix here: https://github.com/multani/nomad/commits/fix-oom-notification

I was happy with this initial patch and was going to submit a PR but obviously it's only for Docker so I'm not sure if this could be proposed already. I need to have a look how to properly support the other drivers as well.

@burdandrei
Copy link
Contributor

burdandrei commented Feb 14, 2017

this would be great thing to have

@tmichaud314
Copy link
Contributor

@dadgar What needs to happen for this to be releasable? We sometimes run into this issue when a task's memory requirements are not completely understood/managed in advance. Setting a restart policy of failed is not respected; the task exits with a 137 status code and is then is restarted ad infinitum. We have to manually stop jobs that end up like this.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants