Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High number of log streaming go-routines: Panic: out of memory #2840

Closed
mlehner616 opened this issue Jul 16, 2017 · 12 comments
Closed

High number of log streaming go-routines: Panic: out of memory #2840

mlehner616 opened this issue Jul 16, 2017 · 12 comments

Comments

@mlehner616
Copy link

mlehner616 commented Jul 16, 2017

A nomad client node crashed with an out of memory error despite having reserved resources configured. The instances have just shy of 4GB of memory available and the memory utilization at the time of the crash (based on relocated allocations) is estimated at about 650MB. I can say, based on how few allocations there were at the time, the client nodes were barely utilized which leads me to think the nomad client on this particular node hit some kind of memory leak until it crashed. I suppose it's also possible my ulimit settings are misconfigured, I will look into this.

{
    "log_level": "INFO",
    "data_dir": "/var/lib/nomad",
    "datacenter": "qa",
    "client": {
        "enabled": true,
        "options": {
            "driver.raw_exec.enable": "1",
            "docker.auth.config": "/root/.docker/config.json"
        },
        "reserved": {
            "cpu": "200",
            "memory": "128",
            "disk": "1024",
            "reserved_ports": "22,80,443,8500-8600"
        }
    }
}

Nomad version

Nomad v0.6.0-rc1 (5187fcf+CHANGES)

Operating system and Environment details

Amazon Linux kernel: 4.9.32-15.41.amzn1.x86_64

Issue

nomad client crashed with the following error:
fatal error: runtime: out of memory

Reproduction steps

I was deploying updates to a job that included additions of invalid template stanzas.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

https://gist.github.com/mlehner616/20d82b2df37054e62fc9041a96498ce7

First example allocation attempt (Invalid Template stanza)

failed to build task's template manager: failed to read template: open /var/lib/nomad/alloc/25b5c177-cfe9-1e90-8458-85fb24fb0fd4/example1/config/default.json.tpl: no such file or directory

Second example allocation attempt (Invalid App config)

Docker container exited with non-zero exit code: 1

Job file (the consul keys were not escaped correctly)

job "example" {
  datacenters = ["qa"]
  type = "service"
  constraint {
    attribute = "${node.class}"
    value     = "qa-infrastructure-cluster"
  }
  update {
    stagger = "5s"
    max_parallel = 1
  }
  meta {
    CI_COMMIT_REF_SLUG = "branch-name"
  }
  group "example-group" {
    count = 6
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    task "example" {
      driver = "docker"
      env {
        NODE_ENV = "qa"
      }
      template {
        data = <<EOH
        {
          "server": {
            "port": 3000
          },
          "api": {
            "signing": {
              "enabled": true,
              "secret": "\{\{ key "some/otherkey" \}\}",
              "timeout": 300000,
            },
            "endpoints": {
              "external": "http://external-service.com/"
            },
            "postBodySizeLimit": "20mb"
          },
          "provider": {
            "apiKey": "\{\{ key "some/key" \}\}"
          },
          "defaults": {
            "from": "[email protected]",
            "fromname": "Not Real"
          }
        }
        EOH

        destination = "config/default.json"
      }
      config {
        image = "123456789012.dkr.ecr.us-west-2.amazonaws.com/example:latest"
        # command = ""
        port_map {
          web = 3000
        }
      }
      resources {
        cpu    = 100 # 500 MHz
        memory = 256 # 256MB
        network {
          mbits = 1
          port "web" {}
        }
      }
      service {
        name = "example"
        tags = ["qa", "example-group"]
        port = "web"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
        check {
            type     = "http"
            port     = "web"
            path     = "/health"
            interval = "5s"
            timeout  = "3s"
        }
      }
    }
  }
}
@dadgar
Copy link
Contributor

dadgar commented Jul 17, 2017

@mlehner616 Thanks for the great bug reports!

@dadgar
Copy link
Contributor

dadgar commented Jul 17, 2017

@mlehner616 Do you by chance have the full panic?

@mlehner616
Copy link
Author

mlehner616 commented Jul 17, 2017

@dadgar The entire log for this crash is ~327MB, do you have a preferred method for me to get this to you?

@mlehner616
Copy link
Author

@dadgar This is the only panic block I see (same as the gist). The rest of the log (>4 million lines) contains repeats of goroutine stack traces. I saved these logs off the server and zipped them to an impressive 10MB so if you want the whole thing I'd be happy to get it over to you somehow.

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x1317a26, 0x16)
	/opt/go/src/runtime/panic.go:596 +0x95
runtime.sysMap(0xc4ffed0000, 0x100000, 0xc4f7e8a800, 0x1b87bd8)
	/opt/go/src/runtime/mem_linux.go:216 +0x1d0
runtime.(*mheap).sysAlloc(0x1b6e540, 0x100000, 0x2)
	/opt/go/src/runtime/malloc.go:428 +0x374
runtime.(*mheap).grow(0x1b6e540, 0x8, 0x0)
	/opt/go/src/runtime/mheap.go:774 +0x62
runtime.(*mheap).allocSpanLocked(0x1b6e540, 0x8, 0x42f5e4)
	/opt/go/src/runtime/mheap.go:678 +0x44f
runtime.(*mheap).alloc_m(0x1b6e540, 0x8, 0x7f0100000000, 0x424958)
	/opt/go/src/runtime/mheap.go:562 +0xe2
runtime.(*mheap).alloc.func1()
	/opt/go/src/runtime/mheap.go:627 +0x4b
runtime.systemstack(0x7fce61ffadd8)
	/opt/go/src/runtime/asm_amd64.s:343 +0xab
runtime.(*mheap).alloc(0x1b6e540, 0x8, 0x10100000000, 0x7fce70066870)
	/opt/go/src/runtime/mheap.go:628 +0xa0
runtime.largeAlloc(0x10000, 0x113a101, 0x113a1c0)
	/opt/go/src/runtime/malloc.go:795 +0x93
runtime.mallocgc.func1()
	/opt/go/src/runtime/malloc.go:690 +0x3e
runtime.systemstack(0xc42001a000)
	/opt/go/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
	/opt/go/src/runtime/proc.go:1132

@dadgar
Copy link
Contributor

dadgar commented Jul 18, 2017

@mlehner616 Yeah if you can upload it here that would be great, if not send it to [email protected]

@mlehner616
Copy link
Author

mlehner616 commented Jul 18, 2017

@dadgar nomad.err.zip

@schmichael
Copy link
Member

@mlehner616 The panic contains ~600k log streaming related goroutines, so I'm guessing we're leaking them somewhere. Are you using the log API (either via HTTP or the nomad logs command)? I doubt you're opening 600k connections at once, but more details may help debug where the leak is.

Thanks!

@mlehner616
Copy link
Author

@dadgar @schmichael Yes, I may have been experimenting with the nomad logs cli command at least a few minutes before it crashed.

I was definitely looking at logs through the hashi-ui and i imagine that uses the API.

I was deploying job updates with changes mostly to the template stanza trying to figure out how to get my particular desired consul integration right. While I was doing this, I was checking both logs and the /config directory to verify my template rendered correctly.

Is it odd though that only one node spun out of control here though? I suppose i could have gotten lucky and was checking logs for a number of allocs that were all on the same server, although I'm not familiar with how nomad bin packs.

@dadgar
Copy link
Contributor

dadgar commented Jul 19, 2017

Going to change title as it is (very likely) unrelated to the release candidate changes.

@dadgar dadgar changed the title 0.6.0-rc1: Panic: out of memory High number of log streaming go-routines: Panic: out of memory Jul 19, 2017
@schmichael
Copy link
Member

@mlehner616 Any chance you've tried to reproduce this on Nomad 0.7.1? It contained what seems like a relevant fix except that it really should have only affected Windows client nodes.

@schmichael
Copy link
Member

I've been unable to reproduce this and log streaming has changed substantially since 0.6, so I'm going to close this. Sorry we couldn't find a direct resolution, but please don't hesitate to reopen if you run into this issue again!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants