Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad segfaults when trying to preempt a docker-based job with lower priority #11342

Closed
aneutron opened this issue Oct 18, 2021 · 3 comments · Fixed by #11346
Closed

Nomad segfaults when trying to preempt a docker-based job with lower priority #11342

aneutron opened this issue Oct 18, 2021 · 3 comments · Fixed by #11346
Assignees
Labels

Comments

@aneutron
Copy link

Nomad version

Nomad v1.1.6 (b83d623fb5ff475d5e40df21e9e7a61834071078)

Operating system and Environment details

# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)
# cat /proc/cpuinfo | grep EPYC | uniq
model name      : AMD EPYC 7763 64-Core Processor
# cat /proc/meminfo | grep -i Memtot
MemTotal:       527815668 kB
# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)

Issue

Hi,

First of all, thanks for the amazing product that's Nomad. I'm currently in the process of PoC-ing Nomad for a use case at our company, and it involves running jobs that use GPUs.

As it is a PoC, I'm only running Nomad in dev mode. I'm trying to use the default scheduler w/ the pre-empting feature enabled for all types of jobs.

My test scenario was the following:

  • Create a job (with a single docker task) that requires 4 GPUs (4 allocations of 1 GPU)
  • Create a job (with a single docker task) that requires 2 GPUs (1 allocation of 2 GPUs) and has higher priority (delta=20)
  • Observe that Nomade vacates 2 of the 4 allocations in Job 1 and schedules an allocation of Job 2.

Instead what happened is once I tried to run Job 2, the server/client segfaulted (due to a panic).

I successfully reproduced the error at least 5 times, using different configurations of GPU requirements but with the same global idea (multiple single GPU jobs, one multi-GPU job).

The jobs schedule fine on their own, but once I schedule the higher priority job where the lower prio job is already deployed, it crashes.

Reproduction steps

  • The server / client configuration:
datacenter = "dev"

log_file = "nomad.log"

client {
    enabled = true
    options {
        docker.cleanup.image = false
    }
}

server {
  default_scheduler_config {
    preemption_config {
      batch_scheduler_enabled    = true
      system_scheduler_enabled   = true
      service_scheduler_enabled  = true
      sysbatch_scheduler_enabled = true # New in Nomad 1.2
    }
  }
}


plugin "nvidia-gpu" {
  config {
    enabled            = true
    fingerprint_period = "1m"
  }
}
  • The command line to run Nomad in dev mode:
    nomad agent -dev -bind 0.0.0.0 -plugin-dir=./plugins -config=./server-config.hcl -log-level=WARN

Then the steps to reproduce are as follows:

  • Enable preemption on all types
  • Deploy Job 1
  • Deploy Job 2

Expected Result

  • Job 1 (or some of its allocations) are vacated
  • Job 2 is deployed

Actual Result

Server / Client segfaults.

Job file (if appropriate)

This is the file for Job 1:

job "jupyterlab" {
  datacenters = ["dev"]
  group "jupyter" {
    count = 4
    network {
      port "jupyter" {
        to = 8091
      }
    }
    task "jupyter-docker" {
      driver = "docker"
      config {
        # A custom cuda+jupyter image but anything will do
        image = "cuda-centos8-jupyter-pytorch"
        ports = ["jupyter"]
      }
      resources {
        cpu    = 500
        memory = 2048
         device "nvidia/gpu" {
          count = 1
         }
      }
    }
  }
}

The second Job is verbatim except for the Job Name and The Count (both for the group and the GPU).

Nomad Server logs (if appropriate)

with this stack trace:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x70 pc=0x1b09597]

goroutine 374 [running]:
github.com/hashicorp/nomad/scheduler.(*JobAntiAffinityIterator).Next(0xc003c1b5e0, 0x0)
        github.com/hashicorp/nomad/scheduler/rank.go:581 +0x1f7
github.com/hashicorp/nomad/scheduler.(*NodeReschedulingPenaltyIterator).Next(0xc001d1a870, 0x0)
        github.com/hashicorp/nomad/scheduler/rank.go:627 +0x38
github.com/hashicorp/nomad/scheduler.(*NodeAffinityIterator).Next(0xc003c1b630, 0x203000)
        github.com/hashicorp/nomad/scheduler/rank.go:699 +0x49
github.com/hashicorp/nomad/scheduler.(*SpreadIterator).Next(0xc0012a6180, 0xc0036eb240)
        github.com/hashicorp/nomad/scheduler/spread.go:112 +0x49
github.com/hashicorp/nomad/scheduler.(*PreemptionScoringIterator).Next(0xc00392b7c0, 0x60)
        github.com/hashicorp/nomad/scheduler/rank.go:794 +0x38
github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc00392b7e0, 0x265f6a0)
        github.com/hashicorp/nomad/scheduler/rank.go:758 +0x38
github.com/hashicorp/nomad/scheduler.(*LimitIterator).nextOption(0xc0012a6360, 0x265c940)
        github.com/hashicorp/nomad/scheduler/select.go:60 +0x34
github.com/hashicorp/nomad/scheduler.(*LimitIterator).Next(0xc0012a6360, 0xc0012a6e01)
        github.com/hashicorp/nomad/scheduler/select.go:39 +0x3d
github.com/hashicorp/nomad/scheduler.(*MaxScoreIterator).Next(0xc001d1ac30, 0xc0036e2600)
        github.com/hashicorp/nomad/scheduler/select.go:102 +0x4d
github.com/hashicorp/nomad/scheduler.(*GenericStack).Select(0xc0027a4000, 0xc0036e2600, 0xc0027b2ec0, 0x0)
        github.com/hashicorp/nomad/scheduler/stack.go:174 +0x7c6
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).selectNextOption(0xc000ad25a0, 0xc0036e2600, 0xc0027b2ec0, 0x0)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:789 +0xe5
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computePlacements(0xc000ad25a0, 0x451dea8, 0x0, 0x0, 0xc002ff67c0, 0x1, 0x1, 0x2, 0x0)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:552 +0x4fa
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc000ad25a0, 0xc000ac0000, 0xc003c1b4f0)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:430 +0x1239
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc000ad25a0, 0x0, 0x0, 0x0)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:257 +0x36f
github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc0036ebe00, 0xc0036ebdf0, 0x6, 0x30ec198)
        github.com/hashicorp/nomad/scheduler/util.go:275 +0x42
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc000ad25a0, 0xc0010fc480, 0x30ec198, 0xc001021b30)
        github.com/hashicorp/nomad/scheduler/generic_sched.go:156 +0x2b7
github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc000931180, 0xc00120fb90, 0xc0010fc480, 0xc00211a6c0, 0x24, 0x0, 0x0)
        github.com/hashicorp/nomad/nomad/worker.go:268 +0x42c
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc000931180)
        github.com/hashicorp/nomad/nomad/worker.go:129 +0x286
created by github.com/hashicorp/nomad/nomad.NewWorker
        github.com/hashicorp/nomad/nomad/worker.go:81 +0x152

Nomad Client logs (if appropriate)

(See above)

@notnoop notnoop self-assigned this Oct 18, 2021
@notnoop
Copy link
Contributor

notnoop commented Oct 19, 2021

Hi @aneutron ! Thanks for letting us know. I was able to reproduce and have a fix. Will PR the fix soon.

@aneutron
Copy link
Author

Hey @notnoop ! Thanks a lot for the swift action on your part. Looking forward to build it and keep testing Nomad. Cheers !

notnoop pushed a commit that referenced this issue Oct 20, 2021
Fix a bug where the scheduler may panic when preemption is enabled. The conditions are a bit complicated:
A job with higher priority that schedule multiple allocations that preempt other multiple allocations on the same node, due to port/network/device assignments.

The cause of the bug is incidental mutation of internal cached data. `RankedNode` computes and cache proposed allocations  in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L42-L53 . But scheduler then mutates the list to remove pre-emptable allocs in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L293-L294, and  `RemoveAllocs` mutates and sets the tail of cached slice with `nil`s triggering a nil-pointer derefencing case.

I fixed the issue by avoiding the mutation in `RemoveAllocs` - the micro-optimization there doesn't seem necessary.

Fixes #11342
lgfa29 pushed a commit that referenced this issue Nov 15, 2021
Fix a bug where the scheduler may panic when preemption is enabled. The conditions are a bit complicated:
A job with higher priority that schedule multiple allocations that preempt other multiple allocations on the same node, due to port/network/device assignments.

The cause of the bug is incidental mutation of internal cached data. `RankedNode` computes and cache proposed allocations  in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L42-L53 . But scheduler then mutates the list to remove pre-emptable allocs in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L293-L294, and  `RemoveAllocs` mutates and sets the tail of cached slice with `nil`s triggering a nil-pointer derefencing case.

I fixed the issue by avoiding the mutation in `RemoveAllocs` - the micro-optimization there doesn't seem necessary.

Fixes #11342
lgfa29 pushed a commit that referenced this issue Nov 15, 2021
Fix a bug where the scheduler may panic when preemption is enabled. The conditions are a bit complicated:
A job with higher priority that schedule multiple allocations that preempt other multiple allocations on the same node, due to port/network/device assignments.

The cause of the bug is incidental mutation of internal cached data. `RankedNode` computes and cache proposed allocations  in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L42-L53 . But scheduler then mutates the list to remove pre-emptable allocs in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L293-L294, and  `RemoveAllocs` mutates and sets the tail of cached slice with `nil`s triggering a nil-pointer derefencing case.

I fixed the issue by avoiding the mutation in `RemoveAllocs` - the micro-optimization there doesn't seem necessary.

Fixes #11342
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants