You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Linux builder0 4.15.0-112-generic 113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Issue
After starting a system job providing CSI storage for DigitalOcean Block Volumes when stopping it seems I cannot actually stop them. They seem to disappear from Docker (when I docker ps) but they appear to be still running on the machine and don't ever signal that they have died.
After stopping the jobs (even when I do a purge) it looks like this:
ID = csi_digitalocean
Name = csi_digitalocean
Submit Date = 2020-09-23T02:15:30Z
Type = system
Priority = 50
Datacenters = dc1
Namespace = default
Status = dead (stopped)
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
monolith 0 0 5 0 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
f289b00a 35750b4b monolith 15 stop running 10h19m ago 12m12s ago
e689cbd1 35750b4b monolith 7 stop running 10h22m ago 10h21m ago
80431cab cc765454 monolith 1 stop running 19h49m ago 19h2m ago
c3363552 f82affc3 monolith 1 stop running 19h53m ago 19h2m ago
308960b3 f82affc3 monolith 0 stop running 19h56m ago 19h53m ago
These allocations remain forever. Even with a nomad system gc. A call to nomad status shows the following:
ID Type Priority Status Submit Date
csi_digitalocean system 50 dead (stopped) 2020-09-23T02:15:30Z
Reproduction steps
Run the following job spec and then try a nomad stop csi_digitalocean or a nomad stop -purge csi_digitalocean
Job file (if appropriate)
job "csi_digitalocean" {
region = "global"
datacenters = ["dc1"]
type = "system"
group "monolith" {
constraint {
operator = "distinct_hosts"
value = "true"
}
constraint {
attribute = "${attr.cpu.arch}"
operator = "="
value = "amd64"
}
constraint {
attribute = "${attr.kernel.name}"
operator = "="
value = "linux"
}
# Only run this on digitalocean ocean droplets
# e.g. droplets with a droplet_id
constraint {
attribute = "${meta.droplet_id}"
operator = "is_set"
}
# Use nomad_storage_drivers list to control which servers these are applied to
constraint {
attribute = "${meta.nomad_storage_drivers}"
operator = "is_set"
}
constraint {
attribute = "${meta.nomad_storage_drivers}"
operator = "set_contains"
value = "digitalocean"
}
restart {
attempts = 10
interval = "5m"
delay = "25s"
mode = "delay"
}
task "plugin" {
driver = "docker"
config {
image = "digitalocean/do-csi-plugin:v2.0.0"
privileged = true
args = [
"--endpoint=unix:///var/run/csi.sock",
"--token=<MY_DO_TOKEN>",
"--url=https://api.digitalocean.com/"
]
}
csi_plugin {
id = "digitalocean"
type = "monolith"
mount_dir = "/var/run"
}
resources {
cpu = 500
memory = 256
}
}
}
}
Further investigation of a specific allocation with nomad status f289b00a shows the following:
ID = f289b00a-b97c-4e5d-5a2c-abec4380ad5c
Eval ID = 992c239c
Name = csi_digitalocean.monolith[0]
Node ID = 35750b4b
Node Name = <node_name>
Job ID = csi_digitalocean
Job Version = 15
Client Status = running
Client Description = Tasks are running
Desired Status = stop
Desired Description = alloc not needed due to job update
Created = 10h23m ago
Modified = 17m7s ago
Task "plugin" is "running"
Task Resources
CPU Memory Disk Addresses
0/500 MHz 7.7 MiB/256 MiB 300 MiB
Task Events:
Started At = 2020-09-22T16:48:47Z
Finished At = N/A
Total Restarts = 1
Last Restart = 2020-09-22T16:48:18Z
Recent Events:
Time Type Description
2020-09-23T02:23:04Z Killing Sent interrupt. Waiting 5s before force killing
2020-09-22T16:48:47Z Started Task started by client
2020-09-22T16:48:18Z Restarting Task restarting in 27.06382292s
2020-09-22T16:48:18Z Terminated Exit Code: 0
2020-09-22T16:16:18Z Started Task started by client
2020-09-22T16:16:15Z Driver Downloading image
2020-09-22T16:16:15Z Task Setup Building Task Directory
2020-09-22T16:16:15Z Received Task received by client
Stopping an allocation directly with a command like nomad alloc stop f289b00a shows the same behaviour, even after this the alloc still remains and shows as running (it's not running) and never disappears.
The text was updated successfully, but these errors were encountered:
hongkongkiwi
changed the title
CSI Storage Monolith Providers will not stop
CSI Storage Monolith Providers show incorrect running status (after stopping)
Sep 23, 2020
If the Docker container is stopped, but the allocation is left running, that suggests that something is preventing the allocation from being cleaned up on the host. And given that we're talking about CSI, it's probably a mount. Some information that would help debug this:
Can you get the client logs from the time the allocation was initially stopped?
Can you get the allocation logs for f289b00a? (Especially from when the Docker container was stopped, which should probably be the last logs we saw.)
Can you check mount on the client for mount points for the allocation?
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.12.5 (514b0d6)
Operating system and Environment details
Linux builder0 4.15.0-112-generic 113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Issue
After starting a system job providing CSI storage for DigitalOcean Block Volumes when stopping it seems I cannot actually stop them. They seem to disappear from Docker (when I
docker ps
) but they appear to be still running on the machine and don't ever signal that they have died.After stopping the jobs (even when I do a purge) it looks like this:
These allocations remain forever. Even with a
nomad system gc
. A call tonomad status
shows the following:Reproduction steps
Run the following job spec and then try a
nomad stop csi_digitalocean
or anomad stop -purge csi_digitalocean
Job file (if appropriate)
Further investigation of a specific allocation with
nomad status f289b00a
shows the following:Stopping an allocation directly with a command like
nomad alloc stop f289b00a
shows the same behaviour, even after this the alloc still remains and shows as running (it's not running) and never disappears.The text was updated successfully, but these errors were encountered: