-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI volume per_alloc availability zone placement #11778
Comments
Hi @ygersie! Yeah unfortunately you'll probably want to namespace the volume names by AZ as well, if you're using the |
Hi @tgross thanks for the quick response! Forgive my ignorance, could you elaborate a bit on what you mean with namespacing the volumes per AZ? |
The jobspec you have crosses AZs, so the allocations are being placed based on binpacking across those AZs. So if you use So as a very hacky workaround, you can name the volumes |
@tgross thanks, yeah it's not pretty. In case someone runs into the same issue I came up with the following for now: variable "datacenters" {
type = list(string)
default = ["us-west-2a", "us-west-2b", "us-west-2c"]
}
variable "instance_count" {
type = number
default = 3
}
locals {
nr_of_dcs = length(var.datacenters)
alloc_to_az = { for i, dc in sort(var.datacenters) : i => dc }
}
job "example" {
type = "service"
region = "us-west-2"
datacenters = var.datacenters
dynamic "group" {
for_each = range(var.instance_count)
iterator = alloc
labels = ["example-${alloc.key}"]
content {
volume "ebs" {
type = "csi"
# creates volume source ids like:
# us-west-2a-myvolume[0]
# us-west-2b-myvolume[1]
source = "${local.alloc_to_az[alloc.key % local.nr_of_dcs]}-myvolume[${alloc.key}]"
read_only = false
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
}
}
task "example" {
driver = "docker"
config {
image = "alpine"
args = ["tail", "-f", "/dev/null"]
}
volume_mount {
volume = "ebs"
destination = "/data"
read_only = false
}
resources {
cpu = 100
memory = 64
}
}
}
}
} At least this makes it easier to schedule a different number of instances in a single job. |
Just a heads up that I'm actively working on #7669 which will resolve this issue. |
Resolved by #12129, which will ship in Nomad 1.3.0 |
Awesome stuff @tgross ! |
Hey @tgross I just got time to test out the Volume specs: id = "ygersie[0]"
name = "ygersie[0]"
namespace = "mynamespace"
type = "csi"
plugin_id = "aws-ebs-us-west-2a"
external_id = "vol-asdfasdfasdfasdfasdf"
capability {
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
mount_options {
fs_type = "ext4"
}
topology_request {
required {
topology {
segments {
"topology.ebs.csi.aws.com/zone" = "us-west-2a"
}
}
}
} and the second: id = "ygersie[1]"
name = "ygersie[1]"
namespace = "mynamespace"
type = "csi"
plugin_id = "aws-ebs-us-west-2b"
external_id = "vol-asdfasdfasdfasdfasdf"
capability {
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
mount_options {
fs_type = "ext4"
}
topology_request {
required {
topology {
segments {
"topology.ebs.csi.aws.com/zone" = "us-west-2b"
}
}
}
} and the job: job "ygersie" {
region = "us-west-2"
datacenters = ["us-west-2a", "us-west-2b", "us-west-2c", "us-west-2d"]
namespace = "mynamespace"
group "example" {
count = 2
volume "ebs" {
type = "csi"
source = "ygersie"
read_only = false
per_alloc = true
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
}
}
task "example" {
driver = "docker"
config {
image = "alpine"
args = ["tail", "-f", "/dev/null"]
}
volume_mount {
volume = "ebs"
destination = "/data"
}
resources {
cpu = 100
memory = 64
}
}
}
} The plan sometimes is completely fine and sometimes still shows: $ nomad job plan example-job-volume.hcl
+ Job: "ygersie"
+ Task Group: "example" (2 create)
+ Task: "example" (forces create)
Scheduler dry-run:
- WARNING: Failed to place all allocations.
Task Group "example" (failed to place 1 allocation):
* Class "default": 1 nodes excluded by filter
* Constraint "CSI plugin aws-ebs-us-west-2b is missing from client 17224e99-393e-16d2-aa0a-329ff47ca63b": 1 nodes excluded by filter
Job Modify Index: 0
To submit the job with version verification run:
nomad job run -check-index 0 example-job-volume.hcl
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid. If I do a job run it'll fail initially and on the second attempt it seems to still schedule it correctly. However, I wonder if it is just luck that the job schedules on the second attempt due to the low amount of (test) nodes I have or if there's a difference between the plan and the evaluation generated on job creation. As long as Nomad guarantees it will be placed once the job is created I can get away with marking this as a "cosmetic" bug. I'm still using the plugin_id with the AZ in there as this is how we currently have our plugins deployed. I wouldn't expect this to be related to the issue I'm seeing here. |
Hi @ygersie! That message bubbles up from the per-node feasibility check (ref feasible.go#L305-L309), which looks like this: plugin, ok := n.CSINodePlugins[vol.PluginID]
if !ok {
return false, fmt.Sprintf(FilterConstraintCSIPluginTemplate, vol.PluginID, n.ID)
} The reason you're seeing different behaviors on different runs is most likely because the scheduler shuffles the list of nodes. For each allocation the list is feasibility checked (filtered) and scored until we either have at least 1 (and a maximum of 2) viable node scores or we run out of nodes. So suppose we have (nodeA, nodeB, nodeC, nodeD) one for each DC. Our example[0] has a volume[0] in zone A, and example[1] has a volume[1] in zone B. One possible iteration through the scheduler might look like this:
The failure you're getting suggests that one of the zones doesn't have a plugin (much less a healthy one). So when you say:
I think this may be unexpectedly be a factor still. If it were topology-related I'd expect to see "did not meet topology requirement" (ref I'll re-open this issue while we debug this. Some questions on your test environment that might help narrow down the behavior:
|
Hey @tgross Can you verify that the volumes were created in the correct AWS AZ as their topology says?They are, otherwise it would never work as planned, but here is the confirmation: $ aws ec2 describe-volumes --volume-ids vol-0f53db9b7f68bca2c vol-07aae447a2bec7826 --query 'Volumes[].[VolumeId,AvailabilityZone]'
[
[
"vol-07aae447a2bec7826",
"us-west-2b"
],
[
"vol-0f53db9b7f68bca2c",
"us-west-2a"
]
] and the volumes in nomad: $ nomad volume status -json ygersie[0] | jq -r '"\(.ExternalID)\n\(.Topologies)"'
vol-0f53db9b7f68bca2c
[{"Segments":{"topology.ebs.csi.aws.com/zone":"us-west-2a"}}]
$ nomad volume status -json ygersie[1] | jq -r '"\(.ExternalID)\n\(.Topologies)"'
vol-07aae447a2bec7826
[{"Segments":{"topology.ebs.csi.aws.com/zone":"us-west-2b"}}] Do you have more than one node in each DC with the running pluginsI do have more than one node running with the plugins, here's the nomad plugin output: $ nomad plugin status -verbose
Container Storage Interface
ID Provider Controllers Healthy/Expected Nodes Healthy/Expected
aws-ebs-us-west-2a ebs.csi.aws.com 2/2 3/3
aws-ebs-us-west-2b ebs.csi.aws.com 2/2 3/3
aws-ebs-us-west-2c ebs.csi.aws.com 2/2 3/3
aws-ebs-us-west-2d ebs.csi.aws.com 2/2 3/3
aws-efs efs.csi.aws.com 2/2 12/12 I have 2 controllers running per AZ (== nomad datacenter) and each node then runs the node plugin as a system job. All plugins are reported healthy and there is sufficient capacity for placement. Do any nodes in the DC not have a plugin?No, each node has a EBS node plugin running. Which DC is 17224e99-393e-16d2-aa0a-329ff47ca63b in? It would be interesting to see whether that was a DC with a different plugin or a DC without a plugin at all.That node is running in $ nomad plugin status -verbose aws-ebs-us-west-2a
ID = aws-ebs-us-west-2a
Provider = ebs.csi.aws.com
Version = v1.4.0
Controllers Healthy = 2
Controllers Expected = 2
Nodes Healthy = 3
Nodes Expected = 3
Controller Capabilities
ATTACH_READONLY
CLONE_VOLUME
CONTROLLER_ATTACH_DETACH
CREATE_DELETE_SNAPSHOT
CREATE_DELETE_VOLUME
EXPAND_VOLUME
GET_CAPACITY
GET_VOLUME
LIST_SNAPSHOTS
LIST_VOLUMES
LIST_VOLUMES_PUBLISHED_NODES
VOLUME_CONDITION
Node Capabilities
EXPAND_VOLUME
GET_VOLUME_STATS
STAGE_UNSTAGE_VOLUME
VOLUME_ACCESSIBILITY_CONSTRAINTS
VOLUME_CONDITION
Accessible Topologies
Node ID Accessible Topology
6095ae77 topology.ebs.csi.aws.com/zone=us-west-2a
a2fca83c topology.ebs.csi.aws.com/zone=us-west-2a
4a3e8c7c topology.ebs.csi.aws.com/zone=us-west-2a
Allocations
No allocations placed and $ nomad plugin status -verbose aws-ebs-us-west-2b
ID = aws-ebs-us-west-2b
Provider = ebs.csi.aws.com
Version = v1.4.0
Controllers Healthy = 2
Controllers Expected = 2
Nodes Healthy = 3
Nodes Expected = 3
Controller Capabilities
ATTACH_READONLY
CLONE_VOLUME
CONTROLLER_ATTACH_DETACH
CREATE_DELETE_SNAPSHOT
CREATE_DELETE_VOLUME
EXPAND_VOLUME
GET_CAPACITY
GET_VOLUME
LIST_SNAPSHOTS
LIST_VOLUMES
LIST_VOLUMES_PUBLISHED_NODES
VOLUME_CONDITION
Node Capabilities
EXPAND_VOLUME
GET_VOLUME_STATS
STAGE_UNSTAGE_VOLUME
VOLUME_ACCESSIBILITY_CONSTRAINTS
VOLUME_CONDITION
Accessible Topologies
Node ID Accessible Topology
e2044d7c topology.ebs.csi.aws.com/zone=us-west-2b
d45499f7 topology.ebs.csi.aws.com/zone=us-west-2b
2b5f4317 topology.ebs.csi.aws.com/zone=us-west-2b
Allocations
No allocations placed |
@ygersie that's all super helpful.
Very interesting! Ok you provided a ton of info here where I can probably write a standalone test that lets me exercise the whole scheduler with roughly the same state. Let me have a go at that and hopefully it'll come up with a reproduction and clues as to what's going wrong. Thanks again! |
@tgross you're welcome, thanks for taking a quick look, the sooner this gets resolved the better :) $ nomad volume status ygersie[0]
ID = ygersie[0]
Name = ygersie[0]
External ID = vol-0f53db9b7f68bca2c
Plugin ID = aws-ebs-us-west-2a
Provider = ebs.csi.aws.com
Version = v1.4.0
Schedulable = true
Controllers Healthy = 2
Controllers Expected = 2
Nodes Healthy = 3
Nodes Expected = 3
Access Mode = single-node-writer
Attachment Mode = file-system
Mount Options = fs_type: ext4
Namespace = mynamespace
Topologies
Topology Segments
00 topology.ebs.csi.aws.com/zone=us-west-2a
Allocations
ID Node ID Task Group Version Desired Status Created Modified
bdb1e655 4a3e8c7c example 0 run running 8h32m ago 8h32m ago and $ nomad volume status ygersie[1]
ID = ygersie[1]
Name = ygersie[1]
External ID = vol-07aae447a2bec7826
Plugin ID = aws-ebs-us-west-2b
Provider = ebs.csi.aws.com
Version = v1.4.0
Schedulable = true
Controllers Healthy = 2
Controllers Expected = 2
Nodes Healthy = 3
Nodes Expected = 3
Access Mode = single-node-writer
Attachment Mode = file-system
Mount Options = fs_type: ext4
Namespace = mynamespace
Topologies
Topology Segments
00 topology.ebs.csi.aws.com/zone=us-west-2b
Allocations
ID Node ID Task Group Version Desired Status Created Modified
eb6d7301 2b5f4317 example 0 run running 8h34m ago 8h34m ago |
It seems this issue even occurs with the manual AZ mapping of source volumes, this worked at least before the 1.3.0 upgrade. I have a job that is failing to be scheduled with the same constraint issue. The job has 3 taskgroups each with a volume claim looking like: "Volumes": {
"data": {
"AccessMode": "single-node-writer",
"AttachmentMode": "file-system",
"MountOptions": {
"FSType": "ext4",
"MountFlags": null
},
"Name": "data",
"PerAlloc": false,
"ReadOnly": false,
"Source": "us-west-2c-redis[2]",
"Type": "csi"
}
} the volume status: $ nomad volume status us-west-2c-redis[2]
ID = us-west-2c-redis[2]
Name = redis[2]
External ID = vol-asdfasdfasdfasdf
Plugin ID = aws-ebs-us-west-2c
Provider = ebs.csi.aws.com
Version = v1.4.0
Schedulable = true
Controllers Healthy = 2
Controllers Expected = 2
Nodes Healthy = 3
Nodes Expected = 3
Access Mode = single-node-writer
Attachment Mode = file-system
Mount Options = fs_type: ext4
Namespace = mynamespace
Allocations
No allocations placed And the job status: Placement Failure
Task Group "redis-0":
* Class "default": 1 nodes excluded by filter
* Constraint "CSI plugin aws-ebs-us-west-2a is missing from client f13406f2-1cb1-008b-ebf5-74c27f418a46": 1 nodes excluded by filter
Task Group "redis-1":
* Class "default": 1 nodes excluded by filter
* Constraint "CSI plugin aws-ebs-us-west-2b is missing from client f562d820-a589-6c02-807a-d653ebfb7b14": 1 nodes excluded by filter
Task Group "redis-2":
* Class "default": 1 nodes excluded by filter
* Constraint "CSI plugin aws-ebs-us-west-2c is missing from client 17224e99-393e-16d2-aa0a-329ff47ca63b": 1 nodes excluded by filter The node ids are indeed in completely different datacenters, so they are rightfully missing the plugin. |
Ok, I haven't had a chance to come back to write that test. But:
Just FYI on this the |
I'm now also seeing this every now and then: Placement Failure
Task Group "example":
* Class "default": 1 nodes excluded by filter
* Constraint "did not meet topology requirement": 1 nodes excluded by filter There really seems to be a problem with selecting feasible nodes. |
Ok, I've got a failing test that demonstrates the issue:
That test can be found in the
Edit: it occurred to me that this plugin ID feasibility check happens before topology entirely. So I removed topology from the plugins and volume requests and it still fails the same way. That probably eliminates topology as the source of the issue. Edit 2: I've run out my timebox for this today but in some detailed exercising of this test (with a lot of printf debugging), I'm finding that the CSI feasibility checker simply isn't getting the full set of nodes to process. Which suggests there's some deeper bug lurking in the scheduler that we've been missing for a while. I saw your comment on #12748 @ygersie, so I'll pick this up again a bit later this week with that in mind. |
I've just opened a PR in #13274 which should ship in Nomad 1.3.2 with backports to Nomad 1.2.x and Nomad 1.1.x. Something interesting we discovered here is that this bug has existed since we first implemented CSI, but topology constraints make the feasibility check "sparser" and therefore more likely to hit this bug. But anyone running CSI on a cluster with a lot of heterogeneity could have easily hit it as well. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
v1.2.3
Issue
According to the recommendation here #10793 (comment) and in the docs the proper way to deal with scheduling volumes that are required to be mounted in the same AZ is to run a plugin controller per AZ. When using the per_alloc volume option, this doesn't work as expected. I assume this has to do with the fact that an alloc id isn't known at time of scheduling, so Nomad tries to randomly assign an alloc to a node that can't satisfy the volume requirement, this however conflicts with the purpose of the
per_alloc
directive. It also makes me wonder if the alloc index is even supposed to be a runtime variable as it is, as far as I can tell, determined during scheduling time?With the following volumes and job:
The result is that placement intermittently fails:
The text was updated successfully, but these errors were encountered: