-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] Ability to use variable interpolation in volume {} stanza #7877
Comments
Hi @benvanstaveren! Thanks for opening this issue and giving a solid description of a use case. Will get this onto the road map.
Yup! Looks like we've had a few requests for this feature: #7110 #6536 |
cross referencing #8262 |
a huge +1, as this is blocking our progress with CSI. In particular, we'd need the Nomad job environment variables e.g.
|
Waiting for this badly 👍 |
It would be super cool if we could simply do this with our nomad jobs:
|
Wanted to follow up on this just to make sure folks don't think we're ignoring this important issue. This feature request is seems like it's a single feature on the surface but there are actually two parts to it:
Here's an example of a job that consumes CSI volumes. The commented-out bits are how the job originally worked with a static CSI volume, and the variables {
volume_ids = ["volume0"]
}
job "example" {
datacenters = ["dc1"]
group "cache" {
dynamic "volume" {
for_each = var.volume_ids
labels = [volume.value]
content {
type = "csi"
source = "test-${volume.value}"
read_only = true
}
}
# volume "volume0" {
# type = "csi"
# source = "test-volume0"
# read_only = true
# mount_options {
# fs_type = "ext4"
# }
# }
count = 2
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
port_map {
db = 6379
}
}
dynamic "volume_mount" {
for_each = var.volume_ids
content {
volume = "${volume_mount.value}"
destination = "${NOMAD_ALLOC_DIR}/${volume_mount.value}"
}
}
# volume_mount {
# volume = "volume0"
# destination = "${NOMAD_ALLOC_DIR}/test"
# }
resources {
cpu = 500
memory = 256
}
}
}
} cc @notnoop to point out a working case of |
Example of using HCL2 to accomplish this is in #9449 |
Thanks a lot for the update @tgross. This issue is still blocking adoption of CSI in our stack. A few comments:
I certainly see that 'allocation index' is insufficient to use for this purpose. You actually just want the container to be able to interpolate 'CSI volume index' - because really, the indexing should be of the persistent objects, not at the allocation level. So perhaps if we could specify that the 'count' of the group can used to reference volumes, through something like a 'volume group'
and then we'd do something like
How does that sound? |
Sorry to hear that. Could you dig into that a bit more? I want to make sure this isn't a matter of having written an incomplete HCL2 example instead of something we can't work around for you. Effectively there's a gap in the job specification:
The HCL2 example in #9449 effective tries to turn case (3) into case (1).
I think you're on the right track here in terms of the job spec. Arguably, volumes aren't really group-level resources once you have CSI in the mix -- they're cluster-level resources that are being consumed by the job. And then the I think this all comes down to wanting to create the count of volumes in the scheduler where the plan is being created. Adding the count into the planning stage opens up a bunch of questions around updates: what happens when we reschedule an allocation, or have canaries, or have rolling updates? Probably not intractable but there's a lot of small details we'll need to get right. Also we haven't yet implemented Volume Creation #8212, which I'm just starting design work on. If we want to be able to support some kind of count/index attribute, we'll need to handle counts at volume creation time too (and I haven't figured out what stage that happens in yet). Our process around design work like this is usually to write up an RFC which gets reviewed within the team and then engineering-wide. I'm going to push for making this something we can share with the community for this issue and #8212 so that we can get some feedback. |
Thanks for that @tgross. I'm a bit short of time, so I'll just ramble a bit. First, lets address the use case:
e.g. 3 node redis cluster of aws ECS containers (1 master 2 replicas) each with a persistent EBS volume. If one of the containers goes down, we'll need to start up a new container with the same volume mounted (so must be in the same AZ as the volume of the downed instance - CSI topology constraint). Lets assume for now, that we have the CSI volume create functionality. Just to abstractly use the spec I laid out before - with more detail. We need to indicate that we want 3 (volume-container) pairs that intersect a set of constraints. First, at the group level , we set up the (volume- half of the joint constraints
Then, also at the group level, we set up the -container) half of the constraints
The intersection of the two scheduling constraints is done by the 'volume_mount' stanza in the task. I.e. its saying that when we shedule a container from this task, we also need to shedule (or satisfy) the placement constraints of the volume_group simultaneously. If there was another task in this group besides "redis" we could also attach it to the same 'volume group' and have both containers (scheduled on the same host because they are in the same 'group') access the shared volume. ++ My general understanding of the syntax elements inside the 'group' stanza, is that 'you get one of these per count' for all of the stanzas subordinate to 'group'. Which is why I feel that the way 'volume' is used now currently is confusing. I would call 'volume' as it as used now 'common_volume'. meaning that, despite what 'count' says, you are only going to get one of these. I almost feel that the default behavior should be 'volume_group' as above. Other use cases:
This is relatively straightforward. There needs to be some care about scaledown-then-scale up. i.e. do you want to use the previously initialized volumes, or do they get cleaned up.
There should be a way to specify if you want the canary container to use the existing persistent volume (i.e. get scheduled alongside existing deployment), or scheduled into new persistent volumes (or temporary volumes?). |
We've done some design work on this and here's how we're going to solve this problem. The existing workflow for mounting a CSI volume is as follows:
Note that in this workflow, the volume ID is used for feasibility checking and then later during the claim workflow on the client. When the schedulers select the next client node they pass the placement along with a We’ll add a volume "data" {
type = "csi"
source = "my-ebs-volume"
per_alloc = true
} When a job with
Allocations don’t carry Because the index is being generated when the scheduler does placements, an allocation with more than one CSI volume receives the same group of volumes after an update, and any state from |
Just an update on this long-awaited feature, I'll be working on this over the next few weeks. |
Thanks so much for this @tgross, we're patiently watching the progress! |
So the final name will be |
Do you have an example of that? Keep in mind this isn't the same as the External ID used by the storage provider. |
Oh right, yes the limitations (if at all) will be on external ID. Sorry! |
#10136 has been merged and will ship with Nomad 1.1 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Per a topic on Discuss - I've ran into a few situations in our setup where it would be absolutely beneficial to use variable interpolation in the volume and host_volume stanzas. As an example, we currently set up various cloud instances with their own mounted volumes that belong to a particular instance. On these instances we like to run system jobs, so that every instance has one - but the problem now is that currently there is no easy way to automatically mount the volume that "belongs" to a particular node.
In essence, being able to do something like...
Would be most excellent. At that point each system job picks up it's own volume automatically, which makes my life easier (and saves on job definitions, or currently the one job we use with the 40+ task groups so we can get it to work properly).
Outside the scope of this feature request (but maybe related) would be a change to host volumes - currently we have a few servers that use host volumes, but the same volume is shared between multiple joibs because we can't mount a subdirectory out of a host volume - perhaps a change to allow something like:
So that each job gets itself a named volume to be used with volume_mount that points to a subdirectory of the actual host volume. Currently if we want to achieve this we need to add new host volumes to the client configuration and restart the nomad process on the node, which isn't something we're very keen on. (Alternatively having new host volumes picked up with a reload, if it isn't already, would be nice too).
This would probably entail some tweaks to the order in which things are resolved and set up, but would (again) maky my life easier.
Can't speak for anyone else though so maybe I'm just the only one with this particular use case :D
The text was updated successfully, but these errors were encountered: