Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facilitate the creation of CSI Volumes within a Jobspec #11195

Open
RickyGrassmuck opened this issue Sep 16, 2021 · 11 comments
Open

Facilitate the creation of CSI Volumes within a Jobspec #11195

RickyGrassmuck opened this issue Sep 16, 2021 · 11 comments

Comments

@RickyGrassmuck
Copy link
Contributor

Proposal

With the ability to create CSI volumes directly from Nomad now via the nomad volume create command, it would be helpful to be able to define a volume inline with the jobspec that would result in the creation of the volume upon job submission without the need for the separate action to explicitly create the volume.

The basic building blocks necessary to facilitate this are already in place with the job -> group -> volume block if that block were extended to make use of the full volume struct that is used by nomad volume create command.

Use-cases

Our main use-case for this feature would be for simplifying the job spec by keeping it all in a single file as well as simplifying CI/CD pipelines by having one less step to perform.

This would also allow for the nomad job plan process to include the volume creation step in planning and prevent the warning below from being generated.

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "app" (failed to place 1 allocation):
    * Class "node-class": 1 nodes excluded by filter
    * Constraint "missing CSI Volume metrics_prometheus": 1 nodes excluded by filter

Attempted Solutions

None yet. This is just a process improvement that would combine two existing functionalities that can't be done without changes to nomad itself.

@DerekStrickland
Copy link
Contributor

Hi @rigrassm,

Thanks for the suggestion. Do I understand your request is that you want to be able to fully embed a volume spec with all its options inside a job spec?

My first reaction to this is couldn't the volume provisioning take a bit of time? I feel like the purpose of the job spec is to register the desired state with Nomad as quickly as possible, and waiting for a volume provisioning step to finish works against that goal.

I am wondering if a lifecycle task would meet your needs today. If nomad is available in your CI image, I suspect you could exec out and issue the volume create command right there. Would that be a good solution for you?

Thanks,

Derek and the Nomad Team

@RickyGrassmuck
Copy link
Contributor Author

RickyGrassmuck commented Sep 23, 2021

Edit: apologies for any spelling/grammar mistakes, wrote this out on mobile and hadn't originally intended to get long winded lol.

@DerekStrickland

My first reaction to this is couldn't the volume provisioning take a bit of time? I feel like the purpose of the job spec is to register the desired state with Nomad as quickly as possible, and waiting for a volume provisioning step to finish works against that goal.

It definitely could take some time for the volume to be provisioned.

The way I look at it is, in most cases, the volume's existence (and by extension it's creation) is a requirement of the job so it makes sense that the requirement be able to be fully declared within the job spec.

I think it would be acceptable for the creation of the volume's to be done asynchronously from the scheduler so that the job would be registered but not placed until the volume finishes being created. Ideally the scheduler would be aware of this and reevaluate the job more frequently and be less aggressive with the eval time back off than it would with normal jobs.

I am wondering if a lifecycle task would meet your needs today. If nomad is available in your CI image, I suspect you could exec out and issue the volume create command right there. Would that be a good solution for you?

Currently we are already just adding a step to our CI that does the volume registration. This works just fine but it is an additional pipeline step.

I'm our case, we're using Cinder CSI without multiwrite capability so all of our volumes will always be a 1:1 mapping with their corresponding jobs.

@DerekStrickland
Copy link
Contributor

@rigrassm

I understand. I'll forward this to the team for backlog consideration. In the meantime, I'm glad you've got an acceptable work around.

Thanks again for using Nomad!

@DerekStrickland and the Nomad Team

@alexiri
Copy link
Contributor

alexiri commented Sep 28, 2021

I too would like this feature, but I'd also want the volume to be destroyed once the job is finished. I'm not sure if that was also @rigrassm's intention, but it's not clearly stated.

@RickyGrassmuck
Copy link
Contributor Author

@alexiri, I didn't state it but now that you mention it, having a destroy on job stop option would be handy

@DerekStrickland
Copy link
Contributor

cc @jrasell

@gregory112
Copy link

@rigrassm I also second this, especially for per_alloc volumes. It will be very handy to be able to create volumes like that instead of creating each volumes one by one especially when the difference is just the volume name (volume[0], volume[1], volume[2]).

@tgross
Copy link
Member

tgross commented Nov 18, 2021

I'm doing some scoping of outstanding CSI issues and I wanted to drop a note about this item. In our design for the volume create feature we left this out intentionally because the impact on scheduler workflow was going to be a fairly large lift. Which isn't to say we're not going to do it, but just that it's not as trivial as it seems at first. Here's an excerpt from our design doc (NMD-086, for internal folks):

Non-Goal

Several users have requested the ability to automatically Create volumes on job registration (aka Dynamic Volume Provisioning). This has implications for scheduling, in that we would not have the volume available when we schedule the job. The naive Nomad-internals workflow for this would be:

  • Register the job
  • Create the volume
  • Schedule the job

Unfortunately, the characteristic time for creating a volume from the Storage Providers is on the scale of minutes. This will make for a long window of time between the time we register the job and schedule it, which means users will not get feedback on whether they have a schedulable job until after the volume has been created. This warrants further design and discussion outside the scope of this RFC.

Fortunately, no matter how we implement dynamic volume provisioning (or whether we do at all), we will need all the plumbing work done for RPCs and commands as described in this RFC. Dynamic Volume Creation will be the subject of a future RFC.

@alexiri
Copy link
Contributor

alexiri commented Nov 19, 2021

This might be a stupid question, but why does the volume have to be available before the job is scheduled? The way I would imagine this working is something like this:

  • Register the job
  • Schedule the job
  • Nomad client picks up the job
  • Nomad client starts provisioning of the volume (job enters "queued" status? "provisioning"?)
  • Nomad client runs job
  • Nomad client deletes volume

If provisioning fails, the job fails and it would be retried according to the job specification.

What's wrong with the approach? My worry with the "create volume before scheduling" approach would be that if I submit a lot of jobs that aren't all going to be run at once due to their scheduling restrictions (#11197), all the volumes will nevertheless be created at once even though they might not be necessary for several hours until their respective job actually starts.

@tgross
Copy link
Member

tgross commented Nov 19, 2021

This might be a stupid question, but why does the volume have to be available before the job is scheduled?

It's not a stupid question, it's a great question!

But it gets into some scheduler internals. From a high-level view, when an evaluation is processed in the scheduler, we check that the job is "feasible" and we "compute placements" to generate the plan that pairs up allocations with client nodes. Importantly, the scheduler cannot write changes to state without handing the plan to the "plan applier" on the leader (this serializes the plans and it's how we guarantee consistency).

For CSI volumes, we currently check in the CSIVolumeChecker.isFeasible method the following:

  • The proposed client node has a healthy Node plugin for the plugin_id (but we're missing a Controller plugin check!).
  • The volume exists.
  • The proposed client node has not maxed-out the number of volumes that Node plugin allows.
  • The volume has available write/read claims.

The last two checks are the troublesome ones to change because the check we can do at the scheduler becomes only eventually consistent. But as it turns out they're already eventually consistent because we drive the claim workflow from the client. And this is the source of integrity issues with cleaning up claims when we're done with the volumes (see #10833 for a discussion of that).

So if we could fix the issue in #10833 we could probably drop the "the volume exists" check and turn it into a "if the volume exists, check for maxed-out / number of claims" check. And then we'd drive the entire volume create/mount flow from the client where the alloc is placed.

There would be a few other ripple-effects to consider:

  • Normally a job update will be unchanged from the current workflow, but if a job updates the count, we need to create volume(s) while a deployment is going on max_parallel-at-a-time. This could make a deployment that changes the count need a different progress timeout than one that doesn't change the count.
  • What do we expect happens to volumes if a job update reduces the count? Should it be detached but kept around for scale-up? Should it be thrown away? The answer to this is going to depend on who you ask and the use case, which means it's a configuration knob we have to figure out.
  • Should a reschedule be treated the same as a scale-down/scale-up or is it somehow different?
  • How does the answer to the above impact lost clients or node drains?
  • Some operators may not want job submitters to have the ability to create volumes, only mount ones that exist. We'll have to thread the ACL through Job Register and related RPCs.

I don't think any of this is insurmountable, but it requires some design of the details. Hope this helps provide some context.

@scaleoutsean
Copy link

My first reaction to this is couldn't the volume provisioning take a bit of time?

Depending on the type of storage, it can be fast enough to justify the convenience and simplicity.

$ nomad -v
Nomad v1.3.3-dev (aaae2734c483019b1c15c0dc0267698a57a44f09)

$ time nomad volume create test-vol.hcl 
<snip>

real	0m0.120s
user	0m0.025s
sys	0m0.009s

@tgross

What do we expect happens to volumes if a job update reduces the count?

K8s has ephemeral volumes. Some Nomad users would probably like the ability to have the volume deleted once the job count hits zero. Others may want to leave it in place (for example if the volume is used for cache that would have to be regenerated the next time someone runs the same job).

Some operators may not want job submitters to have the ability to create volumes, only mount ones that exist.

Good point. There's a similar cautionary note in the K8s docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

7 participants