-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 4381: add structured parameters for dynamic resource allocation #4384
Conversation
pohly
commented
Jan 5, 2024
- One-line PR description: initial draft
- Issue link: DRA: structured parameters #4381
76262cb
to
d5ceb77
Compare
@pohly I think this and the related KEPs are going in the right direction. It will be interesting to see how many use cases we can cover with these. I am wondering though if we can build a more structured and comprehensive model up front, rather than the model being a sort of flat list of parameters. The drivers would publish those to a resource just like you have here, but would also allows some topological considerations baked into the structure. The goal is to get to a place where the scheduler and autoscalers (workload or cluster) have enough information to get the right answer the vast, vast majority of time. The drivers would likely also need to publish information on the valid "transformations" of the model when reservations are done. I am also not sure of the way this "peers into" driver-specific data structures for parameters. I see why you are doing it that way, but it creates some fragility between resources when we have those sort of field paths. I think we can explore some other approaches. For example, a schema of some sort for describing the resource model, which results in those parameter CRDs for the user to use, but attaches the resource management metadata to those types as well, for the scheduler and autoscalers to use. We can brainstorm some of this when we talk next. |
The parameters don't need to be a flat list. Anything that is allowed in a CRD could be used. How complex the in-tree model needs to (and should) be is something that we can discuss in #4384, with https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/edit#heading=h.ljj9kaa144nr as a starting point.
So basically a configurable mapping of fields in a vendor CRD to the in-tree numeric model type? That adds more flexibility, but also complexity. I don't see how it addresses the "fragility between resources" - instead of one configurable field path, we would have many. Perhaps being explicit about all paths would help. 🤷 |
respond by returning the JSON encoding of this capacity type. A streaming gRPC | ||
API allows drivers to update the capacity information. | ||
|
||
For example, a simple counter could be described as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add here that this example is what is being defined in #4385.
metav1.TypeMeta `json:",inline"` | ||
|
||
Count int64 `json:"count"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, the simple counter model uses the same struct for publishing capacity and requesting it. Most other models will have different structs, so let's extend the simple counter a bit so that it also has two.
One common additional criteria is some kind of selector. Let's discuss here what that might look like and then support it consistently with all builtin models.
What I came up with so far is to support a metav1.LabelSelector
:
// Parameters defines how much and which resources are needed for a claim.
type Parameters struct {
metav1.ObjectMeta `json:"metadata,omitempty"`
metav1.TypeMeta `json:",inline"`
// Count is the required number of items.
Count int64 `json:"count"`
// In addition to the "In", "NotIn", "Exists", "DoesNotExist"
// operators from [k8s.io/apimachinery/pkg/apis/meta/v1.LabelSelectorOperator],
// additional operators from [k8s.io/dynamic-resource-allocation/apis/meta/v1.LabelSelectorOperator]
// are also supported:
//
// - "Version>=", "Version>", "Version<=", "Version<", "Version=", "Version!=":
// the Selector.MatchExpressions values must
// contain a single semantic version strong. The expression
// matches if the version in the label satisfy the condition
// as defined by semantic versioning 2.0.0.
// - ">=", ">", "<=", "<", "=", "!=": the Selector.MatchExpressions values must
// contain a single string which can be parsed as [resource.Quantity],
// with comparison against the labels' value using the numeric operation.
Selector metav1.LabelSelector `json:"selector"`
}
The implementation will be a fork of https://github.com/kubernetes/apimachinery/blob/3c8c1f22dc332a8a6b51033f6a7c0566b885d290/pkg/apis/meta/v1/helpers.go#L31-L72 and https://github.com/kubernetes/apimachinery/blob/02a41040d88da08de6765573ae2b1a51f424e1ca/pkg/labels/selector.go#L212-L267, with semantic versioning handled by github.com/blang/semver/v4.
Using a metav1.LabelSelector
as type in the API is IMHO simpler for users than allowing a full expression (foo in (x y z), bar != 1
) (https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#set-based-requirement).
Another benefit is that validating the parameters becomes easier. The CRD could limit the key to know, supported labels. Enforcing the correct operation for each label (i.e. only allowing semver ops for a label that has a version as value) probably needs an admission web hook, but could be implemented that way.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered expanding metav1.LabelSelector
with these features vs. making another label selector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For NVIDIA GPU selection we will also require the following operations:
- Globbing, e.g. (if
model == "A100_PCIE_40GB"
, being able to select onmodel == "*a100*
) - Comparison of booleans with possible values of only true/false
We also need to be able to combine these operations in (somewhat) complicated ways, e.g. a common selector people want to express is:
model == "*v100* || (cudaComputeCapability >= 7.0 && memory >=16GB)`
We could maybe get away with not allowing arbitrary boolean expressions for now, but we do need globbing and boolean comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, the simple counter model uses the same struct for publishing capacity and requesting it. Most other models will have different structs, so let's extend the simple counter a bit so that it also has two.
What does this mean? What will the Request
struct look like, exactly? Is this the difference between what is used in claims vs. what is passed over the gRPC protocol from the DRA plugin to the kubelet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the difference. The "capacity type" is used when publishing capacity of a node, the "parameter type" when asking for resources via the vendor CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think it's covered in the proposal, but take a look tomorrow and see what you think. The proposal still has ResourceClaim with a reference to a vendor-specific CR. We translate that into a vendor neutral request object using CEL (which can operate on arbitrary CRs by treating them as Unstructured
). Various selection criteria could be included there (they are not discussed in there now, beyond a boolean true/false).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't click thru everything yet, but tell me if I am off-target here: This seems like a place where we need to evaluate semi-arbitrary expressions. One might say we need to agree on a common language for those expressions, with extensible semantics.
On globbing: IMO globbing indicates that you have a piece of stringly-typed data, which would almost always be better as N discrete items. This is why label selectors DO NOT support globbing, and instead demand that you factor your data into labels. AND STILL, people choose to parse names, and inevitably it is causing pain and misery.
Coming up with a string which composes the model, vendor, subtype, etc. should be a very last resort. Globbing is a smell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we need a more comprehensive expression language than metav1.LabelSelector
. metav1.LabelSelector
only gives us <term1> && <term2> && ...
where each term is one label check.
In @johnbelamaric's proposal, each hardware instance is described by more than one numeric model and the vendor CRD with the claim parameter then gets mapped to one parameter type per model. Each model is then checked independently and all must match an instance.
At first glance this looks appealing because one can "compose" the description of the hardware from different building blocks. But now consider mode switching. Each mode may have labels that need to match the selector provided by the user. The "selector model" is unaware of "modes", so it cannot be used. The "mode switch model" has to re-implement it. Alternatively, more details of the mode switch model need to be hoisted up into the resource.k8s.io
API. This is a slippery slope where the common API becomes more and more complex.
Therefore I prefer to stick with "one model per driver" and instead implement label selection as a shared library that can be used by the different models. Both the "counter model" and the "mode switch model" would support it, with the difference that the "counter model" publishes a a flat set of labels as part of its capacity while the "mode switch model" has labels for the entire instance and each mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding globbing -- it's been a long-standing feature request for users to be able to request GPUs based on a subset of the model name returned by nvidia-smi
.
Examples of such model names include the following:
Tesla V100-SXM2-16GB
NVIDIA T4 32GB
NVIDIA A100-SXM4-40GB
NVIDIA H100 PCIe
As some background, it's common for people to prototype their workloads in docker
on a specific GPU model before moving them to Kubernetes. When doing so, they are keenly aware of the GPU model they are running on, and want to ensure that they continue to run on this same GPU model once deployed under Kubernetes.
Now keep in mind, this is already possible today using the extended resource exposed by the current device plugin API (i.e. nvidia.com/gpu
) in conjunction with a node label that advertises the GPU model (assuming all GPUs on a node are homogeneous, of course).
It would look something like this:
apiVersion: v1
kind: Pod
metadata:
name: gpu-example
spec:
containers:
- name: gpu-example
image: nvidia/cuda
resources:
limits:
nvidia.com/gpu: 2
nodeSelector:
nvidia.com/gpu.model: NVIDIA A100-SXM4-40GB
Independent of the fact that the selection above occurs at the node level rather than the device level (which DRA proposes to solve), matching on an exact model name is problematic for two reasons:
- As evidenced above, the format of the model name is inconsistent -- meaning that users have to remember the differences between them when putting their selectors together.
- There is not just a single model name for, say, an A100 card on a given machine. The model name encodes other information, such as if its on a mother board that connects devices within a PCIe hierarchy or via SXM, or how much memory the card has on it (e.g. 40GB, 80GB, or 96GB).
Some examples of (2) include:
NVIDIA A100-PCIE-40GB
NVIDIA A100-SXM4-40GB
NVIDIA A100-PCIE-64GB
NVIDIA A100-SXM-64GB
NVIDIA A100-PCIE-80GB
NVIDIA A100-SXM4-80GB
It would make the UX much nicer for one to be able to select on something like "*A100*
in cases where they only care about getting access to an A100 (and don't care about the interconnect or the amount of memory -- they should be using a separate selector for memory anyway).
For cases where the do care about being more specific, they still can be by either matching on the full model name or a different subset of it using an alternate glob expression.
Given the (revised) proposal we have of using a DRA controller to translate any vendor specific selection parameters into selectors that the scheduler will understand (apologies I don't know the right terms here yet to be more specific), I imagine the scheduler would be able to take additional modifier
expressions that tell it to ignore case when performing its globbing operation, for example, and the DRA controller could decide if it wants to apply these modifiers or not when performing the translation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding globbing -- it's been a long-standing feature request for users to be able to request GPUs based on a subset of the model name
People want to glob, substring, prefix, or regex match kube selectors, too. Just say no.
The model name encodes other information
You have a stringly-typed structure. Ideally you would decompose this into actual atoms of information, and let the user select on those.
It would make the UX much nicer for one to be able to select on something like "A100" in cases where they only care about getting access to an A100
So name the concept that "A100" represents. "family"? "generation"? A glommed together name is great for dumping in a logfile, but we have data structures for a reason.
they should be using a separate selector for memory anyway
exactly
Anyway, I probably won't die on the hill of not allowing globbing, but I will push back strongly.
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and comments but this is very much on the right track with respect to the concerns folks have expressed.
From a paperwork perspective, I think we should include all the models in this KEP, rather than separate KEPs. Even if we don't implement all the models in 1.30, we can decide which ones are required to go to beta in the beta criteria.
I think we can keep this separate from the larger DRA KEP for now at least.
I do think we should consider how far we can push this. If we can eliminate the need for the driver-by-driver back and forth communication via PodSchedulingContext that would go a long way towards alleviating concerns on the overall DRA KEP. I realize (at least I think this is true) that that process also covers node-side allocation failures pushing a pod back into the scheduling queue, and this wouldn't cover that. I need to think about that more.
To sort of answer my own question, I realized I got myself confused. We need three things, not two:
I am assuming here that the capacity objects themselves, as stored in the NodeResourceCapacity, have a name or other identifier that we match to the capacity request object. For how you "apply" the request to the current capacity, we could either have it be inherent in the numerical model type, or we can add an expression (and perhaps validation) object for that purpose too. |
I'm open to merging things. I'm just a bit worried that it will become harder to associate use cases and PRR sections with the right feature. I can try and we'll see what the outcome will be. Related to this, how many feature gates do we want for this? We could make this granular (current proposal):
Do we agree on this?
We loose the following functionality if we remove PodSchedulingContext:
IMHO these are important enough to keep the core DRA as originally envisioned.
Triggering pod rescheduling in kubelet is going to be a separate KEP from @ffromani . DRA is avoiding the problem as much as possible by making sure that resources are set aside before the scheduling decision. |
Correct. We agreed to kept this part separated as it is related but not a dependency for DRA progress. Of course the needs and potential benefits for DRA enabled by rescheduling is going to be a very major theme in the pod rescheduling conversation. I'm currently processing all the feedback received from the conversations with various SIGs, will update the design document ASAP. |
Correct. But remember that we need them for each numeric model.
This made it a bad example 😢 The extended counter model with selectors will be a better example, because then capacity and parameter types will be different.
They all have an "instance ID" chosen by the driver and that gets repeated in the allocation result, so when the scheduler picks one resource instance, the driver will know which one.
That's the current proposal, with some Go package implementing that logic. Because it's builtin, it doesn't need to be configurable. |
/label lead-opted-in |
oh, that belongs on the issue not the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments are little all over the place, but I don't want to erase them. I think the priority here is to get something that helps us prove this overall model. As such, we might want to just defer things with a lot of nuance and instead focus on landing broad strokes, knowing full well that we will get some of it wrong and that it will lack the requisite sophistication.
There will be followup, so this can safely be less than MVP, so we can work with autoscaling and scheduling to make sure it works (and if not, we don't lose so much :)
I also think we need to get back in sync about selection via CEL. We probably didn't converge as much as I thought we had :)
parameters (i.e. parameters that are *not* needed for allocation but *are* | ||
needed for configuration). | ||
|
||
* With such a claim in place, DRA drivers "resolve" the contents of any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consideration during impl: What if the resolved form of the vendor-specific params was in the status of the claim?
Or another option, which I am not sure is better or worse UX: What if the Pod referenced a vendor-specific "claim resource" directly, and that claim resource's status referenced a generic claim? That makes the user's life slightly simpler, but relies on duck-typing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consideration during impl: What if the resolved form of the vendor-specific params was in the status of the claim?
That was my original thought too. But Patrick pointed out that this would mean the vendor controllers all need write access to the claim status. We decided that the simplest way, from the point of view of RBAC, was the intermediate object, the ResourceClaimParameters
. The vendor controllers can create and update those, which include the resolved claim parameters, plus the config data needed by the driver stored in a rawextension for pass through. The scheduler only needs read access to that resource, rather than to any vendor specific resources.
Or another option, which I am not sure is better or worse UX: What if the Pod referenced a vendor-specific "claim resource" directly, and that claim resource's status referenced a generic claim? That makes the user's life slightly simpler, but relies on duck-typing.
Similar RBAC issues then, as now anything that needs to follow that reference needs RBAC to those vendor-specific objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the resolved form of the vendor-specific params was in the status of the claim?
Besides the permission issue mentioned by @johnbelamaric already there's also the "many claims referencing the same parameter object" aspect. This will happen with claim templates created by app controllers. Asking the DRA driver controller to constantly check for new claims and repeating the same information again and again seems sub-optimal to me.
OTOH, it has one potential advantage: the translation could be specific to the claim. But in the scenario above with one claim template per app, there wouldn't be any difference between the generated claims, so I think this doesn't matter that much.
Let's stick with the current approach.
object used in the request. | ||
|
||
In this KEP, we define a single "semantic model" called | ||
`PartitionableResources`, and use it throughout the discussion to help describe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "Partitionable" adds enough value to justify the typing difficulty it imposes. I'll think about names as I do this read-thru
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Divisible?
Segemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ModeSwitching?
|
||
### Publishing node resources | ||
|
||
The kubelet publishes NodeResourceSlices to the API server with content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification:
Is this EXCLUSIVELY about kubelet, or would it make sense to expand this to something like:
"""
NodeResourceSlices are published to the API server to describe resources which are available on each node. Often they will be produced by kubelet, with information from on-node agents, but they can also be published by other agents. Access control through the node authorizer ...."
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent at some point was to allow admins to create NodeResourceSlices for pod groups, as information for the autoscaler. @MaciekPytel preferred to have that information in the autoscaler configuration or to get it from cloud providers, so at this point only kubelet is expected to publish these objects.
Either way, kubelet would have been responsible for all objects with the same nodeName field as the node on which kubelet runs. I'll make that clearer:
The kubelet publishes NodeResourceSlices to the API server with content provided by DRA drivers running on its own node. It's also responsible for deleting or updating stale objects. The
nodeName
field determines which objects each kubelet instance is responsible for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the following valid or invalid:
The cloud-provider knows all about the machines and hardware. They run a controller which creates NodeResourceSlices for each machine instead of kubelet.
I guess, at the end of it all, kubelet has to feed device allocation decisions back to the runtime, so it's sort of moot, but I just hate being over-prescriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's valid. Especially considering that these are per-driver. So, kubelet and the on-node agents may know about some things, but for some drivers (e.g., vlan / vrf / routing policy configs on the ToR or distribution routers) they may actually not know, and a different agent populates this.
|
||
NodeResourceSlices are published separately for each driver, using a version of | ||
the API that matches what that driver is using. This implies that the kubelet, DRA | ||
driver, and apiserver must all support the *same* API version. It might be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a scary paragraph which I don't fully understand. Version skew is a VERY real thing, and anything that demands exact same versions is too brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the paragraph is accurate. Round-tripping should work here just as well as anywhere else. So the functionality will revert to the oldest version understood by all components (in effect).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes back to the discussion about strong typing vs. letting kubelet pass through opaque data.
What I was trying to point out here is that kubelet and the DRA drivers on the node must agree on the version of the data and its representation when the DRA driver passes it over gRPC. If kubelet is using "resource.k8s.io/v1alpha2", then the driver cannot give it data defined by "resource.k8s.io/v1beta1". It has to be "resource.k8s.io/v1alpha2".
I'll list that as the current approach. I have an idea how to support version skew better, but it's not required for 1.30 (just one alpha version...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API version negotiation for plugins is always a thing, but saying "the kubelet ... and apiserver must all support the same API version" is a little much. It's 100% normal that you can't use a new Pod API feature properly if kubelet is back-rev.
String: 1.2.3 | ||
- name: runtimeVersion | ||
String: 11.1.42 | ||
- name: t1000-gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct YAML? I read this as a list-item under commonAttributes
Edit: reading on I see what's happening and I don't love it.
apiGroup: dra.example.com | ||
uid: foobar-uid | ||
|
||
vendorParameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm lost on why we are copying that CR into this and what schema you will use to represent it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are configuration parameters that are not needed for scheduling / CA decisions, but need to be passed to the kubelet to do configuration on the node as part of resource "preparation". I believe the idea was to use a RawExtention
to represent it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. So, the vendor CR contains various configuration parameters that are totally unrelated to scheduling and resource allocation. Those can remain opaque to K8s, but they need to make their way from the class and claim all the way down to the on-node driver. They also need to be captured as they existed at the time of allocation. So, the translation of the CR to vendor-neutral form also stuffs away any of these "config" related values from the class and claim into a passthrough RawExtension. This is then copied at the time of allocation into the resource claim status, along with the rest of the allocation information.
If a user subsequently updates the CR, it doesn't matter, because we have captured the config data as it was at the time of allocation. The updated CR will be used for future allocations but not for existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think I get it. I don't love it, but I get it, I think.
// +optional | ||
metav1.ObjectMeta | ||
|
||
Spec NodeResourceSliceSpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, will there be status?
which of those resources are reserved for claims. | ||
|
||
However, despite the finalizer on the claims it could happen that a | ||
sufficiently determined user deletes a claim while it is in use. Therefore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"well intentioned but poorly informed" is my go-to phrasing
Partitionable *PartitionableResources | ||
} | ||
|
||
type PartitionableResources struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go one further - can we jettison anything we're unsure about to reduce scope? We're certainly going to do an alpha2, so I think it's valuable to just be sure the gears mesh overall...
|
||
### Immediate allocation | ||
|
||
Because there is no separate controller anymore, claims with immediate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to look at this anew before beta. Is this still representing user intention well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
claims with immediate allocation remain allocated when no longer in use
I think this is the important bit -- what use cases might require this (and why)?
I don't have any "immediately" on hand (pun intended!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, part of the motivation for immediate allocation was for expensive initialization. Suppose you need to flash a card, which takes a long time, then want to run a whole bunch of workloads against it in series (rather than in parallel). @pohly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that use case is still served better by immediate allocation, because it still ensures that this expensively configured hardware is kept in that configuration after it has been used by one pod. The name just doesn't fit that well anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in the "existing" DRA model, no configuration happens until the first pod that uses a claim lands on the node where it has been allocated. That first pod will always be delayed by the time to configure in both scenarios.
The real difference with this KEP is that now the allocation itself won't happen until a pod that references the claim is scheduled. Whereas before, the DRA controller would pick an arbitrary node to allocate the claim on as soon as the claim object itself was created, and the scheduler was forced to schedule any pods referencing that claim on that node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we saying that "allocation mode" is defunct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a different topic. I had suggested adding an optional nodeName to resource claim in an earlier model. I cannot abide the "randomly pick a node". The nodeName would allow immediate allocation without that, and allow then for pods to be attached to that same (shared) claim.
But I don't think that's key in this KEP, I think that's a discussion for how we evolve base DRA in 1.31.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should hold off on this, but i would also like to express my disdain for "randomly pick a node". It's the main reason we never built support for this mode in the existing DRA driver for GPUs.
Overall response to Tim's comments: I said it in one of the inline comments, but I'll repeat it here. I think we may be better off if we go with some simplified version of the As I see it, the point of alpha is to reduce overall risk and learn from doing. The thing I think 1.30 should focus on for this KEP is the overall concept of the semantic model, how the CRs are processed, how the scheduler, job controllers, cluster autoscaler and other components interpret that, draw down on the allocations, store and reconcile those allocations, etc. I think we should keep the actual model really simple as a placeholder, and during then have a solid model to propose in the 1.31 design phase. I think we should set aside a LOT of time at the contributor summit (and in subsequent days if folks are still in Paris) to really hash through those models real time, in person. |
What if we strip the i.e. have it provide a flat list of opaque strings with no attributes attached. Each string representing a unique resource that is allocatable on the node. This would be a direct mapping of the existing device plugin model to a DRA semantic model. This would mean we can punt on talking about any sort of selection API (because all resources are equal), and focus solely on the vendor specific CRDs providing configuration parameters. We lose the ability to do any sort of dynamic partitioning of resources, but it is still a net gain over the existing device plugin API since users will have the ability to do controlled sharing via resource claims and custom configuration through claim parameters. |
I've pushed my updates where I tried to address the editorial changes (wording, take out implementation details for autoscaler). What we need to decide now is which semantic model we want to describe in the KEP, if we can't agree on simply treating this as preliminary and "defined during implementation". It wouldn't be the first time that a KEP gets accepted with one API and the implementation's API then looks very different.... just saying. More in #4384 (comment). |
I just pushed my own PR against the branch for this PR that describes a much more simplified model (and reiterates many times over that this is a hypothetical model, not one we are proposing to use): |
"Structured parameters" describe resources in a format that enables the Kubernetes components to make scheduling decisions. Co-authored-by: Kevin Klues <[email protected]>
9462d5d
to
624bec4
Compare
I merged @klueska's update and then did another rename of "semantic parameters" into "structured parameters": that is a better name because we define some structure for parameters and the associated mechanics, but not the actual semantic. |
Thanks @pohly for your patience and diligence in seeing this KEP through. While I still think there is room for improvement (and there are still some unresolved differences in opinion between the primary reviewers on how this framework will eventually take shape), we need to start somewhere. The KEP is in good enough shape and we are aligned enough, that I believe we can get started working on this in 1.30. /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve from sig-scheduling
I think this older approach is gone. But, for the record, having scheduler watch arbitrary CRDs, even if just to pass them through, would have been a no go for me.
- name: drivers | ||
attributes: | ||
- name: driverVersion | ||
String: 1.2.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String: 1.2.3 | |
string: 1.2.3 |
why not version
though? What tells CEL that this can be compared using semantic versioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this should have been "version". That, and only that field is something that CEL will be able to use in a semantic version comparison.
name: someArbitraryName | ||
namespace: user-namespace | ||
|
||
generatedFrom: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this if the name, namespace and uid are already part of the vendorParameters below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the vendorParameters are an opaque RawExtension. It's not required that they contain the original object.
|
||
Instead of defining a vendor-specific CRD, DRA driver authors (or | ||
administrators) could decide to allow users to create and reference | ||
`ResourceClaimParameters` directly within their `ResourceClaims`. This would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the ResourceClaim look like?
How does it reference the claim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The claim would reference the parameter object with "resource.k8s.io" as APIGroup, "ResourceClaimParameters" as kind, and then the name of the object. This object then gets used directly.
Like a normal DRA driver controller, the scheduler also sets a finalizer to | ||
ensure that users cannot accidentally delete the allocated claim while a pod | ||
is about to start which depends on it. That finalizer is | ||
"structured.dra.k8s.io/delete-protection". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just dra.k8s.io/delete-protection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also works.
/approve from SIG Node perspective based on the review details @johnbelamaric and others. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, dchen1107, johnbelamaric, klueska, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed few typos
example, a cloud provider controller could populate this based upon information | ||
from the cloud provider API. | ||
|
||
In the kubelet case, each kubelet publishes kubelet publishes a set of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/kubelet publishes kubelet publishes/kubelet publishes/
object used in the request. | ||
|
||
This KEP is specifically focused on defining the framework necessary to enable | ||
different "structured models" to be added to Kuberenetes over time. It is out of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Kuberenetes/Kubernetes/
|
||
If a driver wanted to use a different structured model to represent its resources, | ||
a new structured model would need to be defined inside Kuberenetes, and a field | ||
would need to be added to this struct at the same level as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Kuberenetes/Kubernetes/
tells them that they cannot handle the object because the API has been extended. | ||
|
||
Drivers can use different structured models by publishing multiple | ||
`NodeResourceSlice` objects, as long as each model represents a distinct set of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"distinct" as in "non-overlapping"?
``` | ||
|
||
Where "gpu-0" represents one type of card and "gpu-1" represents another (with | ||
the attributes hanging off each serving to "define" their individual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler:
s/attributes hanging off each serving to "define"/related attributes defining/
|
||
For each node, one or more NodeResourceSlice objects get created. The kubelet | ||
publishes them with the node as the owner, so they get deleted when a node goes | ||
down and then gets removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ and then gets removed//
// allocated resources, others can be arbitrary setup parameters. | ||
VendorClaimParameters *runtime.RawExtension | ||
|
||
// NodeName is the name of the node providing the necessary resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indent.
remaining structured difference compared to delayed allocation is that claims | ||
with immediate allocation remain allocated when no longer in use. | ||
|
||
### Simulation with CA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability (CA could be e.g. Claim Allocation):
s/CA/Cluster Autoscaler/
- All watchers of node objects get the information even if they don't need it. | ||
- It puts complex alpha quality fields into a GA API. | ||
|
||
### Injecting vendor logic into CA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/CA/Cluster Autoscaler/
several different ways: | ||
|
||
- Call out to a vendor server via some RPC mechanism (similar to scheduler | ||
webhooks). The risk here is that simulation becomes to slow. Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/to/too/