-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong AccessibilityRequirement passed in CreateVolumeRequest #221
Comments
Look at the first entry of the preferred topology field to see which node the scheduler picked. Requisite is supposed to contain all valid choices |
@msau42 Thanks for your response. I suspect the way the valid choices are being prepared, currently this code is:
|
@msau42 optimizing the case where the legitimate result is "all" or "many" nodes is one aspect that needs to be considered. But @avalluri's point is that the current code produces the wrong result for a scenario where the volume has to be created exactly on the one node chosen for the pod. Do you agree that @avalluri perhaps you can extend https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/topology_test.go with a test case for your scenario? |
Maybe there's a difference in interpretation of the spec. Requisite is supposed to contain all possible choices for the topology that has not be filtered out through storageclass.allowedtopologies. the driver must pick any subset from requisite. Preferred gives a suggested ordering out of the requisite to pick from. If you have delayed binding set, then the first choice will be what the scheduler chose. |
This is to handle the scenario where your driver may have be spread across multiple topologies. For example, it can replicate data across many nodes or zones. The scheduler only picks one of the nodes, but your driver needs to know what are valid choices for the secondary domains. |
There is a test case for this, but to my surprise the expectation is bit different:
the above testcase is having If this expectation is true, then the driver could choose subset of Requisite set What my expectation in this case is that the provisioner will choose the topology of the selected Node,. @msau42 can you please correct me, if my expectation is wrong. |
@avalluri what you quoted is the case where no node has been selected yet, so the result is indeed that running in either zone1 or zone2 would be acceptable. But right below it is the same case with a selected node: external-provisioner/pkg/controller/topology_test.go Lines 864 to 880 in 19b7dc3
And that one still allows running in zone2, although that is not compatible with the selected node. I agree that this looks fishy. @verult? |
So this unit test is misleading because it only checks requisite, which selectedNode has no impact on. Requisite is the set of topology values that includes all of the values in the cluster, reduced from StorageClass.AllowedTopologies. Preferred is where selectedNode comes into play. The first entry in preferred should contain the topology from the selectedNode. |
@msau42 I am bit confused, So you mean as per the above test case(@pohly thanks for pointing the right one), where Isn't useful in the case of |
There's a use case for volumes that can span/replicate across multiple topologies, ie multiple zones or multiple nodes. However, the scheduler only picks one node, ie one topology. The driver can treat the scheduler choice as the primary zone/node. But then the driver needs a way to know the other available topologies in the cluster for choosing its secondary topologies. That is why we cannot just restrict requisite topologies to only the selected node by the scheduler. Requisite is topologies that are available in the cluster, and preferred is an ordering that is influenced by the scheduler's decision. |
I respect this requirement, though i couldn't understand why can't the driver get the same information, ie, cluster level topology from SP, as the driver is the one who defined this topology. One thing missing here is how could the driver differentiates that the provided
It would be nice if we have some way to differentiate this. As |
As an example, for gce pds, the storage provider knows about all zones in a region, ie
Can you explain why you need to distinguish between chosen by CO and chosen by external-provisioner? External-provisioner respects the Kubernetes decision if it gave one, and it maintains backwards compatibility with the old StatefulSet hashing method if you don't enable delayed binding. So Preferred does already carry the CO requirement in priority order. |
Thanks @msau42 for clarification, now i understand why this design was chosen.
Our driver deals with local storage and we treat every node is in its own zone. We would like to support volume replication where Pod can choose(we are still open if its via In this case, Currently provisioner is sending |
For your case, can you make it explicit that replication is requested by having an additional storageclass parameter for number of replicas instead of inferring from restricted topologies? Users may not care exactly which nodes are chosen: "I want 2 replicas but choose any 2 nodes you think are best". Then that also opens it up to the storage provider to make a smarter decision based off of capacity, etc |
Let's ignore replication for now, I believe that isn't relevant here. I think some of confusion comes from a different understanding what topology means and how strict it is. I'm getting the feeling that currently topology is treated as a suggestion, but not a hard requirement. That's why even though Kubernetes knows that a pod using the volume will run in For local storage, we have to replace "zone" with "host", and now adhering to that topology becomes a hard requirement. A CSI driver can still do that, it just has to know that it must create a volume in the first topology listed in I tried to define some usage models for local volumes here: intel/pmem-csi#160 The case that is affected by this ambiguity in |
I think the driver should just always create in whatever the 1st topology is in the preferred. If delayed binding is set, then it will be what the scheduler chose. If not, then it will be random. |
We just verified (intel/pmem-csi#92 (comment)) that scheduling indeed fails if:
I'm not happy with the suggestion that the CSI driver should "just always create in whatever the 1st topology is in the preferred". That adds a meaning to the field which isn't part of the CSI spec. It's
The spec explicitly lists the case that a driver can only make the volume available in some of the required segments ("If x<n, then the SP SHALL choose x unique topologies from the list of requisite topologies. If it is unable to do so, the SP MUST fail the CreateVolume call."). So suppose the driver replies to the request above by making the volume available on host-2 because local space on host-3 is exhausted. That's allowed by the spec, because that node is among the required ones. But then attaching the volume fails. It would be better to let the So why does the external-provisioner send a Note that there is an easy workaround: just don't use |
Currently, both As per spec |
And then what? As shown in "example 1", the expectation is that the CSI driver falls back to provisioning the volume outside of the "preferred" set if it has to, so the same problem would still occur. If the CO (Kubernetes + external-provisioner) want to enforce that the volume gets provisioned where needed or not at all, then |
These are the different topology use cases I can think of: "External" topology, where the storage topology != node topology. We don't support this well in Kubernetes today, but there have been requests to support it. We can discuss this as a separate issue/topic. "Hyper-converged" topology, where the storage topology == node topology
The problem today is that CSI does not have a way to specify how many topology domains a plugin supports. Even if it could, it's still problematic for drivers such as PD, where some PDs only support a single zone, and others PDs support multi-zones. So to support this use case today, we have the current logic which just gives all possible choices, with preferred as hints to the scheduling decision, and relies on the driver to reduce that set to the number of topology domains that it needs. As pointed above, it causes confusion for the simple case. What if we add a new flag to external-provisioner where a driver can specify how many topology domains it supports? By default, it can be 1 for the simple (and most common) case. That flag will restrict how many entries are populated in the CSI topology requisite + preferred. If late binding is enabled, then it will only contain the selected node. If not, it will be chosen according to a hashing function (== statefulset hashing for backwards compatibility). cc @verult @davidz627 |
That makes sense to me. |
While choosing this topology/node, does external-provisioner consider the available capacity. I mean, what should the driver do if there is no free space available on chosen node? |
I too agree with this, the only reason i suggest to keep filling |
6613c3980 Merge pull request kubernetes-csi#223 from sunnylovestiramisu/update 0e7ae993d Update k8s image repo url 77e47cce8 Merge pull request kubernetes-csi#222 from xinydev/fix-dep-version 155854b09 Fix dep version mismatch 8f839056a Merge pull request kubernetes-csi#221 from sunnylovestiramisu/go-update 1d3f94dd5 Update go version to 1.20 to match k/k v1.27 e322ce5e5 Merge pull request kubernetes-csi#220 from andyzhangx/fix-golint-error b74a51209 test: fix golint error aa61bfd0c Merge pull request kubernetes-csi#218 from xing-yang/update_csi_driver 7563d1963 Update CSI_PROW_DRIVER_VERSION to v1.11.0 a2171bef0 Merge pull request kubernetes-csi#216 from msau42/process cb9878261 Merge pull request kubernetes-csi#217 from msau42/owners a11216e47 add new reviewers and remove inactive reviewers dd9867540 Add step for checking builds b66c08249 Merge pull request kubernetes-csi#214 from pohly/junit-fixes b9b6763bd filter-junit.go: fix loss of testcases when parsing Ginkgo v2 JUnit d4277839f filter-junit.go: preserve system error log 38e11468f prow.sh: publish individual JUnit files as separate artifacts git-subtree-dir: release-tools git-subtree-split: 6613c3980d1e418bebb7bc49d64c977cfff85671
6613c3980 Merge pull request kubernetes-csi#223 from sunnylovestiramisu/update 0e7ae993d Update k8s image repo url 77e47cce8 Merge pull request kubernetes-csi#222 from xinydev/fix-dep-version 155854b09 Fix dep version mismatch 8f839056a Merge pull request kubernetes-csi#221 from sunnylovestiramisu/go-update 1d3f94dd5 Update go version to 1.20 to match k/k v1.27 e322ce5e5 Merge pull request kubernetes-csi#220 from andyzhangx/fix-golint-error b74a51209 test: fix golint error aa61bfd0c Merge pull request kubernetes-csi#218 from xing-yang/update_csi_driver 7563d1963 Update CSI_PROW_DRIVER_VERSION to v1.11.0 a2171bef0 Merge pull request kubernetes-csi#216 from msau42/process cb9878261 Merge pull request kubernetes-csi#217 from msau42/owners a11216e47 add new reviewers and remove inactive reviewers dd9867540 Add step for checking builds b66c08249 Merge pull request kubernetes-csi#214 from pohly/junit-fixes b9b6763bd filter-junit.go: fix loss of testcases when parsing Ginkgo v2 JUnit d4277839f filter-junit.go: preserve system error log 38e11468f prow.sh: publish individual JUnit files as separate artifacts git-subtree-dir: release-tools git-subtree-split: 6613c3980d1e418bebb7bc49d64c977cfff85671
6613c3980 Merge pull request kubernetes-csi#223 from sunnylovestiramisu/update 0e7ae993d Update k8s image repo url 77e47cce8 Merge pull request kubernetes-csi#222 from xinydev/fix-dep-version 155854b09 Fix dep version mismatch 8f839056a Merge pull request kubernetes-csi#221 from sunnylovestiramisu/go-update 1d3f94dd5 Update go version to 1.20 to match k/k v1.27 e322ce5e5 Merge pull request kubernetes-csi#220 from andyzhangx/fix-golint-error b74a51209 test: fix golint error aa61bfd0c Merge pull request kubernetes-csi#218 from xing-yang/update_csi_driver 7563d1963 Update CSI_PROW_DRIVER_VERSION to v1.11.0 a2171bef0 Merge pull request kubernetes-csi#216 from msau42/process cb9878261 Merge pull request kubernetes-csi#217 from msau42/owners a11216e47 add new reviewers and remove inactive reviewers dd9867540 Add step for checking builds b66c08249 Merge pull request kubernetes-csi#214 from pohly/junit-fixes b9b6763bd filter-junit.go: fix loss of testcases when parsing Ginkgo v2 JUnit d4277839f filter-junit.go: preserve system error log 38e11468f prow.sh: publish individual JUnit files as separate artifacts git-subtree-dir: release-tools git-subtree-split: 6613c3980d1e418bebb7bc49d64c977cfff85671
Update vendor dependencies
6613c398 Merge pull request kubernetes-csi#223 from sunnylovestiramisu/update 0e7ae993 Update k8s image repo url 77e47cce Merge pull request kubernetes-csi#222 from xinydev/fix-dep-version 155854b0 Fix dep version mismatch 8f839056 Merge pull request kubernetes-csi#221 from sunnylovestiramisu/go-update 1d3f94dd Update go version to 1.20 to match k/k v1.27 e322ce5e Merge pull request kubernetes-csi#220 from andyzhangx/fix-golint-error b74a5120 test: fix golint error aa61bfd0 Merge pull request kubernetes-csi#218 from xing-yang/update_csi_driver 7563d196 Update CSI_PROW_DRIVER_VERSION to v1.11.0 a2171bef Merge pull request kubernetes-csi#216 from msau42/process cb987826 Merge pull request kubernetes-csi#217 from msau42/owners a11216e4 add new reviewers and remove inactive reviewers dd986754 Add step for checking builds b66c0824 Merge pull request kubernetes-csi#214 from pohly/junit-fixes b9b6763b filter-junit.go: fix loss of testcases when parsing Ginkgo v2 JUnit d4277839 filter-junit.go: preserve system error log 38e11468 prow.sh: publish individual JUnit files as separate artifacts git-subtree-dir: release-tools git-subtree-split: 6613c3980d1e418bebb7bc49d64c977cfff85671
My csi driver deals with storage local to Node, hence i would like to limit topology segment to individual node, by passing as below in NodeInfoGetRespose.
I am using delay binding(WaitForFirstConsumer) so that volume get provisioned only after scheduler picks the node for scheduling pod that claims the PV.
My expectation was
'CreateVolumeRequest.AccessibileRequirement
is filled with the topology of selected node. But its filled with all the nodes in cluster where the driver running.My observation is that the code in aggregateTopologies() is considering only TopologyKeys provided in NodeInfo of SelectedNode to select the Nodes with similar topology, where it's omitting the keyvalue,This results in ending up with all the nodes(where my driver runs) as all nodes has topology constraints with the same key but with unique value.
Environement :
The text was updated successfully, but these errors were encountered: