-
Notifications
You must be signed in to change notification settings - Fork 114
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add 'Parallel SR-IOV configuration' design document
- Loading branch information
Showing
1 changed file
with
250 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,250 @@ | ||
--- | ||
title: Parallel SR-IOV configuration | ||
authors: | ||
- e0ne | ||
reviewers: | ||
- adrianchiris | ||
- SchSeba | ||
creation-date: 18-07-2023 | ||
last-updated: 18-07-2023 | ||
--- | ||
|
||
# Parallel SR-IOV configuration | ||
|
||
## Summary | ||
Allow SR-IOV Network Operator to configure more than one node at the moment. | ||
|
||
## Motivation | ||
SR-IOV Network Operator configures SR-IOV one node at a time and one nic at a same time. That means we’ll need to wait | ||
hours or even days to configure all NICs on large cluster deployments. With multi-NICs deployments Operator configures | ||
NICs one by one on each node which leads to a lot of unnecessary node drain calls during SR-IOV configuration. Also | ||
moving all draining logic to a centralized place which will reduce chances of race conditions and bugs that were | ||
encountered before in sriov-network-config-daemon with draining. | ||
|
||
### Use Cases | ||
|
||
### Goals | ||
* Number of drainable nodes should be 1 by default | ||
* Number of drainable nodes should be configured by pool | ||
* Nodes pool should be defined by node selector | ||
* A node could be included into several pools. In this case pool should be | ||
selected by it’s priority like it’s implemented for SriovNetworkNodePolicy | ||
* Move all drain-related logic into the centralized place | ||
|
||
|
||
### Non-Goals | ||
Parallel NICs configuration on the same node is out of scope of this proposal | ||
|
||
## Proposal | ||
Introduce nodes pool drain configuration and controller to meet Goals targets. | ||
|
||
|
||
### Workflow Description | ||
A new Drain controller will be introduced to manage node drain and cordon procedures. That means we don't need to do | ||
drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states: | ||
|
||
```golang | ||
DrainIdle = "Idle" | ||
DrainRequired = "Drain_Required" | ||
DrainMcpPaused = "Draining_MCP_Paused" | ||
Draining = "Draining" | ||
DrainingComplete = "Draining_Complete" | ||
``` | ||
|
||
Drain controller will watch for node annotation, `SriovNetworkNodeState` and `SriovNetworkPoolConfig` changes: | ||
```golang | ||
// +kubebuilder:rbac:groups="",resources=nodes,verbs=get;list;watch;update;patch | ||
// +kubebuilder:rbac:groups="",resources=SriovNetworkPoolConfig,verbs=get;list;watch;update;patch | ||
// +kubebuilder:rbac:groups="",resources=SriovNetworkNodeState,verbs=get;list;watch;update;patch | ||
``` | ||
|
||
Once `SriovNetworkNodeState.DrainStatus` will be marked as `Drain_Required` by config daemon controller add will start | ||
draining procedure. Reconcile loop will check if current `Draining` nodes count is greater or equal than | ||
`maxUnavailable` and will mark it as `Draining` to start draining procedure. If current Draining nodes count equals to | ||
`maxUnavailable`, no `SriovNetworkNodeState` updates will be applied in the current reconcile loop. | ||
|
||
Once draining and configuration will be finished config daemon will update `SriovNetworkNodeState.DrainStatus` to | ||
`Idle`. | ||
|
||
We can add a webhook to block the change of the configuration when the nodes are draining to a lower number. Without | ||
validating webhook controller will proceed drain procedure for the current `Draining` nodes until the count of them | ||
will decrease to specified `maxUnavailable` number. | ||
|
||
We need this logic in controller to have it centralized and not to have race condition issues during | ||
`SriovNetworkNodeStates` update. | ||
|
||
Config daemon will be responsible for setting `SriovNetworkNodeState.DrainStatus=Drain_Required` and | ||
`SriovNetworkNodeState.DrainStatus=DrainComplete` only. It will simplify its implementation. | ||
|
||
### API Extensions | ||
|
||
#### Extend existing CR SriovNetworkNodeState | ||
SriovNetworkNodeState already contains current state of the configuration progress in a `status` field, so we don't need | ||
to store drain status in node annotation. `SriovNetworkNodeStateStatus` will be extended to contain drain progress | ||
instead of node annotation: | ||
|
||
```golang | ||
// SriovNetworkNodeStateStatus defines the observed state of SriovNetworkNodeState | ||
type SriovNetworkNodeStateStatus struct { | ||
Interfaces InterfaceExts `json:"interfaces,omitempty"` | ||
// +kubebuilder:validation:Enum=Idle;Draining,Draining_MCP_Paused,Succeeded,Failed,InProgress | ||
DrainStatus string `json:"drainStatus,omitempty"` | ||
SyncStatus string `json:"syncStatus,omitempty"` | ||
LastSyncError string `json:"lastSyncError,omitempty"` | ||
} | ||
``` | ||
|
||
#### Extend existing CR SriovNetworkPoolConfig | ||
SriovNetworkPoolConfig is used only for OpenShift to provide configuration for | ||
OVS Hardware Offloading. We can extend it to add configuration for the drain | ||
pool. E.g.: | ||
|
||
```golang | ||
// DrainConfigSpec defines node pool drain configuration | ||
type DrainConfigSpec struct { | ||
// Number of nodes can be configured in parallel | ||
// 0 means no limit, all nodes will be configured in parallel | ||
// +kubebuilder:default:=1 | ||
// +kubebuilder:validation:Minimum:=0 | ||
MaxParallelNodeConfiguration int `json:"maxParallelNodeConfiguration,omitempty"` | ||
} | ||
|
||
// SriovNetworkPoolConfigSpec defines the desired state of SriovNetworkPoolConfig | ||
type SriovNetworkPoolConfigSpec struct { | ||
// +kubebuilder:validation:Minimum=0 | ||
// +kubebuilder:validation:Maximum=99 | ||
// Priority of the SriovNetworkPoolConfig, higher priority policies can override lower ones. | ||
Priority int `json:"priority,omitempty"` | ||
// OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes | ||
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"` | ||
// NodeSelectorTerms is a list of node selectors to apply SriovNetworkPoolConfig | ||
NodeSelectorTerms *v1.NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` | ||
DrainConfig DrainConfigSpec `json:"drainConfig,omitempty"` | ||
} | ||
``` | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: SriovNetworkPoolConfig | ||
metadata: | ||
name: pool-1 | ||
namespace: network-operator | ||
spec: | ||
priority: 44 | ||
drainConfig: | ||
maxParallelNodeConfiguration: 5 | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: some-label | ||
operator: In | ||
values: | ||
- val-2 | ||
- matchExpressions: | ||
- key: other-label | ||
operator: "Exists" | ||
DrainConfig: | ||
maxUnavailable: 5 | ||
ovsHardwareOffloadConfig: {} | ||
``` | ||
Default pool should be always created with the lowest priority: | ||
```yaml | ||
apiVersion: v1 | ||
kind: SriovNetworkPoolConfig | ||
metadata: | ||
name: default | ||
namespace: network-operator | ||
spec: | ||
priority: 0 | ||
DrainConfig: | ||
maxUnavailable: 1 | ||
``` | ||
### Implementation Details/Notes/Constraints | ||
The implementation of this feature is very complex and requires a lot of changes to different parts of SR-IOV Network | ||
Operator. To not introduce breaking changes we have to split this effort to several phases: | ||
* implement new Drain controller without user-facing API changes: | ||
it will proceed only one node configuration at the same time and doesn't require API changes | ||
* drop `SriovNetworkNodeState` usage and move this annotation values into the `SriovNetworkNodeState` object | ||
* introduce new API changes to support pool of nodes to be drained in a parallel: | ||
at this phase we introduce new fields in SriovNetworkPoolConfig and modify Drain controller to watch for | ||
the specified CRs and proceed drain in a parallel per node pool configuration | ||
|
||
All phases should be implemented one-by-one in a separate PRs in the order above. | ||
|
||
### Upgrade & Downgrade considerations | ||
After operator upgrade we have to support `sriovnetwork.openshift.io/state` node annotation to | ||
`SriovNetworkNodeState.DrainStatus` migration. This logic will be implemented in a Drain controller and should be | ||
supported at least until next major version of the operator. | ||
|
||
Node reconcile loop will check for `sriovnetwork.openshift.io/state` annotation and move its value to | ||
`SriovNetworkNodeState.DrainStatus` field. `sriovnetwork.openshift.io/state` annotation will be deleted by operator once | ||
`SriovNetworkNodeState` is updated. | ||
|
||
### Alternative APIs | ||
#### Option 1: extend SriovOperatorConfig CRD | ||
We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: | ||
|
||
```yaml | ||
apiVersion: sriovnetwork.openshift.io/v1 | ||
kind: SriovOperatorConfig | ||
metadata: | ||
name: default | ||
namespace: network-operator | ||
spec: | ||
# Add fields here | ||
enableInjector: false | ||
enableOperatorWebhook: false | ||
configDaemonNodeSelector: {} | ||
disableDrain: false | ||
drainConfig: | ||
- name: default | ||
maxUnavailable: 1 | ||
priority: 0 # the lowest priority | ||
- name: pool-1 | ||
maxUnavailable: 5 | ||
priority: 44 | ||
# empty nodeSelectorTerms means 'all nodes' | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: some-label | ||
operator: In | ||
values: | ||
- val-1 | ||
- val-2 | ||
- matchExpressions: | ||
- key: other-label | ||
operator: "Exists" | ||
``` | ||
|
||
We didn't choose this option because SriovOperatorConfig contains Config Daemon-specific options only while draing | ||
configuration is node-specific. | ||
|
||
#### Option 2: New CRD | ||
Add new `DrainConfiguration`CRD with fields mentioned in previous options. | ||
We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: | ||
```yaml | ||
apiVersion: sriovnetwork.openshift.io/v1 | ||
kind: SriovDrainConfig | ||
metadata: | ||
name: default | ||
namespace: network-operator | ||
spec: | ||
maxUnavailable: 1 | ||
priority: 0 # the lowest priority | ||
# empty nodeSelectorTerms means 'all nodes' | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: some-label | ||
operator: In | ||
``` | ||
|
||
We didn't choose this option because there is already defined `SriovNetworkPoolConfig` CRD wich could be uses for needed | ||
configuration. | ||
|
||
### Test Plan | ||
* Unit tests will be implemented for new Drain Controller. | ||
** E2E, manual or automation functional testing should have such test cases: | ||
** to verify that we actually configure SR-IOV on `maxUnavailable` nodes at the same time | ||
** to check that we don't configure more than `maxUnavailable` nodes at the same time |