Skip to content

Commit

Permalink
Add 'Parallel SR-IOV configuration' design document
Browse files Browse the repository at this point in the history
  • Loading branch information
e0ne committed Sep 12, 2023
1 parent 52e9d20 commit d3b2c56
Showing 1 changed file with 250 additions and 0 deletions.
250 changes: 250 additions & 0 deletions doc/design/parallel-node-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
---
title: Parallel SR-IOV configuration
authors:
- e0ne
reviewers:
- adrianchiris
- SchSeba
creation-date: 18-07-2023
last-updated: 18-07-2023
---

# Parallel SR-IOV configuration

## Summary
Allow SR-IOV Network Operator to configure more than one node at the moment.

## Motivation
SR-IOV Network Operator configures SR-IOV one node at a time and one nic at a same time. That means we’ll need to wait
hours or even days to configure all NICs on large cluster deployments. With multi-NICs deployments Operator configures
NICs one by one on each node which leads to a lot of unnecessary node drain calls during SR-IOV configuration. Also
moving all draining logic to a centralized place which will reduce chances of race conditions and bugs that were
encountered before in sriov-network-config-daemon with draining.

### Use Cases

### Goals
* Number of drainable nodes should be 1 by default
* Number of drainable nodes should be configured by pool
* Nodes pool should be defined by node selector
* A node could be included into several pools. In this case pool should be
selected by it’s priority like it’s implemented for SriovNetworkNodePolicy
* Move all drain-related logic into the centralized place


### Non-Goals
Parallel NICs configuration on the same node is out of scope of this proposal

## Proposal
Introduce nodes pool drain configuration and controller to meet Goals targets.


### Workflow Description
A new Drain controller will be introduced to manage node drain and cordon procedures. That means we don't need to do
drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states:

```golang
DrainIdle = "Idle"
DrainRequired = "Drain_Required"
DrainMcpPaused = "Draining_MCP_Paused"
Draining = "Draining"
DrainingComplete = "Draining_Complete"
```

Drain controller will watch for node annotation, `SriovNetworkNodeState` and `SriovNetworkPoolConfig` changes:
```golang
// +kubebuilder:rbac:groups="",resources=nodes,verbs=get;list;watch;update;patch
// +kubebuilder:rbac:groups="",resources=SriovNetworkPoolConfig,verbs=get;list;watch;update;patch
// +kubebuilder:rbac:groups="",resources=SriovNetworkNodeState,verbs=get;list;watch;update;patch
```

Once `SriovNetworkNodeState.DrainStatus` will be marked as `Drain_Required` by config daemon controller add will start
draining procedure. Reconcile loop will check if current `Draining` nodes count is greater or equal than
`maxUnavailable` and will mark it as `Draining` to start draining procedure. If current Draining nodes count equals to
`maxUnavailable`, no `SriovNetworkNodeState` updates will be applied in the current reconcile loop.

Once draining and configuration will be finished config daemon will update `SriovNetworkNodeState.DrainStatus` to
`Idle`.

We can add a webhook to block the change of the configuration when the nodes are draining to a lower number. Without
validating webhook controller will proceed drain procedure for the current `Draining` nodes until the count of them
will decrease to specified `maxUnavailable` number.

We need this logic in controller to have it centralized and not to have race condition issues during
`SriovNetworkNodeStates` update.

Config daemon will be responsible for setting `SriovNetworkNodeState.DrainStatus=Drain_Required` and
`SriovNetworkNodeState.DrainStatus=DrainComplete` only. It will simplify its implementation.

### API Extensions

#### Extend existing CR SriovNetworkNodeState
SriovNetworkNodeState already contains current state of the configuration progress in a `status` field, so we don't need
to store drain status in node annotation. `SriovNetworkNodeStateStatus` will be extended to contain drain progress
instead of node annotation:

```golang
// SriovNetworkNodeStateStatus defines the observed state of SriovNetworkNodeState
type SriovNetworkNodeStateStatus struct {
Interfaces InterfaceExts `json:"interfaces,omitempty"`
// +kubebuilder:validation:Enum=Idle;Draining,Draining_MCP_Paused,Succeeded,Failed,InProgress
DrainStatus string `json:"drainStatus,omitempty"`
SyncStatus string `json:"syncStatus,omitempty"`
LastSyncError string `json:"lastSyncError,omitempty"`
}
```

#### Extend existing CR SriovNetworkPoolConfig
SriovNetworkPoolConfig is used only for OpenShift to provide configuration for
OVS Hardware Offloading. We can extend it to add configuration for the drain
pool. E.g.:

```golang
// DrainConfigSpec defines node pool drain configuration
type DrainConfigSpec struct {
// Number of nodes can be configured in parallel
// 0 means no limit, all nodes will be configured in parallel
// +kubebuilder:default:=1
// +kubebuilder:validation:Minimum:=0
MaxParallelNodeConfiguration int `json:"maxParallelNodeConfiguration,omitempty"`
}

// SriovNetworkPoolConfigSpec defines the desired state of SriovNetworkPoolConfig
type SriovNetworkPoolConfigSpec struct {
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=99
// Priority of the SriovNetworkPoolConfig, higher priority policies can override lower ones.
Priority int `json:"priority,omitempty"`
// OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"`
// NodeSelectorTerms is a list of node selectors to apply SriovNetworkPoolConfig
NodeSelectorTerms *v1.NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"`
DrainConfig DrainConfigSpec `json:"drainConfig,omitempty"`
}
```

```yaml
apiVersion: v1
kind: SriovNetworkPoolConfig
metadata:
name: pool-1
namespace: network-operator
spec:
priority: 44
drainConfig:
maxParallelNodeConfiguration: 5
nodeSelectorTerms:
- matchExpressions:
- key: some-label
operator: In
values:
- val-2
- matchExpressions:
- key: other-label
operator: "Exists"
DrainConfig:
maxUnavailable: 5
ovsHardwareOffloadConfig: {}
```
Default pool should be always created with the lowest priority:
```yaml
apiVersion: v1
kind: SriovNetworkPoolConfig
metadata:
name: default
namespace: network-operator
spec:
priority: 0
DrainConfig:
maxUnavailable: 1
```
### Implementation Details/Notes/Constraints
The implementation of this feature is very complex and requires a lot of changes to different parts of SR-IOV Network
Operator. To not introduce breaking changes we have to split this effort to several phases:
* implement new Drain controller without user-facing API changes:
it will proceed only one node configuration at the same time and doesn't require API changes
* drop `SriovNetworkNodeState` usage and move this annotation values into the `SriovNetworkNodeState` object
* introduce new API changes to support pool of nodes to be drained in a parallel:
at this phase we introduce new fields in SriovNetworkPoolConfig and modify Drain controller to watch for
the specified CRs and proceed drain in a parallel per node pool configuration

All phases should be implemented one-by-one in a separate PRs in the order above.

### Upgrade & Downgrade considerations
After operator upgrade we have to support `sriovnetwork.openshift.io/state` node annotation to
`SriovNetworkNodeState.DrainStatus` migration. This logic will be implemented in a Drain controller and should be
supported at least until next major version of the operator.

Node reconcile loop will check for `sriovnetwork.openshift.io/state` annotation and move its value to
`SriovNetworkNodeState.DrainStatus` field. `sriovnetwork.openshift.io/state` annotation will be deleted by operator once
`SriovNetworkNodeState` is updated.

### Alternative APIs
#### Option 1: extend SriovOperatorConfig CRD
We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.:

```yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: network-operator
spec:
# Add fields here
enableInjector: false
enableOperatorWebhook: false
configDaemonNodeSelector: {}
disableDrain: false
drainConfig:
- name: default
maxUnavailable: 1
priority: 0 # the lowest priority
- name: pool-1
maxUnavailable: 5
priority: 44
# empty nodeSelectorTerms means 'all nodes'
nodeSelectorTerms:
- matchExpressions:
- key: some-label
operator: In
values:
- val-1
- val-2
- matchExpressions:
- key: other-label
operator: "Exists"
```

We didn't choose this option because SriovOperatorConfig contains Config Daemon-specific options only while draing
configuration is node-specific.

#### Option 2: New CRD
Add new `DrainConfiguration`CRD with fields mentioned in previous options.
We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.:
```yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovDrainConfig
metadata:
name: default
namespace: network-operator
spec:
maxUnavailable: 1
priority: 0 # the lowest priority
# empty nodeSelectorTerms means 'all nodes'
nodeSelectorTerms:
- matchExpressions:
- key: some-label
operator: In
```

We didn't choose this option because there is already defined `SriovNetworkPoolConfig` CRD wich could be uses for needed
configuration.

### Test Plan
* Unit tests will be implemented for new Drain Controller.
** E2E, manual or automation functional testing should have such test cases:
** to verify that we actually configure SR-IOV on `maxUnavailable` nodes at the same time
** to check that we don't configure more than `maxUnavailable` nodes at the same time

0 comments on commit d3b2c56

Please sign in to comment.