Skip to content

k8snetworkplumbingwg/rdma-cni

Repository files navigation

License Go Report Card Build&Tests Coverage Status

RDMA CNI plugin

CNI compliant plugin for network namespace aware RDMA interfaces.

RDMA CNI plugin allows network namespace isolation for RDMA workloads in a containerized environment.

Overview

RDMA CNI plugin is intended to be run as a chained CNI plugin (introduced in CNI Specifications v0.3.0). It ensures isolation of RDMA traffic from other workloads in the system by moving the associated RDMA interfaces of the provided network interface to the container's network namespace path.

The main use-case (for now...) is for containerized SR-IOV workloads orchestrated by Kubernetes that perform RDMA and wish to leverage network namespace isolation of RDMA devices introduced in linux kernel 5.3.0.

Requirements

Hardware

SR-IOV capable NIC which supports RDMA.

Supported Hardware

Mellanox Network adapters

ConnectX®-4 and above

Operating System

Linux distribution

Kernel

Kernel based on 5.3.0 or newer, RDMA modules loaded in the system. rdma-core package provides means to automatically load relevant modules on system start.

Note: For deployments that use Mellanox out-of-tree driver (Mellanox OFED), Mellanox OFED version 4.7 or newer is required. In this case it is not required to use a Kernel based on 5.3.0 or newer.

Pacakges

iproute2 package based on kernel 5.3.0 or newer installed on the system.

Note: It is recommended that the required packages are installed by your system's package manager.

Note: For deployments using Mellanox OFED, iproute2 package is bundled with the driver under /opt/mellanox/iproute2/

Deployment requirements (Kubernetes)

Please refer to the relevant link on how to deploy each component. For a Kubernetes deployment, each SR-IOV capable worker node should have:

Note:: Kubernetes version 1.16 or newer is required for deploying as daemonset

RDMA CNI configurations

{
  "cniVersion": "0.3.1",
  "type": "rdma",
  "args": {
    "cni": {
      "debug": true
    }
  }
}

Note: "args" keyword is optional.

Deployment

System configuration

It is recommended to set RDMA subsystem namespace awareness mode to exclusive on OS boot.

Set RDMA subsystem namespace awareness mode to exclusive via ib_core module parameter:

~$ echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf

Set RDMA subsystem namespace awareness mode to exclusive via rdma tool:

~$ rdma system set netns exclusive

Note: When changing RDMA subsystem netns mode, kernel requires that no network namespaces to exist in the system.

Deploy RDMA CNI

~$ kubectl apply -f ./deployment/rdma-cni-daemonset.yaml

Deploy workload

Pod definition can be found in the example below.

~$ kubectl apply -f ./examples/my_rdma_test_pod.yaml

Pod example:

apiVersion: v1
kind: Pod
metadata:
  name: rdma-test-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-rdma-net
spec:
  containers:
    - name: rdma-app
      image: centos/tools
      imagePullPolicy: IfNotPresent
      command: [ "/bin/bash", "-c", "--" ]
      args: [ "while true; do sleep 300000; done;" ]
      resources:
        requests:
          mellanox.com/sriov_rdma: '1'
        limits:
          mellanox.com/sriov_rdma: '1'

SR-IOV Network Device Plugin ConfigMap example

The following yaml defines an RDMA enabled SR-IOV resource pool named: mellanox.com/sriov_rdma

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
      "resourceList": [
        {
           "resourcePrefix": "mellanox.com",
           "resourceName": "sriov_rdma",
           "selectors": {
               "isRdma": true,
               "vendors": ["15b3"],
               "pfNames": ["enp4s0f0"]
           }
        }
      ]
    }

Network CRD example

The following yaml defines a network, sriov-network, associated with an rdma enabled resurce, mellanox.com/sriov_rdma.

The CNI plugins that will be executed in a chain are for Pods that request this network are: sriov, rdma CNIs

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-rdma-net
  annotations:
    k8s.v1.cni.cncf.io/resourceName: mellanox.com/sriov_rdma
spec:
  config: '{
             "cniVersion": "0.3.1",
             "name": "sriov-rdma-net",
             "plugins": [{
                          "type": "sriov",
                          "ipam": {
                            "type": "host-local",
                            "subnet": "10.56.217.0/24",
                            "routes": [{
                              "dst": "0.0.0.0/0"
                            }],
                            "gateway": "10.56.217.1"
                          }
                        },
                        {
                          "type": "rdma"
                        }]
           }'

Development

It is recommended to use the same go version as defined in .travis.yml to avoid potential build related issues during development (newer version will most likely work as well).

Build from source

~$ git clone https://github.com/k8snetworkplumbingwg/rdma-cni.git
~$ cd rdma-cni
~$ make

Upon a successful build, rdma binary can be found under ./build. For small deployments (e.g a kubernetes test cluster/AIO K8s deployment) you can:

  1. Copy rdma binary to the CNI dir in each worker node.
  2. Build container image, push to your own image repo then modify the deployment template and deploy.

Run tests:

~$ make tests

Build image:

~$ make image

Limitations

Ethernet

RDMA workloads utilizing RDMA Connection Manager (CM)

For Mellanox Hardware, due to kernel limitation, it is required to pre-allocate MACs for all VFs in the deployment if an RDMA workload wishes to utilize RMDA CM to establish connection.

This is done in the following manner:

Set VF administrative MAC address :

$ ip link set <pf-netdev> vf <vf-index> mac <mac-address>

Unbind/Bind VF driver :

$ echo <vf-pci-address> > /sys/bus/pci/drivers/mlx5_core/unbind
$ echo <vf-pci-address> > /sys/bus/pci/drivers/mlx5_core/bind

Example:

$ ip link set enp4s0f0 vf 3 mac 02:03:00:00:48:56
$ echo 0000:03:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind
$ echo 0000:03:00.5 > /sys/bus/pci/drivers/mlx5_core/bind

Doing so will populate the VF's node and pord GUID required for RDMA CM to establish connection.