Today, there exist a few gaps between Container Storage Interface (CSI) and virtual machine (VM) based runtimes such as Kata Containers that prevent them from working together smoothly.
First, it’s cumbersome to use a persistent volume (PV) with Kata Containers. Today, for a PV with Filesystem volume mode, Virtio-fs is the only way to surface it inside a Kata Container guest VM. But often mounting the filesystem (FS) within the guest operating system (OS) is desired due to performance benefits, availability of native FS features and security benefits over the Virtio-fs mechanism.
Second, it’s difficult if not impossible to resize a PV online with Kata Containers. While a PV can be expanded on the host OS, the updated metadata needs to be propagated to the guest OS in order for the application container to use the expanded volume. Currently, there is not a way to propagate the PV metadata from the host OS to the guest OS without restarting the Pod sandbox.
Because of the OS boundary, these features cannot be implemented in the CSI node driver plugin running on the host OS
as is normally done in the runc container. Instead, they can be done by the Kata Containers agent inside the guest OS,
but it requires the CSI driver to pass the relevant information to the Kata Containers runtime.
An ideal long term solution would be to have the kubelet
coordinating the communication between the CSI driver and
the container runtime, as described in KEP-2857.
However, as the KEP is still under review, we would like to propose a short/medium term solution to unblock our use case.
The proposed solution is built on top of a previous proposal described by Eric Ernst. The previous proposal has two gaps:
-
Writing a
csiPlugin.json
file to the volume root path introduced a security risk. A malicious user can gain unauthorized access to a block device by writing their owncsiPlugin.json
to the above location through an ephemeral CSI plugin. -
The proposal didn't describe how to establish a mapping between a volume and a kata sandbox, which is needed for implementing CSI volume resize and volume stat collection APIs.
This document particularly focuses on how to address these two gaps.
- The proposal assumes that a block device volume will only be used by one Pod on a node at a time, which we believe
is the most common pattern in Kata Containers use cases. It’s also unsafe to have the same block device attached to more than
one Kata pod. In the context of Kubernetes, the
PersistentVolumeClaim
(PVC) needs to have theaccessMode
asReadWriteOncePod
. - More advanced Kubernetes volume features such as,
fsGroup
,fsGroupChangePolicy
, andsubPath
are not supported.
- The user specifies a PV as a direct-assigned volume. How a PV is specified as a direct-assigned volume is left for each CSI implementation to decide.
There are a few options for reference:
- A storage class parameter specifies whether it's a direct-assigned volume. This avoids any lookups of PVC or Pod information from the CSI plugin (as external provisioner takes care of these). However, all PVs in the storage class with the parameter set will have host mounts skipped.
- Use a PVC annotation. This approach requires the CSI plugins have
--extra-create-metadata
set to be able to perform a lookup of the PVC annotations from the API server. Pro: API server lookup of annotations only required during creation of PV. Con: The CSI plugin will always skip host mounting of the PV. - The CSI plugin can also lookup pod
runtimeclass
duringNodePublish
. This approach can be found in the ALIBABA CSI plugin.
- The CSI node driver delegates the direct assigned volume to the Kata Containers runtime. The CSI node driver APIs need to
be modified to pass the volume mount information and collect volume information to/from the Kata Containers runtime by invoking
kata-runtime
command line commands.- NodePublishVolume -- It invokes
kata-runtime direct-volume add --volume-path [volumePath] --mount-info [mountInfo]
to propagate the volume mount information to the Kata Containers runtime for it to carry out the filesystem mount operation. ThevolumePath
is the target_path in the CSINodePublishVolumeRequest
. ThemountInfo
is a serialized JSON string. - NodeGetVolumeStats -- It invokes
kata-runtime direct-volume stats --volume-path [volumePath]
to retrieve the filesystem stats of direct-assigned volume. - NodeExpandVolume -- It invokes
kata-runtime direct-volume resize --volume-path [volumePath] --size [size]
to send a resize request to the Kata Containers runtime to resize the direct-assigned volume. - NodeStageVolume/NodeUnStageVolume -- It invokes
kata-runtime direct-volume remove --volume-path [volumePath]
to remove the persisted metadata of a direct-assigned volume.
- NodePublishVolume -- It invokes
The mountInfo
object is defined as follows:
type MountInfo struct {
// The type of the volume (ie. block)
VolumeType string `json:"volume-type"`
// The device backing the volume.
Device string `json:"device"`
// The filesystem type to be mounted on the volume.
FsType string `json:"fstype"`
// Additional metadata to pass to the agent regarding this volume.
Metadata map[string]string `json:"metadata,omitempty"`
// Additional mount options.
Options []string `json:"options,omitempty"`
}
Notes: given that the mountInfo
is persisted to the disk by the Kata runtime, it shouldn't container any secrets (such as SMB mount password).
Instead of the CSI node driver writing the mount info into a csiPlugin.json
file under the volume root,
as described in the original proposal, here we propose that the CSI node driver passes the mount information to
the Kata Containers runtime through a new kata-runtime
commandline command. The kata-runtime
then writes the mount
information to a mountInfo.json
file in a predefined location (/run/kata-containers/shared/direct-volumes/[volume_path]/
).
When the Kata Containers runtime starts a container, it verifies whether a volume mount is a direct-assigned volume by checking
whether there is a mountInfo
file under the computed Kata direct-volumes
directory. If it is, the runtime parses the mountInfo
file,
updates the mount spec with the data in mountInfo
. The updated mount spec is then passed to the Kata agent in the guest VM together
with other mounts. The Kata Containers runtime also creates a file named by the sandbox id under the direct-volumes/[volume_path]/
directory. The reason for adding a sandbox id file is to establish a mapping between the volume and the sandbox using it.
Later, when the Kata Containers runtime handles the get-stats
and resize
commands, it uses the sandbox id to identify
the endpoint of the corresponding containerd-shim-kata-v2
.
containerd-shim-kata-v2
provides an API for sandbox management through a Unix domain socket. Two new handlers are proposed: /direct-volume/stats
and /direct-volume/resize
:
Example:
$ curl --unix-socket "$shim_socket_path" -I -X GET 'http://localhost/direct-volume/stats/[urlSafeVolumePath]'
$ curl --unix-socket "$shim_socket_path" -I -X POST 'http://localhost/direct-volume/resize' -d '{ "volumePath"": [volumePath], "Size": "123123" }'
The shim then forwards the corresponding request to the kata-agent
to carry out the operations inside the guest VM. For resize
operation,
the Kata runtime also needs to notify the hypervisor to resize the block device (e.g. call block_resize
in QEMU).
The mount spec of a direct-assigned volume is passed to kata-agent
through the existing Storage
GRPC object.
Two new APIs and three new GRPC objects are added to GRPC protocol between the shim and agent for resizing and getting volume stats:
rpc GetVolumeStats(VolumeStatsRequest) returns (VolumeStatsResponse);
rpc ResizeVolume(ResizeVolumeRequest) returns (google.protobuf.Empty);
message VolumeStatsRequest {
// The volume path on the guest outside the container
string volume_guest_path = 1;
}
message ResizeVolumeRequest {
// Full VM guest path of the volume (outside the container)
string volume_guest_path = 1;
uint64 size = 2;
}
// This should be kept in sync with CSI NodeGetVolumeStatsResponse (https://github.com/container-storage-interface/spec/blob/v1.5.0/csi.proto)
message VolumeStatsResponse {
// This field is OPTIONAL.
repeated VolumeUsage usage = 1;
// Information about the current condition of the volume.
// This field is OPTIONAL.
// This field MUST be specified if the VOLUME_CONDITION node
// capability is supported.
VolumeCondition volume_condition = 2;
}
message VolumeUsage {
enum Unit {
UNKNOWN = 0;
BYTES = 1;
INODES = 2;
}
// The available capacity in specified Unit. This field is OPTIONAL.
// The value of this field MUST NOT be negative.
uint64 available = 1;
// The total capacity in specified Unit. This field is REQUIRED.
// The value of this field MUST NOT be negative.
uint64 total = 2;
// The used capacity in specified Unit. This field is OPTIONAL.
// The value of this field MUST NOT be negative.
uint64 used = 3;
// Units by which values are measured. This field is REQUIRED.
Unit unit = 4;
}
// VolumeCondition represents the current condition of a volume.
message VolumeCondition {
// Normal volumes are available for use and operating optimally.
// An abnormal volume does not meet these criteria.
// This field is REQUIRED.
bool abnormal = 1;
// The message describing the condition of the volume.
// This field is REQUIRED.
string message = 2;
}
Given the following definition:
---
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
runtime-class: kata-qemu
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: ebs-claim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
skip-hostmount: "true"
name: ebs-claim
spec:
accessModes:
- ReadWriteOncePod
volumeMode: Filesystem
storageClassName: ebs-sc
resources:
requests:
storage: 4Gi
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
csi.storage.k8s.io/fstype: ext4
Let’s assume that changes have been made in the aws-ebs-csi-driver
node driver.
Node publish volume
- In the node CSI driver, the
NodePublishVolume
API invokes:kata-runtime direct-volume add --volume-path "/kubelet/a/b/c/d/sdf" --mount-info "{\"Device\": \"/dev/sdf\", \"fstype\": \"ext4\"}"
. - The
Kata-runtime
writes the mount-info JSON to a file calledmountInfo.json
under/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf
.
Node unstage volume
- In the node CSI driver, the
NodeUnstageVolume
API invokes:kata-runtime direct-volume remove --volume-path "/kubelet/a/b/c/d/sdf"
. - Kata-runtime deletes the directory
/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf
.
Use the volume in sandbox
- Upon the request to start a container, the
containerd-shim-kata-v2
examines the container spec, and iterates through the mounts. For each mount, if there is amountInfo.json
file under/run/kata-containers/shared/direct-volumes/[mount source path]
, it generates astorage
GRPC object after overwriting the mount spec with the information inmountInfo.json
. - The shim sends the storage objects to kata-agent through TTRPC.
- The shim writes a file with the sandbox id as the name under
/run/kata-containers/shared/direct-volumes/[mount source path]
. - The kata-agent mounts the storage objects for the container.
Node expand volume
- In the node CSI driver, the
NodeExpandVolume
API invokes:kata-runtime direct-volume resize –-volume-path "/kubelet/a/b/c/d/sdf" –-size 8Gi
. - The Kata runtime checks whether there is a sandbox id file under the directory
/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf
. - The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to resize the volume.
- The shim handles the request, asks the hypervisor to resize the block device and sends a GRPC request to Kata agent to resize the filesystem.
- Kata agent receives the request and resizes the filesystem.
Node get volume stats
- In the node CSI driver, the
NodeGetVolumeStats
API invokes:kata-runtime direct-volume stats –-volume-path "/kubelet/a/b/c/d/sdf"
. - The Kata runtime checks whether there is a sandbox id file under the directory
/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf
. - The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to get the volume stats.
- The shim handles the request and forwards it to the Kata agent.
- Kata agent receives the request and returns the filesystem stats.